pdf compressing software?
pdf compressing software?
I hava a couple of 300mb+ pdf files (scanned books, OCR'd) which I would like to shrink while maintaining text quality. I do not care much about the photos. What would be the best approach?
- Dingo
- Posts: 1437
- Joined: Tue 11 Dec 2007, 17:48
- Location: somewhere at the end of rainbow...
- Contact:
Re: pdf compressing software?
so you have a pdf that is, really, a pdf wrapper around scanned imagesDromeno wrote:I hava a couple of 300mb+ pdf files (scanned books, OCR'd) which I would like to shrink while maintaining text quality. I do not care much about the photos. What would be the best approach?
A right way to decrease filesize while keeping quality and readability, is to decrease depht of colors
if your scanned pages are in color/grayscale, you can achieve a great size crushing by using the encoder Adam Langley designed for GoogleBooks project to compress in black and white scanned text alongside jpeg 2000 for grayscale details
Jbig2enc
- http://dokupuppylinux.info/programs:encoders
you need:
- python:
I use python 2.5 in puppy 3.01 http://dokupuppylinux.info/programs:python)
- pdf.py ( a small python script to put all jbig2 encoded images in a pdf)
http://dokupuppylinux.info/programs:encoders
HOWTO:
1° - extract all images from pdf (if you don't have original outside pdf) at their original native resolution (that can be done with pdfimages from xpdfutils or from poppler-utils)
2° - encode all images with jbig2enc
Code: Select all
jbig2 -s -p -v *.fileextension && pdf.py output>file.pdf
original scanned image (taken with nikon d3200) 539 KB - 2230x3777
black and white image and encoded pdf (with jbig2enc) 28KB! keeping same size: 2230x3777
http://ge.tt/7jxwM301/v/0 (encoded pdf)
replace .co.cc with .info to get access to stuff I posted in forum
dropbox 2GB free
OpenOffice for Puppy Linux
dropbox 2GB free
OpenOffice for Puppy Linux
epub
Dingo
Thx for your fast help. But I do not understand everything yet.
Yes it is easy to split the pdf's in text+images and then compress the images. But how do I recombine those back into a new compressed PDF with embedded text? Of course I can OCR that pdf but I expect to loose text quality then.
BTW, my end goal is to produce an epub from that PDF (with Calibre).
Thx for your fast help. But I do not understand everything yet.
Yes it is easy to split the pdf's in text+images and then compress the images. But how do I recombine those back into a new compressed PDF with embedded text? Of course I can OCR that pdf but I expect to loose text quality then.
BTW, my end goal is to produce an epub from that PDF (with Calibre).
- Dingo
- Posts: 1437
- Joined: Tue 11 Dec 2007, 17:48
- Location: somewhere at the end of rainbow...
- Contact:
Re: epub
with hopcr2pdf from exactimage utilsDromeno wrote:how do I recombine those back into a new compressed PDF with embedded text?
- http://www.exactcode.com/site/open_sour ... /hocr2pdf/
in this case a possible workflow is:
1° - extract all scanned images from wrapping pdf with pdfimages
Code: Select all
pdfimages file.pdf 0
Code: Select all
gm mogrify -format tiff -dither None -compress Group4 -threshold *value* *.fileext
4 ° - combine together the hocr output and the tiff files inside one multipage pdf with hocr2pdf
if you want take a look to hocr2pdf you need to find the binaries already compiled, since I tried many times to build exactimage utils in Puppy KLinux, but the resulting executables, even if finely built, failed to open any file I submitted for testing
this is generally the field of application of ADOBE CLEARSCAN, a proprietary software that shrink the size of scanned pdf, vectorizing the raster text on boon scans, creating a custom font from recognized text and using this custom subsetted font to represent the textDromeno wrote:Of course I can OCR that pdf but I expect to loose text quality then.
http://acrobatusers.com/tutorials/bette ... oks-better
but, if you dislike Adobe or you, like me, hate this monopolizing company, a possible alternative is using
smoothscan
https://natecraun.net/projects/smoothscan/
source available for building, last time I tried to build smoothscan I encountered problems with fontforge python dependencies, it seems now these dependencies were removed, so maybe there is a chance that building will be finesmoothscan is a tool to convert scanned text into a vectorized output form.
replace .co.cc with .info to get access to stuff I posted in forum
dropbox 2GB free
OpenOffice for Puppy Linux
dropbox 2GB free
OpenOffice for Puppy Linux
I just tried compressing a 923 kB pdf file with Xarchive. The result was a 405 kB .tar.gz file which seemed to uncompress to the original pdf file just fine.
Instructions on how I used Xarchive to do this are here.
Instructions on how I used Xarchive to do this are here.