pdf compressing software?

Word processors, spreadsheets, presentations, translation, etc.
Post Reply
Message
Author
Dromeno
Posts: 534
Joined: Fri 12 Sep 2008, 07:01

pdf compressing software?

#1 Post by Dromeno »

I hava a couple of 300mb+ pdf files (scanned books, OCR'd) which I would like to shrink while maintaining text quality. I do not care much about the photos. What would be the best approach?

User avatar
Dingo
Posts: 1437
Joined: Tue 11 Dec 2007, 17:48
Location: somewhere at the end of rainbow...
Contact:

Re: pdf compressing software?

#2 Post by Dingo »

Dromeno wrote:I hava a couple of 300mb+ pdf files (scanned books, OCR'd) which I would like to shrink while maintaining text quality. I do not care much about the photos. What would be the best approach?
so you have a pdf that is, really, a pdf wrapper around scanned images

A right way to decrease filesize while keeping quality and readability, is to decrease depht of colors

if your scanned pages are in color/grayscale, you can achieve a great size crushing by using the encoder Adam Langley designed for GoogleBooks project to compress in black and white scanned text alongside jpeg 2000 for grayscale details

Jbig2enc
- http://dokupuppylinux.info/programs:encoders

you need:

- python:
I use python 2.5 in puppy 3.01 http://dokupuppylinux.info/programs:python)
- pdf.py ( a small python script to put all jbig2 encoded images in a pdf)
http://dokupuppylinux.info/programs:encoders



HOWTO:

1° - extract all images from pdf (if you don't have original outside pdf) at their original native resolution (that can be done with pdfimages from xpdfutils or from poppler-utils)

2° - encode all images with jbig2enc

Code: Select all

jbig2 -s -p -v *.fileextension && pdf.py output>file.pdf
a sample:

original scanned image (taken with nikon d3200) 539 KB - 2230x3777
Image

black and white image and encoded pdf (with jbig2enc) 28KB! keeping same size: 2230x3777
Image
http://ge.tt/7jxwM301/v/0 (encoded pdf)
replace .co.cc with .info to get access to stuff I posted in forum
dropbox 2GB free
OpenOffice for Puppy Linux

Dromeno
Posts: 534
Joined: Fri 12 Sep 2008, 07:01

epub

#3 Post by Dromeno »

Dingo

Thx for your fast help. But I do not understand everything yet.

Yes it is easy to split the pdf's in text+images and then compress the images. But how do I recombine those back into a new compressed PDF with embedded text? Of course I can OCR that pdf but I expect to loose text quality then.

BTW, my end goal is to produce an epub from that PDF (with Calibre).

User avatar
Dingo
Posts: 1437
Joined: Tue 11 Dec 2007, 17:48
Location: somewhere at the end of rainbow...
Contact:

Re: epub

#4 Post by Dingo »

Dromeno wrote:how do I recombine those back into a new compressed PDF with embedded text?
with hopcr2pdf from exactimage utils
- http://www.exactcode.com/site/open_sour ... /hocr2pdf/

in this case a possible workflow is:

1° - extract all scanned images from wrapping pdf with pdfimages

Code: Select all

pdfimages file.pdf 0
2° - convert to tiff b/w without dither (that can be done with graphicsmagick, lighter and faster than imagenagick)

Code: Select all

gm mogrify -format tiff -dither None -compress Group4 -threshold *value*  *.fileext
3° - performing ocr on these tiff resuting images with an hocr capable software like tesseract that can produce an hocr output to reuse this with hocr2pdf
4 ° - combine together the hocr output and the tiff files inside one multipage pdf with hocr2pdf

if you want take a look to hocr2pdf you need to find the binaries already compiled, since I tried many times to build exactimage utils in Puppy KLinux, but the resulting executables, even if finely built, failed to open any file I submitted for testing
Dromeno wrote:Of course I can OCR that pdf but I expect to loose text quality then.
this is generally the field of application of ADOBE CLEARSCAN, a proprietary software that shrink the size of scanned pdf, vectorizing the raster text on boon scans, creating a custom font from recognized text and using this custom subsetted font to represent the text
http://acrobatusers.com/tutorials/bette ... oks-better

but, if you dislike Adobe or you, like me, hate this monopolizing company, a possible alternative is using
smoothscan
https://natecraun.net/projects/smoothscan/
smoothscan is a tool to convert scanned text into a vectorized output form.
source available for building, last time I tried to build smoothscan I encountered problems with fontforge python dependencies, it seems now these dependencies were removed, so maybe there is a chance that building will be fine
replace .co.cc with .info to get access to stuff I posted in forum
dropbox 2GB free
OpenOffice for Puppy Linux

User avatar
Flash
Official Dog Handler
Posts: 13071
Joined: Wed 04 May 2005, 16:04
Location: Arizona USA

#5 Post by Flash »

I just tried compressing a 923 kB pdf file with Xarchive. The result was a 405 kB .tar.gz file which seemed to uncompress to the original pdf file just fine.

Instructions on how I used Xarchive to do this are here.

Post Reply