pdf compressing software?

Message

Dromeno · #1 Post by **Dromeno** » Mon 18 Nov 2013, 11:30

I hava a couple of 300mb+ pdf files (scanned books, OCR'd) which I would like to shrink while maintaining text quality. I do not care much about the photos. What would be the best approach?

Dingo · #2 Post by **Dingo** » Mon 18 Nov 2013, 13:33

Dromeno wrote:I hava a couple of 300mb+ pdf files (scanned books, OCR'd) which I would like to shrink while maintaining text quality. I do not care much about the photos. What would be the best approach?

so you have a pdf that is, really, a pdf wrapper around scanned images

A right way to decrease filesize while keeping quality and readability, is to decrease depht of colors

if your scanned pages are in color/grayscale, you can achieve a great size crushing by using the encoder Adam Langley designed for GoogleBooks project to compress in black and white scanned text alongside jpeg 2000 for grayscale details

Jbig2enc
- http://dokupuppylinux.info/programs:encoders

you need:

- python:
I use python 2.5 in puppy 3.01 http://dokupuppylinux.info/programs:python)
- pdf.py ( a small python script to put all jbig2 encoded images in a pdf)
http://dokupuppylinux.info/programs:encoders

HOWTO:

1° - extract all images from pdf (if you don't have original outside pdf) at their original native resolution (that can be done with pdfimages from xpdfutils or from poppler-utils)

2° - encode all images with jbig2enc

Code: Select all

jbig2 -s -p -v *.fileextension && pdf.py output>file.pdf

a sample:

original scanned image (taken with nikon d3200) 539 KB - 2230x3777

black and white image and encoded pdf (with jbig2enc) 28KB! keeping same size: 2230x3777

http://ge.tt/7jxwM301/v/0 (encoded pdf)

Dromeno · #3 Post by **Dromeno** » Mon 18 Nov 2013, 15:34

Dingo

Thx for your fast help. But I do not understand everything yet.

Yes it is easy to split the pdf's in text+images and then compress the images. But how do I recombine those back into a new compressed PDF with embedded text? Of course I can OCR that pdf but I expect to loose text quality then.

BTW, my end goal is to produce an epub from that PDF (with Calibre).

Dingo · #4 Post by **Dingo** » Mon 18 Nov 2013, 19:04

Dromeno wrote:how do I recombine those back into a new compressed PDF with embedded text?

with hopcr2pdf from exactimage utils
- http://www.exactcode.com/site/open_sour ... /hocr2pdf/

in this case a possible workflow is:

1° - extract all scanned images from wrapping pdf with pdfimages

Code: Select all

pdfimages file.pdf 0

2° - convert to tiff b/w without dither (that can be done with graphicsmagick, lighter and faster than imagenagick)

Code: Select all

gm mogrify -format tiff -dither None -compress Group4 -threshold *value*  *.fileext

3° - performing ocr on these tiff resuting images with an hocr capable software like tesseract that can produce an hocr output to reuse this with hocr2pdf
4 ° - combine together the hocr output and the tiff files inside one multipage pdf with hocr2pdf

if you want take a look to hocr2pdf you need to find the binaries already compiled, since I tried many times to build exactimage utils in Puppy KLinux, but the resulting executables, even if finely built, failed to open any file I submitted for testing

Dromeno wrote:Of course I can OCR that pdf but I expect to loose text quality then.

this is generally the field of application of ADOBE CLEARSCAN, a proprietary software that shrink the size of scanned pdf, vectorizing the raster text on boon scans, creating a custom font from recognized text and using this custom subsetted font to represent the text
http://acrobatusers.com/tutorials/bette ... oks-better

but, if you dislike Adobe or you, like me, hate this monopolizing company, a possible alternative is using
smoothscan
https://natecraun.net/projects/smoothscan/

smoothscan is a tool to convert scanned text into a vectorized output form.

source available for building, last time I tried to build smoothscan I encountered problems with fontforge python dependencies, it seems now these dependencies were removed, so maybe there is a chance that building will be fine

#5 Post by **Flash** » Mon 18 Nov 2013, 20:02

I just tried compressing a 923 kB pdf file with Xarchive. The result was a 405 kB .tar.gz file which seemed to uncompress to the original pdf file just fine.

Instructions on how I used Xarchive to do this are here.

(old)Puppy Linux Discussion Forum

(old)Puppy Linux Discussion Forum

pdf compressing software?

pdf compressing software?

Re: pdf compressing software?

epub

Re: epub