Page 5 of 6

Posted: Fri 02 Feb 2018, 06:41
by disciple
http://www.willus.com/k2pdfopt/pdf_conversion.shtml
A page with links to a bunch of pdf related tools, particularly for making pdfs more usable on ereaders etc.
Probably not much new there unless you are interested in the ereader type stuff.

Posted: Fri 02 Feb 2018, 06:49
by disciple
Just a comment on compressing pdfs (generated ones, probably not scanned ones): I have found that using the compression option in tools like qpdf or pdftk often makes the files a bit bigger (as well as often deleting bookmarks etc)! The best tool for this (and many other things) is pdfsam - e.g. use the merge option and select the options to compress the output file, and to keep the existing bookmarks.

Also note that if you stick to the free versions of pdfsam you may want to use the current generation for some tasks, due to new capabilities, and the previous generation (which incidentally didn't suffer from the rubbish gui font blurring!) for the "visual reorder feature", if pdfshuffler, pdfmod, flexipdf etc aren't giving you good results.

Posted: Wed 28 Mar 2018, 04:28
by disciple
This has a command line interface and a nice simple gui (unfortunately gtk3) if you're looking to do simple booklet / n-up stuff... it will of course have the same problems on some pdfs as other tools using pypdf.
Bookletimposer is an utility to achieve some basic imposition on PDF documents, especially designed to work on booklets.

Bookletimposer is implemented as a commandline and GTK+ interface to pdfimposer, a reusable python module built on top of pyPdf2.
https://kjo.herbesfolles.org/bookletimposer/

Posted: Wed 28 Mar 2018, 04:31
by disciple
Here is another "free" viewer that might be worth checking out if you are after something with similar capabilities to the Windows Adobe Reader:
https://www.qoppa.com/pdfstudioviewer/
No nice crisp aliased fonts though :(
Cross-platform (Java) and relatively huge...
Ability to create annotations is supposed to be coming this April or May, so you may want to hold off until then.

Posted: Wed 28 Mar 2018, 05:02
by disciple
In terms of more complex booklet / n-up stuff, I previously mentioned PDF bookbinder and pdfbooklet at http://www.murga-linux.com/puppy/viewtopic.php?t=61579, but I think not in this thread.
Pdfbooklet is also using Pypdf2, while bookbinder is Java.

Posted: Mon 29 Oct 2018, 08:24
by disciple
I mentioned my own joinPdf on the first page of this thread, but didn't provide a link.
It provides a gui and a command line tool for assembling documents based on file and path names (you give it one or more inputs, and it sorts them, if an input is a folder it recursively searches for pdfs inside). I recently made some significant improvements to both sorting and performance, by utilising different backend tools if available.

Posted: Mon 29 Oct 2018, 08:28
by disciple
If anyone wants to put images into pdfs (e.g. create a report appendix with a series of photos, one on each page, specifying margins etc), I'm getting good results with a python command line tool https://gitlab.mister-muffin.de/josch/img2pdf/. I think I selected it for the control of margins and because it operates losslessly with jpeg images.
I possibly need to look again for a solution that will also caption the photos.

Posted: Mon 29 Oct 2018, 08:33
by disciple
For converting html reports to pdf I am using the perl html2ps, which is great for my purposes, but no longer developed, and doesn't understand much css etc. I believe the php html2ps is much more capable, but I haven't tried it.

Unfortunately the perl html2ps homepage has recently gone offline, but you can see its documentation online e.g. http://web.mit.edu/outland/share/lib/ht ... ml2ps.html, and presumably find the source easily enough e.g. at Debian.

Posted: Mon 29 Oct 2018, 08:46
by disciple
I've been using pdftk to add watermarks and background stamps to pdfs, which is great except AFAIK it can't cope well with a mixture of page sizes and orientations (unless you give it a stamp pdf with a separate page for each page in the main input pdf).
I use it mainly for adding a header logo, which I want at the top right of the page.
I am going to try using the python pdfrw, which has an example script doing almost exactly what I want.

Fixing corrupted pdfs

Posted: Thu 08 Nov 2018, 05:57
by disciple
I found stamping with pdfrw slightly corrupts certain pdfs produced by "save as" from Microsoft Word 2016.

Investigating ways to fix either the corrupted file or the input file, I found I can:
- process the corrupted pdf with either pdftocairo (1) or gs (2), or
- process the original input pdf with pdftocairo (3) or gs (4), before processing it with pdfrw

Other tools like mutool and qpdf failed to fix this particular problem.
Using gs is a little slower than pdfrw, but I notice it also shrinks the filesize of some pdfs, significantly more than pdftocairo does.

Code: Select all

 pdftocairo -pdf 0s.pdf 1.pdf

 gs -o out.pdf -sDEVICE=pdfwrite in.pdf
EDIT:
Note that:
- In some cases both pdftocairo and gs make files bigger, sometimes MUCH bigger. In ghostscript's case I think this is mainly due to:
1. gs embedding fonts that are missing in the original pdf.
2. gs reencoding images if you use e.g. -dPDFSETTINGS=/prepress \, which I believe actually upsamples images, meaning filesize increases about 25% on average.

If you don't specify -dPDFSETTINGS then filesize decreases about 5% on average, compared to a 35% increase with pdftocairo. But some individual files still increase hugely, and these will often be bigger than if processed with pdftocairo. So if you want to process a bunch of files to make sure they aren't corrupt, it is best to use both tools and then keep the smallest resulting file.

The other thing to be aware of is that if any of your files can't be processed because it needs a password to open, pdftocairo will skip it, but gs will write a blank output file.

attach files to a pdf

Posted: Thu 08 Nov 2018, 06:12
by disciple
If you are interested in "pdf portfolios" i.e. pdfs with attached files, pdfdetach can extract them. But mutool can do that and also attach them in the first place.

mutool does several other useful things - I haven't used it extensively, but I assume it is good as it is a brother of mupdf.

Posted: Thu 08 Nov 2018, 06:17
by disciple
Origami-pdf is another tool that may be worth mentioning, although it is in ruby, and I'm not sure if it is currently being developed anywhere:
Features
•Create PDF documents from scratch.
•Parse existing documents, modify them and recompile them.
•Explore documents at the object level, going deep into the document structure, uncompressing PDF object streams and desobfuscating names and strings.
•High-level operations, such as encryption/decryption, signature, file attachments...
•A GTK interface to quickly browse into the document contents.
There is a similar python project that seems to be dead.https://github.com/jesparza/peepdf
And a Java one that is alive https://github.com/itext/rups/

Posted: Fri 09 Nov 2018, 05:21
by disciple
Since I'm recording my most important knowledge about pdfs in this thread:

Qpdf is the best option if you need to remove restrictions (e.g. can't print, can't edit, or can't copy text) from pdfs, which in most cases doesn't require a password. N.B. this doesn't help if you have a pdf where the text has not been encoding according to the standard characterset (which was a problem using the old "cups-pdf" virtual printer in Puppy i.e. copied text was gibberish because the characters in the pdf had all been randomly remapped).

Posted: Sun 11 Nov 2018, 23:56
by disciple
Pdfshuffler users may want to note that the version some have treated as an unofficial upstream has now forked as pdfarranger, I guess because there has been a little activity on the original upstream lately.

Re: attach files to a pdf

Posted: Thu 18 Jul 2019, 09:24
by disciple
disciple wrote:If you are interested in "pdf portfolios" i.e. pdfs with attached files, pdfdetach can extract them. But mutool can do that and also attach them in the first place.
Poppler now also has a pdfattach.

Sejda has an option to unpack attachments, and an option to create a "portfolio/collection of attachments". I'm not sure whether or not that is actually different from attaching a file with mutool or pdfattach.

Posted: Thu 18 Jul 2019, 09:51
by disciple
Some PDFs have named pages i.e. if you open them in some viewers (e.g. Adobe's) instead of displaying the logical page number they display a number in a different format (perhaps i, ii, iii... in the preface and 1, 2, 3 in the main body of the document), and/or some other text.

I spent quite some time looking for tools which can handle this. I have now discovered that these are called "page labels", and sejda has a tool to apply them to a pdf. I'm not sure if there are any free tools which preserve page labels (e.g. when splitting or merging pdfs), or can list the labels of pages in a pdf. Apparently poppler has supported page labels for a very long time, but tools like pdfunite don't seem to preserve them...

A document can contain more than one page with the same label, which I guess complicates things, and I think rather than being attached to individual pages, they are defined as a kind of metadata that says "starting from this logical page, number using this format".

What I would really like is a way to create bookmarks matching the page labels, and vice versa, and to split a document based on page labels, or the page label prefix.

Posted: Thu 18 Jul 2019, 10:22
by disciple
disciple wrote:I'm not sure if there are any free tools which preserve page labels (e.g. when splitting or merging pdfs), or can list the labels of pages in a pdf
...
What I would really like is a way to create bookmarks matching the page labels, and vice versa, and to split a document based on page labels, or the page label prefix.
Ah, knowing the right terminology helps.
You can get information about the page labels with pdftk:

Code: Select all

# pdftk Drawing1.pdf dump_data
InfoBegin
InfoKey: ModDate
InfoValue: D:20190718221353
InfoBegin
InfoKey: CreationDate
InfoValue: D:20190718221353
InfoBegin
InfoKey: Title
InfoValue: sill name (2)
InfoBegin
InfoKey: Creator
InfoValue: AutoCAD 2019 - English 2019 (23.0s (LMS Tech))
InfoBegin
InfoKey: Producer
InfoValue: pdfplot15.hdi 15.00.152.000
NumberOfPages: 2
BookmarkBegin
BookmarkTitle: Sheets and Views
BookmarkLevel: 1
BookmarkPageNumber: 0
BookmarkBegin
BookmarkTitle: Random name
BookmarkLevel: 2
BookmarkPageNumber: 1
BookmarkBegin
BookmarkTitle: sill name (2)
BookmarkLevel: 2
BookmarkPageNumber: 2
PageMediaBegin
PageMediaNumber: 1
PageMediaRotation: 0
PageMediaRect: 0 0 1191 842
PageMediaDimensions: 1191 842
PageMediaBegin
PageMediaNumber: 2
PageMediaRotation: 0
PageMediaRect: 0 0 1191 842
PageMediaDimensions: 1191 842
PageLabelBegin
PageLabelNewIndex: 1
PageLabelStart: 1
PageLabelPrefix: [1] Random name
PageLabelNumStyle: NoNumber
PageLabelBegin
PageLabelNewIndex: 2
PageLabelStart: 1
PageLabelPrefix: [2] sill name (2)
PageLabelNumStyle: NoNumber
So it wouldn't be too hard to script a solution for the splitting, or listing the page labels.

Posted: Thu 18 Jul 2019, 10:29
by disciple
disciple wrote:
disciple wrote:Does anybody by any chance know of a Linux program to change the default view settings in a pdf e.g. change from |continuous view" to "single page" view, or "fit width" to "100%" or "fit page"?
I haven't tried it, but I think there's a good chance Softmaker's new "Flexipdf Basic" would run in Wine. This functionality is provided, albeit in a rather strange place: File>Preferences>Loading, in the bottom section.
Ah, sejda looks like the best command line option to change settings like this.

Posted: Wed 31 Jul 2019, 12:02
by disciple
It looks like the latest alternative to pdfmod, pdfshuffler etc is pdfslicer.
It uses gtkmm3 :(
The backend is qpdf, so I imagine it will do the best job at handling the most pdfs.

Posted: Mon 05 Aug 2019, 06:30
by disciple
disciple wrote:Pdfshuffler users may want to note that the version some have treated as an unofficial upstream has now forked as pdfarranger, I guess because there has been a little activity on the original upstream lately.
Also note that the latest version of pdfarranger uses pikepdf (python interface to libqpdf) as backend if it is installed, rather than Pypdf2. I believe this will be better (check out the matrix on the pikepdf web page comparing it to Pypdf), although I haven't done any testing.

The latest version also introduces undo/redo.