search multiple pdf files for a string

For discussions about programming, programming questions/advice, and projects that don't really have anything to do with Puppy.
Post Reply
Message
Author
tlchost
Posts: 2057
Joined: Sun 05 Aug 2007, 23:26
Location: Baltimore, Maryland USA
Contact:

search multiple pdf files for a string

#1 Post by tlchost »

Trying to assist a local museum in searching pdf files. Need a solution to search multiple pdf files for a specific string.

Users are not technical, so it would be helpful for the script to look in a specific location, i.e. /mnt/home/newsletter, and search for a string, i.e. 'Howard Huges' in a case-insensitive manner. User could select output to screen or a text file.

Thanks

User avatar
Burn_IT
Posts: 3650
Joined: Sat 12 Aug 2006, 19:25
Location: Tamworth UK

#2 Post by Burn_IT »

You might find that a lot more difficult than it seems at first.
PDF is an optimised format and you would probably need to expand every document before you can search it.
"Just think of it as leaving early to avoid the rush" - T Pratchett

User avatar
mikeslr
Posts: 3890
Joined: Mon 16 Jun 2008, 21:20
Location: 500 seconds from Sol

#3 Post by mikeslr »

or maybe not: https://www.online-tech-tips.com/comput ... s-at-once/

Of course, Burn_It is correct if the goal is to write your own program, especially as this was posted to the Programming Section. But I'm still working on my first cup of coffee --still more right-brain gestalt than left-brain analytical-- so saw the highlight of problem being "Users are not technical". PDF's having been around for ages, so somebody must have thought about batch searching pdfs before. Googled expecting to find a technical discussion. The above was the first post found.

Why re-invent the wheel? I use FoxitReader all the time. Didn't know it had a batch search capability.

User avatar
fabrice_035
Posts: 765
Joined: Mon 28 Apr 2014, 17:54
Location: Bretagne / France

#4 Post by fabrice_035 »

Hello,

You can try https://www.howtogeek.com/228531/how-to ... -in-linux/
(How to Convert a PDF File to Editable Text Using the Command Line in Linux)

https://poppler.freedesktop.org/

Or install with PPM

Image
Bionicpup64-8.0 _ Kernel 5.4.27-64oz _ Asus Rog GL752

User avatar
technosaurus
Posts: 4853
Joined: Mon 19 May 2008, 01:24
Location: Blue Springs, MO
Contact:

#5 Post by technosaurus »

Mupdf has most of what you need in its tools along with a JavaScript interface. The source is pretty easy to follow if you have done any c programming, so you could implement your own tools for specific purposes using the existing ones as a template.
Check out my [url=https://github.com/technosaurus]github repositories[/url]. I may eventually get around to updating my [url=http://bashismal.blogspot.com]blogspot[/url].

User avatar
fabrice_035
Posts: 765
Joined: Mon 28 Apr 2014, 17:54
Location: Bretagne / France

#6 Post by fabrice_035 »

I complete my answer with this (simple) program

Code: Select all

#!/bin/bash
# 
# -> PdfToText mandatory { https://poppler.freedesktop.org/ ] 
# -> PPM / PuppyPacketManager {  POPPLER-UTILS }
#
# search TeXt in pdf with pdftotext tool

trap ctrl_c INT

temp=/tmp/$(head /dev/urandom | tr -dc A-Za-z0-9 | head -c 13 ; echo '')

IFS=$'\n'

sortir() {

rm -f "$temp"
echo -e "\n Bye."
exit
}
export -f sortir

function ctrl_c() {
        sortir
}


binary=$(whereis pdftotext | awk -F: '{print $2}' | tr -d " " )

echo "$binary"
if [ "$binary" != "/usr/bin/pdftotext" ] ; then

echo "/usr/bin/pdftotext not found. End."

sortir

fi

path="$2"

if [ "$path" = "" ] ; then
echo "1) Use default path $PWD, you can also specify folder."
path="$PWD"
else
echo "search path:$2"
fi

if [ -d "$path" ]; then
:
else
echo "Directory not found!"
sortir
fi
if [ "$1" = "" ] ; then
echo "2) This tool search text in PDF : enter an occurrence !"
sortir

else
echo "Search \"$1\" in all .pdf files "
fi


files=$(find $path -iname '*.pdf')

for file in $files
do

echo -e ">Look in:$file"

/usr/bin/pdftotext "$file" "$temp" 
result=$(cat "$temp" | grep $1)

if [ "$result" != "" ] ; then
echo -e "- Found \"$1\" in  $file \nPress [Enter] to continue [o]pen pdf or e[x]it" ; read x
 if [ "$x" = "o" ] ; then
defaultpdfviewer $file & 
 fi
 
  if [ "$x" = "x" ] ; then
  sortir
  fi

fi
done

sortir

Bionicpup64-8.0 _ Kernel 5.4.27-64oz _ Asus Rog GL752

User avatar
rufwoof
Posts: 3690
Joined: Mon 24 Feb 2014, 17:47

#7 Post by rufwoof »

fabrice_035, does that include PDF's where the text content is being stored in image format? (Suspect not as I don't see any OCR type links/code).
[size=75]( ͡° ͜ʖ ͡°) :wq[/size]
[url=http://murga-linux.com/puppy/viewtopic.php?p=1028256#1028256][size=75]Fatdog multi-session usb[/url][/size]
[size=75][url=https://hashbang.sh]echo url|sed -e 's/^/(c/' -e 's/$/ hashbang.sh)/'|sh[/url][/size]

User avatar
Burn_IT
Posts: 3650
Joined: Sat 12 Aug 2006, 19:25
Location: Tamworth UK

#8 Post by Burn_IT »

You cannot compare a text string with a graphic and expect to get meaningful results.
"Just think of it as leaving early to avoid the rush" - T Pratchett

tlchost
Posts: 2057
Joined: Sun 05 Aug 2007, 23:26
Location: Baltimore, Maryland USA
Contact:

#9 Post by tlchost »

Thanks folks....I'll give some of the suggestions a whirl and let you know which one worked for the museum.

User avatar
CatDude
Posts: 1563
Joined: Wed 03 Jan 2007, 17:49
Location: UK

#10 Post by CatDude »

Hi
I find this quite useful myself: https://pdfgrep.org/
[img]http://www.smokey01.com/CatDude/.temp/sigs/acer-futile.gif[/img]

slavvo67
Posts: 1610
Joined: Sat 13 Oct 2012, 02:07
Location: The other Mr. 305

#11 Post by slavvo67 »

Catdude beat me to it. If the PDF is searchable, PDFGrep is a great solution. It should be in most repositories. It comes in RU Xerus as a default because I find it so handy.

All the best,

Slavvo67

tlchost
Posts: 2057
Joined: Sun 05 Aug 2007, 23:26
Location: Baltimore, Maryland USA
Contact:

#12 Post by tlchost »

The plot thickens with the search in PDF files. Once the folks at the museum used the advanced search in Acrobat, they posed the question:

If the pdf files were online, would there be a way to search those files?

I can give them all the space they need on one of my servers, but I simply don't have the skills to write a script that would allow the online search. Anyone know of such a script?

thanks

my elemental internet search skills have not discovered one.

Post Reply