Puppy Linux Discussion Forum Forum Index Puppy Linux Discussion Forum
Puppy HOME page : puppylinux.com
"THE" alternative forum : puppylinux.info
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

The time now is Wed 23 Oct 2019, 21:32
All times are UTC - 4
 Forum index » Off-Topic Area » Programming
search multiple pdf files for a string
Post new topic   Reply to topic View previous topic :: View next topic
Page 1 of 1 [12 Posts]  
Author Message
tlchost

Joined: 05 Aug 2007
Posts: 2111
Location: Baltimore, Maryland USA

PostPosted: Thu 29 Nov 2018, 05:33    Post subject:  search multiple pdf files for a string  

Trying to assist a local museum in searching pdf files. Need a solution to search multiple pdf files for a specific string.

Users are not technical, so it would be helpful for the script to look in a specific location, i.e. /mnt/home/newsletter, and search for a string, i.e. 'Howard Huges' in a case-insensitive manner. User could select output to screen or a text file.

Thanks
Back to top
View user's profile Send private message Visit poster's website 
Burn_IT


Joined: 12 Aug 2006
Posts: 3571
Location: Tamworth UK

PostPosted: Thu 29 Nov 2018, 09:56    Post subject:  

You might find that a lot more difficult than it seems at first.
PDF is an optimised format and you would probably need to expand every document before you can search it.

_________________
"Just think of it as leaving early to avoid the rush" - T Pratchett
Back to top
View user's profile Send private message 
mikeslr


Joined: 16 Jun 2008
Posts: 3420
Location: 500 seconds from Sol

PostPosted: Thu 29 Nov 2018, 10:37    Post subject:  

or maybe not: https://www.online-tech-tips.com/computer-tips/how-to-search-for-text-inside-multiple-pdf-files-at-once/

Of course, Burn_It is correct if the goal is to write your own program, especially as this was posted to the Programming Section. But I'm still working on my first cup of coffee --still more right-brain gestalt than left-brain analytical-- so saw the highlight of problem being "Users are not technical". PDF's having been around for ages, so somebody must have thought about batch searching pdfs before. Googled expecting to find a technical discussion. The above was the first post found.

Why re-invent the wheel? I use FoxitReader all the time. Didn't know it had a batch search capability.
Back to top
View user's profile Send private message 
fabrice_035


Joined: 28 Apr 2014
Posts: 642
Location: Bretagne / France

PostPosted: Thu 29 Nov 2018, 10:54    Post subject:  

Hello,

You can try https://www.howtogeek.com/228531/how-to-convert-a-pdf-file-to-editable-text-using-the-command-line-in-linux/
(How to Convert a PDF File to Editable Text Using the Command Line in Linux)

https://poppler.freedesktop.org/

Or install with PPM


_________________
xenialpup 7.5 / Linux Kernel: 4.4.95 / Window Manager: JWM v2.3.7
Back to top
View user's profile Send private message 
technosaurus


Joined: 18 May 2008
Posts: 4870
Location: Blue Springs, MO

PostPosted: Thu 29 Nov 2018, 16:10    Post subject:  

Mupdf has most of what you need in its tools along with a JavaScript interface. The source is pretty easy to follow if you have done any c programming, so you could implement your own tools for specific purposes using the existing ones as a template.
_________________
Check out my github repositories. I may eventually get around to updating my blogspot.
Back to top
View user's profile Send private message Visit poster's website 
fabrice_035


Joined: 28 Apr 2014
Posts: 642
Location: Bretagne / France

PostPosted: Thu 29 Nov 2018, 17:11    Post subject:  

I complete my answer with this (simple) program

Code:

#!/bin/bash
#
# -> PdfToText mandatory { https://poppler.freedesktop.org/ ]
# -> PPM / PuppyPacketManager {  POPPLER-UTILS }
#
# search TeXt in pdf with pdftotext tool

trap ctrl_c INT

temp=/tmp/$(head /dev/urandom | tr -dc A-Za-z0-9 | head -c 13 ; echo '')

IFS=$'\n'

sortir() {

rm -f "$temp"
echo -e "\n Bye."
exit
}
export -f sortir

function ctrl_c() {
        sortir
}


binary=$(whereis pdftotext | awk -F: '{print $2}' | tr -d " " )

echo "$binary"
if [ "$binary" != "/usr/bin/pdftotext" ] ; then

echo "/usr/bin/pdftotext not found. End."

sortir

fi

path="$2"

if [ "$path" = "" ] ; then
echo "1) Use default path $PWD, you can also specify folder."
path="$PWD"
else
echo "search path:$2"
fi

if [ -d "$path" ]; then
:
else
echo "Directory not found!"
sortir
fi
if [ "$1" = "" ] ; then
echo "2) This tool search text in PDF : enter an occurrence !"
sortir

else
echo "Search \"$1\" in all .pdf files "
fi


files=$(find $path -iname '*.pdf')

for file in $files
do

echo -e ">Look in:$file"

/usr/bin/pdftotext "$file" "$temp"
result=$(cat "$temp" | grep $1)

if [ "$result" != "" ] ; then
echo -e "- Found \"$1\" in  $file \nPress [Enter] to continue [o]pen pdf or e[x]it" ; read x
 if [ "$x" = "o" ] ; then
defaultpdfviewer $file &
 fi
 
  if [ "$x" = "x" ] ; then
  sortir
  fi

fi
done

sortir


_________________
xenialpup 7.5 / Linux Kernel: 4.4.95 / Window Manager: JWM v2.3.7
Back to top
View user's profile Send private message 
rufwoof


Joined: 24 Feb 2014
Posts: 3574

PostPosted: Thu 29 Nov 2018, 19:04    Post subject:  

fabrice_035, does that include PDF's where the text content is being stored in image format? (Suspect not as I don't see any OCR type links/code).
_________________
( ͡° ͜ʖ ͡°) :wq
Fatdog multi-session usb

echo url|sed -e 's/^/(c/' -e 's/$/ hashbang.sh)/'|sh
Back to top
View user's profile Send private message 
Burn_IT


Joined: 12 Aug 2006
Posts: 3571
Location: Tamworth UK

PostPosted: Thu 29 Nov 2018, 19:36    Post subject:  

You cannot compare a text string with a graphic and expect to get meaningful results.
_________________
"Just think of it as leaving early to avoid the rush" - T Pratchett
Back to top
View user's profile Send private message 
tlchost

Joined: 05 Aug 2007
Posts: 2111
Location: Baltimore, Maryland USA

PostPosted: Fri 30 Nov 2018, 04:37    Post subject:  

Thanks folks....I'll give some of the suggestions a whirl and let you know which one worked for the museum.
Back to top
View user's profile Send private message Visit poster's website 
CatDude


Joined: 03 Jan 2007
Posts: 1573
Location: UK

PostPosted: Fri 30 Nov 2018, 07:29    Post subject:  

Hi
I find this quite useful myself: https://pdfgrep.org/

_________________

Back to top
View user's profile Send private message 
slavvo67

Joined: 12 Oct 2012
Posts: 1613
Location: The other Mr. 305

PostPosted: Sat 01 Dec 2018, 12:00    Post subject:  

Catdude beat me to it. If the PDF is searchable, PDFGrep is a great solution. It should be in most repositories. It comes in RU Xerus as a default because I find it so handy.

All the best,

Slavvo67
Back to top
View user's profile Send private message 
tlchost

Joined: 05 Aug 2007
Posts: 2111
Location: Baltimore, Maryland USA

PostPosted: Sun 02 Dec 2018, 08:56    Post subject:  

The plot thickens with the search in PDF files. Once the folks at the museum used the advanced search in Acrobat, they posed the question:

If the pdf files were online, would there be a way to search those files?

I can give them all the space they need on one of my servers, but I simply don't have the skills to write a script that would allow the online search. Anyone know of such a script?

thanks

my elemental internet search skills have not discovered one.
Back to top
View user's profile Send private message Visit poster's website 
Display posts from previous:   Sort by:   
Page 1 of 1 [12 Posts]  
Post new topic   Reply to topic View previous topic :: View next topic
 Forum index » Off-Topic Area » Programming
Jump to:  

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Powered by phpBB © 2001, 2005 phpBB Group
[ Time: 0.0877s ][ Queries: 12 (0.0215s) ][ GZIP on ]