How to get OCR software to work?

Using applications, configuring, problems
Post Reply
Message
Author
fixit

How to get OCR software to work?

#1 Post by fixit »

I would like scans from a book converted to text.

I installed PeasyScan and Tesseract.

Pic2Txt says Tesseract is not installed. It is.

PeasyScan says no scanner is installed, yet scangearmp scans just fine.

I don't know what to check next.

User avatar
rcrsn51
Posts: 13096
Joined: Tue 05 Sep 2006, 13:50
Location: Stratford, Ontario

Re: OCR software

#2 Post by rcrsn51 »

fixit wrote:PeasyScan says no scanner is installed, yet scangearmp scans just fine.
Your Canon printer is not a SANE-compatible model so it only works with the Canon scangearmp program.
Pic2Txt says Tesseract is not installed. It is.
Go to a command line and type: tesseract

Where did you get the tesseract package?
I installed PeasyScan
What Puppy version are you using? Most recent ones already have Peasyscan.

fixit

Re: OCR software

#3 Post by fixit »

rcrsn51 wrote:
fixit wrote:PeasyScan says no scanner is installed, yet scangearmp scans just fine.
Your Canon printer is not a SANE-compatible model so it only works with the Canon scangearmp program.
Pic2Txt says Tesseract is not installed. It is.
Go to a command line and type: tesseract

Where did you get the tesseract package?
I installed PeasyScan
What Puppy version are you using? Most recent ones already have Peasyscan.
My primary printer is a Brother HL-2240.

I used the tesseract from here.

http://murga-linux.com/puppy/viewtopic.php?t=51507

Puppy 5.6.0

User avatar
rcrsn51
Posts: 13096
Joined: Tue 05 Sep 2006, 13:50
Location: Stratford, Ontario

#4 Post by rcrsn51 »

I gave you one crucial test to perform but you failed to report the result. So there is nothing else I can do for you.

fixit

#5 Post by fixit »

# tesseract
tesseract:Error:Usage:tesseract imagename outputbase [-l lang] [configfile [[+|-]varfile]...]

#

User avatar
Galbi
Posts: 1098
Joined: Wed 21 Sep 2011, 22:32
Location: Bs.As. - Argentina.

#6 Post by Galbi »

fixit wrote:# tesseract
tesseract:Error:Usage:tesseract imagename outputbase [-l lang] [configfile [[+|-]varfile]...]

#
It seems that tesseract it's a command line app.
I think that you have to tell tesseract the name of a file/s that you have previously scanned.

Quoted from tesseract ReadMe
Running Tesseract

Tesseract is a command-line program, so first open a terminal or command prompt. The command is used like this:

tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]

So basic usage to do OCR on an image called 'myscan.png' and save the result to 'out.txt' would be:

tesseract myscan.png out
C'mon Andy... put half a battery. :wink:
Remember: [b][i]"pecunia pecuniam parere non potest"[/i][/b]

fixit

#7 Post by fixit »

Guess a lot a folks are still mad at me.

rcrsn51 asked me for the answer to that command line.

I am trying to get a scanned image converted to printable text.

My goal is to just to use only Linux as an O.S.

So far, Puppy has done a great job. :-)

tesseract -man yielded this.

http://code.google.com/p/tesseract-ocr/

Code: Select all

Linux

Tesseract is available directly from many Linux distributions. The package is generally called 'tesseract' or 'tesseract-ocr' - search your distribution's repositories to find it. Packages are also generally available for language training data (search the repositories,) but if not you will need to download the appropriate training data, unpack it, and copy the .traineddata file into the 'tessdata' directory, probably /usr/share/tesseract-ocr/tessdata or /usr/share/tessdata.

If Tesseract isn't available for your distribution, or you want to use a newer version than they offer, you can compile your own. Note that older versions of Tesseract only supported processing .tiff files.
It doesn't seem to be user friendly.

User avatar
rcrsn51
Posts: 13096
Joined: Tue 05 Sep 2006, 13:50
Location: Stratford, Ontario

#8 Post by rcrsn51 »

I installed Tesseract v3.00 from here. and ran it with pic2txt. It worked fine.

fixit

#9 Post by fixit »

Thanks.

It's working, but the conversion is pretty unusable.

I know the difficulties involved in converting pictures to text.

User avatar
rcrsn51
Posts: 13096
Joined: Tue 05 Sep 2006, 13:50
Location: Stratford, Ontario

#10 Post by rcrsn51 »

In my tests, you need at least 300 DPI resolution in the scanned image for OCR to work properly.

User avatar
Dingo
Posts: 1437
Joined: Tue 11 Dec 2007, 17:48
Location: somewhere at the end of rainbow...
Contact:

#11 Post by Dingo »

Personally speaking, I found that the Free ocr features provided by tracker's X-Change pdf viewer

http://www.tracker-software.com/product/downloads
http://www.tracker-software.com/pdf-xchange-viewer-ocr (ocr modules)

are very valuable, especially for me, since I'm used to build a pdf from scans and then submit this pdf to ocr softwares in order to have a searchable pdf with hidden text layer

it works very well with wine

you need to install first the X-Change pdf viewer, then the ocr modules for languages. once installed, you can look into your /root/.wine folder, and move the tracker folder to another location if you don't want to consume too much space in puppy savefile
replace .co.cc with .info to get access to stuff I posted in forum
dropbox 2GB free
OpenOffice for Puppy Linux

fixit

#12 Post by fixit »

rcrsn51 wrote:In my tests, you need at least 300 DPI resolution in the scanned image for OCR to work properly.
Thanks.

When I scanned at 400 dpi, the conversation was pretty accurate.

User avatar
MochiMoppel
Posts: 2084
Joined: Wed 26 Jan 2011, 09:06
Location: Japan

#13 Post by MochiMoppel »

rcrsn51 wrote:I installed Tesseract v3.00 from here. and ran it with pic2txt. It worked fine.
It does, but it doesn't work with the linked language files. Files in http://code.google.com/p/tesseract-ocr/downloads/list are all for version 3.02, could this be the reason? Even the new eng.traineddata (20.8M) doesn't work. The final "Text saved" message appears, but no text file is created :cry:

Any chance to get v3.00 language files anywhere?

User avatar
rcrsn51
Posts: 13096
Joined: Tue 05 Sep 2006, 13:50
Location: Stratford, Ontario

#14 Post by rcrsn51 »

MochiMoppel wrote:Any chance to get v3.00 language files anywhere?
From here:
You may need to view the second page to find a 3.00 version for your language.
I have also built Tesseract v3.02 but the PET is larger and does NOT contain a language file. However, I once posted v3.01 and the download link eventually went dead from lack of use.

If anyone wants to test v3.02, they can send me a PM.

User avatar
MochiMoppel
Posts: 2084
Joined: Wed 26 Jan 2011, 09:06
Location: Japan

#15 Post by MochiMoppel »

Oops...thanks! :oops:

Post Reply