tesseract-ocr optical character recognition

Word processors, spreadsheets, presentations, translation, etc.
Message
Author
disciple
Posts: 6984
Joined: Sun 21 May 2006, 01:46
Location: Auckland, New Zealand

tesseract-ocr optical character recognition

#1 Post by disciple »

Tesseract is the most accurate Open Source character recognition, but it has no layout analysis. If you intend to scan pages with parallel columns, you should use Ocropus (here), which uses the Tesseract engine.
If not, you will have to get rid of all the unnecessary line breaks with Tesseract, but the actual character recognition is better than with Ocropus.
In my tests tesseract was almost 100% accurate, except it missed a few spaces, and a few symmetrical apostrophes ' were turned into left-hand side single quotes ‘
This is much better than the OCR engine included in Microsoft Office, and MUCH better than gocr :)

1. Install from here (512 kb)
2. Copy everything from /local to /usr/local, then you can delete /local (I made a mistake packaging it... I'll package version 3 sometime and get it right).
2. Download your language file (English, French, German, Dutch, Spanish, Italian, Portuguese, Vietnamese or Old German) from http://code.google.com/p/tesseract-ocr/downloads/list (900 to 1400 kb)
YOU WANT THE "LANGUAGE DATA" file, not the "SOURCE TRAINING DATA" file. If you have a different language, you'd better make them :)
3. Extract the language file into /usr/local/share (so the stuff should be ending up in /usr/local/share/tessdata).
4. Run with e.g.

Code: Select all

tesseract /path/scan.tif /path/output_file
or pipe it to something with a spellcheck just in case :)
It will automatically append a .txt extension to the output.

It ONLY works with uncompressed and G3 compressed tiffs because I disabled libtiff support because of a bug that they tell me will be fixed in the next version. Xnview, nconvert, Imagemagick convert, and probably other things can make these. I'm guessing Xsane does too. The Gimp can't (or at least couldn't :) )

FYI I had compile problems with 2.03, so we're waiting for 2.04 :)
Last edited by disciple on Tue 12 Jan 2010, 12:54, edited 3 times in total.
Do you know a good gtkdialog program? Please post a link here

Classic Puppy quotes

ROOT FOREVER
GTK2 FOREVER

User avatar
WhoDo
Posts: 4428
Joined: Wed 12 Jul 2006, 01:58
Location: Lake Macquarie NSW Australia

Re: tesseract-ocr optical character recognition

#2 Post by WhoDo »

disciple wrote:I compiled tesseract 2.01, and can upload a package if someone wants to host it (It's over 11MB, and I'm sick of trying to find decent free file hosts).
Why not get Tom or Will to upload it at puppylinux.org, or PM caneri to host it at puppylinux.ca ... either way.

Both locations have plenty of free space and don't charge for downloads in .pet or .pup formats. No need to bother with the adware hosts for trusted Puppy developers/compilers like yourself these days.
[i]Actions speak louder than words ... and they usually work when words don't![/i]
SIP:whodo@proxy01.sipphone.com; whodo@realsip.com

User avatar
HairyWill
Posts: 2928
Joined: Fri 26 May 2006, 23:29
Location: Southampton, UK

#3 Post by HairyWill »

Puppylinux.org doesn't host packages at the moment, I think we did this to ensure that the site would not run out of transfer quota.
I'm sure Caneri can help.
Will
contribute: [url=http://www.puppylinux.org]community website[/url], [url=http://tinyurl.com/6c3nm6]screenshots[/url], [url=http://tinyurl.com/6j2gbz]puplets[/url], [url=http://tinyurl.com/57gykn]wiki[/url], [url=http://tinyurl.com/5dgr83]rss[/url]

disciple
Posts: 6984
Joined: Sun 21 May 2006, 01:46
Location: Auckland, New Zealand

#4 Post by disciple »

OK, we'll see about that.
Oops. That was a pretty bad typo. I meant "over 1MB" :)
Do you know a good gtkdialog program? Please post a link here

Classic Puppy quotes

ROOT FOREVER
GTK2 FOREVER

User avatar
Dingo
Posts: 1437
Joined: Tue 11 Dec 2007, 17:48
Location: somewhere at the end of rainbow...
Contact:

Re: tesseract-ocr optical character recognition

#5 Post by Dingo »

disciple wrote:I compiled tesseract 2.01, and can upload a package if someone wants to host it (It's over 1MB, and I'm sick of trying to find decent free file hosts).
http://www.filefront.com/
replace .co.cc with .info to get access to stuff I posted in forum
dropbox 2GB free
OpenOffice for Puppy Linux

lluamco
Posts: 208
Joined: Fri 16 Mar 2007, 09:10
Location: Banyoles, Spain

Re: tesseract-ocr optical character recognition

#6 Post by lluamco »

disciple wrote:I compiled tesseract 2.01, and can upload a package if someone wants to host it (It's over 1MB, and I'm sick of trying to find decent free file hosts).
Hello disciple.
MU is always very kind to host large files. Please read
http://www.murga-linux.com/puppy/viewto ... 9400#99400
to know how to proceed.
Cheers,
Lluis

disciple
Posts: 6984
Joined: Sun 21 May 2006, 01:46
Location: Auckland, New Zealand

tesseract-ocr optical character recognition

#7 Post by disciple »

OK I uploaded it and updated the first post.
It turned out I COULD get it under 1MB, but not OCRopus (see link), so thanks Caneri :)
Do you know a good gtkdialog program? Please post a link here

Classic Puppy quotes

ROOT FOREVER
GTK2 FOREVER

User avatar
Dingo
Posts: 1437
Joined: Tue 11 Dec 2007, 17:48
Location: somewhere at the end of rainbow...
Contact:

#8 Post by Dingo »

thanks linked all two topics and mirrored on dokupuppy:

http://puppylover.netsons.org/dokupuppy/programs:ocr
replace .co.cc with .info to get access to stuff I posted in forum
dropbox 2GB free
OpenOffice for Puppy Linux

disciple
Posts: 6984
Joined: Sun 21 May 2006, 01:46
Location: Auckland, New Zealand

#9 Post by disciple »

BTW I was mistaken. Ocropus does not have a gui, but does have a complex set of Lua scripts :)
Do you know a good gtkdialog program? Please post a link here

Classic Puppy quotes

ROOT FOREVER
GTK2 FOREVER

disciple
Posts: 6984
Joined: Sun 21 May 2006, 01:46
Location: Auckland, New Zealand

#10 Post by disciple »

Here are some OCR proofing aids that should be useful; probably more so if you are doing a lot of ocr:

http://gutcheck.sourceforge.net/
Gutcheck is a plain-text checking program that specializes in reporting the problems that spellcheckers don't--errors like mismatched quotes, misplaced punctuation, unintended blank lines. It is specifically tuned for checking texts for submission to Project Gutenberg, though I hope it can be useful elsewhere as well.
The common OCR error of mistaking a "b" for a "h" and vice versa used to lead to horrible things with the words "he" and "be". With the vast improvement in OCR programs in the last few years, this is not the nightmare it used to be.

jeebies detects common he/be errors by a simple lookup table. I really need to add some extra intelligence; I have a set of heuristics that I used previously, and I will probably get the time to plug them in at some point. For now, it's quick and does have some value, especially in checking older texts. It needs its lookup table, which is in the files he.jee and be.jee
Gutspell: I made a very enthusiastic start on this, but I need a big dictionary with possible parts of speech listed for every word to do the next thing with it, and I never got around to doing that.

Now, it simply lists every word that isn't in its dictionary that occurs only once. Still, as a superfast check, it does still catch some typos. It has a bad habit of obsessing on one word sometimes, and reporting lots of instances. I must fix that one day. Its dictionary is the file gutspell.dic
If someone is keen, it would be worth getting Guiguts working, which is a Perl/tk gui for these tools and aspell/ispell. In spite of what the gutcheck site implies, Guiguts is not Windows-only. The trickier part would be packaging perl/tk for Puppy.
Attachments
gutcheck.zip
(35.44 KiB) Downloaded 1177 times
Do you know a good gtkdialog program? Please post a link here

Classic Puppy quotes

ROOT FOREVER
GTK2 FOREVER

disciple
Posts: 6984
Joined: Sun 21 May 2006, 01:46
Location: Auckland, New Zealand

unpaper - post-processing scanned and photocopied book pages

#11 Post by disciple »

This should also be useful before you do the OCR.

Unpaper is a tool for straightening pages and removing black edges, including in the middle, where you have photocopied an open book!

I haven't tested it, and it is at an early stage of development, but it certainly looks good :)

You'll need to figure out how to convert your images to and from .pnm
Attachments
unpaper-0.3.pet
(29.67 KiB) Downloaded 1337 times
Do you know a good gtkdialog program? Please post a link here

Classic Puppy quotes

ROOT FOREVER
GTK2 FOREVER

User avatar
jrb
Posts: 1536
Joined: Tue 11 Dec 2007, 19:56
Location: Smithers, BC, Canada

#12 Post by jrb »

I have built ch-tesseract-2.01-OCR-en.sfs, an english version of tesseract. Tesseract_OCR is placed on the right click menu. If you right click on a .tif file it will produce a text file with the same name in a few seconds. However it is very fussy about these .tif files. You may have to open them in mtpaint or another graphics program and resave them. Even the training files required this. After that, however, it seems to work very well.

I have also placed a menu item on the Documents menu which opens a text file with these same instructions.

Packages for other major languages are available and can be easily built.

Let me know how it works for you. J

disciple
Posts: 6984
Joined: Sun 21 May 2006, 01:46
Location: Auckland, New Zealand

#13 Post by disciple »

To download that sfs use the username "puppy" and password "linux" - I had to fill it in several times for some reason (unless the last time I changed it and put a capital or something?).
However it is very fussy about these .tif files
That should change in 2.04 or 3, which were both expected to be out already... so they should be out soon :)
Do you know a good gtkdialog program? Please post a link here

Classic Puppy quotes

ROOT FOREVER
GTK2 FOREVER

Dromeno
Posts: 534
Joined: Fri 12 Sep 2008, 07:01

Scansoft Omnipage via wine in puppy

#14 Post by Dromeno »

OCR is one of those fields where windows applications still outshine the Linux ones. But fortunately for us puppy users, Scansoft Omnipage -my favorite- works (via wine). And it even works as 'portable' (just copy the Omnipage files from C:\Program files to some external device).

disciple
Posts: 6984
Joined: Sun 21 May 2006, 01:46
Location: Auckland, New Zealand

#15 Post by disciple »

So I gather that is better because it deals with layout? Tesseract is noticeably more accurate than any of the windows products I've tried (some products are as accurate, and I suspect that one would be); where it is lacking is layout analysis.
They say that produces perfectly formatted documents, but how editable are they really? I've never tried any software that produces output that is formatted to match the original well and is also readily editable - it tends to be like copying text from a pdf.
Do you know a good gtkdialog program? Please post a link here

Classic Puppy quotes

ROOT FOREVER
GTK2 FOREVER

ndujoe1
Posts: 851
Joined: Mon 05 Dec 2005, 01:06

tesseract image requirements

#16 Post by ndujoe1 »

Since tesseract will only operate with uncompressed TIFF files you need just a few extra steps to achieve compatiblity with xsane.

goto : click Preferences --> Setup --> Filetype

for the TIFF options

Set compression rate to 1

in the next three TIFF dialong boxes select no compression.

clock OK

click Preferences again and select SAVE settings.

When scanning a file for OCR in the XSANE menu I select type :TIFF

color : gray
enter 300 for scan resoultion

And save the filename with extention .tif not .tiff.

Then when finished you invoke tesseract from the command line with

tesseract filename.tif outputname

disciple
Posts: 6984
Joined: Sun 21 May 2006, 01:46
Location: Auckland, New Zealand

#17 Post by disciple »

Come on people, why did no one report before now that the package was broken? :oops:
Or did it work in older versions of Puppy? Maybe petget was different...
Do you know a good gtkdialog program? Please post a link here

Classic Puppy quotes

ROOT FOREVER
GTK2 FOREVER

ndujoe1
Posts: 851
Joined: Mon 05 Dec 2005, 01:06

tesseract

#18 Post by ndujoe1 »

It is not broken I forgot to post that you need to move the tesseract location from /local/tessearct to /usr/local/tesseract. Then you will be able reference it from the command line. It works well on my machine.

disciple
Posts: 6984
Joined: Sun 21 May 2006, 01:46
Location: Auckland, New Zealand

#19 Post by disciple »

Yes, I know the build isn't broken, and neither are your instructions... but my package is.
I obviously packaged it wrong... unless my package somehow got replaced by a different, broken one.
Do you know a good gtkdialog program? Please post a link here

Classic Puppy quotes

ROOT FOREVER
GTK2 FOREVER

zygo
Posts: 243
Joined: Sat 08 Apr 2006, 20:15
Location: UK

#20 Post by zygo »

I'm using Puppy 431. I read only the first post in this thread and got it working -- to a fashion -- the command simply returned the dots per pixcel and size of the image file. A 1 byte file was made containing a new line character. No error on the command line. Not even in /log/var/messages . Check for dependencies form the menu lists none.

Now I see ndujoe1 says it needs xsane. Which xsane pet from the official Puppy 4 repo should I use and does that need sane?

Post Reply