tesseract-ocr optical character recognition

Message

disciple · #1 Post by **disciple** » Tue 23 Sep 2008, 06:00

Tesseract is the most accurate Open Source character recognition, but it has no layout analysis. If you intend to scan pages with parallel columns, you should use Ocropus (here), which uses the Tesseract engine.
If not, you will have to get rid of all the unnecessary line breaks with Tesseract, but the actual character recognition is better than with Ocropus.
In my tests tesseract was almost 100% accurate, except it missed a few spaces, and a few symmetrical apostrophes ' were turned into left-hand side single quotes ‘
This is much better than the OCR engine included in Microsoft Office, and MUCH better than gocr

1. Install from here (512 kb)
2. Copy everything from /local to /usr/local, then you can delete /local (I made a mistake packaging it... I'll package version 3 sometime and get it right).
2. Download your language file (English, French, German, Dutch, Spanish, Italian, Portuguese, Vietnamese or Old German) from http://code.google.com/p/tesseract-ocr/downloads/list (900 to 1400 kb)
YOU WANT THE "LANGUAGE DATA" file, not the "SOURCE TRAINING DATA" file. If you have a different language, you'd better make them

3. Extract the language file into /usr/local/share (so the stuff should be ending up in /usr/local/share/tessdata).
4. Run with e.g.

Code: Select all

tesseract /path/scan.tif /path/output_file

or pipe it to something with a spellcheck just in case

It will automatically append a .txt extension to the output.

It ONLY works with uncompressed and G3 compressed tiffs because I disabled libtiff support because of a bug that they tell me will be fixed in the next version. Xnview, nconvert, Imagemagick convert, and probably other things can make these. I'm guessing Xsane does too. The Gimp can't (or at least couldn't

)

FYI I had compile problems with 2.03, so we're waiting for 2.04

WhoDo · #2 Post by **WhoDo** » Tue 23 Sep 2008, 09:03

disciple wrote:I compiled tesseract 2.01, and can upload a package if someone wants to host it (It's over 11MB, and I'm sick of trying to find decent free file hosts).

Why not get Tom or Will to upload it at puppylinux.org, or PM caneri to host it at puppylinux.ca ... either way.

Both locations have plenty of free space and don't charge for downloads in .pet or .pup formats. No need to bother with the adware hosts for trusted Puppy developers/compilers like yourself these days.

HairyWill · #3 Post by **HairyWill** » Tue 23 Sep 2008, 11:20

Puppylinux.org doesn't host packages at the moment, I think we did this to ensure that the site would not run out of transfer quota.
I'm sure Caneri can help.

disciple · #4 Post by **disciple** » Tue 23 Sep 2008, 11:37

OK, we'll see about that.
Oops. That was a pretty bad typo. I meant "over 1MB"

Dingo · #5 Post by **Dingo** » Tue 23 Sep 2008, 17:23

disciple wrote:I compiled tesseract 2.01, and can upload a package if someone wants to host it (It's over 1MB, and I'm sick of trying to find decent free file hosts).

http://www.filefront.com/

lluamco · #6 Post by **lluamco** » Wed 24 Sep 2008, 07:59

disciple wrote:I compiled tesseract 2.01, and can upload a package if someone wants to host it (It's over 1MB, and I'm sick of trying to find decent free file hosts).

Hello disciple.
MU is always very kind to host large files. Please read
http://www.murga-linux.com/puppy/viewto ... 9400#99400
to know how to proceed.
Cheers,
Lluis

disciple · #7 Post by **disciple** » Thu 25 Sep 2008, 12:25

OK I uploaded it and updated the first post.
It turned out I COULD get it under 1MB, but not OCRopus (see link), so thanks Caneri

Dingo · #8 Post by **Dingo** » Thu 25 Sep 2008, 12:29

thanks linked all two topics and mirrored on dokupuppy:

http://puppylover.netsons.org/dokupuppy/programs:ocr

disciple · #9 Post by **disciple** » Thu 25 Sep 2008, 12:30

BTW I was mistaken. Ocropus does not have a gui, but does have a complex set of Lua scripts

disciple · #10 Post by **disciple** » Sat 28 Feb 2009, 23:51

Here are some OCR proofing aids that should be useful; probably more so if you are doing a lot of ocr:

http://gutcheck.sourceforge.net/

Gutcheck is a plain-text checking program that specializes in reporting the problems that spellcheckers don't--errors like mismatched quotes, misplaced punctuation, unintended blank lines. It is specifically tuned for checking texts for submission to Project Gutenberg, though I hope it can be useful elsewhere as well.

The common OCR error of mistaking a "b" for a "h" and vice versa used to lead to horrible things with the words "he" and "be". With the vast improvement in OCR programs in the last few years, this is not the nightmare it used to be.

jeebies detects common he/be errors by a simple lookup table. I really need to add some extra intelligence; I have a set of heuristics that I used previously, and I will probably get the time to plug them in at some point. For now, it's quick and does have some value, especially in checking older texts. It needs its lookup table, which is in the files he.jee and be.jee

Gutspell: I made a very enthusiastic start on this, but I need a big dictionary with possible parts of speech listed for every word to do the next thing with it, and I never got around to doing that.

Now, it simply lists every word that isn't in its dictionary that occurs only once. Still, as a superfast check, it does still catch some typos. It has a bad habit of obsessing on one word sometimes, and reporting lots of instances. I must fix that one day. Its dictionary is the file gutspell.dic

If someone is keen, it would be worth getting Guiguts working, which is a Perl/tk gui for these tools and aspell/ispell. In spite of what the gutcheck site implies, Guiguts is not Windows-only. The trickier part would be packaging perl/tk for Puppy.

disciple · #11 Post by **disciple** » Sat 11 Apr 2009, 12:13

This should also be useful before you do the OCR.

Unpaper is a tool for straightening pages and removing black edges, including in the middle, where you have photocopied an open book!

I haven't tested it, and it is at an early stage of development, but it certainly looks good

You'll need to figure out how to convert your images to and from .pnm

jrb · #12 Post by **jrb** » Thu 23 Apr 2009, 21:03

I have built ch-tesseract-2.01-OCR-en.sfs, an english version of tesseract. Tesseract_OCR is placed on the right click menu. If you right click on a .tif file it will produce a text file with the same name in a few seconds. However it is very fussy about these .tif files. You may have to open them in mtpaint or another graphics program and resave them. Even the training files required this. After that, however, it seems to work very well.

I have also placed a menu item on the Documents menu which opens a text file with these same instructions.

Packages for other major languages are available and can be easily built.

Let me know how it works for you. J

disciple · #13 Post by **disciple** » Fri 24 Apr 2009, 06:42

To download that sfs use the username "puppy" and password "linux" - I had to fill it in several times for some reason (unless the last time I changed it and put a capital or something?).

However it is very fussy about these .tif files

That should change in 2.04 or 3, which were both expected to be out already... so they should be out soon

Dromeno · #14 Post by **Dromeno** » Fri 24 Apr 2009, 08:46

OCR is one of those fields where windows applications still outshine the Linux ones. But fortunately for us puppy users, Scansoft Omnipage -my favorite- works (via wine). And it even works as 'portable' (just copy the Omnipage files from C:\Program files to some external device).

disciple · #15 Post by **disciple** » Sat 25 Apr 2009, 00:13

So I gather that is better because it deals with layout? Tesseract is noticeably more accurate than any of the windows products I've tried (some products are as accurate, and I suspect that one would be); where it is lacking is layout analysis.
They say that produces perfectly formatted documents, but how editable are they really? I've never tried any software that produces output that is formatted to match the original well and is also readily editable - it tends to be like copying text from a pdf.

ndujoe1 · #16 Post by **ndujoe1** » Wed 14 Oct 2009, 01:40

Since tesseract will only operate with uncompressed TIFF files you need just a few extra steps to achieve compatiblity with xsane.

goto : click Preferences --> Setup --> Filetype

for the TIFF options

Set compression rate to 1

in the next three TIFF dialong boxes select no compression.

clock OK

click Preferences again and select SAVE settings.

When scanning a file for OCR in the XSANE menu I select type :TIFF

color : gray
enter 300 for scan resoultion

And save the filename with extention .tif not .tiff.

Then when finished you invoke tesseract from the command line with

tesseract filename.tif outputname

disciple · #17 Post by **disciple** » Tue 12 Jan 2010, 12:42

Come on people, why did no one report before now that the package was broken?

Or did it work in older versions of Puppy? Maybe petget was different...

ndujoe1 · #18 Post by **ndujoe1** » Tue 12 Jan 2010, 15:04

It is not broken I forgot to post that you need to move the tesseract location from /local/tessearct to /usr/local/tesseract. Then you will be able reference it from the command line. It works well on my machine.

disciple · #19 Post by **disciple** » Wed 13 Jan 2010, 08:05

Yes, I know the build isn't broken, and neither are your instructions... but my package is.
I obviously packaged it wrong... unless my package somehow got replaced by a different, broken one.

zygo · #20 Post by **zygo** » Wed 13 Jan 2010, 15:55

I'm using Puppy 431. I read only the first post in this thread and got it working -- to a fashion -- the command simply returned the dots per pixcel and size of the image file. A 1 byte file was made containing a new line character. No error on the command line. Not even in /log/var/messages . Check for dependencies form the menu lists none.

Now I see ndujoe1 says it needs xsane. Which xsane pet from the official Puppy 4 repo should I use and does that need sane?

(old)Puppy Linux Discussion Forum

(old)Puppy Linux Discussion Forum