OCRopus 0.2 optical character recognition + layout analysis

Word processors, spreadsheets, presentations, translation, etc.
Post Reply
Message
Author
disciple
Posts: 6984
Joined: Sun 21 May 2006, 01:46
Location: Auckland, New Zealand

OCRopus 0.2 optical character recognition + layout analysis

#1 Post by disciple »

This is an OCR program using the tesseract engine but with layout analysis. In my tests it seems "layout analysis" just means it recognizes columns of text, reads each of them and then strings them all together including headers, footers etc. It is noticeably less accurate than Tesseract, and probably only a little better than the OCR engine in Microsoft Office. This is still much better than gocr :)
I recommend using Tesseract itself (here) unless you intend to scan pages with parallel columns.
It produces an html file that copies and pastes nicely from a browser (not Dillo I think as it hasn't got UTF) to a word processor.

1. Install from here (1121kb).
2. Download your language file (English, French, German, Dutch, Spanish, Italian, Portuguese, Vietnamese or Old German) from http://code.google.com/p/tesseract-ocr/downloads/list (900 to 1400 kb)
YOU WANT THE "LANGUAGE DATA" file, not the "SOURCE TRAINING DATA" file. If you have a different language, you'd better make them ;)
3. Extract the language file into /usr/local/share (so the stuff should be ending up in /usr/local/share/tessdata).
4. Run with e.g.
ocroscript rec-tess /path/some_scan.png > /other_path/scan.html
N.B. doesn't work with tiffs, as I disabled libtiff support in tesseract because of a bug that they tell me will be fixed in the next version. Convert to something else.

Ocropus can also be compiled against a language modelling program and a program for making vector images of diagrams in a scan. I didn't look hard, but there doesn't seem to be a ready-to-go way to use these (or aspell, which I think I did compile against), so I didn't bother.

There were also two HUGE files produced by the install that I didn't include for the same reason - a US dictionary and a file for neural network modelling.

BTW unlike tesseract, I think ocropus converts to black-and-white, so there is no advantage in colour images.
Do you know a good gtkdialog program? Please post a link here

Classic Puppy quotes

ROOT FOREVER
GTK2 FOREVER

disciple
Posts: 6984
Joined: Sun 21 May 2006, 01:46
Location: Auckland, New Zealand

Extra OCR related tools

#2 Post by disciple »

Also check out the extra tools I posted in the Tesseract thread.
http://www.murga-linux.com/puppy/viewto ... 332#279332
and the following post.
Do you know a good gtkdialog program? Please post a link here

Classic Puppy quotes

ROOT FOREVER
GTK2 FOREVER

User avatar
miriam
Posts: 373
Joined: Wed 06 Dec 2006, 23:46
Location: Queensland, Australia
Contact:

#3 Post by miriam »

A newer version of OCRopus is out that no longer uses tesseract -- it uses its own character recognition software instead. I haven't tried it yet.

The new tesseract 3.00 detects columns. I've compiled it, but haven't learned how to make pets yet. It feels slower than tesseract 2, probably because it is looking for columns and perhaps other things, though I'm running it on a much, much, much slower computer since my nice fast computer died (wah!).
[color=blue]A life! Cool! Where can I download one of those from?[/color]

disciple
Posts: 6984
Joined: Sun 21 May 2006, 01:46
Location: Auckland, New Zealand

#4 Post by disciple »

miriam wrote:A newer version of OCRopus is out that no longer uses tesseract -- it uses its own character recognition software instead. I haven't tried it yet.
Where is it out? You haven't built from trunk?
The new tesseract 3.00 detects columns. I've compiled it, but haven't learned how to make pets yet.
It is easy. Instead of running `make install` run `new2dir make install`.
It feels slower than tesseract 2, probably because it is looking for columns and perhaps other things, though I'm running it on a much, much, much slower computer since my nice fast computer died (wah!).
I thought they said it was actually faster, but I could be wrong.

The other OCR project worth trying is cuneiform. The last (free, but Windows only, and the interface is in Russian) version before they open-sourced it did an exceptional job, recognising layout, formatting and tables. Unfortunately they haven't managed to open-source the table recognition yet :(
If you do want to use the old Russian version, the interface is actually identical to the current Windows release (which is available in English). I've taken screenshots of this to find my way around the Russian version.

There are a few guis around for the linux version.
http://symmetrica.net/cuneiform-linux/yagf-en.html (QT + aspell - looks good.
http://en.altlinux.org/Cuneiform-Qt (QT)
http://code.google.com/p/cuneiform-gui/ (Java)
http://code.google.com/p/simplegui4cuneiform/ (zenity dialogs)
http://wiki.ubuntuusers.de/Cuneiform-Li ... g-in-XSane (script to integrate with Xsane and Imagemagick)
EDIT also https://code.google.com/p/ocrfeeder/ (Py/GTK) which I mentioned in the tesseract thread can use cuneiform.
Last edited by disciple on Sun 13 Feb 2011, 06:04, edited 1 time in total.
Do you know a good gtkdialog program? Please post a link here

Classic Puppy quotes

ROOT FOREVER
GTK2 FOREVER

User avatar
rcrsn51
Posts: 13096
Joined: Tue 05 Sep 2006, 13:50
Location: Stratford, Ontario

#5 Post by rcrsn51 »

Tesseract 3 is twice as big as Tesseract 2 when packaged as a PET and is somewhat slower. But it does a good job of detecting columns. And it's compatible with Peasyscan.

User avatar
Sit Heel Speak
Posts: 2595
Joined: Fri 31 Mar 2006, 03:22
Location: downwind

#6 Post by Sit Heel Speak »

In case anyone does wish to attempt building OCRopus from trunk, I have just posted a .pet of mercurial, here.

User avatar
miriam
Posts: 373
Joined: Wed 06 Dec 2006, 23:46
Location: Queensland, Australia
Contact:

#7 Post by miriam »

Wow, thanks disciple. Now that I have the name of the new2dir command I looked it up and have learned lots about making pets, including the really essential part that I didn't know about dir2pet. I've been compiling programs for my various machines for many years. In future I'll make pets for Puppy and share them around. Yay!

I haven't actually tried OCRopus yet. I simply read the details on their GoogleCode page http://code.google.com/p/ocropus/ and on the Wikipedia page http://en.wikipedia.org/wiki/OCRopus however I expect I will try it in the near future. I've long felt that neural networks are the only sensible way to get reliable OCR. I'm especially interested that OCRopus can have the code to read handwriting enabled (it is disabled by default).

It is a bit hard to tell how tesseract 3 compares with the older tesseract 2 because my current machine (until I get a newer one) is soooo slow.

Thanks for the info about cunieform, but I'm very reluctant to put any effort into getting Wine working on my machine after having finally rid myself of all last traces of pesky M$ stuff. :)
[color=blue]A life! Cool! Where can I download one of those from?[/color]

disciple
Posts: 6984
Joined: Sun 21 May 2006, 01:46
Location: Auckland, New Zealand

#8 Post by disciple »

Please note: the only reason anyone would want to run the old Windows version of cuneiform is for table recognition. I believe the Linux version should be just as capable apart from that.
Do you know a good gtkdialog program? Please post a link here

Classic Puppy quotes

ROOT FOREVER
GTK2 FOREVER

User avatar
miriam
Posts: 373
Joined: Wed 06 Dec 2006, 23:46
Location: Queensland, Australia
Contact:

#9 Post by miriam »

Oops, sorry. My bad. My eyes glazed over at the mention of Windows and I am a little embarrassed to admit I didn't read the part that followed properly.

Much more interesting than I thought. Thank you.
Downloading now... Yeow! 25MB! Big.
But it does sound like a very cool program.
http://www.cuneiform.ru/eng/
http://en.wikipedia.org/wiki/CuneiForm_(software)

Weird... the text of this post disappeared, but is here when I edit it... I deleted it all and added stuff back in line by line.
Huh. It is the Wikipedia address. I can't make it a link -- the parentheses probably confuse the bulletin board software.
[color=blue]A life! Cool! Where can I download one of those from?[/color]

User avatar
greengeek
Posts: 5789
Joined: Tue 20 Jul 2010, 09:34
Location: Republic of Novo Zelande

#10 Post by greengeek »

Hi, does anyone have an OCR package that may work on Slacko 5.6 please? I will be happy with anything that works no matter how restrictive (ie: I dont need table recognition or anything fancy - just the ability to recognise/analyze an image of a few words in a single font).
cheers!

User avatar
rcrsn51
Posts: 13096
Joined: Tue 05 Sep 2006, 13:50
Location: Stratford, Ontario

#11 Post by rcrsn51 »

Peasyscan has supported Tesseract OCR since 2010.

User avatar
greengeek
Posts: 5789
Joined: Tue 20 Jul 2010, 09:34
Location: Republic of Novo Zelande

#12 Post by greengeek »

Many thanks rcrsn51!! I already have peasyscan grafted into my Slacko 5.6 derivative so I was able to grab Tesseract and pic2txt from your link here and now I have OCR. Fantastic!
Thanks so much.
(ps: I am getting really good results if I use mtPaint to scale the image size up before feeding the file to pic2txt - this upscaling seems to increase recognition integrity greatly. I also played with gamma, brightness and the "sharpen" function but none of those helped as much as simply making the image bigger - which particularly seemed to help with recognition of spaces between words).

Pelo

Puppyocr works with Slacko 5.5, sure

#13 Post by Pelo »

greengeek Puppyocr works with Slacko 5.5, sure.
I have transfered a dozen of documents to texte with it. so puppyocr should work in Slacko 5.6.
About scaling documents, in the contrary, willing do better, enlarging them made them less recognized by puppy OCR. Strange.
I use puppyocr for judgements during french revolution , typewrited when first typing machines began to be used.

User avatar
rcrsn51
Posts: 13096
Joined: Tue 05 Sep 2006, 13:50
Location: Stratford, Ontario

Re: Puppyocr works with Slacko 5.5, sure

#14 Post by rcrsn51 »

Pelo wrote:About scaling documents, in the contrary, willing do better, enlarging them made them less recognized by puppy OCR.
I tried this with the latest pic2txt and Tesseract3. It worked, but only a small up-scaling was needed, like 110%.

It would depend a lot on the quality of the image.

Pelo

in english Puppy Ocr has an easier Job

#15 Post by Pelo »

Main default with puppy OCR are the accents and punctuation. Sure in english Puppy Ocr has an easier Job. Où vais-je aller à la pêche ?
See remarks of OUI, our dear Franco-'german colleague , he wants a 64 bit OCR :) OCR requested by Oui.
Attachments
puppyocr.jpg
Not so bad.
(103.91 KiB) Downloaded 403 times

Post Reply