The time now is Sat 18 May 2013, 05:44
All times are UTC - 4 |
| Author |
Message |
disciple
Joined: 20 May 2006 Posts: 6178 Location: Auckland, New Zealand
|
Posted: Thu 25 Sep 2008, 08:00 Post subject:
OCRopus 0.2 optical character recognition + layout analysis Subject description: Uses tesseract engine. Much better than gocr |
|
This is an OCR program using the tesseract engine but with layout analysis. In my tests it seems "layout analysis" just means it recognizes columns of text, reads each of them and then strings them all together including headers, footers etc. It is noticeably less accurate than Tesseract, and probably only a little better than the OCR engine in Microsoft Office. This is still much better than gocr
I recommend using Tesseract itself (here) unless you intend to scan pages with parallel columns.
It produces an html file that copies and pastes nicely from a browser (not Dillo I think as it hasn't got UTF) to a word processor.
1. Install from here (1121kb).
2. Download your language file (English, French, German, Dutch, Spanish, Italian, Portuguese, Vietnamese or Old German) from http://code.google.com/p/tesseract-ocr/downloads/list (900 to 1400 kb)
YOU WANT THE "LANGUAGE DATA" file, not the "SOURCE TRAINING DATA" file. If you have a different language, you'd better make them
3. Extract the language file into /usr/local/share (so the stuff should be ending up in /usr/local/share/tessdata).
4. Run with e.g.
| Quote: | | ocroscript rec-tess /path/some_scan.png > /other_path/scan.html |
N.B. doesn't work with tiffs, as I disabled libtiff support in tesseract because of a bug that they tell me will be fixed in the next version. Convert to something else.
Ocropus can also be compiled against a language modelling program and a program for making vector images of diagrams in a scan. I didn't look hard, but there doesn't seem to be a ready-to-go way to use these (or aspell, which I think I did compile against), so I didn't bother.
There were also two HUGE files produced by the install that I didn't include for the same reason - a US dictionary and a file for neural network modelling.
BTW unlike tesseract, I think ocropus converts to black-and-white, so there is no advantage in colour images.
_________________ DEATH TO SPREADSHEETS
- - -
Classic Puppy quotes
- - -
Beware the demented serfers!
|
|
Back to top
|
|
 |
disciple
Joined: 20 May 2006 Posts: 6178 Location: Auckland, New Zealand
|
Posted: Sat 11 Apr 2009, 08:19 Post subject:
Extra OCR related tools |
|
Also check out the extra tools I posted in the Tesseract thread.
http://www.murga-linux.com/puppy/viewtopic.php?p=279332#279332
and the following post.
_________________ DEATH TO SPREADSHEETS
- - -
Classic Puppy quotes
- - -
Beware the demented serfers!
|
|
Back to top
|
|
 |
miriam

Joined: 06 Dec 2006 Posts: 255 Location: Queensland, Australia
|
Posted: Sat 12 Feb 2011, 19:13 Post subject:
|
|
A newer version of OCRopus is out that no longer uses tesseract -- it uses its own character recognition software instead. I haven't tried it yet.
The new tesseract 3.00 detects columns. I've compiled it, but haven't learned how to make pets yet. It feels slower than tesseract 2, probably because it is looking for columns and perhaps other things, though I'm running it on a much, much, much slower computer since my nice fast computer died (wah!).
_________________ A life! Cool! Where can I download one of those from?
|
|
Back to top
|
|
 |
disciple
Joined: 20 May 2006 Posts: 6178 Location: Auckland, New Zealand
|
Posted: Sat 12 Feb 2011, 19:51 Post subject:
|
|
| miriam wrote: | | A newer version of OCRopus is out that no longer uses tesseract -- it uses its own character recognition software instead. I haven't tried it yet. |
Where is it out? You haven't built from trunk?
| Quote: | | The new tesseract 3.00 detects columns. I've compiled it, but haven't learned how to make pets yet. |
It is easy. Instead of running `make install` run `new2dir make install`.
| Quote: | | It feels slower than tesseract 2, probably because it is looking for columns and perhaps other things, though I'm running it on a much, much, much slower computer since my nice fast computer died (wah!). |
I thought they said it was actually faster, but I could be wrong.
The other OCR project worth trying is cuneiform. The last (free, but Windows only, and the interface is in Russian) version before they open-sourced it did an exceptional job, recognising layout, formatting and tables. Unfortunately they haven't managed to open-source the table recognition yet
If you do want to use the old Russian version, the interface is actually identical to the current Windows release (which is available in English). I've taken screenshots of this to find my way around the Russian version.
There are a few guis around for the linux version.
http://symmetrica.net/cuneiform-linux/yagf-en.html (QT + aspell - looks good.
http://en.altlinux.org/Cuneiform-Qt (QT)
http://code.google.com/p/cuneiform-gui/ (Java)
http://code.google.com/p/simplegui4cuneiform/ (zenity dialogs)
http://wiki.ubuntuusers.de/Cuneiform-Linux?highlight=cuneiform#Einbindung-in-XSane (script to integrate with Xsane and Imagemagick)
EDIT also https://code.google.com/p/ocrfeeder/ (Py/GTK) which I mentioned in the tesseract thread can use cuneiform.
_________________ DEATH TO SPREADSHEETS
- - -
Classic Puppy quotes
- - -
Beware the demented serfers!
Last edited by disciple on Sun 13 Feb 2011, 02:04; edited 1 time in total
|
|
Back to top
|
|
 |
rcrsn51

Joined: 05 Sep 2006 Posts: 7736 Location: Stratford, Ontario
|
Posted: Sat 12 Feb 2011, 21:22 Post subject:
|
|
Tesseract 3 is twice as big as Tesseract 2 when packaged as a PET and is somewhat slower. But it does a good job of detecting columns. And it's compatible with Peasyscan.
|
|
Back to top
|
|
 |
Sit Heel Speak

Joined: 30 Mar 2006 Posts: 2595 Location: downwind
|
Posted: Sat 12 Feb 2011, 22:39 Post subject:
|
|
In case anyone does wish to attempt building OCRopus from trunk, I have just posted a .pet of mercurial, here.
|
|
Back to top
|
|
 |
miriam

Joined: 06 Dec 2006 Posts: 255 Location: Queensland, Australia
|
Posted: Sun 13 Feb 2011, 01:00 Post subject:
|
|
Wow, thanks disciple. Now that I have the name of the new2dir command I looked it up and have learned lots about making pets, including the really essential part that I didn't know about dir2pet. I've been compiling programs for my various machines for many years. In future I'll make pets for Puppy and share them around. Yay!
I haven't actually tried OCRopus yet. I simply read the details on their GoogleCode page http://code.google.com/p/ocropus/ and on the Wikipedia page http://en.wikipedia.org/wiki/OCRopus however I expect I will try it in the near future. I've long felt that neural networks are the only sensible way to get reliable OCR. I'm especially interested that OCRopus can have the code to read handwriting enabled (it is disabled by default).
It is a bit hard to tell how tesseract 3 compares with the older tesseract 2 because my current machine (until I get a newer one) is soooo slow.
Thanks for the info about cunieform, but I'm very reluctant to put any effort into getting Wine working on my machine after having finally rid myself of all last traces of pesky M$ stuff.
_________________ A life! Cool! Where can I download one of those from?
|
|
Back to top
|
|
 |
disciple
Joined: 20 May 2006 Posts: 6178 Location: Auckland, New Zealand
|
Posted: Sun 13 Feb 2011, 01:56 Post subject:
|
|
Please note: the only reason anyone would want to run the old Windows version of cuneiform is for table recognition. I believe the Linux version should be just as capable apart from that.
_________________ DEATH TO SPREADSHEETS
- - -
Classic Puppy quotes
- - -
Beware the demented serfers!
|
|
Back to top
|
|
 |
miriam

Joined: 06 Dec 2006 Posts: 255 Location: Queensland, Australia
|
Posted: Sun 13 Feb 2011, 05:57 Post subject:
Subject description: oops |
|
Oops, sorry. My bad. My eyes glazed over at the mention of Windows and I am a little embarrassed to admit I didn't read the part that followed properly.
Much more interesting than I thought. Thank you.
Downloading now... Yeow! 25MB! Big.
But it does sound like a very cool program.
http://www.cuneiform.ru/eng/
http://en.wikipedia.org/wiki/CuneiForm_(software)
Weird... the text of this post disappeared, but is here when I edit it... I deleted it all and added stuff back in line by line.
Huh. It is the Wikipedia address. I can't make it a link -- the parentheses probably confuse the bulletin board software.
_________________ A life! Cool! Where can I download one of those from?
|
|
Back to top
|
|
 |
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum You cannot attach files in this forum You can download files in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|