Puppy Linux Discussion Forum Forum Index Puppy Linux Discussion Forum
Puppy HOME page : puppylinux.com
"THE" alternative forum : puppylinux.info
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

The time now is Sat 19 Apr 2014, 12:59
All times are UTC - 4
 Forum index » Advanced Topics » Additional Software (PETs, n' stuff) » Documents
OCRopus 0.2 optical character recognition + layout analysis
Post new topic   Reply to topic View previous topic :: View next topic
Page 1 of 1 [9 Posts]  
Author Message
disciple

Joined: 20 May 2006
Posts: 6378
Location: Auckland, New Zealand

PostPosted: Thu 25 Sep 2008, 08:00    Post subject:  OCRopus 0.2 optical character recognition + layout analysis
Subject description: Uses tesseract engine. Much better than gocr
 

This is an OCR program using the tesseract engine but with layout analysis. In my tests it seems "layout analysis" just means it recognizes columns of text, reads each of them and then strings them all together including headers, footers etc. It is noticeably less accurate than Tesseract, and probably only a little better than the OCR engine in Microsoft Office. This is still much better than gocr Smile
I recommend using Tesseract itself (here) unless you intend to scan pages with parallel columns.
It produces an html file that copies and pastes nicely from a browser (not Dillo I think as it hasn't got UTF) to a word processor.

1. Install from here (1121kb).
2. Download your language file (English, French, German, Dutch, Spanish, Italian, Portuguese, Vietnamese or Old German) from http://code.google.com/p/tesseract-ocr/downloads/list (900 to 1400 kb)
YOU WANT THE "LANGUAGE DATA" file, not the "SOURCE TRAINING DATA" file. If you have a different language, you'd better make them Wink
3. Extract the language file into /usr/local/share (so the stuff should be ending up in /usr/local/share/tessdata).
4. Run with e.g.
Quote:
ocroscript rec-tess /path/some_scan.png > /other_path/scan.html


N.B. doesn't work with tiffs, as I disabled libtiff support in tesseract because of a bug that they tell me will be fixed in the next version. Convert to something else.

Ocropus can also be compiled against a language modelling program and a program for making vector images of diagrams in a scan. I didn't look hard, but there doesn't seem to be a ready-to-go way to use these (or aspell, which I think I did compile against), so I didn't bother.

There were also two HUGE files produced by the install that I didn't include for the same reason - a US dictionary and a file for neural network modelling.

BTW unlike tesseract, I think ocropus converts to black-and-white, so there is no advantage in colour images.

_________________
DEATH TO SPREADSHEETS
- - -
Classic Puppy quotes
- - -
Beware the demented serfers!
Back to top
View user's profile Send private message 
disciple

Joined: 20 May 2006
Posts: 6378
Location: Auckland, New Zealand

PostPosted: Sat 11 Apr 2009, 08:19    Post subject: Extra OCR related tools  

Also check out the extra tools I posted in the Tesseract thread.
http://www.murga-linux.com/puppy/viewtopic.php?p=279332#279332
and the following post.

_________________
DEATH TO SPREADSHEETS
- - -
Classic Puppy quotes
- - -
Beware the demented serfers!
Back to top
View user's profile Send private message 
miriam


Joined: 06 Dec 2006
Posts: 268
Location: Queensland, Australia

PostPosted: Sat 12 Feb 2011, 19:13    Post subject:  

A newer version of OCRopus is out that no longer uses tesseract -- it uses its own character recognition software instead. I haven't tried it yet.

The new tesseract 3.00 detects columns. I've compiled it, but haven't learned how to make pets yet. It feels slower than tesseract 2, probably because it is looking for columns and perhaps other things, though I'm running it on a much, much, much slower computer since my nice fast computer died (wah!).

_________________
A life! Cool! Where can I download one of those from?
Back to top
View user's profile Send private message Visit poster's website 
disciple

Joined: 20 May 2006
Posts: 6378
Location: Auckland, New Zealand

PostPosted: Sat 12 Feb 2011, 19:51    Post subject:  

miriam wrote:
A newer version of OCRopus is out that no longer uses tesseract -- it uses its own character recognition software instead. I haven't tried it yet.

Where is it out? You haven't built from trunk?
Quote:
The new tesseract 3.00 detects columns. I've compiled it, but haven't learned how to make pets yet.

It is easy. Instead of running `make install` run `new2dir make install`.
Quote:
It feels slower than tesseract 2, probably because it is looking for columns and perhaps other things, though I'm running it on a much, much, much slower computer since my nice fast computer died (wah!).

I thought they said it was actually faster, but I could be wrong.

The other OCR project worth trying is cuneiform. The last (free, but Windows only, and the interface is in Russian) version before they open-sourced it did an exceptional job, recognising layout, formatting and tables. Unfortunately they haven't managed to open-source the table recognition yet Sad
If you do want to use the old Russian version, the interface is actually identical to the current Windows release (which is available in English). I've taken screenshots of this to find my way around the Russian version.

There are a few guis around for the linux version.
http://symmetrica.net/cuneiform-linux/yagf-en.html (QT + aspell - looks good.
http://en.altlinux.org/Cuneiform-Qt (QT)
http://code.google.com/p/cuneiform-gui/ (Java)
http://code.google.com/p/simplegui4cuneiform/ (zenity dialogs)
http://wiki.ubuntuusers.de/Cuneiform-Linux?highlight=cuneiform#Einbindung-in-XSane (script to integrate with Xsane and Imagemagick)
EDIT also https://code.google.com/p/ocrfeeder/ (Py/GTK) which I mentioned in the tesseract thread can use cuneiform.

_________________
DEATH TO SPREADSHEETS
- - -
Classic Puppy quotes
- - -
Beware the demented serfers!

Last edited by disciple on Sun 13 Feb 2011, 02:04; edited 1 time in total
Back to top
View user's profile Send private message 
rcrsn51


Joined: 05 Sep 2006
Posts: 8557
Location: Stratford, Ontario

PostPosted: Sat 12 Feb 2011, 21:22    Post subject:  

Tesseract 3 is twice as big as Tesseract 2 when packaged as a PET and is somewhat slower. But it does a good job of detecting columns. And it's compatible with Peasyscan.
Back to top
View user's profile Send private message 
Sit Heel Speak


Joined: 30 Mar 2006
Posts: 2595
Location: downwind

PostPosted: Sat 12 Feb 2011, 22:39    Post subject:  

In case anyone does wish to attempt building OCRopus from trunk, I have just posted a .pet of mercurial, here.
Back to top
View user's profile Send private message 
miriam


Joined: 06 Dec 2006
Posts: 268
Location: Queensland, Australia

PostPosted: Sun 13 Feb 2011, 01:00    Post subject:  

Wow, thanks disciple. Now that I have the name of the new2dir command I looked it up and have learned lots about making pets, including the really essential part that I didn't know about dir2pet. I've been compiling programs for my various machines for many years. In future I'll make pets for Puppy and share them around. Yay!

I haven't actually tried OCRopus yet. I simply read the details on their GoogleCode page http://code.google.com/p/ocropus/ and on the Wikipedia page http://en.wikipedia.org/wiki/OCRopus however I expect I will try it in the near future. I've long felt that neural networks are the only sensible way to get reliable OCR. I'm especially interested that OCRopus can have the code to read handwriting enabled (it is disabled by default).

It is a bit hard to tell how tesseract 3 compares with the older tesseract 2 because my current machine (until I get a newer one) is soooo slow.

Thanks for the info about cunieform, but I'm very reluctant to put any effort into getting Wine working on my machine after having finally rid myself of all last traces of pesky M$ stuff. Smile

_________________
A life! Cool! Where can I download one of those from?
Back to top
View user's profile Send private message Visit poster's website 
disciple

Joined: 20 May 2006
Posts: 6378
Location: Auckland, New Zealand

PostPosted: Sun 13 Feb 2011, 01:56    Post subject:  

Please note: the only reason anyone would want to run the old Windows version of cuneiform is for table recognition. I believe the Linux version should be just as capable apart from that.
_________________
DEATH TO SPREADSHEETS
- - -
Classic Puppy quotes
- - -
Beware the demented serfers!
Back to top
View user's profile Send private message 
miriam


Joined: 06 Dec 2006
Posts: 268
Location: Queensland, Australia

PostPosted: Sun 13 Feb 2011, 05:57    Post subject:
Subject description: oops
 

Oops, sorry. My bad. My eyes glazed over at the mention of Windows and I am a little embarrassed to admit I didn't read the part that followed properly.

Much more interesting than I thought. Thank you.
Downloading now... Yeow! 25MB! Big.
But it does sound like a very cool program.
http://www.cuneiform.ru/eng/
http://en.wikipedia.org/wiki/CuneiForm_(software)

Weird... the text of this post disappeared, but is here when I edit it... I deleted it all and added stuff back in line by line.
Huh. It is the Wikipedia address. I can't make it a link -- the parentheses probably confuse the bulletin board software.

_________________
A life! Cool! Where can I download one of those from?
Back to top
View user's profile Send private message Visit poster's website 
Display posts from previous:   Sort by:   
Page 1 of 1 [9 Posts]  
Post new topic   Reply to topic View previous topic :: View next topic
 Forum index » Advanced Topics » Additional Software (PETs, n' stuff) » Documents
Jump to:  

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Powered by phpBB © 2001, 2005 phpBB Group
[ Time: 0.0619s ][ Queries: 12 (0.0041s) ][ GZIP on ]