Puppy Linux Discussion Forum Forum Index Puppy Linux Discussion Forum
Puppy HOME page : puppylinux.com
"THE" alternative forum : puppylinux.info
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

The time now is Fri 24 Oct 2014, 15:11
All times are UTC - 4
 Forum index » Advanced Topics » Additional Software (PETs, n' stuff) » Documents
OCRopus 0.2 optical character recognition + layout analysis
Post_new_topic   Reply_to_topic View_previous_topic :: View_next_topic
Page 1 of 1 Posts_count  
Author Message
disciple

Joined: 20 May 2006
Posts: 6449
Location: Auckland, New Zealand

PostPosted: Thu 25 Sep 2008, 08:00    Post_subject:  OCRopus 0.2 optical character recognition + layout analysis
Sub_title: Uses tesseract engine. Much better than gocr
 

This is an OCR program using the tesseract engine but with layout analysis. In my tests it seems "layout analysis" just means it recognizes columns of text, reads each of them and then strings them all together including headers, footers etc. It is noticeably less accurate than Tesseract, and probably only a little better than the OCR engine in Microsoft Office. This is still much better than gocr Smile
I recommend using Tesseract itself (here) unless you intend to scan pages with parallel columns.
It produces an html file that copies and pastes nicely from a browser (not Dillo I think as it hasn't got UTF) to a word processor.

1. Install from here (1121kb).
2. Download your language file (English, French, German, Dutch, Spanish, Italian, Portuguese, Vietnamese or Old German) from http://code.google.com/p/tesseract-ocr/downloads/list (900 to 1400 kb)
YOU WANT THE "LANGUAGE DATA" file, not the "SOURCE TRAINING DATA" file. If you have a different language, you'd better make them Wink
3. Extract the language file into /usr/local/share (so the stuff should be ending up in /usr/local/share/tessdata).
4. Run with e.g.
Quote:
ocroscript rec-tess /path/some_scan.png > /other_path/scan.html


N.B. doesn't work with tiffs, as I disabled libtiff support in tesseract because of a bug that they tell me will be fixed in the next version. Convert to something else.

Ocropus can also be compiled against a language modelling program and a program for making vector images of diagrams in a scan. I didn't look hard, but there doesn't seem to be a ready-to-go way to use these (or aspell, which I think I did compile against), so I didn't bother.

There were also two HUGE files produced by the install that I didn't include for the same reason - a US dictionary and a file for neural network modelling.

BTW unlike tesseract, I think ocropus converts to black-and-white, so there is no advantage in colour images.

_________________
DEATH TO SPREADSHEETS
- - -
Classic Puppy quotes
- - -
Beware the demented serfers!
Back to top
View user's profile Send_private_message 
disciple

Joined: 20 May 2006
Posts: 6449
Location: Auckland, New Zealand

PostPosted: Sat 11 Apr 2009, 08:19    Post_subject: Extra OCR related tools  

Also check out the extra tools I posted in the Tesseract thread.
http://www.murga-linux.com/puppy/viewtopic.php?p=279332#279332
and the following post.

_________________
DEATH TO SPREADSHEETS
- - -
Classic Puppy quotes
- - -
Beware the demented serfers!
Back to top
View user's profile Send_private_message 
miriam


Joined: 06 Dec 2006
Posts: 281
Location: Queensland, Australia

PostPosted: Sat 12 Feb 2011, 19:13    Post_subject:  

A newer version of OCRopus is out that no longer uses tesseract -- it uses its own character recognition software instead. I haven't tried it yet.

The new tesseract 3.00 detects columns. I've compiled it, but haven't learned how to make pets yet. It feels slower than tesseract 2, probably because it is looking for columns and perhaps other things, though I'm running it on a much, much, much slower computer since my nice fast computer died (wah!).

_________________
A life! Cool! Where can I download one of those from?
Back to top
View user's profile Send_private_message Visit_website 
disciple

Joined: 20 May 2006
Posts: 6449
Location: Auckland, New Zealand

PostPosted: Sat 12 Feb 2011, 19:51    Post_subject:  

miriam wrote:
A newer version of OCRopus is out that no longer uses tesseract -- it uses its own character recognition software instead. I haven't tried it yet.

Where is it out? You haven't built from trunk?
Quote:
The new tesseract 3.00 detects columns. I've compiled it, but haven't learned how to make pets yet.

It is easy. Instead of running `make install` run `new2dir make install`.
Quote:
It feels slower than tesseract 2, probably because it is looking for columns and perhaps other things, though I'm running it on a much, much, much slower computer since my nice fast computer died (wah!).

I thought they said it was actually faster, but I could be wrong.

The other OCR project worth trying is cuneiform. The last (free, but Windows only, and the interface is in Russian) version before they open-sourced it did an exceptional job, recognising layout, formatting and tables. Unfortunately they haven't managed to open-source the table recognition yet Sad
If you do want to use the old Russian version, the interface is actually identical to the current Windows release (which is available in English). I've taken screenshots of this to find my way around the Russian version.

There are a few guis around for the linux version.
http://symmetrica.net/cuneiform-linux/yagf-en.html (QT + aspell - looks good.
http://en.altlinux.org/Cuneiform-Qt (QT)
http://code.google.com/p/cuneiform-gui/ (Java)
http://code.google.com/p/simplegui4cuneiform/ (zenity dialogs)
http://wiki.ubuntuusers.de/Cuneiform-Linux?highlight=cuneiform#Einbindung-in-XSane (script to integrate with Xsane and Imagemagick)
EDIT also https://code.google.com/p/ocrfeeder/ (Py/GTK) which I mentioned in the tesseract thread can use cuneiform.

_________________
DEATH TO SPREADSHEETS
- - -
Classic Puppy quotes
- - -
Beware the demented serfers!

Edited_time_total
Back to top
View user's profile Send_private_message 
rcrsn51


Joined: 05 Sep 2006
Posts: 9203
Location: Stratford, Ontario

PostPosted: Sat 12 Feb 2011, 21:22    Post_subject:  

Tesseract 3 is twice as big as Tesseract 2 when packaged as a PET and is somewhat slower. But it does a good job of detecting columns. And it's compatible with Peasyscan.
Back to top
View user's profile Send_private_message 
Sit Heel Speak


Joined: 30 Mar 2006
Posts: 2595
Location: downwind

PostPosted: Sat 12 Feb 2011, 22:39    Post_subject:  

In case anyone does wish to attempt building OCRopus from trunk, I have just posted a .pet of mercurial, here.
Back to top
View user's profile Send_private_message 
miriam


Joined: 06 Dec 2006
Posts: 281
Location: Queensland, Australia

PostPosted: Sun 13 Feb 2011, 01:00    Post_subject:  

Wow, thanks disciple. Now that I have the name of the new2dir command I looked it up and have learned lots about making pets, including the really essential part that I didn't know about dir2pet. I've been compiling programs for my various machines for many years. In future I'll make pets for Puppy and share them around. Yay!

I haven't actually tried OCRopus yet. I simply read the details on their GoogleCode page http://code.google.com/p/ocropus/ and on the Wikipedia page http://en.wikipedia.org/wiki/OCRopus however I expect I will try it in the near future. I've long felt that neural networks are the only sensible way to get reliable OCR. I'm especially interested that OCRopus can have the code to read handwriting enabled (it is disabled by default).

It is a bit hard to tell how tesseract 3 compares with the older tesseract 2 because my current machine (until I get a newer one) is soooo slow.

Thanks for the info about cunieform, but I'm very reluctant to put any effort into getting Wine working on my machine after having finally rid myself of all last traces of pesky M$ stuff. Smile

_________________
A life! Cool! Where can I download one of those from?
Back to top
View user's profile Send_private_message Visit_website 
disciple

Joined: 20 May 2006
Posts: 6449
Location: Auckland, New Zealand

PostPosted: Sun 13 Feb 2011, 01:56    Post_subject:  

Please note: the only reason anyone would want to run the old Windows version of cuneiform is for table recognition. I believe the Linux version should be just as capable apart from that.
_________________
DEATH TO SPREADSHEETS
- - -
Classic Puppy quotes
- - -
Beware the demented serfers!
Back to top
View user's profile Send_private_message 
miriam


Joined: 06 Dec 2006
Posts: 281
Location: Queensland, Australia

PostPosted: Sun 13 Feb 2011, 05:57    Post_subject:
Sub_title: oops
 

Oops, sorry. My bad. My eyes glazed over at the mention of Windows and I am a little embarrassed to admit I didn't read the part that followed properly.

Much more interesting than I thought. Thank you.
Downloading now... Yeow! 25MB! Big.
But it does sound like a very cool program.
http://www.cuneiform.ru/eng/
http://en.wikipedia.org/wiki/CuneiForm_(software)

Weird... the text of this post disappeared, but is here when I edit it... I deleted it all and added stuff back in line by line.
Huh. It is the Wikipedia address. I can't make it a link -- the parentheses probably confuse the bulletin board software.

_________________
A life! Cool! Where can I download one of those from?
Back to top
View user's profile Send_private_message Visit_website 
Display_posts:   Sort by:   
Page 1 of 1 Posts_count  
Post_new_topic   Reply_to_topic View_previous_topic :: View_next_topic
 Forum index » Advanced Topics » Additional Software (PETs, n' stuff) » Documents
Jump to:  

Rules_post_cannot
Rules_reply_cannot
Rules_edit_cannot
Rules_delete_cannot
Rules_vote_cannot
You cannot attach files in this forum
You can download files in this forum


Powered by phpBB © 2001, 2005 phpBB Group
[ Time: 0.0685s ][ Queries: 12 (0.0035s) ][ GZIP on ]