The time now is Fri 20 Apr 2018, 19:53
All times are UTC - 4 |
Page 1 of 3 [32 Posts] |
Goto page: 1, 2, 3 Next |
Author |
Message |
disciple
Joined: 20 May 2006 Posts: 6781 Location: Auckland, New Zealand
|
Posted: Tue 23 Sep 2008, 02:00 Post subject:
tesseract-ocr optical character recognition Subject description: MUCH more accurate than gocr |
|
Tesseract is the most accurate Open Source character recognition, but it has no layout analysis. If you intend to scan pages with parallel columns, you should use Ocropus (here), which uses the Tesseract engine.
If not, you will have to get rid of all the unnecessary line breaks with Tesseract, but the actual character recognition is better than with Ocropus.
In my tests tesseract was almost 100% accurate, except it missed a few spaces, and a few symmetrical apostrophes ' were turned into left-hand side single quotes ‘
This is much better than the OCR engine included in Microsoft Office, and MUCH better than gocr
1. Install from here (512 kb)
2. Copy everything from /local to /usr/local, then you can delete /local (I made a mistake packaging it... I'll package version 3 sometime and get it right).
2. Download your language file (English, French, German, Dutch, Spanish, Italian, Portuguese, Vietnamese or Old German) from http://code.google.com/p/tesseract-ocr/downloads/list (900 to 1400 kb)
YOU WANT THE "LANGUAGE DATA" file, not the "SOURCE TRAINING DATA" file. If you have a different language, you'd better make them
3. Extract the language file into /usr/local/share (so the stuff should be ending up in /usr/local/share/tessdata).
4. Run with e.g.
Code: | tesseract /path/scan.tif /path/output_file |
or pipe it to something with a spellcheck just in case
It will automatically append a .txt extension to the output.
It ONLY works with uncompressed and G3 compressed tiffs because I disabled libtiff support because of a bug that they tell me will be fixed in the next version. Xnview, nconvert, Imagemagick convert, and probably other things can make these. I'm guessing Xsane does too. The Gimp can't (or at least couldn't )
FYI I had compile problems with 2.03, so we're waiting for 2.04
_________________ If you have or know of a good gtkdialog application, please post a link here
Classic Puppy quotes
ROOT FOREVER
Last edited by disciple on Tue 12 Jan 2010, 08:54; edited 3 times in total
|
Back to top
|
|
 |
WhoDo

Joined: 11 Jul 2006 Posts: 4440 Location: Lake Macquarie NSW Australia
|
Posted: Tue 23 Sep 2008, 05:03 Post subject:
Re: tesseract-ocr optical character recognition Subject description: MUCH more accurate than gocr |
|
disciple wrote: | I compiled tesseract 2.01, and can upload a package if someone wants to host it (It's over 11MB, and I'm sick of trying to find decent free file hosts). |
Why not get Tom or Will to upload it at puppylinux.org, or PM caneri to host it at puppylinux.ca ... either way.
Both locations have plenty of free space and don't charge for downloads in .pet or .pup formats. No need to bother with the adware hosts for trusted Puppy developers/compilers like yourself these days.
_________________ Actions speak louder than words ... and they usually work when words don't!
SIP:whodo@proxy01.sipphone.com; whodo@realsip.com
|
Back to top
|
|
 |
HairyWill

Joined: 26 May 2006 Posts: 2946 Location: Southampton, UK
|
Posted: Tue 23 Sep 2008, 07:20 Post subject:
|
|
Puppylinux.org doesn't host packages at the moment, I think we did this to ensure that the site would not run out of transfer quota.
I'm sure Caneri can help.
_________________ Will
contribute: community website, screenshots, puplets, wiki, rss
|
Back to top
|
|
 |
disciple
Joined: 20 May 2006 Posts: 6781 Location: Auckland, New Zealand
|
Posted: Tue 23 Sep 2008, 07:37 Post subject:
|
|
OK, we'll see about that.
Oops. That was a pretty bad typo. I meant "over 1MB"
_________________ If you have or know of a good gtkdialog application, please post a link here
Classic Puppy quotes
ROOT FOREVER
|
Back to top
|
|
 |
Dingo

Joined: 11 Dec 2007 Posts: 1434 Location: somewhere at the end of rainbow...
|
Posted: Tue 23 Sep 2008, 13:23 Post subject:
Re: tesseract-ocr optical character recognition Subject description: MUCH more accurate than gocr |
|
disciple wrote: | I compiled tesseract 2.01, and can upload a package if someone wants to host it (It's over 1MB, and I'm sick of trying to find decent free file hosts).
|
http://www.filefront.com/
_________________ replace .co.cc with .info to get access to stuff I posted in forum
dropbox 2GB free
OpenOffice for Puppy Linux
|
Back to top
|
|
 |
lluamco
Joined: 16 Mar 2007 Posts: 207 Location: Banyoles, Spain
|
Posted: Wed 24 Sep 2008, 03:59 Post subject:
Re: tesseract-ocr optical character recognition Subject description: MUCH more accurate than gocr |
|
disciple wrote: | I compiled tesseract 2.01, and can upload a package if someone wants to host it (It's over 1MB, and I'm sick of trying to find decent free file hosts).
|
Hello disciple.
MU is always very kind to host large files. Please read
http://www.murga-linux.com/puppy/viewtopic.php?p=99400#99400
to know how to proceed.
Cheers,
Lluis
|
Back to top
|
|
 |
disciple
Joined: 20 May 2006 Posts: 6781 Location: Auckland, New Zealand
|
Posted: Thu 25 Sep 2008, 08:25 Post subject:
tesseract-ocr optical character recognition Subject description: MUCH more accurate than gocr |
|
OK I uploaded it and updated the first post.
It turned out I COULD get it under 1MB, but not OCRopus (see link), so thanks Caneri
_________________ If you have or know of a good gtkdialog application, please post a link here
Classic Puppy quotes
ROOT FOREVER
|
Back to top
|
|
 |
Dingo

Joined: 11 Dec 2007 Posts: 1434 Location: somewhere at the end of rainbow...
|
Posted: Thu 25 Sep 2008, 08:29 Post subject:
|
|
thanks linked all two topics and mirrored on dokupuppy:
http://puppylover.netsons.org/dokupuppy/programs:ocr
_________________ replace .co.cc with .info to get access to stuff I posted in forum
dropbox 2GB free
OpenOffice for Puppy Linux
|
Back to top
|
|
 |
disciple
Joined: 20 May 2006 Posts: 6781 Location: Auckland, New Zealand
|
Posted: Thu 25 Sep 2008, 08:30 Post subject:
|
|
BTW I was mistaken. Ocropus does not have a gui, but does have a complex set of Lua scripts
_________________ If you have or know of a good gtkdialog application, please post a link here
Classic Puppy quotes
ROOT FOREVER
|
Back to top
|
|
 |
disciple
Joined: 20 May 2006 Posts: 6781 Location: Auckland, New Zealand
|
Posted: Sat 28 Feb 2009, 19:51 Post subject:
|
|
Here are some OCR proofing aids that should be useful; probably more so if you are doing a lot of ocr:
http://gutcheck.sourceforge.net/
Quote: | Gutcheck is a plain-text checking program that specializes in reporting the problems that spellcheckers don't--errors like mismatched quotes, misplaced punctuation, unintended blank lines. It is specifically tuned for checking texts for submission to Project Gutenberg, though I hope it can be useful elsewhere as well. |
Quote: | The common OCR error of mistaking a "b" for a "h" and vice versa used to lead to horrible things with the words "he" and "be". With the vast improvement in OCR programs in the last few years, this is not the nightmare it used to be.
jeebies detects common he/be errors by a simple lookup table. I really need to add some extra intelligence; I have a set of heuristics that I used previously, and I will probably get the time to plug them in at some point. For now, it's quick and does have some value, especially in checking older texts. It needs its lookup table, which is in the files he.jee and be.jee |
Quote: | Gutspell: I made a very enthusiastic start on this, but I need a big dictionary with possible parts of speech listed for every word to do the next thing with it, and I never got around to doing that.
Now, it simply lists every word that isn't in its dictionary that occurs only once. Still, as a superfast check, it does still catch some typos. It has a bad habit of obsessing on one word sometimes, and reporting lots of instances. I must fix that one day. Its dictionary is the file gutspell.dic |
If someone is keen, it would be worth getting Guiguts working, which is a Perl/tk gui for these tools and aspell/ispell. In spite of what the gutcheck site implies, Guiguts is not Windows-only. The trickier part would be packaging perl/tk for Puppy.
Description |
|

Download |
Filename |
gutcheck.zip |
Filesize |
35.44 KB |
Downloaded |
1043 Time(s) |
_________________ If you have or know of a good gtkdialog application, please post a link here
Classic Puppy quotes
ROOT FOREVER
|
Back to top
|
|
 |
disciple
Joined: 20 May 2006 Posts: 6781 Location: Auckland, New Zealand
|
Posted: Sat 11 Apr 2009, 08:13 Post subject:
unpaper - post-processing scanned and photocopied book pages Subject description: Straighten pages and remove black edges |
|
This should also be useful before you do the OCR.
Unpaper is a tool for straightening pages and removing black edges, including in the middle, where you have photocopied an open book!
I haven't tested it, and it is at an early stage of development, but it certainly looks good
You'll need to figure out how to convert your images to and from .pnm
Description |
|

Download |
Filename |
unpaper-0.3.pet |
Filesize |
29.67 KB |
Downloaded |
1173 Time(s) |
_________________ If you have or know of a good gtkdialog application, please post a link here
Classic Puppy quotes
ROOT FOREVER
|
Back to top
|
|
 |
jrb

Joined: 11 Dec 2007 Posts: 1103 Location: Smithers, BC, Canada
|
Posted: Thu 23 Apr 2009, 17:03 Post subject:
|
|
I have built ch-tesseract-2.01-OCR-en.sfs, an english version of tesseract. Tesseract_OCR is placed on the right click menu. If you right click on a .tif file it will produce a text file with the same name in a few seconds. However it is very fussy about these .tif files. You may have to open them in mtpaint or another graphics program and resave them. Even the training files required this. After that, however, it seems to work very well.
I have also placed a menu item on the Documents menu which opens a text file with these same instructions.
Packages for other major languages are available and can be easily built.
Let me know how it works for you. J
|
Back to top
|
|
 |
disciple
Joined: 20 May 2006 Posts: 6781 Location: Auckland, New Zealand
|
Posted: Fri 24 Apr 2009, 02:42 Post subject:
|
|
To download that sfs use the username "puppy" and password "linux" - I had to fill it in several times for some reason (unless the last time I changed it and put a capital or something?).
Quote: | However it is very fussy about these .tif files |
That should change in 2.04 or 3, which were both expected to be out already... so they should be out soon
_________________ If you have or know of a good gtkdialog application, please post a link here
Classic Puppy quotes
ROOT FOREVER
|
Back to top
|
|
 |
Dromeno
Joined: 12 Sep 2008 Posts: 543
|
Posted: Fri 24 Apr 2009, 04:46 Post subject:
Scansoft Omnipage via wine in puppy Subject description: not open source unfortunately |
|
OCR is one of those fields where windows applications still outshine the Linux ones. But fortunately for us puppy users, Scansoft Omnipage -my favorite- works (via wine). And it even works as 'portable' (just copy the Omnipage files from C:\Program files to some external device).
|
Back to top
|
|
 |
disciple
Joined: 20 May 2006 Posts: 6781 Location: Auckland, New Zealand
|
Posted: Fri 24 Apr 2009, 20:13 Post subject:
|
|
So I gather that is better because it deals with layout? Tesseract is noticeably more accurate than any of the windows products I've tried (some products are as accurate, and I suspect that one would be); where it is lacking is layout analysis.
They say that produces perfectly formatted documents, but how editable are they really? I've never tried any software that produces output that is formatted to match the original well and is also readily editable - it tends to be like copying text from a pdf.
_________________ If you have or know of a good gtkdialog application, please post a link here
Classic Puppy quotes
ROOT FOREVER
|
Back to top
|
|
 |
|
Page 1 of 3 [32 Posts] |
Goto page: 1, 2, 3 Next |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum You cannot attach files in this forum You can download files in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|