Entry 53304 (Berkeley CSUA MOTD)

Berkeley CSUA MOTD:Entry 53304

WIKI \| FAQ \| Tech FAQ
`http://csua.com/feed/`

2025/07/13 [General] UID:1000 Activity:popular

7/13

2009/8/27-9/9 [Computer/SW/OS/OsX] UID:53304 Activity:nil

8/26    Any suggestions on a good OCR program for either OS X or Windows that
        will work on scanned documents outputted to pdf?  Preferably free?
        Thanks, scottyg
        \_ Check Abbyy or Scansoft.  Not free.
           \_ Thanks...I think I'd prefer a free or opensource piece of
              software unless there is a huge difference in quality.  I
              bought a kindle 2 for my girlfriend so that she wouldn't have
              to lug around 100 lbs of textbooks.  But the textbooks aren't
              out in ebook form yet.  I've scanned a few on the office scanner,
              now I'd like to make them searchable.  This project is already
              getting expensive :)  -scottyg
              \_ I think there is a huge quality difference, but if you must have OSS:
                 http://code.google.com/p/tesseract-ocr
                 I use Acrobat Pro, works and easy to get a hold of.

Cache (2411 bytes)

code.google.com/p/tesseract-ocr -> code.google.com/p/tesseract-ocr/mhasnat Background The Tesseract OCR engine was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but it is probably one of the most accurate open source OCR engines available. The sourcecode will read a binary, grey or color image and output text. A tiff readeris built in that will read uncompressed TIFF images, or libtiff can be added to read compressed images. Supported Platforms The developers are regularly testing on the following platforms: * Ubuntu 606 (x86/32, x86/64) * Ubuntu 610 (x86/32, x86/64) * Windows (x86/32) with Visual C++ Express 2008 Additionally, we believe that the code should be running on these other platforms, but we don't have the resources to test on them regularly: * recent Linux distributions (x86/32, x86/64) * Mac OS X (x86, PPC) People have reported success with Cygwin on Windows, but this is not a tested platform. If you're interested in supporting other platforms or languages, please get in touch with Ray Smith. Roadmap Version 204 release is now available for download and contains the following new features: * Many reported issues fixed, especially portability issues: 1, 63, 67, 71, 76, 79, 81, 82, 84, 106, 108, 111, 112, 128, 129, 130, 133, 135, 142, 143, 145, 146, 147, 153, 154, 160, 165, 169, 170, 175, 177, 187, 192, 195, 199, 201, 205, 209. The release candidate will be available from the downloads page soon, after further testing. Even the windows executables tarball is incomplete as language files are required. The upcoming 300 release will probably include: * Page layout analysis. Core Developers The core developer on the project is Ray Smith (theraysmith). OCRopus project, for which Tesseract is one of the pluggable OCR engines; OCRopus also provides layout analysis and statistical language modeling. Migration As you have probably noticed, the Tesseract project has migrated from SourceForge to Google hosting. We were actually happy with SourceForge hosting, but since we needed to move from CVS to Subversion anyway, it seemed to make sense to move to Google hosting at the same time. We had planned on announcing the migration first and spending some time on it, but it turned out to be so quick and easy that we were done the same day. If you have questions or concerns about this migration, please contact Ray Smith. The major difference is that there is no discussion forum.