Tesseract is an optical character recognition (OCR) engine originally developed by Hewlett Packard, in 2005 it was open sourced under Apache license. Its development is now supported by Google. Version 3.0 was released in September 2010 apart from other things this version offers support for Polish language.
Wiki at Tesseract website is a bit messy, that is why I decided to describe my experience with building and installation of Tesseract 3.0. I was working on Ubuntu 10.10 server edition, deployed on virtual machine created using Oracle Virtual Box.
First, I’ve install build-essential and autoconf:
sudo apt-get install build-essential sudo apt-get install autoconf
Next, step according to Tesseract wiki is to install dependencies:
sudo apt-get install libpng12-dev sudo apt-get install libjpeg62-dev sudo apt-get install libtiff4-dev sudo apt-get install zlib1g-dev
Please note, that the name of zlib1g-dev package is misspelled in the wiki.
I’ve tried to install libleptonica (Leptonica is also required dependency) package from default Ubuntu repositories but Tesseract’s ./configure script does not recognize that it is installed. To cope with that I have downloaded sources of Leptonica 1.6.7 from its Google Code website and than followed rather standard build process:
./configure make sudo make install sudo ldconfig
The next step was downloading tesseract-3.00.tar.gz from Tesseract project website. Uncompress archive, go to tesseract-3.0 directory and invoke:
After invoking ./configure you should check config_auto.h is dependencies were recognised by ./configure script. Header file should contain #define for HAVE_LIBLEPT, HAVE_LIBPNG, HAVE_LIBTIFF, HAVE_LIBJPEG and HAVE_ZLIB.
make sudo make install sudo ldconfig
Without ldconfig you might experience problems with launching Tesseract.
Download languages of your choice from Tesseract website and place them (uncompress first) in your tessdata folder (by default /usr/local/share/tessdata).
Now run the OCR using:
tesseract phototest.tiff out.txt -l eng more out.txt
Hope that this will be helpful.