Tag Archives: tesseract

Digitlab

Digitlab is an especially adapted operational system based on Linux Ubuntu. The main aim of its creation was to create a complete system which can be used for collections digitization with the usage of free and widely available tools. The system was based on Ubuntu in version 12.04 LTS and prepared with a tool called Remastersys. It can be downloaded as an ISO image and tried by recording it on pendrive or DVD without the necessity of installation on the computer.

Digitlab is an element of work results of project ACCESS IT Plus. ACCESS IT Plus. Among installed programs there can be found inter alia: ScanTailor (a tool for processing scanning results), gscan2pdf (scanning, creation of PDF&DjVu files with a support for Tesseract), magicktiler (tool enabling the creation of Zoomify images), the engine of OCR Tesseract with a support for Polish language, German Gothic fonts and much more. Apart from utility tools mentioned above, users will also find exemplary digital libraries created on base of DSpace, GreenStone and dLibra software. The full list of software installed in Digitlab system is published here.

In order to check the Digitlab possibilities , the usage of at least 4GB carrier and software such as Ubuntu Startup Disk Creator (Linux) or Universal USB Installer (Windows) or any other tool enabling to create a bootable carrier from downloaded ISO image is indispensable. In both those programs, apart form recording the image of the system, there can be created a disc space in which all changes made by the user will be saved if the system turn on with pendrive. While creating this space, it should be remembered that the bigger the space, the longer the time of system starting with pendrive is. Except turning on with pendrive, Digitlab may also be installed on computer and use it in work as a basic operational system.

The default language of the system is English. Additionally there are some other languages installed: Croatian, Serbian, Greek, Albanian, Turkish and Polish. All applications which were not installed from system packages, were placed in the catalog “/usr/apps/”. System can be successfully used for scholarships and daily work. The ISO image can be downloaded here.

Image which was used in this post comes from Missye Katolickie journal. It was published in 1882 and is available in Digital Library of Greater Poland.

Comparison of FineReader and Tesseract OCR engines – report

Today we’ve published a report related to comparison of FineReader and Tesseract OCR engines. Both tools were tested on Polish historical documents (printed before 1850) coming from various Polish digital libraries. The comparison concerned both gothic and antiqua documents as well as noisy and clean images. In order to conduct the comparison both engines has been appropriately trained.

When comparing OCR results of both engines, there is no single winner that would outperform the second engine. However, we tried to point out differences between FineReader and Tesseract, their advantages and disadvantages. We invite you to read the report in order to get details of our approach and gained results.

All test cases are based on the ground truth data produced in the scope of the IMPACT project. The comparison itself was part of the pilot work conducted in course of the IMPACT project extension in the first half of 2012. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

The full report is available for download on the PSNC Digital Libraries Team website dedicated to the IMPACT project results.

Tesseract 3.0 installation on Ubuntu 10.10 server

Tesseract is an optical character recognition (OCR) engine originally developed by Hewlett Packard, in 2005 it was open sourced under Apache license. Its development is now supported by Google. Version 3.0 was released in September 2010 apart from other things this version offers support for Polish language.

Wiki at Tesseract website is a bit messy, that is why I decided to describe my experience with building and installation of Tesseract 3.0. I was working on Ubuntu 10.10 server edition, deployed on virtual machine created using Oracle Virtual Box.

First, I’ve install build-essential and autoconf:

sudo apt-get install build-essential
sudo apt-get install autoconf

Next, step according to Tesseract wiki is to install dependencies:

sudo apt-get install libpng12-dev
sudo apt-get install libjpeg62-dev
sudo apt-get install libtiff4-dev
sudo apt-get install zlib1g-dev

Please note, that the name of zlib1g-dev package is misspelled in the wiki.

I’ve tried to install libleptonica (Leptonica is also required dependency) package from default Ubuntu repositories but Tesseract’s ./configure script does not recognize that it is installed. To cope with that I have downloaded sources of Leptonica 1.6.7 from its Google Code website and than followed rather standard build process:

./configure
make
sudo make install
sudo ldconfig

The next step was downloading tesseract-3.00.tar.gz from Tesseract project website. Uncompress archive, go to tesseract-3.0 directory and invoke:

./runautoconf
./configure

After invoking ./configure you should check config_auto.h is dependencies were recognised by ./configure script. Header file should contain #define for HAVE_LIBLEPT, HAVE_LIBPNG, HAVE_LIBTIFF, HAVE_LIBJPEG and HAVE_ZLIB.

make
sudo make install
sudo ldconfig

Without ldconfig you might experience problems with launching Tesseract.

Download languages of your choice from Tesseract website and place them (uncompress first) in your tessdata folder (by default /usr/local/share/tessdata).
Now run the OCR using:

tesseract phototest.tiff out.txt -l eng 
more out.txt

Hope that this will be helpful.