Comparison of FineReader and Tesseract OCR engines – report

Today we’ve published a report related to comparison of FineReader and Tesseract OCR engines. Both tools were tested on Polish historical documents (printed before 1850) coming from various Polish digital libraries. The comparison concerned both gothic and antiqua documents as well as noisy and clean images. In order to conduct the comparison both engines has been appropriately trained.

When comparing OCR results of both engines, there is no single winner that would outperform the second engine. However, we tried to point out differences between FineReader and Tesseract, their advantages and disadvantages. We invite you to read the report in order to get details of our approach and gained results.

All test cases are based on the ground truth data produced in the scope of the IMPACT project. The comparison itself was part of the pilot work conducted in course of the IMPACT project extension in the first half of 2012. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

The full report is available for download on the PSNC Digital Libraries Team website dedicated to the IMPACT project results.

New textual resources developed in frame of the IMPACT project

Today we have made available new textual resources, coming from the IMPACT project Polish Digital Libraries dataset. The new resources include 478 pages of full text with the details on coordinates reaching the region, line, word and glyph levels. This is an important textual material in the context of research, especially in scope of the optical character recognition algorithms. The quality of the developed resources is approx. 99.95%. All of them are available for download at http://dl.psnc.pl/activities/projekty/impact/results/.

These resources has been developed for the needs of the pilot work done by Poznań Supecomputing and Networking Center in course of the IMPACT project. The pilot was related to the comparison of two well-known OCR enignes: FineReader 10 CE and Tesseract 3.0.

IMPACT event: Project Outcomes

We would like to invite you to attend the IMPACT event “Project Outcomes” on 26 June 2012, which will take place at the KB National Library of the Netherlands in The Hague. At this event, the IMPACT project outcomes will be presented by IMPACT staff, along with results of several pilots that have been conducted with some of the tools at IMPACT libraries in early 2012.
The IMPACT project (January 2008 – June 2012) is a European research project focused on innovating OCR software and language technology to improve the digitisation of historical printed text. IMPACT is led by the KB National Library of the Netherlands. Our group of partners includes several major European national libraries, universities, research centres and two private sector companies (ABBYY and IBM Haifa). IMPACT recently launched the IMPACT Centre of Competence (www.digitisation.eu), a productive network of experts in digitisation that will build on the research and development of partners from the IMPACT project and continue to improve access to text.

At the end of the project in June 2012, IMPACT is presenting the following results:

    • The improved commercial OCR engine ABBYY FineReader 10 (the IMPACT FineReader)
    • IBM’s Adaptive OCR engine with the CONCERT tool for OCR correction
    • Computerlexica for 9 European languages and tools for lexicon building
    • A digitisation framework for demonstrating and evaluating tools and results
    • An invaluable dataset which can foster further research activities
    • The Functional Extension Parser capable of decoding layout elements of books
    • A postcorrection tool with text and error profiler
    • Novel Approaches to preprocessing and OCR for future development
    • The IMPACT Centre of Competence for digitisation
Attendance of this event is free of charge, but we kindly ask you to register in advance through http://impactocr.eventbrite.com/. The programme will be made available through this page in the near future.



Full text versions of Polish historical documents available for download!

Activities performed by PSNC Digital Libraries Team in frame of the IMPACT project resulted in a set of full text versions of selected Polish historical documents from four digital libraries in Poland. Altogether 4 693 files were processed, corresponding full text versions have 6 890 677 characters. Size of the master files is around 16,5GB. Size of the full text is around 300MB, and size of full text with additional information is 700MB.

Deatails related to data and available data for download can be accessed via IMPACT results website dedicated to PSNC Digital Libraries Team activities.

PSNC joined European IMPACT project

At the beggining of February 2010 PSNC joined European IMPACT (Improving Access To Text) project. PSNC Digital Libraries Team is responsible for the works assigned to PSNC within the IMPACT project.

IMPACT is a four-year project (2008-2012) financed in frame of the EU 7th Framework Programme. In 2010 IMPACT project came into its second phase which extended IMPACT consortium with new partners from France, Spain and Poland. IMPACT aims to significantly improve access to historical text and support all European players such as libraries and cultural institutions, but also companies, decision making bodies, funding agencies etc. with high level information concerning the mass-digitisation and transformation of historical texts.

IMPACT is coordinated by the National Library of the Netherlands, the sub-projects leaders are The British Library, University of Innsbruck and the Austrian National Library.

Poland is represented in the project by PSNC Digital Libraries Team (PSNC DLT) and Department of Formal Linguistics of the University of Warsaw (DFL UW). PSNC DLT is responsible for coordination of works lead by Polish partners, demonstration works concerning Polish hisrotical documents and connected with the OCR and information retrieval tools created in frame of the IMPACT project, leading dissemination activities in Poland and providing support for project in the scope of building a network of competence centres. DFL UW is responsible for language works concerning Polish historical documents in the scope of historical lexicons building which aim to help with improvement of OCR and information retrieval on historical documents.

More information can be found at the IMPACT project home page – http://www.impact-project.eu/