Tag Archives: ocr

New version of the Virtual Transcription Laboratory portal

A few days ago a new version of Virtual Transcription Laboratory portal has been implemented (http://wlt.synat.pcss.pl). In the current version you can find some new functions and conveniences, also all mistakes reported by users were repaired.

Most important changes that were implemented:

  • transcription’s editor supports multicolumn documents f.ex. newspapers (this option is available in new projects),
  • lines verification mechanism was added, each line is connected to the information about whether it has been looked through or not,
  • the import of TIFF files mechanism was expedited,
  • the possibility to download transcription results in text format,
  • link to preview whole scan in transcription’s editor,
  • numbers of lines were added in a transcription’s editor view,
  • it is also possible to move lines to a certain position in transcription’s view (by giving its number),
  • a mail is sent to the project’s owner after finishing the batch OCR,
  • the information about changes author was added to a history view,
  • author of the project is an optional field in a formula of creating new project

Detailed note about release with all changes and adjustments can be found here.

Apart from changes mentioned above, a suggestions and improvements forum where you can inform us about your propositions of improvements in VTL and vote for other user’s ideas has started (it is available here). You can enter the forum using an orange bookmark “Your suggestion”, which appears in the right upper corner of the VTL site. We strongly encourage to report your ideas and vote for those already visible on forum. Further works on VTL will include functions which mostly appeal to users.

New functions of the Virtual Transcription Laboratory portal

Source: http://pl.wikipedia.org/wiki/Plik:Escribano.jpg

We can gladly inform about the release of the newest version of the Virtual Transcription Laboratory (http://wlt.synat.pcss.pl).

This version contains new functions and correction of errors reported by users. Among most important changes there are:

  • the possibility to export the results of work in a ePUB file,
  • share project only with chosen VTL users,
  • support for scans in TIFF format (after uploading they will be automatically converted into the PNG/300DPI format),
  • changes in transcription editor dialogue,
  • a number of corrections in outcome hOCR files,

Full list of changes with screens can be found on our wiki:
https://confluence.man.poznan.pl/community/display/WLT/Note+about+release+from+2013-03-25

Next stage in beta testing of VTL

circle1
On Friday 15th of February 2013, we have released a number of new functions and improvements in the Virtual Transcription Laboratory (http://wlt.synat.pcss.pl) portal.

These are the most prominent ones:

  • noticeable improvement of capability and stability of whole portal activities,
  • change in the way of transcription edition history is stored,
  • import of existing DjVu publication on the basis of the OAI identifier (this feature is described in an end-user documentation),
  • batch OCR for all files in the project,
  • notifications showing whether changed performed in transcription editor were saved,
  • many minor improvements and bug fixes reported by users,
  • the first version of documentation for users has been published (http://confluence.man.poznan.pl/community/display/WLT).

A few months passed since the BETA release of VTL. We would like to thank everyone for their feedback ;-). After the initial release it became clear that serious changes must be done in the portal engine. The most important was the change in the way transcription is represented and stored in database. This was a very significant thing but it resulted in a significant performance and stability improvement.

In the near future two new functions will be added:

  • export of project results in EPUB format,
  • the possibility to upload TIFF files into the project (they will be automatically converted to  PNG file in 300 DPI).

Authors of the post: Bogna Wróż, Adam Dudczak

Digitlab

Digitlab is an especially adapted operational system based on Linux Ubuntu. The main aim of its creation was to create a complete system which can be used for collections digitization with the usage of free and widely available tools. The system was based on Ubuntu in version 12.04 LTS and prepared with a tool called Remastersys. It can be downloaded as an ISO image and tried by recording it on pendrive or DVD without the necessity of installation on the computer.

Digitlab is an element of work results of project ACCESS IT Plus. ACCESS IT Plus. Among installed programs there can be found inter alia: ScanTailor (a tool for processing scanning results), gscan2pdf (scanning, creation of PDF&DjVu files with a support for Tesseract), magicktiler (tool enabling the creation of Zoomify images), the engine of OCR Tesseract with a support for Polish language, German Gothic fonts and much more. Apart from utility tools mentioned above, users will also find exemplary digital libraries created on base of DSpace, GreenStone and dLibra software. The full list of software installed in Digitlab system is published here.

In order to check the Digitlab possibilities , the usage of at least 4GB carrier and software such as Ubuntu Startup Disk Creator (Linux) or Universal USB Installer (Windows) or any other tool enabling to create a bootable carrier from downloaded ISO image is indispensable. In both those programs, apart form recording the image of the system, there can be created a disc space in which all changes made by the user will be saved if the system turn on with pendrive. While creating this space, it should be remembered that the bigger the space, the longer the time of system starting with pendrive is. Except turning on with pendrive, Digitlab may also be installed on computer and use it in work as a basic operational system.

The default language of the system is English. Additionally there are some other languages installed: Croatian, Serbian, Greek, Albanian, Turkish and Polish. All applications which were not installed from system packages, were placed in the catalog “/usr/apps/”. System can be successfully used for scholarships and daily work. The ISO image can be downloaded here.

Image which was used in this post comes from Missye Katolickie journal. It was published in 1882 and is available in Digital Library of Greater Poland.

New textual resources developed in frame of the IMPACT project

Today we have made available new textual resources, coming from the IMPACT project Polish Digital Libraries dataset. The new resources include 478 pages of full text with the details on coordinates reaching the region, line, word and glyph levels. This is an important textual material in the context of research, especially in scope of the optical character recognition algorithms. The quality of the developed resources is approx. 99.95%. All of them are available for download at http://dl.psnc.pl/activities/projekty/impact/results/.

These resources has been developed for the needs of the pilot work done by Poznań Supecomputing and Networking Center in course of the IMPACT project. The pilot was related to the comparison of two well-known OCR enignes: FineReader 10 CE and Tesseract 3.0.

IMPACT event: Project Outcomes

We would like to invite you to attend the IMPACT event “Project Outcomes” on 26 June 2012, which will take place at the KB National Library of the Netherlands in The Hague. At this event, the IMPACT project outcomes will be presented by IMPACT staff, along with results of several pilots that have been conducted with some of the tools at IMPACT libraries in early 2012.
The IMPACT project (January 2008 – June 2012) is a European research project focused on innovating OCR software and language technology to improve the digitisation of historical printed text. IMPACT is led by the KB National Library of the Netherlands. Our group of partners includes several major European national libraries, universities, research centres and two private sector companies (ABBYY and IBM Haifa). IMPACT recently launched the IMPACT Centre of Competence (www.digitisation.eu), a productive network of experts in digitisation that will build on the research and development of partners from the IMPACT project and continue to improve access to text.

At the end of the project in June 2012, IMPACT is presenting the following results:

    • The improved commercial OCR engine ABBYY FineReader 10 (the IMPACT FineReader)
    • IBM’s Adaptive OCR engine with the CONCERT tool for OCR correction
    • Computerlexica for 9 European languages and tools for lexicon building
    • A digitisation framework for demonstrating and evaluating tools and results
    • An invaluable dataset which can foster further research activities
    • The Functional Extension Parser capable of decoding layout elements of books
    • A postcorrection tool with text and error profiler
    • Novel Approaches to preprocessing and OCR for future development
    • The IMPACT Centre of Competence for digitisation
Attendance of this event is free of charge, but we kindly ask you to register in advance through http://impactocr.eventbrite.com/. The programme will be made available through this page in the near future.

 

 

Tesseract 3.0 installation on Ubuntu 10.10 server

Tesseract is an optical character recognition (OCR) engine originally developed by Hewlett Packard, in 2005 it was open sourced under Apache license. Its development is now supported by Google. Version 3.0 was released in September 2010 apart from other things this version offers support for Polish language.

Wiki at Tesseract website is a bit messy, that is why I decided to describe my experience with building and installation of Tesseract 3.0. I was working on Ubuntu 10.10 server edition, deployed on virtual machine created using Oracle Virtual Box.

First, I’ve install build-essential and autoconf:

sudo apt-get install build-essential
sudo apt-get install autoconf

Next, step according to Tesseract wiki is to install dependencies:

sudo apt-get install libpng12-dev
sudo apt-get install libjpeg62-dev
sudo apt-get install libtiff4-dev
sudo apt-get install zlib1g-dev

Please note, that the name of zlib1g-dev package is misspelled in the wiki.

I’ve tried to install libleptonica (Leptonica is also required dependency) package from default Ubuntu repositories but Tesseract’s ./configure script does not recognize that it is installed. To cope with that I have downloaded sources of Leptonica 1.6.7 from its Google Code website and than followed rather standard build process:

./configure
make
sudo make install
sudo ldconfig

The next step was downloading tesseract-3.00.tar.gz from Tesseract project website. Uncompress archive, go to tesseract-3.0 directory and invoke:

./runautoconf
./configure

After invoking ./configure you should check config_auto.h is dependencies were recognised by ./configure script. Header file should contain #define for HAVE_LIBLEPT, HAVE_LIBPNG, HAVE_LIBTIFF, HAVE_LIBJPEG and HAVE_ZLIB.

make
sudo make install
sudo ldconfig

Without ldconfig you might experience problems with launching Tesseract.

Download languages of your choice from Tesseract website and place them (uncompress first) in your tessdata folder (by default /usr/local/share/tessdata).
Now run the OCR using:

tesseract phototest.tiff out.txt -l eng 
more out.txt

Hope that this will be helpful.