Tag Archives: SYNAT

New version of the Virtual Transcription Laboratory portal

A few days ago a new version of Virtual Transcription Laboratory portal has been implemented (http://wlt.synat.pcss.pl). In the current version you can find some new functions and conveniences, also all mistakes reported by users were repaired.

Most important changes that were implemented:

  • transcription’s editor supports multicolumn documents f.ex. newspapers (this option is available in new projects),
  • lines verification mechanism was added, each line is connected to the information about whether it has been looked through or not,
  • the import of TIFF files mechanism was expedited,
  • the possibility to download transcription results in text format,
  • link to preview whole scan in transcription’s editor,
  • numbers of lines were added in a transcription’s editor view,
  • it is also possible to move lines to a certain position in transcription’s view (by giving its number),
  • a mail is sent to the project’s owner after finishing the batch OCR,
  • the information about changes author was added to a history view,
  • author of the project is an optional field in a formula of creating new project

Detailed note about release with all changes and adjustments can be found here.

Apart from changes mentioned above, a suggestions and improvements forum where you can inform us about your propositions of improvements in VTL and vote for other user’s ideas has started (it is available here). You can enter the forum using an orange bookmark “Your suggestion”, which appears in the right upper corner of the VTL site. We strongly encourage to report your ideas and vote for those already visible on forum. Further works on VTL will include functions which mostly appeal to users.

New functions of the Virtual Transcription Laboratory portal

Source: http://pl.wikipedia.org/wiki/Plik:Escribano.jpg

We can gladly inform about the release of the newest version of the Virtual Transcription Laboratory (http://wlt.synat.pcss.pl).

This version contains new functions and correction of errors reported by users. Among most important changes there are:

  • the possibility to export the results of work in a ePUB file,
  • share project only with chosen VTL users,
  • support for scans in TIFF format (after uploading they will be automatically converted into the PNG/300DPI format),
  • changes in transcription editor dialogue,
  • a number of corrections in outcome hOCR files,

Full list of changes with screens can be found on our wiki:
https://confluence.man.poznan.pl/community/display/WLT/Note+about+release+from+2013-03-25

Next stage in beta testing of VTL

circle1
On Friday 15th of February 2013, we have released a number of new functions and improvements in the Virtual Transcription Laboratory (http://wlt.synat.pcss.pl) portal.

These are the most prominent ones:

  • noticeable improvement of capability and stability of whole portal activities,
  • change in the way of transcription edition history is stored,
  • import of existing DjVu publication on the basis of the OAI identifier (this feature is described in an end-user documentation),
  • batch OCR for all files in the project,
  • notifications showing whether changed performed in transcription editor were saved,
  • many minor improvements and bug fixes reported by users,
  • the first version of documentation for users has been published (http://confluence.man.poznan.pl/community/display/WLT).

A few months passed since the BETA release of VTL. We would like to thank everyone for their feedback ;-). After the initial release it became clear that serious changes must be done in the portal engine. The most important was the change in the way transcription is represented and stored in database. This was a very significant thing but it resulted in a significant performance and stability improvement.

In the near future two new functions will be added:

  • export of project results in EPUB format,
  • the possibility to upload TIFF files into the project (they will be automatically converted to  PNG file in 300 DPI).

Authors of the post: Bogna Wróż, Adam Dudczak

Culture 2.0: Digital Archives – Tool Shop

On October 26-27, Polish National Audiovisual Institute (NiNA) organized the annual Culture 2.0 conference/festival. The attendants were given a chance to participate in a range lectures, workshops, games and demonstration. Full conference programme is available at the conference website. PSNC was the event’s partner,  Platon TV recorded the event, and the Digital Libraries Team was responsible for operating the “Digital Archives Tool Shop”.

What was the idea behind the Tool Shop? Digital libraries, archives and museums are usually associated with big institutions and their priceless, historical collections. But each one of us can stumble upon some family mementoes – old documents, photos or postcards – hidden in a long-forgotten drawer, but worthy of preservation and display. Our goal was to show the visitors how to create a digital (e.g. family) archive using widely available tools: a simple scanner and camera, open source software, and how to make it accessible online in accordance with digital librarianship canons and guidelines. The Tool Shop consisted of three stands: “Scanning and Processing”, “Tran2|>rip>ion”, and “Let Everyone See!”

fot. Justyna Walkowska

At the first stand (Scanning and Processing) the visitor were invited to digitize materials they brought from home. We presented the scanning process and its result, and explained how the quality of the result can be improved after it finds it way to a computer disk. All tasks at this stage were performed using the DigitLab system, with tools such as ScanTailor, gScan2PDF, Tesseract or SimpleScan. We treated the taks as a kind of exam for DigitLab, and we think it passed with flying colours. Direct contact with users is a great opportunity for every tool creator. The comments and suggestions we received will be reflected in the next release of the system.

Stand no. 2, the one with the peculiar name (Tran2|>rip>ion), presented the “Virtual Transcription Laboratory”. VTL is a portal which allows users to create full-text versions (transcriptions) of textual documents. As demonstrated at the first stand, the result of a digitization process is a graphics file – the digital representation of the document that was scanned. However, it does not contain the text of the document in a form understandable for the computer. The textual contents are necessary to create effective search mechanisms, to enhance the document’s visibility online, and to open new research possibilites. Using VTL, the conference visitors were able to automatically convert their scans to digital text in the OCR (Optical Character Recognition) process. VLT also makes it possible for users to co-edit the automatically recognized text, correcting any programme’s errors. VTL brings together automatic and crowdsourcing methods, thanks to which librarians, researchers and hobbyists can create high-quality textual representations of historical documents.

At the last stand the visitors were encouraged to make their digitized resources available online in the same way  professional librarians do. We tought them how to create a private archive using tools such as Omeka, and also presented the publication process of the biggest Polish digital libraries (which mostly use our dLibra software). As the next step, we explained how to check who is linking to our online resources and how to monitor their usage with free tools. A significant number of visitors had not heard about the  Digital Libraries Federation or Europeana, so we put some effort into describing those portals’ functions and goals.

For us this event offered a priceless opportunity to test our solutions in direct interactions with the users. Those were two very busy days, and unfortunately we did have much time to participate in lectures or workshops happening in the same place. We did manage to look around Level 2.0 (that is the 2nd floor on which we were located), where different installations were presented. One of our favourites was Waldemar Węgrzyn’s Electrolibrary in which a traditional book was used as the interface to an enriched electronic version.

Electrolibrary from Waldek Wegrzyn on Vimeo.

This short post is far from being a complete description of what conference participants were able to see. We hope that the recorded lectures will be made available soon, giving us the chance to catch up and see what we missed. 😉

First Polish THATCamp

First Polish THATCamp will be organized on 24-25 October 2012 and will be held next to “Zwrot Cyfrowy w humanistyce Internet Nowe Media-Kultura 2.0” conference in Lublin. Event is organized by the Polish THATCamp coalition and will take place in headquarters of NN Theater on Old Town in Lublin (Grodzka 21). Poznań Supercomputing and Networking Center is an official partner of this event.

THATCamps (The Humanities And Technology Camp, http://www.thatcamp.org) is a meeting of people interested in new technologies in humanities, sociology, academic and artistic institutes activities (universities, galleries, archives, libraries and museums) organized all over the world. Participation in that kind of events is free.

Beginnings of THATCamp date back to 2008, when it was organized for the first time in USA by Center for History and New Media (CHNM) in George Mason University.

More information about event can be found here (in Polish).

Post authors: Bogna Wróż, Adam Dudczak

The Europeana Libraries and TEL Joint Meeting in Bucharest

The Italian Church in Bucharest

The Europeana Libraries project and TEL (The European Library) joint meeting was held in Bucharest on 21-23rd of May. The theme of the meeting was Looking to the future: how do we place our service at the heart of Europe’s research communities? Some videos from the meetings are available here.

PSNC participates in Europeana Libraries project Work Package 5 (as an external expert), whose main objective is to Enhance the searchability of existing library-domain content in Europeana by defining transformations from ESE metadata to EDM and establishing best practice taking account of the different types of library contributing to Europeana.

Internally, the Europeana portal is switching to a new data representation schema, called EDM (Europeana Data Model). The main difference between EDM and the previously used ESE (Europeana Semantic Elements) is that EDM is more Semantic Web and Linked Open Data oriented, ontology-based format. EDM makes a clear distinction between the physical resource (e.g. a painting, or an old print), called Provided Cultural Heritage Object, and the so-called Web Resource, which is a digital representation of the object, possibly one of many. This distinction has not always been clear in ESE. Also, ultimately EDM should become an event-oriented ontology, similar in this aspect to CIDOC CRM.

In case of libraries, the EDM poses challenges a bit different from those of museums. One of the first questions to decide upon was whether the Provided Cultural Heritage Object is the Item (physical copy of a book) or the Expression (an abstract edition of a book), referring to FRBR terms.

Before the meeting institutions participating in WP5 were asked to prepare a manual mapping of a number of chosen metadata records from their collections to EDM. The goal of this excercise was to express doubts or inconsistencies in the mapping or the EDM libraries profile (separate profiles for monographs and series), thus validating the profile. In the next step, TEL will prepare automatic mapping description for the provided internal library formats, and test them against new metadata records and the new metadata aggregation infrastructure.

PSNC’s participation in the Europeana Libraries project is related to the development of methods for semantic integration of cultural heritage objects’ metadata beeing a part of stage A10 of the SYNAT project.

“CIDOC 2011 – Knowledge Management and Museums” Conference

The “CIDOC 2011 – Knowledge Management and Museums” conference took place in Sibiu in Romania on September 4-9, 2011. The conference is an annual event, organized by ICOM-CIDOC, that is the Committee for Documentation at the International Council of Museums.

The conference participants came from very different, but cooperating environments: museologists, librarians, programmers and museum software vendors, researchers in the field of ontologies and semantic web,
and also people and institutions concerned with museum documentation standards.

The conference included meetings of CIDOC working groups:

  • Archaeological Sites
  • Conceptual Reference Model Special Interest Group
  • Co-reference
  • Data Harvesting and Interchange
  • Digital preservation
  • Documentation Standards
  • Information Centres
  • Multimedia
  • Transdisciplinary Approaches in Documentation

A number of topics were raised at the conference which are tightly connected with PSNC’s work in the SYNAT project. The most prominent ones were:

  • LIDO (Lightweight Information Describing Objects) specification (www.lido-schema.org/) for description of museum resources made available online
  • recommendation to use persistent, unique identifiers (URIs) of museum resources
  • FRBRoo ontology which merges CIDOC CRM and FRBR (Functional Requirements for Bibliographic Records) to properly describe digital resources online (www.nla.gov.au/lis/stndrds/grps/acoc/tillett2004.ppt, http://www.frbr.org/categories/frbroo)
  • Wiss-ki system presentation (http://wiss-ki.eu/, http://www8.informatik.uni-erlangen.de/transdisc/hohmann_cidoc09_wisski-2.pdf). The goals and assumptions of the project are very close to those of SYNAT. Some of the already used solutions might possibly be used in SYNAT.

The next CIDOC conference will take place in June 2012 in Helsinki. Additionally, the CIDOC “summer school” for people taking care of museum documentation is planned for the holiday period of 2012.

Open Repositories 2011 conference

Open Repositories 2011 conference was held on June 6-11. It is an important international event for exchange of information about development, management and application of digital repositories.

Over 300 participants, from over 20 countries had opportunity to hear the lectures of such great representatives of IT and digital libraries community as Jim Jagielski and Clifford Lynch. Conference sessions were dedicated to various topics related to digital repositories, including semantic web, tools and standards, long term preservation and social networks.

Conference opening speech was performed by Jim Jagielski, president of Apache Software Foundation. Jim Jagielski described open-source communities organisation and collaboration. He underlined that open-source projects are developed mailny by volunteers, and the key aspect of cooperation is trust between project participants. Bradley McLean from DuraSpace identified key trends for the future of digital repositories: mobile technologies, long term preservation, cloud computing, and mashups. Richard Rodgers from M.I.T. Libraries presented ORCID initiative, which aim is to create a central registry for researchers to solve the problem of author ambiguity.

Many tools, systems and initiatives related to digital repositories were also presented on the conference: Memento, Hathi Trust, DAR, FITS, OTS-Schemas, BatchBuilder, ReDBox and Mint, Exhibit, Fascinator, Recollection, SWORD, CUPID.

On the conference, Tomasz Parkoła from PSNC presented a poster describing the concept of building multiple virtual digital repositories mapped over collections of a shared digital library. Digital repositories are currently an important new trend in the network of Polish digital libraries. The main aim is to increase on-line visibility of contemporary Open Access research works. This kind of activities is also supported by the Integrated Knowledge System developed by PSNC in frame of the SYNAT project.

Tesseract 3.0 installation on Ubuntu 10.10 server

Tesseract is an optical character recognition (OCR) engine originally developed by Hewlett Packard, in 2005 it was open sourced under Apache license. Its development is now supported by Google. Version 3.0 was released in September 2010 apart from other things this version offers support for Polish language.

Wiki at Tesseract website is a bit messy, that is why I decided to describe my experience with building and installation of Tesseract 3.0. I was working on Ubuntu 10.10 server edition, deployed on virtual machine created using Oracle Virtual Box.

First, I’ve install build-essential and autoconf:

sudo apt-get install build-essential
sudo apt-get install autoconf

Next, step according to Tesseract wiki is to install dependencies:

sudo apt-get install libpng12-dev
sudo apt-get install libjpeg62-dev
sudo apt-get install libtiff4-dev
sudo apt-get install zlib1g-dev

Please note, that the name of zlib1g-dev package is misspelled in the wiki.

I’ve tried to install libleptonica (Leptonica is also required dependency) package from default Ubuntu repositories but Tesseract’s ./configure script does not recognize that it is installed. To cope with that I have downloaded sources of Leptonica 1.6.7 from its Google Code website and than followed rather standard build process:

./configure
make
sudo make install
sudo ldconfig

The next step was downloading tesseract-3.00.tar.gz from Tesseract project website. Uncompress archive, go to tesseract-3.0 directory and invoke:

./runautoconf
./configure

After invoking ./configure you should check config_auto.h is dependencies were recognised by ./configure script. Header file should contain #define for HAVE_LIBLEPT, HAVE_LIBPNG, HAVE_LIBTIFF, HAVE_LIBJPEG and HAVE_ZLIB.

make
sudo make install
sudo ldconfig

Without ldconfig you might experience problems with launching Tesseract.

Download languages of your choice from Tesseract website and place them (uncompress first) in your tessdata folder (by default /usr/local/share/tessdata).
Now run the OCR using:

tesseract phototest.tiff out.txt -l eng 
more out.txt

Hope that this will be helpful.