All posts by Tomasz Parkoła

Open Repositories 2012 conference

Open Repositories 2012 conference and corresponding workshops were held in Edinburgh (Scotland) on 9-13 July 2012. Rich conference programme and workshops, as well as a huge number of participants confirms the importance of the Open Repositories series. OR2012
Workshops consisted of several sessions. Especially interesting in the context of digital libraries were those related to data and text mining as well as long-term preservation. Data mining workshops were mainly related to search engines, semantic search, metadata and data aggregation, information extraction from texts as well as workflow systems related to texts. Various topics and systems has been presented, including:

During the development of the above systems various tools has been utilised, e.g. TextCat (http://odur.let.rug.nl/vannoord/TextCat/), U-Compare (http://u-compare.org/), OSCAR4 (https://bitbucket.org/wwmm/oscar4/wiki/Home), ANTRL (http://www.antlr.org/), MAUI (http://code.google.com/p/maui-indexer/), KEA (http://www.nzdl.org/Kea/), Sesame (http://www.openrdf.org/index.jsp), H2 (http://www.h2database.com/).

Workshops related to long-term preservation were focused mainly on Trident system and its possibilities.During the workshops most important aspects of long-term preservation has been presented, including identification of files that should be migrated or normalised as well as tools that can be used to create long-term preservation workflow (Kepler (https://kepler-project.org/), Taverna (http://www.taverna.org.uk/), Ptolemy II (http://ptolemy.eecs.berkeley.edu/ptolemyII/), Triana (http://www.trianacode.org/)).
The conference itself covered three days. Various topics has been raised and a number of interesting articles presented, e.g.:
  • “Build to scale” – presentation that shows how to build search system based on ApacheSolr, for 250M of records and providing results in 2 or less seconds.
  • “Inter-repository Linking of Research Objects with Webtracks” – presentation which describes InteRCom protocol for exchanging semantic information between repositories.
  • “ResourceSync: Web-based Resource Synchronization” – presentation of the protocol for synchronisation of data. It is based on experienced from OAI-PMH and OAI-ORE protocols.
  • “Griffith’s Research Data Evolution Journey: Enabling data capture, management, aggregation, discovery and reuse.” – description of research infrastructure of the Griffith University, including semantic tools such as VIVO (http://sourceforge.net/apps/mediawiki/vivo/) and VITRO (http://vitro.mannlib.cornell.edu/).
  • “Multivio, a flexible solution for in-browser access to digital content” – presentation which describes multi purpose viewer for PDF, GIF, JPEG and PNG that can understand DublinCore, MARC21, MODS and METS.
  • “ORCID update and why you should use ORCIDs in your repository” – presentation that shows the current status of the system for researchers identification called ORCID (http://about.orcid.org/).
  • “Digital Preservation Network, Saving the Scholarly Record Together” – presentation related to the initiative among several institutions in the USA focused on building heterogeneous system for long-term preservation (http://d-p-n.org/).
During the conference representative of Poznań Supercomputing and Networking Center presented the article entitled “dArceo services: advancing long-term preservation” and described long-term preservation services, focused on texts, images and a/v content, dedicated for Polish scientific and cultural heritage institutions. We invite you to visit OR2012 (http://or2012.ed.ac.uk/) website and view available presentations.

Europeana Newspapers project – survey

The Europeana Newspapers project published a survey to learn about available digitised newspapers in Europe. The survey is aimed at institutions that are not currently part of the Europeana Newspapers project. As it is mentioned on the Europeana Newspapers website the survey has three purposes:

  1. To get a clear idea of the extent of newspaper digitisation within Europe
  2. To record the relevant metadata in the Berlin State Library’s Catalogue of Serials (http://www.zeitschriftendatenbank.de/) and as part of the central index of newspapers being created by The European Library (http://www.theeuropeanlibrary.org/)
  3. To help locate 10 additional partners to join the project

The survey is available at http://www.surveymonkey.com/s/BQ28579 and it is open until 31st of July 2012.

SEEDI conference – summary

At the end of May 2012 the 7th SEEDI (South-Eastern European Digitisation Initiative) conference was held in the city of Ljubljana in Slovenia. The conference treated about the issues related to digitisation and is focused on the south-eastern part of Europe.

During the conference multiple interesting speeches were performed. The keynote speaker Jill Cousins (Executive Director of the Europeana Foundation) introduced a new vision of Europeana portal, which will be based on cloud computing technologies and focused on the users needs and their improved experience. The other sessions included the topic of public-private partnership in scope of the digitisation basis on experience gained by the Austrian National Library. The conclusions and crucial aspects of such projects were described. The representative of the Czech National Library presented their approach for coordination of mass digitisation project and avoiding duplication of digitisation efforts. Various digitisation projects were presented, including those performed in Serbia, Croatia and Slovenia. The topic of digitisation workflow management and long-term preservation was tackled by the representative of Poznan Supercomputing and Networking Center, who introduced dLibra and dMuseion tools as well as dLab tool for digitisation workflow management and dArceo services for long-term preservation. The OCR challenges has been also presented, including the problems with OCR of historical documents such as old printed books or even manuscripts.

The conference program with additional information is available at: http://www.nuk.uni-lj.si/nukeng4.asp?id=483558290

New textual resources developed in frame of the IMPACT project

Today we have made available new textual resources, coming from the IMPACT project Polish Digital Libraries dataset. The new resources include 478 pages of full text with the details on coordinates reaching the region, line, word and glyph levels. This is an important textual material in the context of research, especially in scope of the optical character recognition algorithms. The quality of the developed resources is approx. 99.95%. All of them are available for download at http://dl.psnc.pl/activities/projekty/impact/results/.

These resources has been developed for the needs of the pilot work done by Poznań Supecomputing and Networking Center in course of the IMPACT project. The pilot was related to the comparison of two well-known OCR enignes: FineReader 10 CE and Tesseract 3.0.

IMPACT event: Project Outcomes

We would like to invite you to attend the IMPACT event “Project Outcomes” on 26 June 2012, which will take place at the KB National Library of the Netherlands in The Hague. At this event, the IMPACT project outcomes will be presented by IMPACT staff, along with results of several pilots that have been conducted with some of the tools at IMPACT libraries in early 2012.
The IMPACT project (January 2008 – June 2012) is a European research project focused on innovating OCR software and language technology to improve the digitisation of historical printed text. IMPACT is led by the KB National Library of the Netherlands. Our group of partners includes several major European national libraries, universities, research centres and two private sector companies (ABBYY and IBM Haifa). IMPACT recently launched the IMPACT Centre of Competence (www.digitisation.eu), a productive network of experts in digitisation that will build on the research and development of partners from the IMPACT project and continue to improve access to text.

At the end of the project in June 2012, IMPACT is presenting the following results:

    • The improved commercial OCR engine ABBYY FineReader 10 (the IMPACT FineReader)
    • IBM’s Adaptive OCR engine with the CONCERT tool for OCR correction
    • Computerlexica for 9 European languages and tools for lexicon building
    • A digitisation framework for demonstrating and evaluating tools and results
    • An invaluable dataset which can foster further research activities
    • The Functional Extension Parser capable of decoding layout elements of books
    • A postcorrection tool with text and error profiler
    • Novel Approaches to preprocessing and OCR for future development
    • The IMPACT Centre of Competence for digitisation
Attendance of this event is free of charge, but we kindly ask you to register in advance through http://impactocr.eventbrite.com/. The programme will be made available through this page in the near future.

 

 

Full text versions of Polish historical documents available for download!

Activities performed by PSNC Digital Libraries Team in frame of the IMPACT project resulted in a set of full text versions of selected Polish historical documents from four digital libraries in Poland. Altogether 4 693 files were processed, corresponding full text versions have 6 890 677 characters. Size of the master files is around 16,5GB. Size of the full text is around 300MB, and size of full text with additional information is 700MB.

Deatails related to data and available data for download can be accessed via IMPACT results website dedicated to PSNC Digital Libraries Team activities.

IMPACT Centre of Competence in 2012

Early in 2012 IMPACT Centre of Competence will gain its final shape and will be able to cooperate and help all interested in mass digitization. Because the initiative is an effect of the European IMPACT project, key roles in the centre initiation play project partners.

Five of the IMPACT project partners declared willingness to support IMPACT Centre of Competence and therefore became premium members with representatives in the management board. University of Alicante and Miguel de Cervantes Digital Library are the leaders. Poznań Supercomputing and Networking Centre, National Library of the Netherlands, National Library of France and Institute for Dutch Lexicology are premium members. PSNC is also responsible for hosting the wesite of the IMPACT Centre of Competence.

We encourage you to participate in the IMPACT Centre of Competence, either as leading premium member or basic member which benefits from the resources and tools developed within the IMPACT project. Details can be found on the official IMPACT Centre of Competence website: www.digitisation.eu.

IMPACT project conference – Digitisation & OCR

We would like to invite you to the final conference of the IMPACT project, “Digitisation & OCR: Better, faster, cheaper. Solutions of the IMPACT Centre of Competence and future challenges” that will take place on 24-25 October 2011 at the British Library in London. At this conference IMPACT will present the final project results, along with related research in the field of OCR and language technology.

This event will also mark the official launch of the IMPACT Centre of Competence. This Centre is focused on making digitisation of historical printed text in Europe better, faster, cheaper by sharing expertise and providing access to tools for all parts of the digitisation workflow, as well as tools, services and facilities for further advancement of the State of the Art in this field.

The programme for the conference is now online on the conference webpage, highlights include: Khalil Rouhana (European Commission), Michael Fuchs (ABBYY Europe), Paul Fogel (California Digital Library), Clemens Neudecker (National library of the Netherlands), Asaf Tzadok (IBM Haifa Research Lab), Majlis Bremer-Laamanen (National Library of Finland), Katrien Depuydt (INL ) and Klaus Schulz (University of Munich), Stephen Krauwer (CLARIN coordinator, University of Utrecht).

More programme updates will be announced through conference webpage and Twitter (hashtag: #impactconf2011).

Open Repositories 2011 conference

Open Repositories 2011 conference was held on June 6-11. It is an important international event for exchange of information about development, management and application of digital repositories.

Over 300 participants, from over 20 countries had opportunity to hear the lectures of such great representatives of IT and digital libraries community as Jim Jagielski and Clifford Lynch. Conference sessions were dedicated to various topics related to digital repositories, including semantic web, tools and standards, long term preservation and social networks.

Conference opening speech was performed by Jim Jagielski, president of Apache Software Foundation. Jim Jagielski described open-source communities organisation and collaboration. He underlined that open-source projects are developed mailny by volunteers, and the key aspect of cooperation is trust between project participants. Bradley McLean from DuraSpace identified key trends for the future of digital repositories: mobile technologies, long term preservation, cloud computing, and mashups. Richard Rodgers from M.I.T. Libraries presented ORCID initiative, which aim is to create a central registry for researchers to solve the problem of author ambiguity.

Many tools, systems and initiatives related to digital repositories were also presented on the conference: Memento, Hathi Trust, DAR, FITS, OTS-Schemas, BatchBuilder, ReDBox and Mint, Exhibit, Fascinator, Recollection, SWORD, CUPID.

On the conference, Tomasz Parkoła from PSNC presented a poster describing the concept of building multiple virtual digital repositories mapped over collections of a shared digital library. Digital repositories are currently an important new trend in the network of Polish digital libraries. The main aim is to increase on-line visibility of contemporary Open Access research works. This kind of activities is also supported by the Integrated Knowledge System developed by PSNC in frame of the SYNAT project.