All posts by Justyna Walkowska

Simple Visualisation of Small RDF Graphs

One of the things that we have created in the SYNAT project is an RDF knowledge base that contains information about resources from:

  • the Digital Libraries Federation,
  • NUKAT (a union catalogue of Polish academic libraries),
  • the National Museum in Warsaw,
  • and will soon be supplemented with data from the National Museum in Kraków.

The metadata is obtained from the sources in their original metadata format (which means: PLMET, MARC 21, Mona system format, CDWA-Lite), and is transformed to the target format (CIDOC CRM / FRBRoo) using the jMet2Ont mapper and a set of rules expressed in XML.

The RDF data can be queried using the SPARQL language, viewed with a browser at the Linked Open Data endpoint (well, it is going to be open soon) and is processed by the SYNAT portal. Still, sometimes we want to look at the data after mapping to understand how a record has been mapped, especially when we detect that the resulting graph is disconnected.

There are a few RDF visualisation tools, but not all of them produce satisfying results. Here is how we visualise the graphs:

  1. The graph is serialized (a result of the mapping process) in the RDF/XML form. An example can be found here.
  2. A very simple XSLT transformation (available here) is run against the RDF/XML file, producing a PlantUML source code. The transformation can be applied with a tool like Saxon.
  3. PlantUML is a simple tool to create UML diagrams based on text files. Here is the automatically generated PlantUML code.
  4. Finally, PlantUML is used to generate a diagram like this:
PlantUML-generated (class) diagram showing relation between RDF/XML entities

So, to go from a test.xml file to the test.png diagram:

Transform.exe -xsl:RDF2PUML.xsl -s:test.xml -o:test.puml

java -jar plantuml.jar  test.puml

We thought this might me useful 🙂

Semantic Web in Libraries Conference

The Semantic Web in Libraries conference, organized by HBZ and ZBW, was held in Cologne in Germany on November 26-28. The first day was dedicated to workshops. Participants were able to choose from: Introduction to Linked Open Data, Metadata Provenance, and a PhD Workshop. The remaining two days were “normal” conference days with lectures and presentations.

The major topics of the conference were:

  • lessons learned from publishing the first LOD datasets,
  • library metadata enrichment and integration,
  • authority files,
  • integration of LOD and Semantic Web technologies into productive library environments,
  • new cataloguing models,
  • tasks and challenges for the future.

The conference finished with a “lighting talks” session, in which participants were encouraged to give short speeches (of up to five minutes) to describe a problem, present results or look for project partners and a panel discussion.

Also included was the new BIBFRAME data model proposed by the Library of Congress. It consists of the following main classes:

  • Creative Work: reflects a conceptual essence of the cataloging item,
  • Instance: an individual, material embodiment of the Work,
  • Authority: a resource reflecting key authority concepts (e.g. people, places),
  • Annotation: decorates other BIBFRAME resources with additional information (e.g. reviews, holdings).

The presentations have been recorded. The videos are available here.

Cologne

Polish Collections in Europeana conference in Toruń

Toruń, AD 1641

The Polish Collections in Europeana conference was organized in the medieval city of Toruń on October 18-19 by the International Centre for Information Management Systems and Services ICIMSS.

The opening speech, The Decision to Digitise, was given by Eleanor Kenny of the Europeana Foundation. The remaining presentations, delivered in Polish, may be divided into the following four categories:

  • Presentation of Europeana-related projects
  • National IT infrastructure for cultural heritage resources
  • The support of Ministry of Culture and National Heritage for digitization projects
  • Problems and needs of Polish cultural heritage institutions

Two presentations were given by representatives of The Ministry of Culture and National Heritage: The Digitization Strategy of The Ministry of Culture and National Heritage (Anna
Duńczyk-Szulc) and The Project of a Ministerial Portal Dedicated to Cultural Heritage Resources Digitization (Agata Bratek). The portal is to be launched at the beginning of 2013.

A number of Europeana-related projects were presented, including:

  • Europeana Photography (Europeana Photography – Documentation of the First Century of Photography, Marta Miskowiec, Museum of History of Photography in Cracow, Piotr Kożurno, ICIMSS)
  • Athena (Athena and Athena Plus – Projects Encouraging Museums to Cooperate with Europeana, Maria Śliwińska, ICIMSS)
  • Judaica Europeana (Judaica Europeana – Digitizing Jewish Cultural Heritage in Europe, Edyta Kurek, Jewish Historical Institute, Warsaw)
  • APEX (Polish Archives’ Participation in the APEX Project, Anna Matejak, Head Office of State Archives, Warsaw)

Representatives of a number of big Polish institutions presented their current activities, including those related to Europeana:

  • National Institute of Museology and Collections Protection (National Institute of Museology and Collections Protection, Its Activities and Plans Concerning Museum Objects Digitization, Anna Kuśmidrowicz, Monika Jędralska)
  • National Audiovisual Institute (National Audiovisual Institute’s Digitization Support: Europeana Awareness Project Case Study, Jarosław Czuba)
  • The National Library of Poland (The National Library’s Participation in the Ongoing Europeana Projects, Katarzyna Ślaska)

Poznań Supecomputing and Networking Center prepared a presentation entitled The Digital Libraries Deferation: Supporting Institutions of Culture in Making Their Resources Available Online, Metadata Aggregation for Europeana (Marcin Werla, Justyna Walkowska), which is available here (in Polish). In the presentation we describe the role of the Polish Digital Libraries Federation in the Polish digital heritage resources environment and in the context of the Polish IT infrastracture for researchers and science. We also present our cooperation with Europeana, including a number of projects we have been involved or will be involved in near future.

The problems section was opened by a presentation prepared by prof. Folga-Januszewska, Problems Concerning the Delivery of Polish Museums Collections to Europeana. The representatives of smaller institutions were interested in obtaining information on digitization projects funding.

A very important issue was Europeana’s new Data Exchange Agreement. A set of materials and opinions on this subject in the context of the Polish law are available here: http://fbc.pionier.net.pl/pro/dla-dostawcow-danych/wspolpraca-z-zewnetrznymi-serwisami/wspolpraca-z-europeana/. The agreement, based on Creative Commons 0, is quite problematic in Polish law. It is not possible to waive copyright in Poland, and licenses can only be granted for enumerated fields of exploitation. The current ministerial directive is to send to Europeana only those metadata records or parts of records which are not copyrighted. This means, for example, excluding the conservation-restoration description of an object’s state. A very good news for all European readers is that the deputy director Katarzyna Ślaska announced that the National Library of Poland has decided to sign the agreement.

Another recurring subject was the need to translate (by a group of GLAM experts) the documentation of the most popular metadata description formats into Polish, so that they are unambiguous and used consistently by institutions.

The conference was open for general public, and there were a few people intested in publishing their private collections online. One of those people was Piotr Grzywacz from Tuchola, running the private Hunting Signals Museum.

TPDL 2012: Theory and Practice of Digital Libraries

The Theory and Practice of Digital Libraries conference (known before as European Conference on Digital Libraries, ECDL) was held in Paphos (Cyprus) on September 23-27, 2012. PSNC presented two papers which can be found in the conference proceedings published as Lecture Notes in Computer Science (7489):

The former paper describes the prototype of the Virtual Transcription Laboratory created by PSNC as part of the SYNAT project. The work described in the paper included performing experiments whose goal was to train an OCR engine to automatically recognize text in digital scans of old documents (Polish texts printed between 16th and 17th century). The paper explains the rationale behind the prototype, its possibilities, and new development directions.

The latter paper concerns the issue of transforming data described using traditional metadata schemas (such as MARC 21 or Dublin Core) to an ontological formats, designed to exist in the Semantic Web and Linked Open Data environment. The paper describes requirements for languages expressing such mapping rules and the tools that implement them. It also shortly presents the jMet2Ont mapping tool.

Maa – Palaeokastro Museum

For us, the conference started on Sunday with a so-called doctoral consortium. A doctoral consortium is a meeting during which each PhD student is assigned a mentor who is obliged to read (before the meeting) an extended abstract of the planned PhD thesis, and to prepare a list of comments and questions. During the meeting, each student presents their work and results to date. The mentor is expected to facilitate discussion after the presentation. Such an event is very beneficial for the students who are offered a chance to learn experts’ opinion on the strong and weak points of the research, all in a safe and friendly environment (the meeting is closed to the public).

The main conference lasted three days, Monday to Wednesday.

An outstanding keynote speech was Cathy Marshall‘s (Microsoft Research) Whose content is it anyway? Social media, personal data, and the fate of our digital legacy. The author raised a number of interesting issues concerning the transience of digital media, the expectances of the general user, and how the situation has been changed by social media such as Twitter or Facebook. The talk was well prepared and full of surprising points, turnabouts, and inspiring conclusions.

The same subject appeared in a presentation by Hany M. SalahEldeen and Michael L Nelson of Old Dominion University. In their paper entitled Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost? the authors analyzed archival contents of social media corresponding to six important events in the las few years (including the Egiptian revolution, H1N1 pandemic, and Michael Jackson’s death). It turns out that after a year 11% of content linked from social portals is no longer available. One more year means another dozen percent of dead links. The full paper is available at the arXiv.org pages. This study was considered important also by traditional media, including the BBC.

A large number of papers were dedicated to machine learning applications. Digital libraries data seem to be a perfect field to apply and test machine learning algorithms. Another very interesting talk was Finding Quality Issues in SKOS Vocabularies (Christian Mader, Bernhard Haslhofer, Antoine Isaac).The authors defined a set of quality indicators and good practices for thesauri encoded in the SKOS format, and also created a qSKOS validating tool.

One of the most interesting events during the conference was the poster and demo session. The best demo contest was won by FrbrVis: An Information Visualization Approach to Presenting FRBR Work Families (Tanja Mercun, Maja Zumer, and Trond Aalberg).The authors, aware of the fact that more and more libraries and metadata aggregators are thinking about introducing the FRBR model, assigned themselves the task of designing an effective way of displaying FRBR data, so that the user could benefit from the model without feeling overwhelmed by it. They proposed four interface options, and then performed usability testing on a large number of users. Two graphical representations were picked as favourite, a concentric (called sun burst) and a hierarchical one. An unexpected conclusion was that graph-based representation (popular in Semantic Web world due to the very nature of RDF data), even though considered attractive at the first glance, proved difficult to use. A notable poster was presented in this session by the already metnioned here Hany M. SalahEldeen, who studied the temporal intention of users publishing links to online resources in social networks.

Thursday was the day of workshops. Conference participants were given the following choice:

The NKOS workshop was dedicated mainly to the ISO 25964 thesaurus standard and its relation to SKOS. Only the first part of the standard is ready as of now. The documents describing the standard are not available for free, but a number of materials can be downloaded from the ISO 25964 webpage, including the XML schema (xsd) definition.

The archives workshop included a Semantic Technologies & Ontologies session in which Vladimir Alexiev of Ontotext gave a very interesting presentation about CIDOC CRM Search Based on Fundamental Relations and OWLIM Rules. Mentioning the FORTH (Foundation for Research and Technology – Hellas) A New Framework for Querying Semantic Networks study, he presented a model of searching which translates the 82 classes and 142 properties of CIDOC CRM to a smaller number of so-called fundamental classes (e.g. Person, Place) and properties, making the search much easier. Ontotext is the producer of the RDF repository called OWLIM. The presentation also described a set of OWLIM reasoning rules producing the simplified model.

In the shortest of the workshops, on supporting users’ exploration (additional materials available here) the participants had a chance to listen to a talk by David Haskiya (Europeana Foundation) about Europeana’s existing and planned features supporting users’ exploration of resources. The workshop ended with an interesting panel discussion in which the most prominent subject were the needs and expectations of current and future users of digital libraries, especially in the context of the youngest generation (see the video below).

The conference was held in a beautiful and historically significant corner of Europe which unfortunately is very hard to reach from Poland. The last year’s location (Berlin) was easier to get to for most of the participants. Next year the conference is to be held in Malta.

Cypriot cuisine

Post authors: Adam Dudczak, Justyna Walkowska, Marcin Werla

CIDOC 2012: Enriching Cultural Heritage

The Helsinki Cathedral, minutes before midnight.

The CIDOC 2012: Enriching Cultural Heritage conference was held in Helsinki (the World Design Capital this year) on June 10-14. The conference is organized annually by CIDOC/ICOM, the International Committee for Documentation at the Internation Council of Museums. Last year the conference was held in Sibiu, Romania – a short post about it is available here.

The reason why PSNC is interested in the works of CIDOC is that we have started using the CIDOC CRM model as the main ontology to organize metadata stored in a Semantic Web knowledge base we have built in the SYNAT project. The knowledge base contains information about resources of different type (currently: librariy, catalogue and museum), described with different metadata schemas. We needed adescription format to which we could map the existing heterogeneous records CIDOC CRM (Conceptual Reference Model) provides definitions and a formal structure for describing the implicit and explicit concepts and relationships used in cultural heritage documentation, so it was a natural choice. Also, there exist an OWL implementation of CIDOC CRM, very useful in the Semantic Web environment.

While trying to map library metadata to CIDOC CRM we realized that the representation of books is more complicated than the representation of museum objects (even though, of course, museum can store books, and libraries can have old volumes that have historical value). You can find more information about this issue in the team’s publication list.

First we tried to cope with the problem by introducing our own CIDOC CRM extensions (mostly subclasses and subproperties), but then we switched to FRBRoo. FRBRoo is an extension of CIDOC CRM created by the CIDOC committee, that is also compliant with FRBR (Functional Requirements for Bibliographic Records) model as specified by IFLA (International Federation of Library Associations). The most distinctive feature of FRBR is the description of a publication (e.g. a book) on four levels:

  • work (e.g. ‘Heart of Darkness’ by Joseph Conrad),
  • expression (the intellectual contents first English edition of ‘Heart of Darkness’),
  • manifestation (all physical copies of the edition, as a set),
  • item (a particular exemmplar from the set).

During this CIDOC conference we presented our results: the description of mapping from MARC 21 and PLMET schemas to FRBRoo and the challenges related with this process.

  • The conference included a number of workshops (with a very interesting CIDOC CRM/FRBRoo/EDM/CRM Dig one by Martin Doerr), CIDOC working groups meetings, keynotes, and ‘regular’ presentations. The main themes of the conference were:
  • Co-operation & exchange,
  • Social media,
  •  Semantic Web,
  • Digital technologies and intangible cultural heritage,
  • Innovations in documentation,
  • Multilingualism and regional cultures.

The Europeana Libraries and TEL Joint Meeting in Bucharest

The Italian Church in Bucharest

The Europeana Libraries project and TEL (The European Library) joint meeting was held in Bucharest on 21-23rd of May. The theme of the meeting was Looking to the future: how do we place our service at the heart of Europe’s research communities? Some videos from the meetings are available here.

PSNC participates in Europeana Libraries project Work Package 5 (as an external expert), whose main objective is to Enhance the searchability of existing library-domain content in Europeana by defining transformations from ESE metadata to EDM and establishing best practice taking account of the different types of library contributing to Europeana.

Internally, the Europeana portal is switching to a new data representation schema, called EDM (Europeana Data Model). The main difference between EDM and the previously used ESE (Europeana Semantic Elements) is that EDM is more Semantic Web and Linked Open Data oriented, ontology-based format. EDM makes a clear distinction between the physical resource (e.g. a painting, or an old print), called Provided Cultural Heritage Object, and the so-called Web Resource, which is a digital representation of the object, possibly one of many. This distinction has not always been clear in ESE. Also, ultimately EDM should become an event-oriented ontology, similar in this aspect to CIDOC CRM.

In case of libraries, the EDM poses challenges a bit different from those of museums. One of the first questions to decide upon was whether the Provided Cultural Heritage Object is the Item (physical copy of a book) or the Expression (an abstract edition of a book), referring to FRBR terms.

Before the meeting institutions participating in WP5 were asked to prepare a manual mapping of a number of chosen metadata records from their collections to EDM. The goal of this excercise was to express doubts or inconsistencies in the mapping or the EDM libraries profile (separate profiles for monographs and series), thus validating the profile. In the next step, TEL will prepare automatic mapping description for the provided internal library formats, and test them against new metadata records and the new metadata aggregation infrastructure.

PSNC’s participation in the Europeana Libraries project is related to the development of methods for semantic integration of cultural heritage objects’ metadata beeing a part of stage A10 of the SYNAT project.

jMet2Ont: From XML-based Metadata to Ontology-based Formats

We have just released jMet2Ont: http://fbc.pionier.net.pl/pro/jmet2ont/

jMet2Ont is a tool that transforms XML-based metadata, either flat (like Dublin Core) or hierarchical (like MARC/XML), into ontology-governed RDF triples (in CIDOC CRM, EDM, or any other ontology expressed in OWL).

You do not have to do any programming to use the mapper. What you have to do is prepare a mapping rules XML file – the syntax of this file is expressed in detail in the user documentation section of the project’s website.

For now the mapper is run as a command-line tool, but future development direction may include adding a graphical user interface. In case of any questions, do not hesitate to contact the responsible developers.

Mapping of MARC 21 Bibliographic Records to PLMET

Following this link, you will find the definition of mapping from a MARC 21 Bibliographic Record to a PLMET record – unfortunately, at this moment only in Polish. The mapping has been defined by Leszek Śnieżko from the NUKAT Center (NUKAT is the union catalog of Polish research libraries).

We would like to encourage libraries using MARC 21 internally (together with other interested entities) to consult the mapping proposal.

The Difference between FROM and FROM NAMED in SPARQL, and the SeRQL Alternative

The SYNAT project involves intense use of semantic web technologies. We store data in an RDF repository (OWLIM), and use the SPARQL and SeRQL languages to query the data. The former is considered a standard, the latter, proposed by Aduna (producer of the Sesame RDF repository) is easier to use at least for some of us.

Last week we realized that no matter how often we use SPARQL, it was about time to fully understand the difference between FROM and FROM NAMED. It turned out that finding a reliable and complete source providing this information was not that easy, so we decided to create this post (based on this post on a team member’s private blog) to clear the matters.

It seems that one of the bigger problems is the name itself. Both FROM and FROM NAMED concern named graphs, which we hold reponsible for a lot of misunderstandings around those clauses. Below is a consise Q&A section that descibes the situation.

If you do not declare FROM or FROM NAMED, what exactly do you query?
You query the active graph. The active graph does not need to be the default graph! In OWLIM, for instance, the active graph is the whole of the repository.

Example (from the SPARQL specification):

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?nameX ?nameY ?nickY
WHERE
  { ?x foaf:knows ?y ;
       foaf:name ?nameX .
    ?y foaf:name ?nameY .
    OPTIONAL { ?y foaf:nick ?nickY }
  }

What is the active graph?
It is the graph(s) that is queried when FROM and FROM NAMED are not used. It might be the default graph, the whole repository contents… or possibly something else, depending on the implementation.

What is the default graph?
The default graph is the graph without a name, or without a context. This is the graph whose triples are in fact triples and not quads.

What does the FROM clause change?
If you use the FROM clause, you restrict the set of graphs that are queried. Only the named graph(s) given in the FROM clause(s) will be considered while matching the template.

Example. Only triples from the <http://example.org/foaf/aliceFoaf> graph will be used.

PREFIX foaf: <http://xmlns.com/foaf/0.1/glt;
SELECT  ?name
FROM    <http://example.org/foaf/aliceFoaf>
WHERE   { ?x foaf:name ?name }

What does the FROM NAMED clause change?
If you use the FROM NAMED, every graph name you use in the query will be matched only to the graph provided in the clause.

Example (which combines FROM and FROM NAMED). ?g will be matched either to <http://example.org/alice> or to <http://example.org/bob>, but to no other named graph.

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>

SELECT ?who ?g ?mbox
FROM <http://example.org/dft.ttl>
FROM NAMED <http://example.org/alice>
FROM NAMED <http://example.org/bob>
WHERE
{
   ?g dc:publisher ?who .
   GRAPH ?g { ?x foaf:mbox ?mbox }
}

Can you combine FROM and FROM NAMED?
Yes, see the question above. In the example the named triple has to be found in one of the graphs given in the FROM NAMED clause, and the loose triple will be matched against the graph given in the FROM clause.

What if there is only one FROM NAMED clause?
Then the following two queries yield the same results:

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>

SELECT ?who ?mbox
FROM <http://example.org/dft.ttl>
FROM NAMED <http://example.org/alice>
WHERE
{
   ?g dc:publisher ?who .
   GRAPH ?g { ?x foaf:mbox ?mbox }
}

is equal to

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>

SELECT ?who ?mbox
FROM <http://example.org/dft.ttl>
WHERE
{
   ?g dc:publisher ?who .
   GRAPH <http://example.org/alice> { ?x foaf:mbox ?mbox }
}

Is it easier in SeRQL?
Yes. Both FROM and FROM NAMED are done using the FROM CONTEXT clause (example from the SeRQL specification):

SELECT name, mbox
FROM CONTEXT <http://example.org/context/graph2>
     {x} foaf:name {name};
         foaf:mbox {mbox}
USING NAMESPACE
foaf = <http://xmlns.com/foaf/0.1/>

“CIDOC 2011 – Knowledge Management and Museums” Conference

The “CIDOC 2011 – Knowledge Management and Museums” conference took place in Sibiu in Romania on September 4-9, 2011. The conference is an annual event, organized by ICOM-CIDOC, that is the Committee for Documentation at the International Council of Museums.

The conference participants came from very different, but cooperating environments: museologists, librarians, programmers and museum software vendors, researchers in the field of ontologies and semantic web,
and also people and institutions concerned with museum documentation standards.

The conference included meetings of CIDOC working groups:

  • Archaeological Sites
  • Conceptual Reference Model Special Interest Group
  • Co-reference
  • Data Harvesting and Interchange
  • Digital preservation
  • Documentation Standards
  • Information Centres
  • Multimedia
  • Transdisciplinary Approaches in Documentation

A number of topics were raised at the conference which are tightly connected with PSNC’s work in the SYNAT project. The most prominent ones were:

  • LIDO (Lightweight Information Describing Objects) specification (www.lido-schema.org/) for description of museum resources made available online
  • recommendation to use persistent, unique identifiers (URIs) of museum resources
  • FRBRoo ontology which merges CIDOC CRM and FRBR (Functional Requirements for Bibliographic Records) to properly describe digital resources online (www.nla.gov.au/lis/stndrds/grps/acoc/tillett2004.ppt, http://www.frbr.org/categories/frbroo)
  • Wiss-ki system presentation (http://wiss-ki.eu/, http://www8.informatik.uni-erlangen.de/transdisc/hohmann_cidoc09_wisski-2.pdf). The goals and assumptions of the project are very close to those of SYNAT. Some of the already used solutions might possibly be used in SYNAT.

The next CIDOC conference will take place in June 2012 in Helsinki. Additionally, the CIDOC “summer school” for people taking care of museum documentation is planned for the holiday period of 2012.