Who put the D in the cloud?

LOD is the abbreviation for Linked Open Data. You might find it in the context of a cloud: the LOD-cloud. This refers to the amount of open accessible and linked data of the semantic web – usually RDF documents. But how does the data get into the cloud?

Direct entering of the triples

One possibility to get data into the LOD cloud is the direct entering of the triples. Semantic MediaWiki, an extension of the well-known wiki software MediaWiki and the Wikidata project work like this. Entry dialogs try to make data entry easier. However, users still need to know a lot about the technical background. It requires some imagination to envison the future benefit of the time and labour put into the projects right now.

Generation from database content

Several data collections have been created in particular to offer free information for everyone. Among these projects are WordNet, which offers information about the English language and GeoNames, providing information about places. Usually internally and for exports a customized format is used. Since the information is stored in a database it is not too difficult to export it as RDF triples.

Generation from non-technical databases

If you already have structured information available in databases, it is relatively easy to generate RDF triples for the LOD cloud. Taking a look at the map of the LOD cloud, we quickly find some examples: Last.FM wants to bring music to the customer, the core business of the New York Times are news, Flickr makes its money with photograph and the BBC is probably more interested in the producion of documentaries and TV series and not RDF triples. However, since the structure data is already there, the contribution to the LOD cloud comes quite incidentally.

Extraction from suitably structured web pages

DBpedia extracts “semantic” information from infoboxes of Wikipedia articles of several Wikipedia language versions. This works because Wikipedia infoboxes have a rather simple  predicate=value syntax. If you add the article’s name as subject you get a triple. Here you can see the infobox for Universität Kiel from the German Wikipedia:

{{Infobox Hochschule
| Name = Christian-Albrechts-Universität zu Kiel
| Logo = Siegel der CAU.png
| Motto = Pax optima rerum<br /><small>([lat.]: ''Der Frieden ist das beste der Güter)''</small>
| Gründungsdatum = 1665
| Ort = [[Kiel]]
| Bundesland = [[Schleswig-Holstein]]
| Staat = [[Deutschland]]
| Leitung          = [[Gerhard Fouquet]]
| Leitungstitel    = Präsident
| Studentenzahl    = 24.189 <small>''(WS 2011/12)''</small>[http://www.uni-kiel.de/ueberblick/statistik/eckdaten.shtml CAU: Statistische Eckdaten]. Abgerufen am 10. September 2012
| Mitarbeiterzahl  = 3.328 <small>''(2011)''</small> <small>''(ohne Klinikum)''</small>
| davon Professoren= 391
| Trägerschaft     = staatlich
| Jahresetat = 228,6 Mio €
| Website          = [http://www.uni-kiel.de/ www.uni-kiel.de]

This method has to face several problems:

  • As you can see in the example the values are mixed with free text, HTML code and MediaWiki code. These have to be removed during the processing, e.g., to output an integer for the number of students.
  • Somehow attribute names have to be mapped to URIs of RDF predicates. Different language versions of Wikipedia will use different attribute names. However, they have to be mapped to one and the same URI.
  • Different language versions of Wikipedia might contain different values for the same attribute. In these cases the result document will contains multiple and conflicting values.

More clouds on the horizon

However, there is much more knowledge out there that could be made accessible for the LOD cloud. Although I have no numbers at hand I suspenct that for many disciplines (e.g. history) much more information is available in printed rather than in digital form. Technical aids for the transfer of this information from printed sources into “semantic” form might allow entirely new research opportunities in the respective discipline but also interesting new connections in general. 

Even if the information is digitally or even online available, it is not always easy to include it into the LOD cloud. There are countless hand written at most semi-structured web pages. Information on this pages can at most be found using full text search. With human readers as the main audience this data evade automatic processing. It should be possible to make this kind of information more usefull by applying suitable (semi-) automatic extraction programs. 

If the information is digitally available in the form of more or less structured files it can also be opened for the LOD cloud. Required is a formal description of the file format and transformation rules how to generate RDF triples or ontologie from the file’s content. The files can be quite different in nature:

  • text files containing tables with fixed column width or column separators
  • Excel or Opendocument spreadsheets
  • proprietary and structured file formats such as GEDCOM, which for example are interesting for demographic research

As part of my research I would like to explore, evaluate, and futher develop these new methods to prepare data for the LOD cloud.

Comments are closed.