News and Updates on the KRR Group
Header image

Source: Semantic Web world for you
Last week, on the afternoon of November 22, I co-organized a tutorial about Linked Data aimed at researchers from digital humanities. The objective was to give a basic introduction to the core principles and to do that in a very hands-on setting, so that everyone can get a concrete experience with publishing Linked Data. To [...]

Source: Semantic Web world for you
Last week, on the afternoon of November 22, I co-organized a tutorial about Linked Data aimed at researchers from digital humanities. The objective was to give a basic introduction to the core principles and to do that in a very hands-on setting, so that everyone can get a concrete experience with publishing Linked Data. To [...]

This week we received notification from the EU that the LDBC project has been granted. We think this is great news. The LDBC project (is a STREP and will run until Q2 2015. LDBC stands for Linked Data Benchmark Council, and linked data here of course comprises RDF data management, but also includes the emerging class of graph database systems.

The mission of the LDBC project is to establish a long-term independent association among RDF and Graph database companies that define benchmarks, specify benchmarking practices and publish officially vetted benchmark results. Beyond the project partners, many commercial vendors of RDF and Graph database systems have already expressed their interest in joining this council (once we have founded the legal entity.. it will take a few months still).

The motivation behind the project is to show the strengths (and weaknesses) of RDF and Graph database technologies to the wider IT community pondering the adoption of these technologies, by enabling comparisons between the various products but also with established relational database technologies. Also, by establishing competition on these benchmarks LDBC aims to foment technical progress in the RDF and Graph database systems.

The LDBC project partners include for the RDF database community Ontotext and Openlink; from the graph database side there is Neo Technologies (of neo4j fame) and Sparsity is indirectly involved through academic project partner UPC (Barcelona). Other project partners are University of Innsbruck, FORTH, VU University Amsterdam and Technical University Munich (TUM). The academic partners will help to provide the council with an initial set of benchmarks.

The technical topics of interest for benchmarking are:

  • complex analytical queries for both graph and RDF
  • graph analysis algorithms and traversals
  • large-scale reasoning on RDF data
  • transaction performance
  • systems support for data integration and provenance

The use-case scenarios for these are:

  • social networking (e.g. marketing companies)
  • dynamic publishing (e.g. BBC)
  • telecommunication network analysis
  • bioinformatics data integration (e.g. OpenPhacts)

LDBC interacts with users of Graph and RDF technologies through is Technical User Community (TUC), and the TUC is having its first users workshop in Barcelona next week Nov19+Nov20 (http://www.ldbc.eu:8090/display/TUC/First+TUC+meeting+Nov+2012) on the premises of UPC. The main take-away for users to engage with the TUC is to influence the benchmarking agenda of the LDBC. Talk to us, and RDF vendors might start competing in how to best solve your problems! Even if the Barcelona meeting is too short notice, please drop a note if you want to be involved in the TUC or know people who should.

Finally, please fill in the questionnaire (http://goo.gl/PwGtK) to tell us about your usage (problems) with RDF (or graph) database technologies. We will be looking at the questionnaire results that we have received by Friday November 16 to help set the agenda in the users meeting, so if you want to contribute already this week, that would be highly appreciated.

Thanks for your time, also on behalf of the full LDBC consortium,

Peter Boncz (scientific director LDBC)
Paul Groth
Frank van Harmelen

Enhanced by Zemanta

Source: Data2Semantics

Complexity metrics form the backbone of graph analysis. Centrality, betweenness, assortativity and scale freeness are just a handful of selections from a large and quickly growing literature. It seems that every purpose has its own notion of complexity. Can we find a way to tie these disparate notions together?

Algorithmic statistics provide an answer. It posits that any useful property that is induced from data can be used to compress it—to store it more efficiently. If I know that my network is scale free, or that a set of points is distributed normally, that information will allow me to come up with a more efficient representation of the data. If not, the property we have learned is of no use.

This notion allows us to see data compression, learning and complexity analysis as simply three names for the same thing. The less a dataset can be compressed, the more complex it is, the more it can be compressed the more useful our induced information is.

But we can go further than just complexity. Occam’s razor tells us that the simplest explanation is often the best. Algorithmic statistics provides us with a more precise version. If our data is the result of a computational process, and we have found a short description of it, then with high probability the model that allowed that compression is also a description of the process that generated our data. And that is ultimately what semantics is, a description of a generating process. Whether it’s the mental state that led to a linguistic expression, or the provenance trail that turned one form of data into another. When we talk about semantics, we are usually discussing computational processes generating data.

Practically, algorithmic statistics will give us a means to turn any family of network models (from frequent subgraphs to graph grammars) into a family of statistics. If the network model is powerful enough, the statistics should be able to capture any existing property of complex graphs, including scale freeness, assortativity or fractal scaling.

Enhanced by Zemanta

Source: Data2Semantics

TabLinker, introduced in an earlier post, is a spreadsheet to RDF converter. It takes Excel/CSV files as input, and produces enriched RDF graphs with cell contents, properties and annotations using the DataCube and Open Annotation vocabularies.

TabLinker interprets spreadsheets based on hand-made markup using a small set of predefined styles (e.g. it needs to know what the header cells are). Work package 6 is currently investigating whether and how we can perform this step automatically.

Features:

  • Raw, model-agnostic conversion from spreadsheets to RDF
  • Interactive spreadsheet marking within Excel
  • Automatic annotation recognition and export with OA
  • Round-trip conversion: revive the original spreadsheet files from the produced RDF (UnTabLinker)

In Data2Semantics, we have used TabLinker to publish linked socio-historical data, converting the historical Dutch censuses (1795-1971) to RDF (see slides).

 

Social historians are actively doing research using these datasets, producing rich annotations that correct or reinterpret data; these annotations are very useful when checking dataset quality and consistency (see model). Published RDF is ready-to-query and visualze via SPARQL queries.

 

 

Enhanced by Zemanta

Source: Data2Semantics

Mathematics

Part of work package 2 is developing machine learning techniques to automatically enrich linked data. The web of data has become so large, that maintaining it by hand is no longer possible. In contrast to existing techniques for learning for the semantic web, we aim at applying the techniques directly to the linked data.

We use kernel based machine learning techniques, which can deal well with structured data, such as RDF graphs. Different graph kernels exist, typically developed in the bioinformatics domain, thus which kernels are most suited to RDF is an unanswered question. A big advantage of the graph kernel approach is that relatively little preprocessing/feature selection of the RDF graph is necessary and graph kernels can be applied for a wide range of tasks, such as property prediction, link prediction, node clustering, node ranking, etc.

Currently our research focusses on:

  • which graph kernels are best suited to RDF,
  • what part of the RDF graph do we need for the graph kernel,
  • which tasks are well suited to solve using kernels.

A paper with the most recent results is currently under submission at SDM 2013. Code for different graph kernels and for redoing our experiments is available at: https://github.com/Data2Semantics/d2s-tools.

Enhanced by Zemanta

Source: Data2Semantics

Linkitup is a Web-based dashboard for enrichment of research output published via the Figshare.com repository service. For license terms, see below.

Linkitup currently does two things:

  • it takes metadata entered through Figshare.com and tries to find equivalent terms, categories, persons or entities on the Linked Data cloud and several Web 2.0 services.
  • it extracts references from publications in Figshare.com, and tries to find the corresponding Digital Object Identifier (DOI).

Linkitup is developed as part of our strategy to bring technology for adding semantics to research data to actual users.

Linkitup currently contains five plugins:

  • Wikipedia/DBPedia linking to tags and categories
  • Linking of authors to the DBLP bibliography
  • CrossRef linking of papers to DOIs through bibliographic reference extraction
  • Elsevier Linked Data Repository linking to tags and categories
  • ORCID linking to authors

Using Figshare allows Data2Semantics to:

  • tap into a wealth of research data already published
  • provide state-of-the art data enrichment services on a prominent platform with a significant user base, and
  • bring RDF data publishing to a mainstream platform.
  • And lastly, Figshare removes the need for a Data2Semantics data repository

Linkitup feeds the enriched metadata back as links to the original article in Figshare, but also builds a RDF representation of the metadata that can be downloaded separately, or published as research output on Figshare.

We aim to extend linkitup to connect to other research repositories such as EASY and the Dataverse Network.

A live version of Linkitup is available at http://linkitup.data2semantics.org. Note that the software is stil in beta! You will need a Figshare login and some data published in Figshare to get started.

More information, including installation instructions are available from Github.

 

Enhanced by Zemanta

Source: Data2Semantics

Work package 5 of Data2Semantics focuses on the reconstruction of provenance information. Provenance is a hot topic in many domains, at it is believed that accurate provenance information can benefit measures of trust and quality. In science, this is certainly true. Provenance information in the form of citations is a centuries old practice. However, this is not enough:

  • What data is a scientific conclusion based on?
  • How was that data gathered?
  • Was the data altered in the process?
  • Did the authors cite all sources?
  • What part of the cited paper is used in the article?

Detailed provenance of scientific articles, presentations, data and images is often missing. Data2Semantics investigates the possibilities for reconstructing provenance information.

Over the past year, we have implemented a pipeline for provenance reconstruction (see picture), that is based on four steps: preprocessing, hypotheses generation, hypotheses pruning, aggragation and ranking. Each of these steps perform multiple analyses on a document that can be run in parallel.

 

 

 

 

 

 

 

Our first results, running this pipeline on a Dropbox folder (including version history), are encouraging. We achieve an F-score of 0.7  when we compare generated dependency relations to manually specified ones (our gold standard).

Sara Magliacane has a paper accepted at the ISWC 2012 Doctoral Consortium, where she explains her approach and methodology.

 

Enhanced by Zemanta

Source: Data2Semantics

What if Dolly was only a partial replica?

Work package 3 of Data2Semantics centers around the problem of ranking Linked Data.

Over the past year, we have identified partial replication as a core use case for ranking.

The amount of linked data is rapidly growing, especially in the life sciences and medical domain. At the same time, this data is increasingly dynamic: information is added and removed at a much more granular level than before. Where until recently the Linked Data cloud grew one dataset at a time, these datasets are now `live`, and can change as fast as every minute.

For most Linked Data applications, the cloud, but also the individual datasets are too big too handle. In some cases, as in e.g. the clinical decision support use case of Data2Semantics, one may need to support e.g. offline access from a tablet computer. Another use case is where semantic web development requires a testing environment: running your experiment on a full dataset will just take too much time.

The question then is, how can we make a proper selection of the original dataset that is sufficiently representative for the application purpose, while at the same time ensuring timely synchronisation with the original dataset whenever updates take place.

A first version of the partial replication implementation was integrated in the latest version of the Hubble CDS Prototype.

Laurens Rietveld has an accepted paper at the ISWC 2012 doctoral consortium, where he explains his research approach and methodology. You can find the paper at http://www.mendeley.com/research/replication-for-linked-data/

Enhanced by Zemanta

Source: Data2Semantics

For over a year now, Data2Semantics organizes biweekly lunches for all COMMIT projects running at VU Amsterdam under the header ‘COMMIT@VU’. These projects are SEALINCMedia, EWIDS, METIS, e-Infrastructure Virtualization for e-Science Applications, Data2Semantics, and eFoodLab.

On October 29th, we had a lively discussion about opportunities to collaborate across projects (see photo).

Concretely:

  • SEALINCMedia and Data2Semantics will collaborate on trust and provenance. Trust and provenance analysis technology developed in SEALINCMedia can benefit from the extensive provenance graphs constructed in Data2Semantics, and vice versa.
  • There is potential for integrating the Amalgame vocabulary alignment toolkit (SEALINCMedia) in the Linkitup service of Data2Semantics. Also, Amalgame can play a role in the census use case of Data2Semantics (aligning vocabularies of occupations through time)
  • Both the SEALINCMedia and Data2Semantics projects are working on annotation methodologies and tools. Both adopt the Open Annotation model, and require multi-layered annotations (i.e. annotations on annotations).
  • eFoodlab and Data2Semantics are both working on annotating and interpreting spreadsheet data. We already had a joint meeting last year on this topic, but it makes sense to revive this collaboration.
  • We decided to gather vocabularies and datasets used by the various projects to make more clear where expertise in using these vocabularies lies. Knowing at a glance who else is using a particular dataset or vocabulary can be very helpful: you know on who’s door to knock if you have questions or want to share experiences.
A first version of the COMMIT Data and Vocabulary catalog is already online at:
Enhanced by Zemanta