News and Updates on the KRR Group
Header image

Linkitup: Lightweight Data Enrichment

Posted by data2semantics in collaboration | computer science | large scale | semantic web | vu university amsterdam - (Comments Off on Linkitup: Lightweight Data Enrichment)

Source: Data2Semantics

Linkitup is a Web-based dashboard for enrichment of research output published via the repository service. For license terms, see below.

Linkitup currently does two things:

  • it takes metadata entered through and tries to find equivalent terms, categories, persons or entities on the Linked Data cloud and several Web 2.0 services.
  • it extracts references from publications in, and tries to find the corresponding Digital Object Identifier (DOI).

Linkitup is developed as part of our strategy to bring technology for adding semantics to research data to actual users.

Linkitup currently contains five plugins:

  • Wikipedia/DBPedia linking to tags and categories
  • Linking of authors to the DBLP bibliography
  • CrossRef linking of papers to DOIs through bibliographic reference extraction
  • Elsevier Linked Data Repository linking to tags and categories
  • ORCID linking to authors

Using Figshare allows Data2Semantics to:

  • tap into a wealth of research data already published
  • provide state-of-the art data enrichment services on a prominent platform with a significant user base, and
  • bring RDF data publishing to a mainstream platform.
  • And lastly, Figshare removes the need for a Data2Semantics data repository

Linkitup feeds the enriched metadata back as links to the original article in Figshare, but also builds a RDF representation of the metadata that can be downloaded separately, or published as research output on Figshare.

We aim to extend linkitup to connect to other research repositories such as EASY and the Dataverse Network.

A live version of Linkitup is available at Note that the software is stil in beta! You will need a Figshare login and some data published in Figshare to get started.

More information, including installation instructions are available from Github.


Enhanced by Zemanta

Update: Provenance Reconstruction

Posted by data2semantics in collaboration | computer science | large scale | semantic web | vu university amsterdam - (Comments Off on Update: Provenance Reconstruction)

Source: Data2Semantics

Work package 5 of Data2Semantics focuses on the reconstruction of provenance information. Provenance is a hot topic in many domains, at it is believed that accurate provenance information can benefit measures of trust and quality. In science, this is certainly true. Provenance information in the form of citations is a centuries old practice. However, this is not enough:

  • What data is a scientific conclusion based on?
  • How was that data gathered?
  • Was the data altered in the process?
  • Did the authors cite all sources?
  • What part of the cited paper is used in the article?

Detailed provenance of scientific articles, presentations, data and images is often missing. Data2Semantics investigates the possibilities for reconstructing provenance information.

Over the past year, we have implemented a pipeline for provenance reconstruction (see picture), that is based on four steps: preprocessing, hypotheses generation, hypotheses pruning, aggragation and ranking. Each of these steps perform multiple analyses on a document that can be run in parallel.








Our first results, running this pipeline on a Dropbox folder (including version history), are encouraging. We achieve an F-score of 0.7  when we compare generated dependency relations to manually specified ones (our gold standard).

Sara Magliacane has a paper accepted at the ISWC 2012 Doctoral Consortium, where she explains her approach and methodology.


Enhanced by Zemanta

Update: Linked Data Replication

Posted by data2semantics in collaboration | computer science | large scale | semantic web | vu university amsterdam - (Comments Off on Update: Linked Data Replication)

Source: Data2Semantics

What if Dolly was only a partial replica?

Work package 3 of Data2Semantics centers around the problem of ranking Linked Data.

Over the past year, we have identified partial replication as a core use case for ranking.

The amount of linked data is rapidly growing, especially in the life sciences and medical domain. At the same time, this data is increasingly dynamic: information is added and removed at a much more granular level than before. Where until recently the Linked Data cloud grew one dataset at a time, these datasets are now `live`, and can change as fast as every minute.

For most Linked Data applications, the cloud, but also the individual datasets are too big too handle. In some cases, as in e.g. the clinical decision support use case of Data2Semantics, one may need to support e.g. offline access from a tablet computer. Another use case is where semantic web development requires a testing environment: running your experiment on a full dataset will just take too much time.

The question then is, how can we make a proper selection of the original dataset that is sufficiently representative for the application purpose, while at the same time ensuring timely synchronisation with the original dataset whenever updates take place.

A first version of the partial replication implementation was integrated in the latest version of the Hubble CDS Prototype.

Laurens Rietveld has an accepted paper at the ISWC 2012 doctoral consortium, where he explains his research approach and methodology. You can find the paper at

Enhanced by Zemanta

Source: Data2Semantics

For over a year now, Data2Semantics organizes biweekly lunches for all COMMIT projects running at VU Amsterdam under the header ‘COMMIT@VU’. These projects are SEALINCMedia, EWIDS, METIS, e-Infrastructure Virtualization for e-Science Applications, Data2Semantics, and eFoodLab.

On October 29th, we had a lively discussion about opportunities to collaborate across projects (see photo).


  • SEALINCMedia and Data2Semantics will collaborate on trust and provenance. Trust and provenance analysis technology developed in SEALINCMedia can benefit from the extensive provenance graphs constructed in Data2Semantics, and vice versa.
  • There is potential for integrating the Amalgame vocabulary alignment toolkit (SEALINCMedia) in the Linkitup service of Data2Semantics. Also, Amalgame can play a role in the census use case of Data2Semantics (aligning vocabularies of occupations through time)
  • Both the SEALINCMedia and Data2Semantics projects are working on annotation methodologies and tools. Both adopt the Open Annotation model, and require multi-layered annotations (i.e. annotations on annotations).
  • eFoodlab and Data2Semantics are both working on annotating and interpreting spreadsheet data. We already had a joint meeting last year on this topic, but it makes sense to revive this collaboration.
  • We decided to gather vocabularies and datasets used by the various projects to make more clear where expertise in using these vocabularies lies. Knowing at a glance who else is using a particular dataset or vocabulary can be very helpful: you know on who’s door to knock if you have questions or want to share experiences.
A first version of the COMMIT Data and Vocabulary catalog is already online at:
Enhanced by Zemanta

Source: Think Links

From November 1 – 3, 2012, I attended the PLOS Article Level Metrics Workshop in San Francisco .

PLOS is a major open-access online publisher and the publisher of the leading megajournal PLOS One. A mega-journal is one that accepts any scientifically sound manuscript. This means there is no decision on novelty just a decision on whether the paper was done in a scientifically sound way. The consequence is that this leads to much more science getting published and the corresponding need for even better filters and search systems for science.
As an online publisher, PLOS tracks many what are termed article level metrics – these metrics go beyond of traditional scientific citations and include things like page views, pdf downloads, mentions on twitter, etc. Article level metrics are to my mind altmetrics aggregated at the article level.
PLOS provides a comprehensive api to obtain these metrics and wants to encourage the broader adoption and usage of these metrics. Thus, they organized this workshop. There were a variety of people attending ( from publishers (including open access ones and the traditional big ones), funders, librarians to technologists. I was a bit disappointed not to see more social scientists there but I think the push here has been primarily from the representative communities. The goal was to outline key challenges for altmetrics and then corresponding concrete actions that could place in the next 6 months that could help address these challenges. It was an unconference so no presentations and lots of discussion. I found it to be quite intense as we often broke up into small groups where one had to be fully engaged. The organizers are putting together a report that digests the work that was done. I’m excited to see the results.

Me actively contributing :-) Thanks Ian Mulvany!


  • Launch of the PLOS Altmetrics Collection. This was really exciting for me as I was one of the organizers of getting this collection produced. Our editorial is here: This collection provides a nice home for future articles on altmetrics
  • I was impressed about the availability of APIs. There are now several aggregators and good sources of altmetrics in just a bit of time. ImpactStory,, plos alm apis, mendeley,, microsoft academic search
  • rOpenSci ( is a cool project that provides R apis to many of these alt metric and other sources for analyzing data
  • There’s quite a bit of interest in services to do these metrics. For example, Plum Analytics ( has a test being done at the University of Pittsburgh. I also talked to other people who were getting interest in using these alternative impact measures and also heard a number of companies are now providing this sort of analytics service.
  • I talked a lot to Mark Hahnel from about the Data2Semantics LinkItUp service. He is super excited about it and loved the demo. I’m really excited about this collaboration.
  • Microsoft Academic Search is getting better, they are really turning it into a production product with better and more comprehensive data. I’m expecting a really solid service in the next couple of months.
  • I learned from Ian Mulvany of eLife that Graph theory is mathematically “the same as” statistical mechanics in physics.
  • Context, Context, Context – there was a ton of discussion about the importance of context to the numbers one gets from altmetrics. For example, being able to quickly compare to some baseline or by knowing the population which the number is applied.

    White board thoughts on context! thanks Ian Mulvany

  • Related to context was the need for simple semantics – there was a notion that for example we need to know if a retweet in twitter was positive or negative and what kind of person retweeted the paper (i.e. a scientists, a member of the public, a journalist, etc). This was because that unlike citations the population that altmetrics uses is not as clearly defined as it exists in a communication medium that doesn’t just contain scholarly communication.
  • I had a nice discussion with Elizabeth Iorns the founder of . There doing cool stuff around building markets for performing and replicating experiments.
  • Independent of the conference, I met up with some people I know from the natural language processing community and one of the things that they were excited about is computational semantics but using statistical approaches. It seems like this is very hot in that community and something we in the knowledge representation & reasoning community should pay attention to.


Associated with the workshop was a hackathon held at the PLOS offices. I worked in a group that built a quick demo called . This was a bookmarklet that would highlight papers in pubmed search results based on their online impact according to impact story. So you would get different color coded results based on alt metric scores. This only took a day’s worth of work and really showed to me how far these apis have come in allowing applications to be built. It was a fun environment and was really impressed with the other work that came out.

Random thought on San Francisco

  • Four Barrel coffee serves really really nice coffee – but get there early before the influx of ultra cool locals
  • The guys at Goody Cafe are really nice and also serve good coffee
  • If you’re in the touristy Fisherman’s Wharf area walk to the Fort Mason for fantastic views of the golden gate bridge. The hostel there also looks cool.

Filed under: altmetrics, interdisciplinary research Tagged: #altmetrics, plos, trip report

Source: Data2Semantics

A quick heads up on the progress of Data2Semantics over the course of the third quarter of 2012.

Management summary: we have made headway in developing data enrichment and analysis tools that will have use in practice.

First of all, we developed a first version of a tool for enriching and interlinking data stored in the popular Figshare

data repository. This tool, called Linkitup, takes metadata provided by the data publisher (usually the author) and can link it to DBPedia/Wikipedia, DBLP, ORCID, ScopusID, the Elsevier Linked Data Repository. These links are fed back to the publication in Figshare, but can also be published separately as RDF. This way, we use Web 2.0 mechanisms that are popular amongst researchers to allow them to enrich their data, and reap immediate benefit. Our plans are to integrate increasingly elaborate enrichment and analysis tools in this dashboard (e.g. annotation, complexity analysis, provenance reconstruction, etc.)

Linkitup is available from . We are aiming for a first release soon!

Furthermore, we have made a good start at reconstructing provenance information from documents stored in a Dropbox folder. This technology can be used to trace back research ideas through chains of publications in a way that augments and extends the citation network. The provenance graph does not necessarily span just published papers, but can be constructed for a variety of document types (posters, presentations, notes, documents, spreadsheets, etc.).

A first implementation of (partial) linked data replication that will make dealing with large volumes of linked data much more manageable. The crux of partial replication lies in the selection (i.e. ranking) of a suitable subset of data to replicate. We will use data information measures, graph analysis, and statistical methods to perform this selection. The use case of clinical decision support will be the primary testing ground for this technology.

Enhanced by Zemanta

Source: Think Links

For International Open Access Week the VU University Amsterdam  is working with other Dutch universities to provide information to academics on open access. Here’s a video they took of me talking about the relationship between open access and social media. The two go hand-in-hand!

Filed under: academia, altmetrics Tagged: oa, oaweek, open access, social media

Source: Think Links

For the last 1.5 years, I’ve been working on the Open PHACTS project – a public private partnership for pharmacology  data integration. Below is a good video from Lee Harland the CTO giving an overview of the project from the tech/pharma perspective.

Filed under: academia

Source: Think Links

I just received 50 copies of this in the mail today:

Literature is so much better in its electronic form but its still fun to get a physical copy. Most importantly, this proceedings represents scientific content and a scientific community that I’m proud to be be part of. You can obviously access the full proceedings online. Preprints are also available from most of the author’s sites. You can also read my summary of the 4th International Provenance and Annotation Workshop (IPAW 2012) .

Filed under: academia, provenance Tagged: ipaw, lecture notes in computer science, lncs, provenance

Source: Think Links

If you read this blog a bit, you’ll know I’m a fairly big fan of RDF as a data format. It’s really great for easily mashing different data sources together. The common syntax gets rid of a lot of headaches before you can start querying the data and you get nice things like simple reasoning to boot.

One thing that I’ve been looking for is a nice way to store a ton of RDF data, which is

  1. easy to deploy;
  2. easy to query;
  3. works well with scalable analysis & reasoning techniques (e.g. stuff built using MapReduce/Hadoop);
  4. oh and obviously scalable.

This past spring I was at a dagstuhl workshop where I had the chance to briefly talk to Chris Ré about the data storage environment used  by the Hazy – one of the leading projects in the world on large scale statistical inference. At the time, he was fairly enthusiastic about using HBase as a storage layer.

Based on that suggestion, I played around with deploying HBase myself on Amazon. Using whirr it was pretty straightforward to deploy a pretty nice environment in a matter of hours. In addition, HBase has the nice side effect that it uses the same file system as Hadoop  (HDFS) so you can run Hadoop jobs over the data that’s stored in the database.

With that I wanted to see a) what was a good way to store a bunch of RDF in HBase and b) if the retrieval of RDF was performant. Sever Fundatureanu worked on this as his master’s thesis.

One of the novel things he looked at was using coprocessors (built in user defined functions in  hbase) to try and improve the building of indexes for RDF within the database. That is instead of running multiple hadoop load jobs you run ~one and then let the coprocessors in each worker node build the rest of the x indexes you want to improve retrieval. While it didn’t improve performance, I thought the idea was cool. I’m still interested in how much user side processing you can shove into the worker nodes within HBase. Below you’ll find an abstract and link to his full thesis.

I’m still keen on using HBase as the basis for the analysis and reasoning over RDF data. We’re continuing to look into this area. If you have some cool ideas, let us know.

A Scalable RDF Store Based on HBase  –

Sever Fundatureanu

The exponential growth of the Semantic Web leads to a need for a scalable storage solution for RDF data. In this project, we design a quad store based on HBase, a NoSQL database which has proven to scale out to thousands of nodes. We adopt an Id-based schema and argue why it enables a good trade-off between loading and retrieval performance. We devise a novel bulk loading technique based on HBase coprocessors and we compare it to a traditional Map-Reduce technique. The evaluation shows that our technique does not scale as well as the traditional approach. Instead, with Map-Reduce, we achieve a loading throughput of 32152 quads/second on a cluster of 13 nodes. For retrieval, we obtain a peak throughput of 56447 quads/second.


Filed under: linked data Tagged: coprocessors, hbase, rdf