News and Updates on the KRR Group
Header image

Source: Data2Semantics

A quick heads up on the progress of Data2Semantics over the course of the third quarter of 2012.

Management summary: we have made headway in developing data enrichment and analysis tools that will have use in practice.

First of all, we developed a first version of a tool for enriching and interlinking data stored in the popular Figshare

data repository. This tool, called Linkitup, takes metadata provided by the data publisher (usually the author) and can link it to DBPedia/Wikipedia, DBLP, ORCID, ScopusID, the Elsevier Linked Data Repository. These links are fed back to the publication in Figshare, but can also be published separately as RDF. This way, we use Web 2.0 mechanisms that are popular amongst researchers to allow them to enrich their data, and reap immediate benefit. Our plans are to integrate increasingly elaborate enrichment and analysis tools in this dashboard (e.g. annotation, complexity analysis, provenance reconstruction, etc.)

Linkitup is available from http://github.com/Data2Semantics/linkitup . We are aiming for a first release soon!

Furthermore, we have made a good start at reconstructing provenance information from documents stored in a Dropbox folder. This technology can be used to trace back research ideas through chains of publications in a way that augments and extends the citation network. The provenance graph does not necessarily span just published papers, but can be constructed for a variety of document types (posters, presentations, notes, documents, spreadsheets, etc.).

A first implementation of (partial) linked data replication that will make dealing with large volumes of linked data much more manageable. The crux of partial replication lies in the selection (i.e. ranking) of a suitable subset of data to replicate. We will use data information measures, graph analysis, and statistical methods to perform this selection. The use case of clinical decision support will be the primary testing ground for this technology.

Enhanced by Zemanta

Source: Think Links

For International Open Access Week the VU University Amsterdam  is working with other Dutch universities to provide information to academics on open access. Here’s a video they took of me talking about the relationship between open access and social media. The two go hand-in-hand!



Filed under: academia, altmetrics Tagged: oa, oaweek, open access, social media

Source: Think Links

For the last 1.5 years, I’ve been working on the Open PHACTS project – a public private partnership for pharmacology  data integration. Below is a good video from Lee Harland the CTO giving an overview of the project from the tech/pharma perspective.



Filed under: academia

Source: Think Links

I just received 50 copies of this in the mail today:

Literature is so much better in its electronic form but its still fun to get a physical copy. Most importantly, this proceedings represents scientific content and a scientific community that I’m proud to be be part of. You can obviously access the full proceedings online. Preprints are also available from most of the author’s sites. You can also read my summary of the 4th International Provenance and Annotation Workshop (IPAW 2012) .

Filed under: academia, provenance Tagged: ipaw, lecture notes in computer science, lncs, provenance

Source: Think Links

If you read this blog a bit, you’ll know I’m a fairly big fan of RDF as a data format. It’s really great for easily mashing different data sources together. The common syntax gets rid of a lot of headaches before you can start querying the data and you get nice things like simple reasoning to boot.

One thing that I’ve been looking for is a nice way to store a ton of RDF data, which is

  1. easy to deploy;
  2. easy to query;
  3. works well with scalable analysis & reasoning techniques (e.g. stuff built using MapReduce/Hadoop);
  4. oh and obviously scalable.

This past spring I was at a dagstuhl workshop where I had the chance to briefly talk to Chris Ré about the data storage environment used  by the Hazy – one of the leading projects in the world on large scale statistical inference. At the time, he was fairly enthusiastic about using HBase as a storage layer.

Based on that suggestion, I played around with deploying HBase myself on Amazon. Using whirr it was pretty straightforward to deploy a pretty nice environment in a matter of hours. In addition, HBase has the nice side effect that it uses the same file system as Hadoop  (HDFS) so you can run Hadoop jobs over the data that’s stored in the database.

With that I wanted to see a) what was a good way to store a bunch of RDF in HBase and b) if the retrieval of RDF was performant. Sever Fundatureanu worked on this as his master’s thesis.

One of the novel things he looked at was using coprocessors (built in user defined functions in  hbase) to try and improve the building of indexes for RDF within the database. That is instead of running multiple hadoop load jobs you run ~one and then let the coprocessors in each worker node build the rest of the x indexes you want to improve retrieval. While it didn’t improve performance, I thought the idea was cool. I’m still interested in how much user side processing you can shove into the worker nodes within HBase. Below you’ll find an abstract and link to his full thesis.

I’m still keen on using HBase as the basis for the analysis and reasoning over RDF data. We’re continuing to look into this area. If you have some cool ideas, let us know.

A Scalable RDF Store Based on HBase  –

Sever Fundatureanu

The exponential growth of the Semantic Web leads to a need for a scalable storage solution for RDF data. In this project, we design a quad store based on HBase, a NoSQL database which has proven to scale out to thousands of nodes. We adopt an Id-based schema and argue why it enables a good trade-off between loading and retrieval performance. We devise a novel bulk loading technique based on HBase coprocessors and we compare it to a traditional Map-Reduce technique. The evaluation shows that our technique does not scale as well as the traditional approach. Instead, with Map-Reduce, we achieve a loading throughput of 32152 quads/second on a cluster of 13 nodes. For retrieval, we obtain a peak throughput of 56447 quads/second.

 

Filed under: linked data Tagged: coprocessors, hbase, rdf

Source: Semantic Web world for you
Here it is: the first fully featured release of SemanticXO! Use it in your activities to store and share any kind of structured information with other XOs. The installation procedure is easy and only requires and XO-1 running the operating system version 12.1.0. Go to the GIT repository and download the files “setup.sh” and “semanticxo.tar.gz” […]

Source: Semantic Web world for you
Here it is: the first fully featured release of SemanticXO! Use it in your activities to store and share any kind of structured information with other XOs. The installation procedure is easy and only requires and XO-1 running the operating system version 12.1.0. Go to the GIT repository and download the files “setup.sh” and “semanticxo.tar.gz” […]