News and Updates on the KRR Group
Header image

Source: Semantic Web world for you

LOD Around The Clock (LATC) logoGoogle recently announced a new project, named the Google Art, which give access to paints from around the world in very high definition. It also provides some information related to these paintings.This is a very cool service but the data is not provided in a machine-friendly way. So we thought it would be nice to have a wrapper exporting in RDF so that this data could be more easily consumed by any semantic-aware application.

The GoogleArt2RDF wrapper offers such a wrapping service for any painting made available through GoogleArt. In order to use it, just copy the name of the artwork and paste it after “”. For instance, change “” into ““.

The data is expressed using essentially the FOAF and Dublin Core ontologies. When possible, the resources are linked to DBPedia for the author of the painting and the medium used (oil on canvas, etc). This is a first version of the system which does not yet export all the data from Google, comments and suggestions on how to improve it are much welcome!

Related Articles

Source: Think Links

In preparation for Science Online 2011, I was asked by Mark Hahnel from over at Science 3.0 if I could do some analysis of the blogs that they’ve been aggregating since Octobor (25 thousand posts from 1506 authors). Mark along with Dave Munger will be talking more about the role/importance of aggregators in a session Saturday morning 9am (Developing an aggregator for all science blogs). These analysis provide a high level overview of the content of science blogs. Here are the results.

The first analysis tried to find the topics of blogs and their relationships. We used title words as a proxy for topics and co-occurrence of those words as representative of the relationships between those topics. Here’s the map (click the image to see a larger size):

The words cluster together according to their co-occurrence. The hotter the color the more occurrence of those words. You’ll notice that for example Science and Blog are close to one another. Darwin and days as well as fumbling and tenure are close as well. The visualization was done with Vosviewer software.

I also looked at how blogs are citing research papers. We looked for the occurrence of DOIs as well as research blogging style citations within all the blog posts. We found that there were 964 posts with these sorts of citations. In this case, I thought there would be more but maybe this is down to how I implemented it.

Finally, I looked at what URLs were most commonly used in all the blog posts. Here are the top 20:

URL Occurences 4476 3920 1002 930 789 648 533 485 482 376 350 336 295 271 269 266 265 232 232 195

I was quite happy with this list because they are pretty much all science links. I thought there would be a lot more links to non-science places.

I hope the results can provide a useful discussion piece. Obviously, this is just the start and we can do a lot more interesting analyses. In particular, I think such statistics can be the basis for alt-metrics style measures. If you’re interested in talking to me about these analysis come find me at Science Online.

Filed under: academia Tagged: #altmetrics, #scio11, analysis, science blogging

Source: Semantic Web world for you

LOD Around The Clock (LATC) logo
Althought being commonly depicted as one giant graph, the Web of Data is not a single entity that can be queried. Instead, it’s a distributed architecture made of different datasets each providing some triples (see the LOD Cloud picture and Each of these data source can be queried separately, most often through an end point understanding the SPARQL query language. Looking for answers making use of information spanning over different data sets is a more challenging task as the mechanisms used internally to query one data set (database-like joins, query planning, …) do not scale easily over several data sources.

When you want to combine information from, say DBPedia and the Semantic Web doog food site, the easiest and quickest workaround is to download the content of the two datasets, eventually filtering out triples you don’t need, and load the content retrieved into a single data store. This approach as some limitations: you must have a store running somewhere (that may require a significantly powerful machine to host it), the downloaded data must be updated from time to time and the data you need may not be available for downloading at the first place.

When used along with a SPARQL datalayer, eRDF offers you a solution when one of these limitation prevents you from executing your SPARQL query over several datasets. The applications runs on a low-end laptop and can query, and combine the results from, several SPARQL end points. eRDF is a novel RDF query engine making use of evolutionary computing to search for the solution. Instead of the traditional resolution mecanism, an iterative trial and error process is used to progressively find some answers to the query (more information can be found in the published papers which are listed on and in this technical report). It’s a versatile optimisation tool that can run other different kind of data layers and the SPARQL data layer offers an abstraction over a set of SPARQL end points.

Let’s suppose you want to find some persons and the capital of the country they live in:

PREFIX rdf: <>
PREFIX foaf: <>
PREFIX rdfs: <>
PREFIX db: <>

SELECT DISTINCT ?person ?first ?last ?home ?capital WHERE {
	?person  rdf:type         foaf:Person.
	?person  foaf:firstName   ?first.
	?person  foaf:family_name ?last.
	?person  foaf:homepage    ?home.
	?person  foaf:based_near  ?country.
	?country rdf:type         db:Country.
	?country db:capital       ?capital.
	?capital rdf:type         db:Place.
ORDER BY ?first

Such a query can be answered by combining data from the dog food server and dbpedia. More data sets may also contain list of people but let’s focus on researchers as a start. We’ll have to indicate to eRDF which are the end points to query, this is done with a simple csv listing:

Semantic Web Dog Food;

Assuming the query is saved into a “people.sparql” file and the end points list goes into a “endpoints.csv”, the query engine is called like this:

java -cp nl.erdf.datalayer-sparql-0.1-SNAPSHOT.jar nl.erdf.main.SPARQLEngine -q people.sparql -s endpoints.csv -t 5

The query will first be scanned for its basic graph patterns, all of them will be grouped and sent to the eRDF optimiser as a set of constraints to solve. Then, eRDF will look for solutions matching as many of these constraints as possible and push back all the relevant triples found back into an RDF model. After some time (set with the parameter “t”), eRDF is stopped and Jena is used to issue the query over the model that was just populated. The answers are then displayed, along with a list of the data sources that contributed in finding them.

If you don’t know which end points are likely to contribute to the answers, you can just query all of the WOD and see what happens… ;-)
The package comes with a tool to fetch a list of SPARQL end points from CKAN, test them and create a configuration file. It gets called like that:

java -cp nl.erdf.datalayer-sparql-0.1-SNAPSHOT.jar nl.erdf.main.GetEndPointsFromCKAN

After a few minutes, you will get a “ckan-endpoints.csv” allowing you to run query the WoD from your laptop.

The source code along with a package including all the dependencies are available on GitHub. Please note that this is a first public release of the tool still in snapshot state so bugs are expected to show up. If you spot some, report them and help us improve the software. Comments and suggestions are also much welcome :)

The work on eRDF is supported by the LOD Around-The-Clock (LATC) Support Action funded under the European Commission FP7 ICT Work Programme, within the Intelligent Information Management objective (ICT-2009.4.3).

Source: Think Links

The university where I work asks us to register all our publications for the year in a central database [1].  Doing this obviously made me think of doing an ego search on my academic papers. Plus, it’s the beginning of the year, which always seems like a good time to look at these things.

The handy tool Publish-or-Perish calculates all sorts of citation metrics based on a search of Google Scholar. The tool lets you pick the set of publications to consider. (For example, I left out all the publications from another Paul Groth who’s a professor of architecture at Berkeley.) I did a cursory run through to remove publications that weren’t mine but I didn’t spend much time so all the standard disclaimers apply. There may be duplicates, it includes technical reports, etc. For transparency, you can find the set of publications considered in the Excel file here. Also, it’s worth noting that the Google Scholar corpus has it’s own problems, in particular, it makes you look better. With all that in mind, let’s get to the fun stuff.

My stats as of Jan. 4, 2011 are:

  • Papers:93,
  • Citations:1318,
  • Years:12,
  • Cites/year:109.83,
  • Cites/paper:14.17/4.0/0,
  • Cites/author:416.35,
  • Papers/author:43.27,
  • Authors/paper:3.04/3.0/2,
  • h-index:21,
  • g-index:34,
  • hc-index:16,
  • hI-index:5.58,
  • hI-norm:11,
  • AWCR:224.17,
  • AW-index:14.97,
  • AWCRpA:70.96,
  • e-index:24.98,
  • hm-index:9.07,

You can find the definitions for these metrics here.

What does it all mean? I don’t know :-) I think it’s not half bad.

For comparison, here’s a list of  the h-indexes for top computer scientist computed using Google Scholar. All have  an h-index of 40 or greater. A quick scan through that least, shows that there’s a pretty strong correlation between being a top computer scientist and a high h-index. Thus, I conclude that I should continue concentrating on being a good computer scientists and the statistics will follow.

[1] I don’t know why my university doesn’t support importing publication information from bibtex, or RIS. Everything has to be added by hand, which takes a bit.

    Filed under: academia, meta Tagged: citation metrics, computer science, h-index

    Source: Think Links

    One of the nice things about using cloud services is that sometimes you get a feature that you didn’t expect. Below is a nice set of stats from about how well Think Links did in 2010. I was actually quite happy with 12 posts – one post a month. I will be trying to increase the rate of posts this year. If you’ve been reading this blog, thanks! and have a great 2011. The stats are below:

    Here’s a high level summary of this blogs overall blog health:

    Healthy blog!

    The Blog-Health-o-Meter™ reads Fresher than ever.

    Crunchy numbers

    Featured image

    A Boeing 747-400 passenger jet can hold 416 passengers. This blog was viewed about 4,500 times in 2010. That’s about 11 full 747s.


    In 2010, there were 12 new posts, growing the total archive of this blog to 46 posts. There were 12 pictures uploaded, taking up a total of 5mb. That’s about a picture per month.

    The busiest day of the year was October 13th with 176 views. The most popular post that day was Data DJ realized….well at least version 0.1.

    Where did they come from?

    The top referring sites in 2010 were,,,, and

    Some visitors came searching, mostly for provenance open gov, think links, ready made food, 4store, and thinklinks.

    Attractions in 2010

    These are the posts and pages that got the most views in 2010.


    Data DJ realized….well at least version 0.1 October 2010


    4store Amazon Machine Image and Billion Triple Challenge Data Set October 2009


    Linking Slideshare Data June 2010


    A First EU Proposal April 2010


    Two Themes from WWW 2010 May 2010

    Filed under: meta

    Source: Semantic Web world for you

    A few days ago, I posted about SemanticXO and how you will see how to install a TripleStore on your XO. Here are the steps to follow to compile&install RedStore on the XO, put some triples in it and issue some queries. The following has been tested with an XO-1 running the software 10.1.3 and a MacBookPro running ArchLinux x64 (it’s not so easy to compile directly on the XO, that’s why you will need a secondary machine). All the scripts are available here.

    Installation of RedStore

    RedStore depends on some external libraries that are not yet packaged for Fedora11, which is used as a base for the operating system of the XO. The script will download and compile all the necessary stuff. You may however need to install external dependencies on your system, such as libxml. That script only takes care of the things redstore directly depends on, namely raptor2, rasqal and redland (all available here). Here is the full list of commands to issue:

    mkdir /tmp/xo
    cd /tmp/xo
    wget --no-check-certificate

    Once done, you will get four files to copy on the XO and if you don’t, you can also download this pre-compiled package. These files shall be put all together somewhere, for instance “/opt/redstore”. Note that all the data redstore needs will be put into that same directory. In plus of these 4 files, you’ll need a wrapper script and an init scripts. Both are available on the source code repository. So, here what to do on the XO, as root (replacing “cgueret@″ by the login/IP accurate for you) :

    mkdir /opt/redstore
    scp cgueret@ .
    scp cgueret@ .
    scp cgueret@ .
    scp cgueret@ .
    wget --no-check-certificate
    chmod +x
    cd /etc/init.d
    wget --no-check-certificate
    chmod +x redstoredaemon
    chkconfig --add redstoredaemon

    Then you can reboot your XO and enjoy the triplestore through its http frontend, available on the port 8080 :)

    Loading some triples

    Now that the triple store is running, it’s time to add some triples. The SP2Bench benchmark comes with a tool (sp2b_gen) to generate any number of triples. To begin with, you can generate 50000 triples. That should be about of the maximum amount of triples an XO will have to deal with later on when the activities will store data in it. Here is what to do, with “″ being the IP of the XO:

    sp2b_gen -t 50000
    rapper -i guess -o rdfxml sp2b.n3 > sp2b.rdf
    curl -T sp2b.rdf ''

    It takes about 43 minutes to upload these 50k triples which gives an average of 53 milliseconds per triple or 19 triples per second. That’s not fast but should be enough to have an API allowing to store a bunch triples with an acceptable response time. The data takes 4Mo of disk space on the XO for an initial RDF file of about 9.8Mo.

    Issue some queries

    The SP2Bench benchmark comes with a generator for the triples and a set of 17 SPARQL queries expressed over this data. The queries are of changing complexity in order to benchmark different triple stores. Unfortunately, 9 of them where to complex for RedStore on the XO, with these 50k triples. These queries where not solved, even after being executed over a full night! The 8 remaining queries are solved without much problems, as long as you have enough time to wait for the answer:

    Query file Execution time
    q1.sparql 14229.4 ms
    q2.sparql 44189.2 ms
    q3a.sparql 21506.8 ms
    q3b.sparql 19498.4 ms
    q3c.sparql 19663.9 ms
    q10.sparql 3940.6 ms
    q11.sparql 4685.2 ms
    q12c.sparql 3539.6 ms

    The queries have been executed using the “sparql-query” command line client that way:

    cat q2.sparql | sparql-query -t -p -n

    The long delay can sounds as a bad news but it must be noted that this was with 50k triples and with queries designed to be tricky in order to test triple store capabilities. Considering a normal usage with fewer triples and more standard queries, we can expect things to go better.

    Source: Semantic Web world for you

    The three XOs received for the project

    The project One Laptop Per Child (OLPC) has provided millions of kids world wide with a low-cost connected laptop helping them to enhance their knowledge and develop learning skills. Learning a foreign language, getting an introduction to reading/writting or preserving/revive an endangered/extinct language are among the possible usages of these XOs. Such activities could take a significant benefit from a storage layer optimised for multi-lingual and loosely structured data.

    One of the building block of the Semantic Web, the “Triple Store”, is such a data storage service.  A triple store is like a database engine optimised to store and provide access to triples, atomic statements binding together a subject a predicate and an object. For instance, <Amsterdam,isLocatedIn,Netherlands>. And these two triples would define two different names for two different languages: <Amsterdam,isLocatedIn,”Netherlands”@nl>,  <Amsterdam,isLocatedIn,”Pays-Bas”@fr>.

    SemanticXO is a new project from the contributor program aimed at adding a triple store and a front-end API on the XOs’ operating system. This triple store will extend the functionalities of Sugar with the possibility for all activities to store loosely structured/multilingual data and easily connect information across activities. In plus, the SPARQL protocol will allow for an easy access to the data stored on any device.

    A first goal is to setup RedStore on the XOs allocated to this project. RedStore is a lightweight triple store that should be able to run on low hardware and still provide nice performances. Stay tuned for the result! ;-)

    Source: Semantic Web world for you

    This is the first post on this blog, aimed at giving and pointing to information about the Semantic Web. The Semantic Web (or Web 3.0) is a new technology and research topic aimed at putting more semantic into the Web as we know it. The changes are happening, in a not so visible but very concrete way for you, user of the Web. On this blog you will learn more about it and how you can benefit from it, whoever you are.

    Source: Think Links

    This has been a great week if you think that it’s important to know the origins of content on the web. First, Google announced the support of explicit metadata describing the origins of news article content that will be used by Google News. Publishers can now identify using two tags whether they the original source of a piece of news or are syndicating it from some other provider. Second, the New York Times now has the ability to do paragraph level permalinks. (So this is the link to the third paragraph of an article on starbucks recycling). So one can link to the exact paragraph when quoting a piece. This was supported by some other sites as well and there’s a wordpress plug-in for it but having the Times support it is big news. Essentially, with a couple of tweaks these techniques could make the quote pattern that you see in blogs (shown below) machine readable.

    In the W3C  Provenance Incubator Group that is just wrapping up, one of the main scenarios was how to support a News Aggregator that can makes use of provenance to help determine the quality of the articles it automatically creates. With these developments, we are moving one step closer to being able to make this scenario possible.

    To me, this is more evidence that with simple markup, and simple link structures, we can achieve the goal of having machines know where content on the web originates. However, like with a lot of the web, we need to agree on those simple structures so that everyone knows how to properly give credit on the web.

    Filed under: provenance markup Tagged: google news syndication tags, new york times, permalinks, provenance

    Source: Think Links

    Current ways of measuring scientific impact are rather course grained, they often don’t capture the many different ways that science and scientists might have impact. As science increasingly is done on-line and in the open, new metrics are being created to help measure this impact. Jason Priem, Dario Taraborelli, myself, and Cameron Neylon have recently put out a manifesto calling out-lining a research direction for these new metrics, termed alt-metrics.

    You can read the manifesto here:


    Filed under: academia Tagged: alt-metrics, science impact