News and Updates on the KRR Group
Header image

The Botari application from the LarKC project has won the Open Track of the Semantic Web Challenge.

Botari is a LarKC workflow running on servers in Seoul, plus a user frontend that runs on a Galaxy Tab.

The workflow combines open data from the city of Seoul (Open Street Map, POI’s) with twitter traffic and combines stream processing, machine learning and querying over RDF datasets and streams to give personalised restaurant information and recommendations, presented in an augmented reality interface on the Galaxy Tab.

For more info on Botari, see either the website, or the demo movie or the slide deck or the paper.

Enhanced by Zemanta

Source: Semantic Web world for you

Over the last couple of years, we have engineered a fantastic data sharing technology based on open standards from the W3C: Linked Data. Using Linked Data, it is possible to express some knowledge with a set of facts and connect the facts together to build a network. Having such networked data openly accessible is a source of economical and societal benefits. It enables sharing data in an unambiguous, open and standard way, just as the Web enabled document sharing. Yet, the way we designed it deprives the majority of the World’s population from using it.

Doing “Web-less” Linked Data?

The problem may lay in the fact that Linked Data is based on Web technologies, or in the fact that Linked Data have been designed and engineered by individuals having an easy access to the Web, or maybe a combination of both aspects. Nowadays, Linked Data rhymes with having a Cloud hosted data storing services, a set of (web-based) applications to interact with this service and the infrastructure of the Web. As a result, if you don’t have access to this Web infrastructure, you can not use Linked Data. Which is a pity, because an estimated 4.5B persons don’t have access to it for various reasons (lack of infrastructure, cost of access, literacy issues, …). Wouldn’t it be possible to adjust our design choices to ensure they could also benefit from Linked Data, even if they don’t have the Web? The answer is yes, and the best news is that it wouldn’t be that hard either. But for it to happen, we need to adapt both our mindset and our technologies.

Changing our mindset

We have tendency to think that any data sharing platform is a combination of a cloud based data store, some client applications to access the data and form to feed new data into the system. This is not always applicable as central hosting of data may not be possible, or its access from client applications may not be guaranteed. We should also think of the part of the World which is illiterate and for which Linked Data, and the Web, are not accessible. In short, we need to think de-centralised, small and vocal in order to widen the access to Linked Data.

Think de-centralised

Star-shaped networks can be hard to deploy. They imply setting a central producer of resource somewhere and connecting all the clients to it. Electricity networks have already found a better alternative: the microgrids. Microgrids are made of small networks of producers/consumers (the “prosumers”) of electricity that locally manage the electricity needs. We could, and should, copy on this approach to manage local data production and consumption. For example, think of a decentralised DBpedia whose content would be made of the aggregation of several data sources producing part of the content – most likely, the content that is locally relevant to them.

Think small

Big servers require more energy and more cooling. They usually end up racked into big cabinets that in turn are packed into cooled data centers. These data centers needs to be big in order to cope with the scale issues. Thinking decentralised allow to think small, and we need to think small to provide alternatives to having data centers where these are not available. As the content production and creation goes decentralised, several small servers can be used. To continue with the analogy with microgrids, we can name these small servers taking care of locally relevant content “micro-servers”.

Think vocal

Unfortunately, not everyone can read and type. In some African areas, knowledge is shared using vocal channels (mobile phone, meetings, …) because there is no other alternative. Getting access to knowledge exchanged that way can not be done using form based data acquisition systems. We need to think of exploiting vocal conversation through Text To Speech (TTS) and Automatic Speech Recognition (ASR) rather than staying focused on forms.

Changing our technologies

Changing the mindsets is not enough, if we aim at stripping down the Web from Linked Data we also need to pay attention to our technologies and adapt them. In particular, there are 5 upcoming challenges that can be phrased as research questions:

  1. Dereferencability: How do you get a route to the data if you want to avoid using the routing system provided by the Web? For instance, how do you dereference an host-name based URIs if you don’t have access to the DNS network?
  2. Consistency: In a decentralised setting where several publishers produce part of a common data set, how do you ensure URIs are re-used and non colliding? There are chances that two different producers would use the same URI to describe different things.
  3. Reliability: Unlike centrally hosted data servers, micro-servers can not be asked to provide a 99% availability. They may go on and off unexpectedly. First thing to know is whether that’s an issue or not. The second is, if we should ensure their data remains available, how do we achieve this?
  4. Security: That’s also related to having a swarm of microservers serving a particular dataset. If any microserver can produce a chunk of that dataset, how do you avoid having a spammer getting in and starting producing falsified content? If we want to avoid centralized networks, authority based solution such as in Public Key Infrastructure (PKI) is not an option. We need to find decentralised authentication mechanisms.
  5. Accessibility: How do we make Linked Data accessible to those that are illiterate? As highlighted earlier, not everyone can read an write but illiterate persons can still talk. We need to take more of the vocal technologies into account in order to make Linked Data accessible to them. We can also investigate graphical based data acquisition techniques with visual representations of information.

More about this

This is a presentation that Stefan Schlobach gave at ISWC2011 on this topic:

You are also invited to read the associated paper “Is data sharing the privilege of a few ? Bringing Linked Data to those without the Web” and check out two projects working on the mentioned challenges: SemanticXO and Voices.

The LarKC project’s development team would like to announce a new release (v.3.0) of the LarKC platform, which is available for downloading here. The new release is a considerable improvement of the previous release (v.2.5), with the following distinctive features: PLATFORM New (plain) plug-in registry light-weight plug-in loading and thus very low platform’s start-up time […]

If you liked WebPIE, you’ll also like QueryPIE

WebPIE performed forward inference over up to 100 billion triples (yes, that’s 10^11). Our about-to-be-published QueryPIE can do on the fly backward-chaining inference at query-time, over a billion triples, in milliseconds, on just 8 parallel machines.

Last year, Jacopo Urbani and co-authors from the LarKC team broke the speed record for forward chaining inference over OWL-Horst.  Computing the complete closure over 100 billion of triples in a number of hours using a MapReduce/Hadoop implementation on a medium-sized cluster. The performance of WebPie [see conference and journal paper] is:

  • 1 billion FactForge triples in 1.5 hours on 32 compute nodes
  • 24 billion Bio2RDF triples in 10 hours on 32 compute nodes
  • 100 billion LUBM triples in 15 hours on 64 compute notes
  • deriving anywhere between 150K-650K triples per second, depending on the dataset
  • runtime growing linearly with number of triples
  • speedup growing linearly the number of compute nodes

Now, a year later, we’re breaking another speed record, but this time for “backward chaining“: not doing all inferencing up front, but doing the inferencing “on the fly”, at query time, as and when they are needed.

Until now, backward-chaining was considered to be unfeasible on very large realistic data, since it would slow down the query response time too much. Our paper at ISWC this year shows it’s not all that impossible: on different real-life datasets of up to 1 billion triples, QueryPIE can do on the fly backward-chaining inference at query-time, implementing the full OWL Horst fragment with response times in millisecs on just 8 machines.

All code available at

Enhanced by Zemanta

Source: Semantic Web world for you

With the last post about SemanticXO dating back from April, it’s time for an update, isn’t it? ;-)

A lot of things happened since April. First, a paper about the project was accepted for presentation at the First International Conference on e-Technologies and Networks for Development (ICeND2011). Then, I spoke about the project during the symposium of the Network Institute as well as during the SugarCamp #2. Lastly, a first release of a triple-store powered Journal is now available for testing.


The paper entitled “SemanticXO : connecting the XO with the World’s largest information network ” is available from Mendeley. It explains what the goal of the project is and then report on some performance assessement and a first test activity. Most of the information contained has actually been blogged before on this blog (c.f. there and there) but if you want a global overview of the project, this paper is still worth a read. The conference in itself was very nice and I did some networking. I came back with a lot of business card and the hope of keeping in touch with the people I met there. The slides from the presentation are available from SlideShare


The Network Institute of Amsterdam organised on May 10 the Network Institute organized a one-day symposium to strengthen the ties between its members and to stimulate further collaboration. This institute is a long-term collaboration between groups from the Department of Computer Science, the Department of Mathematics, the Faculty of Social Sciences and the Faculty of Economics and Business Administration. I presented a poster about SemanticXO and an abstract went into the proceedings of the event.

More recently, I spent the 10 and the 11 of September at Paris for the Sugar Camp #2 organised by OLPC France. Bastien managed me a bit of time on Sunday afternoon to re-do the presentation from ICeND2011 (thanks again for that!) and get some feedback. This was a very well organised event held at a cool location (“La cité des sciences“), it was also the first time I met so many other people working on Sugar and I could finally put some faces on the name I saw so many time on the mailing lists and on the GIT logs :)

First SemanticXO prototype

The project developement effort is split in 3 parts: a common layer hidding the complexity of SPARQL, a new implementation of the journal datastore and the coding of diverse activities making use of the new semantic capabilities. All three are going more or less in parallel, at different speed, as, for instance, the work on activities direct what the common layer will contain. I’ve focused my efforts on the journal datastore to get something ready to test. It’s a very first prototype that has been coded starting with the genuine datastore 0.92 and replacing the part in charge of the metadata. The code taking care of the files remains the same. This new datastore is available from Gitorious but because installing the triple store and replacing the journal is a tricky manual process, I bundled all of that ;-)


The installation bundle consists of two files, a “semanticxo.tgz” and a script ““. To install SemanticXO, you need to download the two and put them in the same location somewhere on your machine and then type (as root):

sh ./ setup

This will install a triple store, add it to the daemons to start at boot time and replace the default journal by one using the triple store. Be careful to have backups if needed as this will remove all the content previously stored in the journal! Once the script has been executed, reboot the machine to start using the new software.

The bundle has been tested on an XO-1 running the software release 11.2.0 but it should work on any software release on both the XO-1 and XO-1.5. This bundle won’t work on the 1.75 has it contains a binary (the triple store) not compiled for ARM.

What now?

Now that you have the thing installed, open the browser and go to “″. You will see the web interface of the triple store which allows you to make some SPARQL queries and see which named graphs are stored. If you are not fluent in SPARQL, the named graph interface is the most interesting part to play with. Every entry in the journal gets its own named graph, after having populated the journal with some entries you will see this list of named graphs growing. Click on one of them and the content of the journal entry will be displayed. Note that this web interface is also accessible from any other machine on the same network as the XO. This yields new opportunities in terms of backup and information gathering: a teacher can query the journal of any XO directly from a school server, or an other XO.


The patch script comes with an install function if you want to revert the XO to its original setup. To use it, simply type (as root):

sh ./ remove

and then reboot the machine.

Data2Semantics aims to provide essential semantic infrastructure for bringing e-Science to the next level.

A core task for scientific publishers is to speed up scientific progress by improving the availability of scientific knowledge. This holds both for dissemination of results through traditional publications, as well as through the publication of scientific data. The Data2Semantics project focuses on a key problem for data management in e-Science:

How to share, publish, access, analyse, interpret and reuse data?

Data2Semantics is a collaboration between the VU University Amsterdam, the University of Amsterdam, Data Archiving and Networked Services (DANS) of the KNAW, Elsevier Publishing and Philips, and is funded under the COMMIT programme of the NL Agency of the Dutch Ministry of Economic Affairs, Agriculture and Innovation.

Enhanced by Zemanta

 by Zhisheng Huang

The China Higher Education Press will publish a LarKC book in Chinese. This book will appear  in the book series of Web Intelligence and Web Science ( .

This Chinese LarKC book consists of two parts: Technology part and application part. The technology part covers the topics of LarKC platform, development guide and various plugins and workflows.  The application part covers the topics of Linked Life Data, semantic information retrieval, urban computing, and cancer study. The main contributors of  the book are six Chinese researchers in the LarKC Consortium, who are from Amsterdam, WICI, and Siemens. See the appended text below for the detail.
The book is expected to be published by the end of this year.

Here is the outline of the book content and the main contributors.

Chapter 1 Introduction to LarKC
by Zhisheng Huang (VUA) and Ning Zhong (WICI)

Chapter 2 LarKC Platform
by Jun Fang (VUA)

Chapter 3 Identification  and Selection
by Yi Zeng (WICI)

Chapter 4 Abstraction and Transformation
by Yi Huang (SIEMENS)

Chapter 5 Reasoning  and Deciding
by Jun Fang (VUA) and Zhisheng Huang (VUA)

Chapter 6 LarKC Development Guide
by Zhisheng Huang (VUA) and Jun Fang (VUA)

Chapter 7 Linked Life Data
by Yi Huang (SIEMENS) and Zhisheng Huang (VUA)

Chapter 8 Semantic information retrieval for biomedical applications
by Ru He (SIEMENS) and Zhisheng Huang (VUA)

Chapter 9 Semantic Technology and Gene Study
by Zhisheng Huang (VUA)

Chapter 10 Urban Computing
by Yi Huang (SIEMENS) and Zhisheng Huang (VUA)

Chapter 11 Conclusions
by Zhisheng Huang (VUA),  Ru He (SIEMENS), and Ning Zhong (WICI)

The LarKC folk at the German High Performance Computing Centre in Stuttgart did a rather nice write-up on LarKC from a high-performance computing perspective, intended for their own community. Find the relevant pages here.

Enhanced by Zemanta

A LarKC workflow for traffic-aware route-planning has won the 1st prize in the AI Mashup Challenge at the ESWC 2011 conference, held this week on Crete.

The detail of “Traffic_LarKC” can be found at, but in brief:

Four different datasets are used:

  • the traffic sensors data, obtained from Milano Municipality
  • the Milano street topology
  • historical weather data from the Italian website
  • calendar information (week days and week-end days, holidays, etc.) from Milano Municipality and from the Mozilla Calendar project.

These are used in a batchtime workflow to predict the traffic situation over the next two ours and in a runtime workflow to respond to route-planning queries from users.

This LarKC workflow shows that Linked Open Data and the corresponding technologies are now getting good enough to compete with what’s possible in closed commercial systems.

Congratulations to the entire team that has made this possible!

LarKC traffic demo

The LarKC development team is proud to announce the new release V2.5 of the LarKC platform. The new release is a considerable improvement over the previous V2.0 edition, with the following distinctive features:

  • V2.5 is fully compliant with the LarKC final architecture. You can now develop your workflows and plugins, and be assured that future updates won’t change the main APIs.
  • The Management Interface, which makes it possible to run LarKC from your browser, has an updated RESTful implementation. Besides RDF/XML, workflows can now be described in very readable N3 notation.
  • The endpoint for submitting queries to LarKC is now user-definable, and multiple endpoints are supported.
  • The Plug-in Registry has been improved, and is now coupled with the browser-based Management Interface
  • LarKC now uses a Maven-based build system, giving improved version and dependency management, and a simplified procedure for new plug-in creation
  • A number of extra tools have been introduced to make life for LarKC users a lot easier. Besides the Mangement Interface to run LarKC from your browser, V2.5 also contains:
    • A WYSIWIG Worfklow Designer tool that allows you to construct workflows by drag-and-drop, right from your browser: click on some plugins, drag them to the workspace, click to connect them, and press run! (see screenshot below).
    • An updated plug-in wizard for Eclipse.
  • We have thouroughly updated the distributed execution framework. Besides deploying LarKC plugins through Apache (simply by dropping them in your Apache folder), it is now also possible to deploy plugins through JEE (for webservers) or GAT (for clusters).
  • The WYSIWYG Workflow Designer allows you to specify remote execution of a plugin simply by connecting a plugin to a remote host. Templates are provided for such remote host declaration.
  • LarKC now takes care of advanced data caching for plug-ins
  • V2.5 comes with extended and improved JUnit tests
  • Last but not least, we have considerably improved documentation and user manuals, including a quick-start guide, tutorial materials and example workflows.

The release can be downloaded from
The platform’s manual is available at

Bugs can be submitted using the bug tracker at

As usual, you are encouraged to use the discussion forums and mailing lists served by the LarKC@SourceForge development environment.
please see at

LarKC Workflow Editor