News and Updates on the KRR Group
Header image

Source: Think Links

Last week, I was at a seminar on Semantic Data Management at Dagstuhl. A month ago I was at Dagstuhl discussing the principles of provenance. You can read more about the atmosphere and style of a Dagstuhl event at the post on the provenance event. From my perspective, it’s pretty cool to get invited to multiple Dagstuhl events in short succession… I think it just happens that two of my main research areas overlap and were scheduled in the same time period.

Semantic Data Management Group photo

Obligatory Dagstuhl Group Photo

Indeed, one of the topics for discussion at the seminar was provenance. The others were scalability, dynamcity, and search. The organizers (Elena Simprel, Karl AbererGrigoris AntoniouOscar Corcho and Rudi Studer) will put together a report summarizing all the outcomes. What I wanted to do is focus on the key points that I took away from the seminar.

Scaling semantic data management = scaling graph databases

There was some discussion around what it means to scale in terms of semantic data management. For the most part this boiled down to, what does it mean to scale RDF databases? The organizers did a good job of bringing members of industry in that have actual experience in building scalable RDF systems. The first day contained some great discussion about the guts of databases and what makes scaling hard – issues such as the latency of storage infrastructure and what the right join algorithm were. Steve Harris brought out the difficulty of backup and restore in real world systems and the lack of research in that area.  But my primary feeling was the challenges of scalability are ones of how we deal with large graphs. In my open work in Open PHACTs, I’ve seen how using graphs has increased our flexibility but challenged us in terms of scalability.

Dealing with large graphs is hard but I think the Semantic Web community can lead the way here because we have a nice substrate, namely, an exchange model for graphs and a common query language.  This leads to the next point:

Benchmarks! Benchmarks! Benchmarks!

Throughout the week there was discussion of the need for all types of benchmarks. LUBM and BSBM  have served us well but we need better benchmarks: more and different types of queries, more realistic datasets, configurable benchmarks, etc.  There was also discussions of other types of benchmarks, for example, a provenance corpus or a corpus that combines structured and unstructured data for ranking. One comment that I heard in terms of benchmarks is where should you publish them? Unlike the IR community we don’t have  something thing like TREC. Although, I think USEWOD is a good example of bootstrapping this sort of activity.

Let’s just be uncertain

One of the cross-cutting themes of the symposium was the need to deal with uncertainty. From dealing with crawled data, to information extraction systems, to even data created by classic knowledge capture, there is a need to express and use uncertainty.  In the area of provenance, I was impressed Martin Theobald’s URDF system that deals with both uncertain data and uncertain rules.

One major handicap is that RDF systems have is that reification let’s you associate confidence values with statements but is just extremely verbose. At the symposium, Bryan Thompson  and Orri Erling led the way in constructing a proposal to expose statement level identifiers that are compatible with reification. Olaf Hartig even worked out an approach that makes this compatible with SPARQL semantics. I’m looking forward to seeing their final proposal. This will make associating uncertainity and other evidence related information to triples.

One final thing to say is that these discussions made me glad that attributes are included in the PROV model. This provides an important hook for this kind of uncertainty information.

Crowdsourcing is a component

There was quite a lot of talk about integrating crowdsourcing into the data management stack (See Wolf-Tilo Balke’s work). It’s clear that when we are designing semantic data management systems that crowdsourcing is clearly an important component. Just as ontology engineers are boxes in many of our architectures maybe the crowd should be there by default as well.

Provenance – get it out – we’re ready

Beyond being a discussant in the conversation. I also gave an intro to provenance research based on the categorization of content, management and use produced in the Provenance Incubator. Luc Moreau, Olaf Hartig and Paolo Missier gave a walkthrough of the PROV spec coming from the W3C.  We had some interesting technical feedback but the general impression I got was – it looks pretty good, get it out there, this is something we need and can use – now.

For example, I had discussions with Manuel Salvadores about using PROV as the ontology for describing provenance in BioPortal. Satya S. Sahoo (a working group member) is extending PROV for capturing provenance in sleep studies. There was discussion of connecting PROV with the Semantic Sensor Network ontology. As with other Semantic Web standards, PROV will provide the basis for both applications and future research. It’s now up to us as working group to get these documents out.

Embracing other communities

I think the community as a whole has been doing a good job in embracing other communities. This has been shown by those working on RDF stores who have embraced the database community. Also in semantic search there is a good conversation that is bridging the IR community and the database field. Interestingly, semantic search is really a driver of that conversation. I learned about a good survey paper by Thanh Tran and Peter Mika at Dagstuhl – highly recommended.

Federation is a spectrum

There was lots of talk about federation at the symposium. My general impression is that federation is not something that we can say yes or no. Instead different applications will require different kinds of federation. I think there is lots of room to research how we can systematically place systems on the federation spectrum. I come with a series of requirements, where and how should I include federation in my data management scheme. For example, I may want to trade-of computational overhead for space as suggested by Olaf Hartig in his Link Traversal Based Query Execution approach (i.e. follow your nose). This caused some of the most entertaining discussions at the symposium. Should you need a data center to query the Web of Data? Let’s find out.

Conclusion

I think the report coming from this symposium will provide a good document sketching out the research challenges in semantic data management for the next several years. I’m looking forward to it. I’ll end with a quote from a slide in José Manuel Gomez-Perez‘s talk. According to the IDC 2011 Digital Universe study, metadata is the fastest growing data category.

There’s demand for the work we are doing and there are many challenges remaining – this promises to be a fun couple of years.

Filed under: academia, linked data Tagged: dagstuhl, research challenges, semantic data managment

Source: Think Links

Last week, I was at a seminar on Semantic Data Management at Dagstuhl. A month ago I was at Dagstuhl discussing the principles of provenance. You can read more about the atmosphere and style of a Dagstuhl event at the post on the provenance event. From my perspective, it’s pretty cool to get invited to multiple Dagstuhl events in short succession… I think it just happens that two of my main research areas overlap and were scheduled in the same time period.

Semantic Data Management Group photo

Obligatory Dagstuhl Group Photo

Indeed, one of the topics for discussion at the seminar was provenance. The others were scalability, dynamcity, and search. The organizers (Elena Simprel, Karl AbererGrigoris AntoniouOscar Corcho and Rudi Studer) will put together a report summarizing all the outcomes. What I wanted to do is focus on the key points that I took away from the seminar.

Scaling semantic data management = scaling graph databases

There was some discussion around what it means to scale in terms of semantic data management. For the most part this boiled down to, what does it mean to scale RDF databases? The organizers did a good job of bringing members of industry in that have actual experience in building scalable RDF systems. The first day contained some great discussion about the guts of databases and what makes scaling hard – issues such as the latency of storage infrastructure and what the right join algorithm were. Steve Harris brought out the difficulty of backup and restore in real world systems and the lack of research in that area.  But my primary feeling was the challenges of scalability are ones of how we deal with large graphs. In my open work in Open PHACTs, I’ve seen how using graphs has increased our flexibility but challenged us in terms of scalability.

Dealing with large graphs is hard but I think the Semantic Web community can lead the way here because we have a nice substrate, namely, an exchange model for graphs and a common query language.  This leads to the next point:

Benchmarks! Benchmarks! Benchmarks!

Throughout the week there was discussion of the need for all types of benchmarks. LUBM and BSBM  have served us well but we need better benchmarks: more and different types of queries, more realistic datasets, configurable benchmarks, etc.  There was also discussions of other types of benchmarks, for example, a provenance corpus or a corpus that combines structured and unstructured data for ranking. One comment that I heard in terms of benchmarks is where should you publish them? Unlike the IR community we don’t have  something thing like TREC. Although, I think USEWOD is a good example of bootstrapping this sort of activity.

Let’s just be uncertain

One of the cross-cutting themes of the symposium was the need to deal with uncertainty. From dealing with crawled data, to information extraction systems, to even data created by classic knowledge capture, there is a need to express and use uncertainty.  In the area of provenance, I was impressed Martin Theobald’s URDF system that deals with both uncertain data and uncertain rules.

One major handicap is that RDF systems have is that reification let’s you associate confidence values with statements but is just extremely verbose. At the symposium, Bryan Thompson  and Orri Erling led the way in constructing a proposal to expose statement level identifiers that are compatible with reification. Olaf Hartig even worked out an approach that makes this compatible with SPARQL semantics. I’m looking forward to seeing their final proposal. This will make associating uncertainity and other evidence related information to triples.

One final thing to say is that these discussions made me glad that attributes are included in the PROV model. This provides an important hook for this kind of uncertainty information.

Crowdsourcing is a component

There was quite a lot of talk about integrating crowdsourcing into the data management stack (See Wolf-Tilo Balke’s work). It’s clear that when we are designing semantic data management systems that crowdsourcing is clearly an important component. Just as ontology engineers are boxes in many of our architectures maybe the crowd should be there by default as well.

Provenance – get it out – we’re ready

Beyond being a discussant in the conversation. I also gave an intro to provenance research based on the categorization of content, management and use produced in the Provenance Incubator. Luc Moreau, Olaf Hartig and Paolo Missier gave a walkthrough of the PROV spec coming from the W3C.  We had some interesting technical feedback but the general impression I got was – it looks pretty good, get it out there, this is something we need and can use – now.

For example, I had discussions with Manuel Salvadores about using PROV as the ontology for describing provenance in BioPortal. Satya S. Sahoo (a working group member) is extending PROV for capturing provenance in sleep studies. There was discussion of connecting PROV with the Semantic Sensor Network ontology. As with other Semantic Web standards, PROV will provide the basis for both applications and future research. It’s now up to us as working group to get these documents out.

Embracing other communities

I think the community as a whole has been doing a good job in embracing other communities. This has been shown by those working on RDF stores who have embraced the database community. Also in semantic search there is a good conversation that is bridging the IR community and the database field. Interestingly, semantic search is really a driver of that conversation. I learned about a good survey paper by Thanh Tran and Peter Mika at Dagstuhl – highly recommended.

Federation is a spectrum

There was lots of talk about federation at the symposium. My general impression is that federation is not something that we can say yes or no. Instead different applications will require different kinds of federation. I think there is lots of room to research how we can systematically place systems on the federation spectrum. I come with a series of requirements, where and how should I include federation in my data management scheme. For example, I may want to trade-of computational overhead for space as suggested by Olaf Hartig in his Link Traversal Based Query Execution approach (i.e. follow your nose). This caused some of the most entertaining discussions at the symposium. Should you need a data center to query the Web of Data? Let’s find out.

Conclusion

I think the report coming from this symposium will provide a good document sketching out the research challenges in semantic data management for the next several years. I’m looking forward to it. I’ll end with a quote from a slide in José Manuel Gomez-Perez‘s talk. According to the IDC 2011 Digital Universe study, metadata is the fastest growing data category.

There’s demand for the work we are doing and there are many challenges remaining – this promises to be a fun couple of years.

Filed under: academia, linked data Tagged: dagstuhl, research challenges, semantic data managment

Source: Semantic Web world for you
The Institute of Development Studies (IDS) is a UK based institute specialised in development research, teaching and communications. As part of their activities, they provide an API to query their knowledge services data set compromising more than 32k abstracts or summaries of development research documents related to 8k development organisations, almost 30 themes and 225 countries and territories.

A month ago, Victor de Boer and myself got a grant from IDS to investigate exposing their data as RDF and building some client applications making use of the enriched data. We aimed at using the API as it is and create 5-star Linked Data by linking the created resources to other resources on the Web. The outcome is the IDSWrapper which is now freely accessible, both as HTML and as RDF. Although this is still work in progress, this wrapper already shows some advantages provided by publishing the data as Linked Data.

Source: Data2Semantics
The Data2Semantics is part of the COMMIT/ research community. We attended the COMMIT kick-off meeting where we presented our project and networked with the rest of the 15 projects and learned about presenting our work to the broader community. Paul wrote up his thoughts on the kick-off, which you can find here. The whole team [...]

On 27.03.2012, 04.00pm CET the LOD2 project (http://lod2.eu) will offer the next free one hour webinar on LIMES. LIMES is a tool providing time-efficient and lossless discovery of links across knowledge bases. LIMES is an extensible declarative framework that encapsulates manifold algorithms dedicated to the processing of structured data of any sort. Built with extensibility and easy integration in mind, LIMES allows implementing applications that integrate, consume and/or generate Linked Data. Within LOD2, it will be used for discovering links between knowledge bases.

This webinar will be presented by the LOD2 Partner: University of Leipzig (ULEI), Germany

The LOD2 webinar series is powered by the LOD2 project organised and produced by the Semantic Web Company (Austria). If you are interested in Linked (Open) Data principles and mechanisms, LOD tools & services and concrete use cases that can be realised using LOD then join us in the LOD2 webinar series! The LOD2 team is looking forward to meeting you at the webinar!

When : 27.03. 2012, 04.00pm – 05.00pm CET
Information & Registration: https://www2.gotomeeting.com/register/369667514

The LOD2 team is looking forward to meeting you at the webinar!! All the best and have a nice day!

Enhanced by Zemanta

Source: Semantic Web world for you
On March 16, 2012 the European Public Sector Information Platform organised the ePSIplatform Conference 2012 on the theme “Taking re-use to the next level!”. A very well organised and interesting event, also a good opportunity to meet new persons and put a face on the names seen on the mails and during teleconferences :-)

The program was intense: 3 plenary sessions, 12 break-out sessions and project presentations during the lunch break. That was a lot to talk about and a lot to listen to. I left Rotterdam with a number of take out messages and food for thoughts. What follows is a mix of my own opinions and things said by some of the many participants/speakers of the event.

Source: Think Links

For the last two days, I was at the kickoff for the COMMIT/ Program – a major computer science (i.e. ICT) research initiative in the Netherlands (110 million euros in funding). The entire program has 15 projects covering most of the major hot topics in computer science everything from sensor networks to large scale data management. It involves 76 partners both academic and industrial. The event itself was attended by ~220 people.

I’m involved in COMMIT/ as part of the Data2Semantics project where we’re developing approaches that enable scientists to more easily publish, share and reuse data.

The projects of COMMIT/

The aim of the kickoff was (from my perspective) two fold:

  1. to connect these different projects;
  2. to encourage participants to think beyond traditional academic output.

With respect to 1, the the leaders of COMMIT/ are trying to create a cohesive program and not just a bunch of separate projects that happened to be funded by the same source. This is going to be a hard task but the kickoff was a good start. The event focused extensively on networking exercises, for example, developing demonstrator ideas with other mixed groups of partners. In addition, they gave out t-shirts which is always a good way to create cohesion :-) Indeed, the theme of the event was try to position COMMIT/ as an unfolding story.

During one session of lightening talks, I drew the above picture trying to capture a quick visual summary of each project and the common themes across them: health, scale, storytelling.

With respect to 2, it’s clear that one of the main goals of the program is to have impact in the world beyond academia. This was shown by the emphasis on research communication to the outside world. I attended a fantastic workshop on visual storytelling given by Onno van der Venn from Zeeno. In addition, there was emphasis placed on creating companies or helping develop products within the existing companies within the various projects. A number support opportunities were discussed.

The event was well organized and it was great to be able to network under the COMMIT/ banner. The program is just beginning so it will be interesting to see how the story progresses whether indeed these various already big projects can be brought together.

Personally, I’m hoping to see if we can come up with something that can use the kind of tech transfer instruments the program is encouraging…so back to work :-)

Filed under: academia Tagged: COMMIT-NL, data2semantics

Source: Think Links

For the last two days, I was at the kickoff for the COMMIT/ Program – a major computer science (i.e. ICT) research initiative in the Netherlands (110 million euros in funding). The entire program has 15 projects covering most of the major hot topics in computer science everything from sensor networks to large scale data management. It involves 76 partners both academic and industrial. The event itself was attended by ~220 people.

I’m involved in COMMIT/ as part of the Data2Semantics project where we’re developing approaches that enable scientists to more easily publish, share and reuse  data.

The projects of COMMIT/

The aim of the kickoff was (from my perspective) two fold:

  1. to connect these different projects;
  2. to encourage participants to think beyond traditional academic output.

With respect to 1, the the leaders of COMMIT/ are trying to create a cohesive program and not just a bunch of separate projects that happened to be funded by the same source. This is going to be a hard task but the kickoff was a good start. The event focused extensively on networking exercises, for example, developing demonstrator ideas with other mixed groups of partners. In addition, they gave out t-shirts which is always a good way to create cohesion :-) Indeed, the theme of the event was try to position COMMIT/ as an unfolding story.

During one session of lightening talks, I drew the above picture trying to capture a quick visual summary of each project and the common themes across them: health, scale, storytelling.

With respect to 2, it’s clear that one of the main goals of the program is to have impact in the world beyond academia. This was shown by the emphasis on research communication to the outside world. I attended a fantastic workshop on visual storytelling given by Onno van der Venn from Zeeno.  In addition, there was emphasis placed on creating companies or helping develop products within the existing companies within the various projects. A number  support opportunities were discussed.

The event was well organized and it was great two be able to network under the COMMIT/ banner. The program is just beginning so it will be interesting to see how the story progresses whether indeed these various already big projects can be brought together.

Personally, I’m hoping to see if we can come up with something that can use the kind of tech transfer instruments the program is encouraging…so back to work :-)

Filed under: academia Tagged: COMMIT-NL, data2semantics

Source: Think Links

Note this post has been cross posted at The theicecream.org the International Collaboration of Early Career Researchers 

This past Thursday, I had the opportunity to participate in a mini-symposium held by the VU University Amsterdam (where I work) around open data for science titled Open Data for Science: Will it hurt or help?

The symposium consisted of three 15 minute talks and then some lively discussion with the audience of I think ~60 people from the university. We were lucky to have Jos Engelen the chairman of the NWO (the dutch NSF) discuss the perspective from research policy makers. The main take away I got from his presentation and the subsequent discussion is that open data (despite all reservations) is a worthy endeavor to pursue and something that research funders should (and will) encourage. Furthermore, just his presence means that policy makers are reaching out to see what the academic community thinks and that the community will have a say in how (open) data management policies will be rolled out in the Netherlands.

The most difficult talk to give was by Eco de Geus, who was asked to reflect on the more negative aspects of open data.  He presented important points about incentive structures (will I be scooped?), privacy, and the tendency towards one size fits all open data policies. These were important points. I think what made the reservations more poignant is that Prof. de Geus is not anti open data indeed he is deeply involved in large open data project  in his domain.

I talked about the view from a scientist starting out in their career. I told two stories:

  1.  how open data really benefited a collaborator of mine in her study of interdisciplinary work practices. As a consumer it really of data, open data really removes a number of barriers.
  2. in an analogy to open code, I discussed how an open source code I produced during my PhD led to more citations, a new collaboration, and others comparing there work to mine. However, these benefits were contrasted with the need to do support and having to be comfortable exposing my work practices.

I ended by making the following points about open data:

  1. Open data is a boon to young scientists when they are acting as consumers of data.
  2. It’s a more difficult position for producers of data. There are trade-offs including concerns about credit, time for support, and time to prepare data.
  3. Given 2, if we want to help scientists as consumers of data, we need to give support to producers.
  4. Clear simple guidelines for data publication are critical. Scientists shouldn’t need to be lawyers to either produce or consume data sets.
  5. Credit where credit is due. For open data to succeed, we need data citation on par with traditional citation.

You’ll find the slides to my talk below. Although they are a lot images so may not make much sense.

Overall, I thought the talks and discussion were excellent. It’s great to see this sort of discussion happening where I work. I hope it’s happening in many other institutions as well.

Filed under: academia Tagged: open data, science, symposium, vu university amsterdam

Source: Think Links

This past Thursday, I had the opportunity to participate in a mini-symposium held by the VU University Amsterdam (where I work) around open data for science titled Open Data for Science: Will it hurt or help?

The symposium consisted of three 15 minute talks and then some lively discussion with the audience of I think ~60 people from the university. We were lucky to have Jos Engelen the chairman of the NWO (the dutch NSF) discuss the perspective from research policy makers. The main take away I got from his presentation and the subsequent discussion is that open data (despite all reservations) is a worthy endeavor to pursue and something that research funders should (and will) encourage. Furthermore, just his presence means that policy makers are reaching out to see what the academic community thinks and that the community will have a say in how (open) data management policies will be rolled out in the Netherlands.

The most difficult talk to give was by Eco de Geus, who was asked to reflect on the more negative aspects of open data.  He presented important points about incentive structures (will I be scooped?), privacy, and the tendency towards one size fits all open data policies. These were important points. I think what made the reservations more poignant is that Prof. de Geus is not anti open data indeed he is deeply involved in large open data project  in his domain.

I talked about the view from a scientist starting out in their career. I told two stories:

  1.  how open data really benefited a collaborator of mine in her study of interdisciplinary work practices. As a consumer it really of data, open data really removes a number of barriers.
  2. in an analogy to open code, I discussed how an open source code I produced during my PhD led to more citations, a new collaboration, and others comparing there work to mine. However, these benefits were contrasted with the need to do support and having to be comfortable exposing my work practices.

I ended by making the following points about open data:

  1. Open data is a boon to young scientists when they are acting as consumers of data.
  2. It’s a more difficult position for producers of data. There are trade-offs including concerns about credit, time for support, and time to prepare data.
  3. Given 2, if we want to help scientists as consumers of data, we need to give support to producers.
  4. Clear simple guidelines for data publication are critical. Scientists shouldn’t need to be lawyers to either produce or consume data sets.
  5. Credit where credit is due. For open data to succeed, we need data citation on par with traditional citation.

You’ll find the slides to my talk below. Although they are a lot images so may not make much sense.

Overall, I thought the talks and discussion were excellent. It’s great to see this sort of discussion happening where I work. I hope it’s happening in many other institutions as well.

Filed under: academia Tagged: open data, science, symposium, vu university amsterdam