News and Updates on the KRR Group
Header image

Source: Think Links

This year I had the opportunity to be program co-chair and help organize the 4th International Provenance and Annotation Workskhop (IPAW). The event went great, really, better than I imagined. First, I was fortunate to be organizing it with James Frew from the Bren School of Environmental Science and Management a the University of California, Santa Barabara. He not only helped coordinate the program but along with his team took care of all the local organization. It’s hard to beat sunny Santa Barbara as a location, but they also made sure that everything ran smoothly: great wifi, shuttles to and from the location, tasty outdoor lunches looking over the ocean, an open air poster session with wine and cheese, and a BBQ on the beach for a workshop dinner:

The IPAW workshop dinner. Photo from Andreas Schrieber.

So big kudos to Frew and his team. Obviously, beyond being well run we covered a lot in the two days of the main workshop. The workshop had 47 attendees and you can find the twitter log here.

Research Highlights

I think the program really highlighted where we are at in provenance research today and the directions forward. I won’t go through every paper but  just try to pick 3 interesting trends.

1) Using Provenance to Address the Messiness of the Web

The Web provides us a fantastic source of knowledge. But the problem is that knowledge is completely unclean and unintegrated. Even efforts such as Linked Data while giving us better data our still messy and still under-integrated. Both researchers and firms have been trying to make clean integrated  knowledge, but then they are faced with what Timothy Lebo in his paper termed the Integrator’s Dilemma. A integrator may produce a clean well-structured data set, but in the process the resulting data set looses authority and a connection to domain expertise. To rectify this problem, provenance can be used to identify the authoritative source and connect back to domain expertise. Indeed, Jim McCusker and colleagues argued that provenance is the 3rd step to Data Sanity.

However, then we run into Tim’s 1st law of producing provenance:

For any provenance record, there exists a provenance consumer that will need it, but will not like how it is provided.

Tim suggests a service based solution to provide provenance at the correct granularity for provenance. While I don’t know if that is the right solution, it’s clear that providing provenance at the right level of granularity is one foundation to building confidence in integrated web data sources.

Another example of trying different ways of using provenance to address messiness is the use of network analysis to understand provenance captured from crowd sourced applications.

2) Provenance and Credit are Intertwined

Science has always been a driver of research in provenance and we saw a number of good pieces of work addressing domains ranging from climate analysis to archeology. However, as our key note speaker Phil Bourne pointed out in his talk, scientists are not using provenance technologies in their work. He argued that this is for two reasons: 1) because they are not given credit for all parts of the scientific process and 2) provenance infrastructure is still not easy enough to use.  Phil argued that it was fundamental that more artifacts of the research lifecycle need to be given credit to facilitate sharing and thus increase the pace of innovation particularly in the life sciences. Thus, for scientists to capture their information in a sharable fashion they need to be given credit for doing so. (Yes, he connected altmetrics to provenance – very cool from my point of view). To do this, he argued, we need better support for provenance throughout the research lifecycle. However, while tools exist, they are far from being usable and integrated enough into everyday science practice. This is a real challenge to the provenance community. We need to do better at getting our approaches into scientists hands.

3) The Problem of Post-hoc

Much work in the provenance literature has asked the question of how does one capture provenance effectively in computational systems. But many times this is just not possible. The user may not have thought about installing the system to capture provenance in the first place or may not have chosen to write down their rational for taking some action. This is an area that I’m actively researching so it was a great to see others starting to address the problem. Tom De Neiss attacked the problem of reconstructing provenance for a collection of newspaper articles using semantic similarity. An even more farther out idea presented at the workshop was to try and reconstruct the provenance of decision made by a human using simulation. Both works highlight the need for dealing with incomplete or even non-existant provenance.

These were just some of the themes that I saw. Overall, the presentations were good and the audience was engaged. We had lots of hall time and I heard many intense discussions so I’m hoping that the event spurred more research. I know personally we will try to pursue a collaboration to build a provenance corpus to study this reconstruction problem.

A Provenance Week

IPAW has a tradition of being hosted as an independent event, which allows us to not only have the two day workshop but also organize collocated events. This IPAW was the same. The Data Observation Network for Earth organized a meeting on provenance and scientific workflow collocated with IPAW. Additionally, the W3C Provenance Working Group both gave a tutorial before the workshop and held their two day face-to-face meeting afterwards. Here’s me presenting the core of the provenance data model to the 28 tutorial participants.

The Provenance Data Model. It’s easy! Photo prov:wasAttributedTo Andreas Schrieber

Conclusion

IPAW 2012 was a lot effort but it was worth it – fun discussion, beautiful weather and research insight.  Again, the community voted to have another IPAW in 2014. The community is continuing to play to its strengths in workflows, databases and science applications while exploring novel areas. In the CFP for IPAW, we wrote that “2012 will be a watershed year for provenance/annotation research.” For me, IPAW confirmed that statement.

Filed under: academia, events Tagged: #ipaw2010, international provenance and annotation workshop, ipaw

Source: Think Links

This year I had the opportunity to be program co-chair and help organize the 4th International Provenance and Annotation Workskhop (IPAW). The event went great, really, better than I imagined. First, I was fortunate to be organizing it with James Frew from the Bren School of Environmental Science and Management a the University of California, Santa Barabara. He not only helped coordinate the program but along with his team took care of all the local organization. It’s hard to beat sunny Santa Barbara as a location, but they also made sure that everything ran smoothly: great wifi, shuttles to and from the location, tasty outdoor lunches looking over the ocean, an open air poster session with wine and cheese, and a BBQ on the beach for a workshop dinner:

The IPAW workshop dinner. Photo from Andreas Schrieber.

So big kudos to Frew and his team. Obviously, beyond being well run we covered a lot in the two days of the main workshop. The workshop had 47 attendees and you can find the twitter log here.

Research Highlights

I think the program really highlighted where we are at in provenance research today and the directions forward. I won’t go through every paper but  just try to pick 3 interesting trends.

1) Using Provenance to Address the Messiness of the Web

The Web provides us a fantastic source of knowledge. But the problem is that knowledge is completely unclean and unintegrated. Even efforts such as Linked Data while giving us better data our still messy and still under-integrated. Both researchers and firms have been trying to make clean integrated  knowledge, but then they are faced with what Timothy Lebo in his paper termed the Integrator’s Dilemma. A integrator may produce a clean well-structured data set, but in the process the resulting data set looses authority and a connection to domain expertise. To rectify this problem, provenance can be used to identify the authoritative source and connect back to domain expertise. Indeed, Jim McCusker and colleagues argued that provenance is the 3rd step to Data Sanity.

However, then we run into Tim’s 1st law of producing provenance:

For any provenance record, there exists a provenance consumer that will need it, but will not like how it is provided.

Tim suggests a service based solution to provide provenance at the correct granularity for provenance. While I don’t know if that is the right solution, it’s clear that providing provenance at the right level of granularity is one foundation to building confidence in integrated web data sources.

Another example of trying different ways of using provenance to address messiness is the use of network analysis to understand provenance captured from crowd sourced applications.

2) Provenance and Credit are Intertwined

Science has always been a driver of research in provenance and we saw a number of good pieces of work addressing domains ranging from climate analysis to archeology. However, as our key note speaker Phil Bourne pointed out in his talk, scientists are not using provenance technologies in their work. He argued that this is for two reasons: 1) because they are not given credit for all parts of the scientific process and 2) provenance infrastructure is still not easy enough to use.  Phil argued that it was fundamental that more artifacts of the research lifecycle need to be given credit to facilitate sharing and thus increase the pace of innovation particularly in the life sciences. Thus, for scientists to capture their information in a sharable fashion they need to be given credit for doing so. (Yes, he connected altmetrics to provenance – very cool from my point of view). To do this, he argued, we need better support for provenance throughout the research lifecycle. However, while tools exist, they are far from being usable and integrated enough into everyday science practice. This is a real challenge to the provenance community. We need to do better at getting our approaches into scientists hands.

3) The Problem of Post-hoc

Much work in the provenance literature has asked the question of how does one capture provenance effectively in computational systems. But many times this is just not possible. The user may not have thought about installing the system to capture provenance in the first place or may not have chosen to write down their rational for taking some action. This is an area that I’m actively researching so it was a great to see others starting to address the problem. Tom De Neiss attacked the problem of reconstructing provenance for a collection of newspaper articles using semantic similarity. An even more farther out idea presented at the workshop was to try and reconstruct the provenance of decision made by a human using simulation. Both works highlight the need for dealing with incomplete or even non-existant provenance.

These were just some of the themes that I saw. Overall, the presentations were good and the audience was engaged. We had lots of hall time and I heard many intense discussions so I’m hoping that the event spurred more research. I know personally we will try to pursue a collaboration to build a provenance corpus to study this reconstruction problem.

A Provenance Week

IPAW has a tradition of being hosted as an independent event, which allows us to not only have the two day workshop but also organize collocated events. This IPAW was the same. The Data Observation Network for Earth organized a meeting on provenance and scientific workflow collocated with IPAW. Additionally, the W3C Provenance Working Group both gave a tutorial before the workshop and held their two day face-to-face meeting afterwards. Here’s me presenting the core of the provenance data model to the 28 tutorial participants.

The Provenance Data Model. It’s easy! Photo prov:wasAttributedTo Andreas Schrieber

Conclusion

IPAW 2012 was a lot effort but it was worth it – fun discussion, beautiful weather and research insight.  Again, the community voted to have another IPAW in 2014. The community is continuing to play to its strengths in workflows, databases and science applications while exploring novel areas. In the CFP for IPAW, we wrote that “2012 will be a watershed year for provenance/annotation research.” For me, IPAW confirmed that statement.

Filed under: academia, events Tagged: #ipaw2010, international provenance and annotation workshop, ipaw

New version of Hubble CDS prototype

Posted by data2semantics in collaboration | computer science | large scale | semantic web | vu university amsterdam - (Comments Off on New version of Hubble CDS prototype)

Source: Data2Semantics
Last week we’ve been working very hard to reshape the CDS prototype system for the ESWC demo this evening. This required quite a lot of work on the UI and underlying data (e.g. linking BioPortal annotations to LLD and AERS was not as straightforward as one would hope) New stuff Publications cited in our example […]

Source: Semantic Web world for you
Reblogged from The World Wide Semantic Web: Poster of Downscale 2012 The schedule for the first workshop on DownScaling the Semantic Web is now ready and we also prepared a poster. Jump on the program page to see what we will be talking about on May 28. If you are attending ESWC don’t forget to […]

Source: Think Links

Last week, I was at a seminar on Semantic Data Management at Dagstuhl. A month ago I was at Dagstuhl discussing the principles of provenance. You can read more about the atmosphere and style of a Dagstuhl event at the post on the provenance event. From my perspective, it’s pretty cool to get invited to multiple Dagstuhl events in short succession… I think it just happens that two of my main research areas overlap and were scheduled in the same time period.

Semantic Data Management Group photo

Obligatory Dagstuhl Group Photo

Indeed, one of the topics for discussion at the seminar was provenance. The others were scalability, dynamcity, and search. The organizers (Elena Simprel, Karl AbererGrigoris AntoniouOscar Corcho and Rudi Studer) will put together a report summarizing all the outcomes. What I wanted to do is focus on the key points that I took away from the seminar.

Scaling semantic data management = scaling graph databases

There was some discussion around what it means to scale in terms of semantic data management. For the most part this boiled down to, what does it mean to scale RDF databases? The organizers did a good job of bringing members of industry in that have actual experience in building scalable RDF systems. The first day contained some great discussion about the guts of databases and what makes scaling hard – issues such as the latency of storage infrastructure and what the right join algorithm were. Steve Harris brought out the difficulty of backup and restore in real world systems and the lack of research in that area.  But my primary feeling was the challenges of scalability are ones of how we deal with large graphs. In my open work in Open PHACTs, I’ve seen how using graphs has increased our flexibility but challenged us in terms of scalability.

Dealing with large graphs is hard but I think the Semantic Web community can lead the way here because we have a nice substrate, namely, an exchange model for graphs and a common query language.  This leads to the next point:

Benchmarks! Benchmarks! Benchmarks!

Throughout the week there was discussion of the need for all types of benchmarks. LUBM and BSBM  have served us well but we need better benchmarks: more and different types of queries, more realistic datasets, configurable benchmarks, etc.  There was also discussions of other types of benchmarks, for example, a provenance corpus or a corpus that combines structured and unstructured data for ranking. One comment that I heard in terms of benchmarks is where should you publish them? Unlike the IR community we don’t have  something thing like TREC. Although, I think USEWOD is a good example of bootstrapping this sort of activity.

Let’s just be uncertain

One of the cross-cutting themes of the symposium was the need to deal with uncertainty. From dealing with crawled data, to information extraction systems, to even data created by classic knowledge capture, there is a need to express and use uncertainty.  In the area of provenance, I was impressed Martin Theobald’s URDF system that deals with both uncertain data and uncertain rules.

One major handicap is that RDF systems have is that reification let’s you associate confidence values with statements but is just extremely verbose. At the symposium, Bryan Thompson  and Orri Erling led the way in constructing a proposal to expose statement level identifiers that are compatible with reification. Olaf Hartig even worked out an approach that makes this compatible with SPARQL semantics. I’m looking forward to seeing their final proposal. This will make associating uncertainity and other evidence related information to triples.

One final thing to say is that these discussions made me glad that attributes are included in the PROV model. This provides an important hook for this kind of uncertainty information.

Crowdsourcing is a component

There was quite a lot of talk about integrating crowdsourcing into the data management stack (See Wolf-Tilo Balke’s work). It’s clear that when we are designing semantic data management systems that crowdsourcing is clearly an important component. Just as ontology engineers are boxes in many of our architectures maybe the crowd should be there by default as well.

Provenance – get it out – we’re ready

Beyond being a discussant in the conversation. I also gave an intro to provenance research based on the categorization of content, management and use produced in the Provenance Incubator. Luc Moreau, Olaf Hartig and Paolo Missier gave a walkthrough of the PROV spec coming from the W3C.  We had some interesting technical feedback but the general impression I got was – it looks pretty good, get it out there, this is something we need and can use – now.

For example, I had discussions with Manuel Salvadores about using PROV as the ontology for describing provenance in BioPortal. Satya S. Sahoo (a working group member) is extending PROV for capturing provenance in sleep studies. There was discussion of connecting PROV with the Semantic Sensor Network ontology. As with other Semantic Web standards, PROV will provide the basis for both applications and future research. It’s now up to us as working group to get these documents out.

Embracing other communities

I think the community as a whole has been doing a good job in embracing other communities. This has been shown by those working on RDF stores who have embraced the database community. Also in semantic search there is a good conversation that is bridging the IR community and the database field. Interestingly, semantic search is really a driver of that conversation. I learned about a good survey paper by Thanh Tran and Peter Mika at Dagstuhl – highly recommended.

Federation is a spectrum

There was lots of talk about federation at the symposium. My general impression is that federation is not something that we can say yes or no. Instead different applications will require different kinds of federation. I think there is lots of room to research how we can systematically place systems on the federation spectrum. I come with a series of requirements, where and how should I include federation in my data management scheme. For example, I may want to trade-of computational overhead for space as suggested by Olaf Hartig in his Link Traversal Based Query Execution approach (i.e. follow your nose). This caused some of the most entertaining discussions at the symposium. Should you need a data center to query the Web of Data? Let’s find out.

Conclusion

I think the report coming from this symposium will provide a good document sketching out the research challenges in semantic data management for the next several years. I’m looking forward to it. I’ll end with a quote from a slide in José Manuel Gomez-Perez‘s talk. According to the IDC 2011 Digital Universe study, metadata is the fastest growing data category.

There’s demand for the work we are doing and there are many challenges remaining – this promises to be a fun couple of years.

Filed under: academia, linked data Tagged: dagstuhl, research challenges, semantic data managment

Source: Think Links

Last week, I was at a seminar on Semantic Data Management at Dagstuhl. A month ago I was at Dagstuhl discussing the principles of provenance. You can read more about the atmosphere and style of a Dagstuhl event at the post on the provenance event. From my perspective, it’s pretty cool to get invited to multiple Dagstuhl events in short succession… I think it just happens that two of my main research areas overlap and were scheduled in the same time period.

Semantic Data Management Group photo

Obligatory Dagstuhl Group Photo

Indeed, one of the topics for discussion at the seminar was provenance. The others were scalability, dynamcity, and search. The organizers (Elena Simprel, Karl AbererGrigoris AntoniouOscar Corcho and Rudi Studer) will put together a report summarizing all the outcomes. What I wanted to do is focus on the key points that I took away from the seminar.

Scaling semantic data management = scaling graph databases

There was some discussion around what it means to scale in terms of semantic data management. For the most part this boiled down to, what does it mean to scale RDF databases? The organizers did a good job of bringing members of industry in that have actual experience in building scalable RDF systems. The first day contained some great discussion about the guts of databases and what makes scaling hard – issues such as the latency of storage infrastructure and what the right join algorithm were. Steve Harris brought out the difficulty of backup and restore in real world systems and the lack of research in that area.  But my primary feeling was the challenges of scalability are ones of how we deal with large graphs. In my open work in Open PHACTs, I’ve seen how using graphs has increased our flexibility but challenged us in terms of scalability.

Dealing with large graphs is hard but I think the Semantic Web community can lead the way here because we have a nice substrate, namely, an exchange model for graphs and a common query language.  This leads to the next point:

Benchmarks! Benchmarks! Benchmarks!

Throughout the week there was discussion of the need for all types of benchmarks. LUBM and BSBM  have served us well but we need better benchmarks: more and different types of queries, more realistic datasets, configurable benchmarks, etc.  There was also discussions of other types of benchmarks, for example, a provenance corpus or a corpus that combines structured and unstructured data for ranking. One comment that I heard in terms of benchmarks is where should you publish them? Unlike the IR community we don’t have  something thing like TREC. Although, I think USEWOD is a good example of bootstrapping this sort of activity.

Let’s just be uncertain

One of the cross-cutting themes of the symposium was the need to deal with uncertainty. From dealing with crawled data, to information extraction systems, to even data created by classic knowledge capture, there is a need to express and use uncertainty.  In the area of provenance, I was impressed Martin Theobald’s URDF system that deals with both uncertain data and uncertain rules.

One major handicap is that RDF systems have is that reification let’s you associate confidence values with statements but is just extremely verbose. At the symposium, Bryan Thompson  and Orri Erling led the way in constructing a proposal to expose statement level identifiers that are compatible with reification. Olaf Hartig even worked out an approach that makes this compatible with SPARQL semantics. I’m looking forward to seeing their final proposal. This will make associating uncertainity and other evidence related information to triples.

One final thing to say is that these discussions made me glad that attributes are included in the PROV model. This provides an important hook for this kind of uncertainty information.

Crowdsourcing is a component

There was quite a lot of talk about integrating crowdsourcing into the data management stack (See Wolf-Tilo Balke’s work). It’s clear that when we are designing semantic data management systems that crowdsourcing is clearly an important component. Just as ontology engineers are boxes in many of our architectures maybe the crowd should be there by default as well.

Provenance – get it out – we’re ready

Beyond being a discussant in the conversation. I also gave an intro to provenance research based on the categorization of content, management and use produced in the Provenance Incubator. Luc Moreau, Olaf Hartig and Paolo Missier gave a walkthrough of the PROV spec coming from the W3C.  We had some interesting technical feedback but the general impression I got was – it looks pretty good, get it out there, this is something we need and can use – now.

For example, I had discussions with Manuel Salvadores about using PROV as the ontology for describing provenance in BioPortal. Satya S. Sahoo (a working group member) is extending PROV for capturing provenance in sleep studies. There was discussion of connecting PROV with the Semantic Sensor Network ontology. As with other Semantic Web standards, PROV will provide the basis for both applications and future research. It’s now up to us as working group to get these documents out.

Embracing other communities

I think the community as a whole has been doing a good job in embracing other communities. This has been shown by those working on RDF stores who have embraced the database community. Also in semantic search there is a good conversation that is bridging the IR community and the database field. Interestingly, semantic search is really a driver of that conversation. I learned about a good survey paper by Thanh Tran and Peter Mika at Dagstuhl – highly recommended.

Federation is a spectrum

There was lots of talk about federation at the symposium. My general impression is that federation is not something that we can say yes or no. Instead different applications will require different kinds of federation. I think there is lots of room to research how we can systematically place systems on the federation spectrum. I come with a series of requirements, where and how should I include federation in my data management scheme. For example, I may want to trade-of computational overhead for space as suggested by Olaf Hartig in his Link Traversal Based Query Execution approach (i.e. follow your nose). This caused some of the most entertaining discussions at the symposium. Should you need a data center to query the Web of Data? Let’s find out.

Conclusion

I think the report coming from this symposium will provide a good document sketching out the research challenges in semantic data management for the next several years. I’m looking forward to it. I’ll end with a quote from a slide in José Manuel Gomez-Perez‘s talk. According to the IDC 2011 Digital Universe study, metadata is the fastest growing data category.

There’s demand for the work we are doing and there are many challenges remaining – this promises to be a fun couple of years.

Filed under: academia, linked data Tagged: dagstuhl, research challenges, semantic data managment

Source: Semantic Web world for you
The Institute of Development Studies (IDS) is a UK based institute specialised in development research, teaching and communications. As part of their activities, they provide an API to query their knowledge services data set compromising more than 32k abstracts or summaries of development research documents related to 8k development organisations, almost 30 themes and 225 countries and territories.

A month ago, Victor de Boer and myself got a grant from IDS to investigate exposing their data as RDF and building some client applications making use of the enriched data. We aimed at using the API as it is and create 5-star Linked Data by linking the created resources to other resources on the Web. The outcome is the IDSWrapper which is now freely accessible, both as HTML and as RDF. Although this is still work in progress, this wrapper already shows some advantages provided by publishing the data as Linked Data.

Source: Data2Semantics
The Data2Semantics is part of the COMMIT/ research community. We attended the COMMIT kick-off meeting where we presented our project and networked with the rest of the 15 projects and learned about presenting our work to the broader community. Paul wrote up his thoughts on the kick-off, which you can find here. The whole team […]

On 27.03.2012, 04.00pm CET the LOD2 project (http://lod2.eu) will offer the next free one hour webinar on LIMES. LIMES is a tool providing time-efficient and lossless discovery of links across knowledge bases. LIMES is an extensible declarative framework that encapsulates manifold algorithms dedicated to the processing of structured data of any sort. Built with extensibility and easy integration in mind, LIMES allows implementing applications that integrate, consume and/or generate Linked Data. Within LOD2, it will be used for discovering links between knowledge bases.

This webinar will be presented by the LOD2 Partner: University of Leipzig (ULEI), Germany

The LOD2 webinar series is powered by the LOD2 project organised and produced by the Semantic Web Company (Austria). If you are interested in Linked (Open) Data principles and mechanisms, LOD tools & services and concrete use cases that can be realised using LOD then join us in the LOD2 webinar series! The LOD2 team is looking forward to meeting you at the webinar!

When : 27.03. 2012, 04.00pm – 05.00pm CET
Information & Registration: https://www2.gotomeeting.com/register/369667514

The LOD2 team is looking forward to meeting you at the webinar!! All the best and have a nice day!

Enhanced by Zemanta

Source: Semantic Web world for you
On March 16, 2012 the European Public Sector Information Platform organised the ePSIplatform Conference 2012 on the theme “Taking re-use to the next level!”. A very well organised and interesting event, also a good opportunity to meet new persons and put a face on the names seen on the mails and during teleconferences :-)

The program was intense: 3 plenary sessions, 12 break-out sessions and project presentations during the lunch break. That was a lot to talk about and a lot to listen to. I left Rotterdam with a number of take out messages and food for thoughts. What follows is a mix of my own opinions and things said by some of the many participants/speakers of the event.