News and Updates on the KRR Group
Header image

Source: Think Links

A really great way to teach computer science without a computer – Computer Science Unplugged . Thanks to Lois Delcambre for the pointer.

Filed under: academia Tagged: computer science, csunplugged, teaching

Source: Think Links

A really great way to teach computer science without a computer – Computer Science Unplugged . Thanks to Lois Delcambre for the pointer.

Filed under: academia Tagged: computer science, csunplugged, teaching

Source: Think Links

Last week, I attended a workshop at Dagstuhl on the Principles of Provenance. Before talking about the content of the workshop itself, it’s worth describing the experience of Dagstuhl. The venue itself is a manor house located pretty much in the middle of nowhere in southwest Germany.  The nature around the venue is nice and it really is away from it all. All the attendees stay in the manor house so you spend not only the scheduled workshop times with your colleagues but also breakfast, lunch, dinner and evenings. They also have small tricks to ensure that everyone mingles, for example, by pseudo-randomly seating people at different tables for meals. Additionally, Dagstuhl is specifically for computer science – they have a good internet connection and one of the best computer science libraries I’ve seen.  All these things together make Dagstuhl a unique  intellectually intense environment. It’s one of the nicest traditions in computer science.

Me at the Principles of Provenance workshop

With that context in mind, the organizers of the Principles of Provenance workshop (James Cheney, Wang-Chiew Tan, Bertram Ludaescher, Stijn Vansummeren) brought together computer scientists studying provenance from the perspective of databases, the semantic web, scientific workflow, programming languages and software engineering. While I knew most of the people in this broad community (at least from paper titles), I met some new people and got to know people better. The organizers started the workshop with overviews of provenance work from 4 areas:

  1. Provenance in Database Systems
  2. Provenance in Workflows and Scientific Computation
  3. Provenance in Software Engineer, programming languages and security
  4. Provenance interchange on the web (i.e. the w3C standardization effort)

These tutorials were a great idea because they provided a common basis for communication throughout the week. The rest of the week combined quite a few talks and plenty of discussion The organizers are putting together a report right now containing abstracts and presentations so I won’t go into that more here. What I do want to do is pull out 3 take-aways that I had from the week.

1) Connecting consensus models to formal foundations

Because provenance often spans multiple systems (my data is often sourced from somewhere else), there is a need for provenance systems to interoperate. There have be a number of efforts to enable this interoperability including the creation of the Open Provenance Model as well as the current standardization effort at the W3C. Because these efforts are trying to bridge across multiple implementation, they are driven by community consensus: what models can we agree upon, what is minimally necessary for interchange, what is easy to understand and implement.

Separately, there is quite a lot of work on formal foundations of provenance especially within the database community. This work is grounded in applications but also in formal theory that ensures that provenance information has nice properties. Concretely, one can show that certain types of provenance within a database context can be expressed as polynomials, algebraically manipulated, and also related. (semirings!) Plus, provenance polynomials sounds nice. Check out T.J. Green’s thesis for starters:

Todd J. Green. Collaborative Data Sharing with Mappings and Provenance. PhD thesis, University of Pennsylvania, 2009

During the workshop, it became clear to me that the consensus based models (which are often graphical in nature) can not only be formalized but also be directly connected to these database focused formalizations. I just needed to get over the differences in syntax.  This could imply that we could have nice way to trace provenance across systems and through databases and be able to understand the mathematical properties of this interconnection.

2) Social implications of producing provenance

For a couple of years now, I’ve been asked by people and have asked myself, so what do you do with provenance? I think there are a lot of good answers for that (e.g. requirements for provenance in e-science). However, the community has spent a lot of time thinking about how to capture provenance from a technical point of view asking questions like: how do we instrument systems? how do we store provenance efficiently? can we leverage execution environments for tracing?

At Dagstuhl, Carole Goble asked another question, why would people record and share provenance in the first place? There are big social implications that we need to grapple with: producing provenance may expose information that we are not ready to share, it may require us to change work practice  leading to effort that we may not want to give or it may be in form that is to raw to be useful. Developing techniques to address these issues is from my point of view a new and important area of work.

From my perspective, we are starting to work on the ideas of how to reconstruct provenance from data that will hopefully reduce the effort for producers of provenance.

3) Provenance is important for messy data integration

A key usecase for  provenance is tracking back to original data sources after data has been integrated. This is particularly important when the data integration requires complex processing (e.g. natural language processing). Christopher Ré gave a fantastic example of this with a demonstration the WiscI system part of the Hazy project. This application enriches Wikipedia pages with facts collected from a (~40 TB) web crawl and provides links back to a supporting source for those facts. It was a great example of how provenance is really foundational to providing confidence in these systems.

Beyond these points, there was a lot more discussed, which will be summarized in the forthcoming report. This was a great workshop for me. From my point of view, I wanted to thank the organizers for putting it together. It’s a lot of effort. Additionally, thanks to all of the participants for really great conversations.

Filed under: academia Tagged: dagstuhl, provenance

Source: Think Links

Last week, I attended a workshop at Dagstuhl on the Principles of Provenance. Before talking about the content of the workshop itself, it’s worth describing the experience of Dagstuhl. The venue itself is a manor house located pretty much in the middle of nowhere in southwest Germany.  The nature around the venue is nice and it really is away from it all. All the attendees stay in the manor house so you spend not only the scheduled workshop times with your colleagues but also breakfast, lunch, dinner and evenings. They also have small tricks to ensure that everyone mingles, for example, by pseudo-randomly seating people at different tables for meals. Additionally, Dagstuhl is specifically for computer science – they have a good internet connection and one of the best computer science libraries I’ve seen.  All these things together make Dagstuhl a unique  intellectually intense environment. It’s one of the nicest traditions in computer science.

Me at the Principles of Provenance workshop

With that context in mind, the organizers of the Principles of Provenance workshop (James Cheney, Wang-Chiew Tan, Bertram Ludaescher, Stijn Vansummeren) brought together computer scientists studying provenance from the perspective of databases, the semantic web, scientific workflow, programming languages and software engineering. While I knew most of the people in this broad community (at least from paper titles), I met some new people and got to know people better. The organizers started the workshop with overviews of provenance work from 4 areas:

  1. Provenance in Database Systems
  2. Provenance in Workflows and Scientific Computation
  3. Provenance in Software Engineer, programming languages and security
  4. Provenance interchange on the web (i.e. the w3C standardization effort)

These tutorials were a great idea because they provided a common basis for communication throughout the week. The rest of the week combined quite a few talks and plenty of discussion The organizers are putting together a report right now containing abstracts and presentations so I won’t go into that more here. What I do want to do is pull out 3 take-aways that I had from the week.

1) Connecting consensus models to formal foundations

Because provenance often spans multiple systems (my data is often sourced from somewhere else), there is a need for provenance systems to interoperate. There have be a number of efforts to enable this interoperability including the creation of the Open Provenance Model as well as the current standardization effort at the W3C. Because these efforts are trying to bridge across multiple implementation, they are driven by community consensus: what models can we agree upon, what is minimally necessary for interchange, what is easy to understand and implement.

Separately, there is quite a lot of work on formal foundations of provenance especially within the database community. This work is grounded in applications but also in formal theory that ensures that provenance information has nice properties. Concretely, one can show that certain types of provenance within a database context can be expressed as polynomials, algebraically manipulated, and also related. (semirings!) Plus, provenance polynomials sounds nice. Check out T.J. Green’s thesis for starters:

Todd J. Green. Collaborative Data Sharing with Mappings and Provenance. PhD thesis, University of Pennsylvania, 2009

During the workshop, it became clear to me that the consensus based models (which are often graphical in nature) can not only be formalized but also be directly connected to these database focused formalizations. I just needed to get over the differences in syntax.  This could imply that we could have nice way to trace provenance across systems and through databases and be able to understand the mathematical properties of this interconnection.

2) Social implications of producing provenance

For a couple of years now, I’ve been asked by people and have asked myself, so what do you do with provenance? I think there are a lot of good answers for that (e.g. requirements for provenance in e-science). However, the community has spent a lot of time thinking about how to capture provenance from a technical point of view asking questions like: how do we instrument systems? how do we store provenance efficiently? can we leverage execution environments for tracing?

At Dagstuhl, Carole Goble asked another question, why would people record and share provenance in the first place? There are big social implications that we need to grapple with: producing provenance may expose information that we are not ready to share, it may require us to change work practice  leading to effort that we may not want to give or it may be in form that is to raw to be useful. Developing techniques to address these issues is from my point of view a new and important area of work.

From my perspective, we are starting to work on the ideas of how to reconstruct provenance from data that will hopefully reduce the effort for producers of provenance.

3) Provenance is important for messy data integration

A key usecase for  provenance is tracking back to original data sources after data has been integrated. This is particularly important when the data integration requires complex processing (e.g. natural language processing). Christopher Ré gave a fantastic example of this with a demonstration the WiscI system part of the Hazy project. This application enriches Wikipedia pages with facts collected from a (~40 TB) web crawl and provides links back to a supporting source for those facts. It was a great example of how provenance is really foundational to providing confidence in these systems.

Beyond these points, there was a lot more discussed, which will be summarized in the forthcoming report. This was a great workshop for me. From my point of view, I wanted to thank the organizers for putting it together. It’s a lot of effort. Additionally, thanks to all of the participants for really great conversations.

Filed under: academia Tagged: dagstuhl, provenance

Source: Semantic Web world for you
The VU is making short videos of 1 minute to highlight some of the research that is being done within its walls. This is the video for SemanticXO, realised by Pepijn Borgwat and presented by Laurens Rietveld. The script is in Dutch and is as follows: Ik ben laurens rietveld en ik doe onderzoek aan de vrije [...]

TabLinker is experimental software for converting manually annotated Microsoft Excel workbooks to the RDF Data Cube vocabulary. It is used in the context of the Data2Semantics project to investigate the use of Linked Data for humanities research (Dutch census dataproduced by DANS).

TabLinker was designed for converting Excel or CSV files to RDF (triplification, RDF-izing) that have a complex layout and cannot be handled by fully automatic csv2rdf scripts.

A presentation about Linked Census Data, including TabLinker is available from SlideShare.

Please consult the Github page for the latest release information.

Using TabLinker

TabLinker takes annotated Excel files (found using the srcMask option in the config.ini file) and converts them to RDF. This RDF is serialized to the target folder specified using the targetFolder option in config.ini.

Annotations in the Excel file should be done using the built-in style functionality of Excel (you can specify these by hand). TabLinker currently recognises seven styles:

  • TabLink Title - The cell containing the title of a sheet
  • TabLink Data - A cell that contains data, e.g. a number for the population size
  • TabLink ColHeader - Used for the headers of columns
  • TabLink RowHeader - Used for row headers
  • TabLink HierarchicalRowHeader - Used for multi-column row headers with subsumption/taxonomic relations between the values of the columns
  • TabLink Property - Typically used for the header cells directly above RowHeader or HierarchicalRowHeader cells, cell values are the properties that relate Data cells to RowHeader and HierarchicalRowHeader cells.
  • TabLink Label - Used for cells that contain a label for one of the HierarchicalRowHeader cells.

An eight style, TabLink Metadata, is currently ignored (See #3).

An example of such an annotated Excel file is provided in the input directory. There are ways to import the styles defined in that file into your own Excel files.

Tip: If your table contains totals for HierarchicalRowHeader cell values, use a non-TabLink style to mark the cells between the level to which the total belongs, and the cell that contains the name of the total. Have a look at the example annotated Excel file to see how this is done (up to row 428).

Once you’re all set, start the TabLinker by cd-ing to the src folder, and running:

python tablinker.py

Requirements

TabLinker was developed under the following environment:

Source: Think Links

Update: A version of this post appeared in SURF magazine (on the back page) in their trendwatching column

Technology at its best lets us do what we want to do without being held back by time consuming or complex processes. We see this in great consumer technology: your phone giving you directions to the nearest cafe, your calendar reminding you of a friend’s birthday, or a website telling you what films are on. Good technology removes friction.

While attending the SURF Research day, I was reminded that this idea of removing friction through technology shouldn’t be limited to consumer or business environments but should also be applied in academic research settings. The day showcased a variety of developments in information technology to help researchers do better research. Because SURF is a Dutch organization there was a particular focus on developments here in the Netherlands.

The day began with a fantastic keynote from Cameron Neylon outlining how networks qualitatively change how research can be communicated. A key point was that to create the best networks we need to make research communication as frictionless as possible.  You can find his longer argument here. After Cameron’s talk, Jos Engelen the chairman of the NWO (the Dutch NSF) gave some remarks. For me, the key take-away was that in every one of the Dutch Government’s 9 Priority Sectors, technology has a central role in smoothing both the research process and its transition to practice.

After the opening session, there were four parallel sessions on text analysis, dealing with data, profiling research, and technology for research education. I managed to attend parts of three of the sessions. In the profiling session, the recently released SURF Report on tracking the impact of scholarly publications in the 21st century, sparked my interest.  Finding new faster and broader ways of measuring impact (i.e. altmetrics)  is a way of reducing friction in science communication. The ESCAPE project showed how enriched publications can make it easy to collate and browse related content around traditional articles. The project won SURF’s enriched publication of the year award. Again, the key, simplifying the research process.  Beyond these presentations, there were talks ranging from making it easier to do novel chemistry to helping religious scholars understand groups through online forms. In each case, the technology was successful because it eliminated friction in the research process.

The SURF research day presented not just technology but how, when it’s done right, technology can make research just a bit smoother.

Filed under: academia, altmetrics Tagged: events, ozdag, surffounation

Source: Think Links

Technology at its best lets us do what we want to do without being held back by time consuming or complex processes. We see this in great consumer technology: your phone giving you directions to the nearest cafe, your calendar reminding you of a friend’s birthday, or a website telling you what films are on. Good technology removes friction.

While attending the SURF Research day, I was reminded that this idea of removing friction through technology shouldn’t be limited to consumer or business environments but should also be applied in academic research settings. The day showcased a variety of developments in information technology to help researchers do better research. Because SURF is a Dutch organization there was a particular focus on developments here in the Netherlands.

The day began with a fantastic keynote from Cameron Neylon outlining how networks qualitatively change how research can be communicated. A key point was that to create the best networks we need to make research communication as frictionless as possible.  You can find his longer argument here. After Cameron’s talk, Jos Engelen the chairman of the NWO (the Dutch NSF) gave some remarks. For me, the key take-away was that in every one of the Dutch Government’s 9 Priority Sectors, technology has a central role in smoothing both the research process and its transition to practice.

After the opening session, there were four parallel sessions on text analysis, dealing with data, profiling research, and technology for research education. I managed to attend parts of three of the sessions. In the profiling session, the recently released SURF Report on tracking the impact of scholarly publications in the 21st century, sparked my interest.  Finding new faster and broader ways of measuring impact (i.e. altmetrics)  is a way of reducing friction in science communication. The ESCAPE project showed how enriched publications can make it easy to collate and browse related content around traditional articles. The project won SURF’s enriched publication of the year award. Again, the key, simplifying the research process.  Beyond these presentations, there were talks ranging from making it easier to do novel chemistry to helping religious scholars understand groups through online forms. In each case, the technology was successful because it eliminated friction in the research process.

The SURF research day presented not just technology but how, when it’s done right, technology can make research just a bit smoother.

Filed under: academia, altmetrics Tagged: events, ozdag, surffounation

Source: Think Links

I had a nice opportunity to start out this year with a visit to the Information Sciences Institute (ISI) in Southern California’s beautiful Marina del Rey .  I did my postdoc with Yolanda Gil at ISI and we have continued to have an active collaboration, recently, doing work on using workflows for exposing networks from linked data.

I always get a jolt of information visiting ISI. Here are five pointers to things I learned this time:

1. The Karma system [github] is really leading the way on bringing data integration techniques to linked data. I’ll definitely be looking at Karma with respect to our development of the Open PHACTS platform.

2. I’m excited about change detection algorithms in particular edit distance related measures for figuring out how to generate rich provenance information in the Data2Semantics project. These are pretty well studied algorithms but I think we should be able to apply them differently. A good place to start is the paper:

3. Also with respect to provenance, after talking with Greg Ver Steeg, I think Granger Causality and some of the other associated statistical models are worth looking at. Some pointers:

4. Tran Thanh gave a nice overview of his work on Semantic Search. I liked how he combined and extended the information retrieval and database communities work using Semantic Web techniques. Keyword: Steiner Trees

5. MadSciNetwork is a site where scientists answer questions from the public. This has been around since 1995. They have collected over 40,000 answered science questions. This corpus of questions is available at MadSci Network Research. Very cool.

Finally… it’s nice to visit southern california in January when you live in cold Amsterdam :-)

 

 

Filed under: academia, linked data, provenance markup Tagged: pointers

Source: Think Links

I had a nice opportunity to start out this year with a visit to the Information Sciences Institute (ISI) in Southern California’s beautiful Marina del Rey .  I did my postdoc with Yolanda Gil at ISI and we have continued to have an active collaboration, recently, doing work on using workflows for exposing networks from linked data.

I always get a jolt of information visiting ISI. Here are five pointers to things I learned this time:

1. The Karma system [github] is really leading the way on bringing data integration techniques to linked data. I’ll definitely be looking at Karma with respect to our development of the Open PHACTS platform.

2. I’m excited about change detection algorithms in particular edit distance related measures for figuring out how to generate rich provenance information in the Data2Semantics project. These are pretty well studied algorithms but I think we should be able to apply them differently. A good place to start is the paper:

3. Also with respect to provenance, after talking with Greg Ver Steeg, I think Granger Causality and some of the other associated statistical models are worth looking at. Some pointers:

4. Tran Thanh gave a nice overview of his work on Semantic Search. I liked how he combined and extended the information retrieval and database communities work using Semantic Web techniques. Keyword: Steiner Trees

5. MadSciNetwork is a site where scientists answer questions from the public. This has been around since 1995. They have collected over 40,000 answered science questions. This corpus of questions is available at MadSci Network Research. Very cool.

Finally… it’s nice to visit southern california in January when you live in cold Amsterdam :-)

 

 

Filed under: academia, linked data, provenance markup Tagged: pointers