News and Updates on the KRR Group
Header image

Author Archives: paulgroth

Source: Think Links

The WordPress.com stats helper monkeys prepared a 2012 annual report for this blog.

Here’s an excerpt:

600 people reached the top of Mt. Everest in 2012. This blog got about 4,900 views in 2012. If every person who reached the top of Mt. Everest viewed this blog, it would have taken 8 years to get that many views.

Click here to see the complete report.

Filed under: Uncategorized

Source: Think Links

From November 1 – 3, 2012, I attended the PLOS Article Level Metrics Workshop in San Francisco .

PLOS is a major open-access online publisher and the publisher of the leading megajournal PLOS One. A mega-journal is one that accepts any scientifically sound manuscript. This means there is no decision on novelty just a decision on whether the paper was done in a scientifically sound way. The consequence is that this leads to much more science getting published and the corresponding need for even better filters and search systems for science.
As an online publisher, PLOS tracks many what are termed article level metrics – these metrics go beyond of traditional scientific citations and include things like page views, pdf downloads, mentions on twitter, etc. Article level metrics are to my mind altmetrics aggregated at the article level.
PLOS provides a comprehensive api to obtain these metrics and wants to encourage the broader adoption and usage of these metrics. Thus, they organized this workshop. There were a variety of people attending (https://sites.google.com/site/altmetricsworkshop/attendees/attendee-bios) from publishers (including open access ones and the traditional big ones), funders, librarians to technologists. I was a bit disappointed not to see more social scientists there but I think the push here has been primarily from the representative communities. The goal was to outline key challenges for altmetrics and then corresponding concrete actions that could place in the next 6 months that could help address these challenges. It was an unconference so no presentations and lots of discussion. I found it to be quite intense as we often broke up into small groups where one had to be fully engaged. The organizers are putting together a report that digests the work that was done. I’m excited to see the results.

Me actively contributing :-) Thanks Ian Mulvany!

Highlights

  • Launch of the PLOS Altmetrics Collection. This was really exciting for me as I was one of the organizers of getting this collection produced. Our editorial is here: This collection provides a nice home for future articles on altmetrics
  • I was impressed about the availability of APIs. There are now several aggregators and good sources of altmetrics in just a bit of time. ImpactStory, almetric.com, plos alm apis, mendeley, figshare.com, microsoft academic search
  • rOpenSci (http://ropensci.org) is a cool project that provides R apis to many of these alt metric and other sources for analyzing data
  • There’s quite a bit of interest in services to do these metrics. For example, Plum Analytics (http://www.plumanalytics.com) has a test being done at the University of Pittsburgh. I also talked to other people who were getting interest in using these alternative impact measures and also heard a number of companies are now providing this sort of analytics service.
  • I talked a lot to Mark Hahnel from Figshare.com about the Data2Semantics LinkItUp service. He is super excited about it and loved the demo. I’m really excited about this collaboration.
  • Microsoft Academic Search is getting better, they are really turning it into a production product with better and more comprehensive data. I’m expecting a really solid service in the next couple of months.
  • I learned from Ian Mulvany of eLife that Graph theory is mathematically “the same as” statistical mechanics in physics.
  • Context, Context, Context – there was a ton of discussion about the importance of context to the numbers one gets from altmetrics. For example, being able to quickly compare to some baseline or by knowing the population which the number is applied.

    White board thoughts on context! thanks Ian Mulvany

  • Related to context was the need for simple semantics – there was a notion that for example we need to know if a retweet in twitter was positive or negative and what kind of person retweeted the paper (i.e. a scientists, a member of the public, a journalist, etc). This was because that unlike citations the population that altmetrics uses is not as clearly defined as it exists in a communication medium that doesn’t just contain scholarly communication.
  • I had a nice discussion with Elizabeth Iorns the founder of https://www.scienceexchange.com . There doing cool stuff around building markets for performing and replicating experiments.
  • Independent of the conference, I met up with some people I know from the natural language processing community and one of the things that they were excited about is computational semantics but using statistical approaches. It seems like this is very hot in that community and something we in the knowledge representation & reasoning community should pay attention to.

Hackathon

Associated with the workshop was a hackathon held at the PLOS offices. I worked in a group that built a quick demo called rerank.it . This was a bookmarklet that would highlight papers in pubmed search results based on their online impact according to impact story. So you would get different color coded results based on alt metric scores. This only took a day’s worth of work and really showed to me how far these apis have come in allowing applications to be built. It was a fun environment and was really impressed with the other work that came out.

Random thought on San Francisco

  • Four Barrel coffee serves really really nice coffee – but get there early before the influx of ultra cool locals
  • The guys at Goody Cafe are really nice and also serve good coffee
  • If you’re in the touristy Fisherman’s Wharf area walk to the Fort Mason for fantastic views of the golden gate bridge. The hostel there also looks cool.

Filed under: altmetrics, interdisciplinary research Tagged: #altmetrics, plos, trip report

Source: Think Links

For International Open Access Week the VU University Amsterdam  is working with other Dutch universities to provide information to academics on open access. Here’s a video they took of me talking about the relationship between open access and social media. The two go hand-in-hand!



Filed under: academia, altmetrics Tagged: oa, oaweek, open access, social media

Source: Think Links

For the last 1.5 years, I’ve been working on the Open PHACTS project – a public private partnership for pharmacology  data integration. Below is a good video from Lee Harland the CTO giving an overview of the project from the tech/pharma perspective.



Filed under: academia

Source: Think Links

I just received 50 copies of this in the mail today:

Literature is so much better in its electronic form but its still fun to get a physical copy. Most importantly, this proceedings represents scientific content and a scientific community that I’m proud to be be part of. You can obviously access the full proceedings online. Preprints are also available from most of the author’s sites. You can also read my summary of the 4th International Provenance and Annotation Workshop (IPAW 2012) .

Filed under: academia, provenance Tagged: ipaw, lecture notes in computer science, lncs, provenance

Source: Think Links

If you read this blog a bit, you’ll know I’m a fairly big fan of RDF as a data format. It’s really great for easily mashing different data sources together. The common syntax gets rid of a lot of headaches before you can start querying the data and you get nice things like simple reasoning to boot.

One thing that I’ve been looking for is a nice way to store a ton of RDF data, which is

  1. easy to deploy;
  2. easy to query;
  3. works well with scalable analysis & reasoning techniques (e.g. stuff built using MapReduce/Hadoop);
  4. oh and obviously scalable.

This past spring I was at a dagstuhl workshop where I had the chance to briefly talk to Chris Ré about the data storage environment used  by the Hazy – one of the leading projects in the world on large scale statistical inference. At the time, he was fairly enthusiastic about using HBase as a storage layer.

Based on that suggestion, I played around with deploying HBase myself on Amazon. Using whirr it was pretty straightforward to deploy a pretty nice environment in a matter of hours. In addition, HBase has the nice side effect that it uses the same file system as Hadoop  (HDFS) so you can run Hadoop jobs over the data that’s stored in the database.

With that I wanted to see a) what was a good way to store a bunch of RDF in HBase and b) if the retrieval of RDF was performant. Sever Fundatureanu worked on this as his master’s thesis.

One of the novel things he looked at was using coprocessors (built in user defined functions in  hbase) to try and improve the building of indexes for RDF within the database. That is instead of running multiple hadoop load jobs you run ~one and then let the coprocessors in each worker node build the rest of the x indexes you want to improve retrieval. While it didn’t improve performance, I thought the idea was cool. I’m still interested in how much user side processing you can shove into the worker nodes within HBase. Below you’ll find an abstract and link to his full thesis.

I’m still keen on using HBase as the basis for the analysis and reasoning over RDF data. We’re continuing to look into this area. If you have some cool ideas, let us know.

A Scalable RDF Store Based on HBase  –

Sever Fundatureanu

The exponential growth of the Semantic Web leads to a need for a scalable storage solution for RDF data. In this project, we design a quad store based on HBase, a NoSQL database which has proven to scale out to thousands of nodes. We adopt an Id-based schema and argue why it enables a good trade-off between loading and retrieval performance. We devise a novel bulk loading technique based on HBase coprocessors and we compare it to a traditional Map-Reduce technique. The evaluation shows that our technique does not scale as well as the traditional approach. Instead, with Map-Reduce, we achieve a loading throughput of 32152 quads/second on a cluster of 13 nodes. For retrieval, we obtain a peak throughput of 56447 quads/second.

 

Filed under: linked data Tagged: coprocessors, hbase, rdf

Source: Think Links

Today, I was teaching the second class of our Semantic Web class here at the VU University Amsterdam on RDF and RDFS. After the first half of the class in a very warm lecture room, the students were fading. After a quick poll, we decided to take the course outside. So I had the fun challenge of teaching off-the-cuff RDF Schema without a chalkboard and slides… and I think it actually worked. The students did a great job of participating and we managed to demonstrate a bit of rule based reasoning using a combination of coloured paper, students, and moving about. Here’s a photo of the class as we ended:

Filed under: academia Tagged: lecture, outdoors, vrije universiteit amsterdam, vu university amsterdam

Source: Think Links

This year I had the opportunity to be program co-chair and help organize the 4th International Provenance and Annotation Workskhop (IPAW). The event went great, really, better than I imagined. First, I was fortunate to be organizing it with James Frew from the Bren School of Environmental Science and Management a the University of California, Santa Barabara. He not only helped coordinate the program but along with his team took care of all the local organization. It’s hard to beat sunny Santa Barbara as a location, but they also made sure that everything ran smoothly: great wifi, shuttles to and from the location, tasty outdoor lunches looking over the ocean, an open air poster session with wine and cheese, and a BBQ on the beach for a workshop dinner:

The IPAW workshop dinner. Photo from Andreas Schrieber.

So big kudos to Frew and his team. Obviously, beyond being well run we covered a lot in the two days of the main workshop. The workshop had 47 attendees and you can find the twitter log here.

Research Highlights

I think the program really highlighted where we are at in provenance research today and the directions forward. I won’t go through every paper but  just try to pick 3 interesting trends.

1) Using Provenance to Address the Messiness of the Web

The Web provides us a fantastic source of knowledge. But the problem is that knowledge is completely unclean and unintegrated. Even efforts such as Linked Data while giving us better data our still messy and still under-integrated. Both researchers and firms have been trying to make clean integrated  knowledge, but then they are faced with what Timothy Lebo in his paper termed the Integrator’s Dilemma. A integrator may produce a clean well-structured data set, but in the process the resulting data set looses authority and a connection to domain expertise. To rectify this problem, provenance can be used to identify the authoritative source and connect back to domain expertise. Indeed, Jim McCusker and colleagues argued that provenance is the 3rd step to Data Sanity.

However, then we run into Tim’s 1st law of producing provenance:

For any provenance record, there exists a provenance consumer that will need it, but will not like how it is provided.

Tim suggests a service based solution to provide provenance at the correct granularity for provenance. While I don’t know if that is the right solution, it’s clear that providing provenance at the right level of granularity is one foundation to building confidence in integrated web data sources.

Another example of trying different ways of using provenance to address messiness is the use of network analysis to understand provenance captured from crowd sourced applications.

2) Provenance and Credit are Intertwined

Science has always been a driver of research in provenance and we saw a number of good pieces of work addressing domains ranging from climate analysis to archeology. However, as our key note speaker Phil Bourne pointed out in his talk, scientists are not using provenance technologies in their work. He argued that this is for two reasons: 1) because they are not given credit for all parts of the scientific process and 2) provenance infrastructure is still not easy enough to use.  Phil argued that it was fundamental that more artifacts of the research lifecycle need to be given credit to facilitate sharing and thus increase the pace of innovation particularly in the life sciences. Thus, for scientists to capture their information in a sharable fashion they need to be given credit for doing so. (Yes, he connected altmetrics to provenance – very cool from my point of view). To do this, he argued, we need better support for provenance throughout the research lifecycle. However, while tools exist, they are far from being usable and integrated enough into everyday science practice. This is a real challenge to the provenance community. We need to do better at getting our approaches into scientists hands.

3) The Problem of Post-hoc

Much work in the provenance literature has asked the question of how does one capture provenance effectively in computational systems. But many times this is just not possible. The user may not have thought about installing the system to capture provenance in the first place or may not have chosen to write down their rational for taking some action. This is an area that I’m actively researching so it was a great to see others starting to address the problem. Tom De Neiss attacked the problem of reconstructing provenance for a collection of newspaper articles using semantic similarity. An even more farther out idea presented at the workshop was to try and reconstruct the provenance of decision made by a human using simulation. Both works highlight the need for dealing with incomplete or even non-existant provenance.

These were just some of the themes that I saw. Overall, the presentations were good and the audience was engaged. We had lots of hall time and I heard many intense discussions so I’m hoping that the event spurred more research. I know personally we will try to pursue a collaboration to build a provenance corpus to study this reconstruction problem.

A Provenance Week

IPAW has a tradition of being hosted as an independent event, which allows us to not only have the two day workshop but also organize collocated events. This IPAW was the same. The Data Observation Network for Earth organized a meeting on provenance and scientific workflow collocated with IPAW. Additionally, the W3C Provenance Working Group both gave a tutorial before the workshop and held their two day face-to-face meeting afterwards. Here’s me presenting the core of the provenance data model to the 28 tutorial participants.

The Provenance Data Model. It’s easy! Photo prov:wasAttributedTo Andreas Schrieber

Conclusion

IPAW 2012 was a lot effort but it was worth it – fun discussion, beautiful weather and research insight.  Again, the community voted to have another IPAW in 2014. The community is continuing to play to its strengths in workflows, databases and science applications while exploring novel areas. In the CFP for IPAW, we wrote that “2012 will be a watershed year for provenance/annotation research.” For me, IPAW confirmed that statement.

Filed under: academia, events Tagged: #ipaw2010, international provenance and annotation workshop, ipaw

Source: Think Links

This year I had the opportunity to be program co-chair and help organize the 4th International Provenance and Annotation Workskhop (IPAW). The event went great, really, better than I imagined. First, I was fortunate to be organizing it with James Frew from the Bren School of Environmental Science and Management a the University of California, Santa Barabara. He not only helped coordinate the program but along with his team took care of all the local organization. It’s hard to beat sunny Santa Barbara as a location, but they also made sure that everything ran smoothly: great wifi, shuttles to and from the location, tasty outdoor lunches looking over the ocean, an open air poster session with wine and cheese, and a BBQ on the beach for a workshop dinner:

The IPAW workshop dinner. Photo from Andreas Schrieber.

So big kudos to Frew and his team. Obviously, beyond being well run we covered a lot in the two days of the main workshop. The workshop had 47 attendees and you can find the twitter log here.

Research Highlights

I think the program really highlighted where we are at in provenance research today and the directions forward. I won’t go through every paper but  just try to pick 3 interesting trends.

1) Using Provenance to Address the Messiness of the Web

The Web provides us a fantastic source of knowledge. But the problem is that knowledge is completely unclean and unintegrated. Even efforts such as Linked Data while giving us better data our still messy and still under-integrated. Both researchers and firms have been trying to make clean integrated  knowledge, but then they are faced with what Timothy Lebo in his paper termed the Integrator’s Dilemma. A integrator may produce a clean well-structured data set, but in the process the resulting data set looses authority and a connection to domain expertise. To rectify this problem, provenance can be used to identify the authoritative source and connect back to domain expertise. Indeed, Jim McCusker and colleagues argued that provenance is the 3rd step to Data Sanity.

However, then we run into Tim’s 1st law of producing provenance:

For any provenance record, there exists a provenance consumer that will need it, but will not like how it is provided.

Tim suggests a service based solution to provide provenance at the correct granularity for provenance. While I don’t know if that is the right solution, it’s clear that providing provenance at the right level of granularity is one foundation to building confidence in integrated web data sources.

Another example of trying different ways of using provenance to address messiness is the use of network analysis to understand provenance captured from crowd sourced applications.

2) Provenance and Credit are Intertwined

Science has always been a driver of research in provenance and we saw a number of good pieces of work addressing domains ranging from climate analysis to archeology. However, as our key note speaker Phil Bourne pointed out in his talk, scientists are not using provenance technologies in their work. He argued that this is for two reasons: 1) because they are not given credit for all parts of the scientific process and 2) provenance infrastructure is still not easy enough to use.  Phil argued that it was fundamental that more artifacts of the research lifecycle need to be given credit to facilitate sharing and thus increase the pace of innovation particularly in the life sciences. Thus, for scientists to capture their information in a sharable fashion they need to be given credit for doing so. (Yes, he connected altmetrics to provenance – very cool from my point of view). To do this, he argued, we need better support for provenance throughout the research lifecycle. However, while tools exist, they are far from being usable and integrated enough into everyday science practice. This is a real challenge to the provenance community. We need to do better at getting our approaches into scientists hands.

3) The Problem of Post-hoc

Much work in the provenance literature has asked the question of how does one capture provenance effectively in computational systems. But many times this is just not possible. The user may not have thought about installing the system to capture provenance in the first place or may not have chosen to write down their rational for taking some action. This is an area that I’m actively researching so it was a great to see others starting to address the problem. Tom De Neiss attacked the problem of reconstructing provenance for a collection of newspaper articles using semantic similarity. An even more farther out idea presented at the workshop was to try and reconstruct the provenance of decision made by a human using simulation. Both works highlight the need for dealing with incomplete or even non-existant provenance.

These were just some of the themes that I saw. Overall, the presentations were good and the audience was engaged. We had lots of hall time and I heard many intense discussions so I’m hoping that the event spurred more research. I know personally we will try to pursue a collaboration to build a provenance corpus to study this reconstruction problem.

A Provenance Week

IPAW has a tradition of being hosted as an independent event, which allows us to not only have the two day workshop but also organize collocated events. This IPAW was the same. The Data Observation Network for Earth organized a meeting on provenance and scientific workflow collocated with IPAW. Additionally, the W3C Provenance Working Group both gave a tutorial before the workshop and held their two day face-to-face meeting afterwards. Here’s me presenting the core of the provenance data model to the 28 tutorial participants.

The Provenance Data Model. It’s easy! Photo prov:wasAttributedTo Andreas Schrieber

Conclusion

IPAW 2012 was a lot effort but it was worth it – fun discussion, beautiful weather and research insight.  Again, the community voted to have another IPAW in 2014. The community is continuing to play to its strengths in workflows, databases and science applications while exploring novel areas. In the CFP for IPAW, we wrote that “2012 will be a watershed year for provenance/annotation research.” For me, IPAW confirmed that statement.

Filed under: academia, events Tagged: #ipaw2010, international provenance and annotation workshop, ipaw

Source: Think Links

Last week, I was at a seminar on Semantic Data Management at Dagstuhl. A month ago I was at Dagstuhl discussing the principles of provenance. You can read more about the atmosphere and style of a Dagstuhl event at the post on the provenance event. From my perspective, it’s pretty cool to get invited to multiple Dagstuhl events in short succession… I think it just happens that two of my main research areas overlap and were scheduled in the same time period.

Semantic Data Management Group photo

Obligatory Dagstuhl Group Photo

Indeed, one of the topics for discussion at the seminar was provenance. The others were scalability, dynamcity, and search. The organizers (Elena Simprel, Karl AbererGrigoris AntoniouOscar Corcho and Rudi Studer) will put together a report summarizing all the outcomes. What I wanted to do is focus on the key points that I took away from the seminar.

Scaling semantic data management = scaling graph databases

There was some discussion around what it means to scale in terms of semantic data management. For the most part this boiled down to, what does it mean to scale RDF databases? The organizers did a good job of bringing members of industry in that have actual experience in building scalable RDF systems. The first day contained some great discussion about the guts of databases and what makes scaling hard – issues such as the latency of storage infrastructure and what the right join algorithm were. Steve Harris brought out the difficulty of backup and restore in real world systems and the lack of research in that area.  But my primary feeling was the challenges of scalability are ones of how we deal with large graphs. In my open work in Open PHACTs, I’ve seen how using graphs has increased our flexibility but challenged us in terms of scalability.

Dealing with large graphs is hard but I think the Semantic Web community can lead the way here because we have a nice substrate, namely, an exchange model for graphs and a common query language.  This leads to the next point:

Benchmarks! Benchmarks! Benchmarks!

Throughout the week there was discussion of the need for all types of benchmarks. LUBM and BSBM  have served us well but we need better benchmarks: more and different types of queries, more realistic datasets, configurable benchmarks, etc.  There was also discussions of other types of benchmarks, for example, a provenance corpus or a corpus that combines structured and unstructured data for ranking. One comment that I heard in terms of benchmarks is where should you publish them? Unlike the IR community we don’t have  something thing like TREC. Although, I think USEWOD is a good example of bootstrapping this sort of activity.

Let’s just be uncertain

One of the cross-cutting themes of the symposium was the need to deal with uncertainty. From dealing with crawled data, to information extraction systems, to even data created by classic knowledge capture, there is a need to express and use uncertainty.  In the area of provenance, I was impressed Martin Theobald’s URDF system that deals with both uncertain data and uncertain rules.

One major handicap is that RDF systems have is that reification let’s you associate confidence values with statements but is just extremely verbose. At the symposium, Bryan Thompson  and Orri Erling led the way in constructing a proposal to expose statement level identifiers that are compatible with reification. Olaf Hartig even worked out an approach that makes this compatible with SPARQL semantics. I’m looking forward to seeing their final proposal. This will make associating uncertainity and other evidence related information to triples.

One final thing to say is that these discussions made me glad that attributes are included in the PROV model. This provides an important hook for this kind of uncertainty information.

Crowdsourcing is a component

There was quite a lot of talk about integrating crowdsourcing into the data management stack (See Wolf-Tilo Balke’s work). It’s clear that when we are designing semantic data management systems that crowdsourcing is clearly an important component. Just as ontology engineers are boxes in many of our architectures maybe the crowd should be there by default as well.

Provenance – get it out – we’re ready

Beyond being a discussant in the conversation. I also gave an intro to provenance research based on the categorization of content, management and use produced in the Provenance Incubator. Luc Moreau, Olaf Hartig and Paolo Missier gave a walkthrough of the PROV spec coming from the W3C.  We had some interesting technical feedback but the general impression I got was – it looks pretty good, get it out there, this is something we need and can use – now.

For example, I had discussions with Manuel Salvadores about using PROV as the ontology for describing provenance in BioPortal. Satya S. Sahoo (a working group member) is extending PROV for capturing provenance in sleep studies. There was discussion of connecting PROV with the Semantic Sensor Network ontology. As with other Semantic Web standards, PROV will provide the basis for both applications and future research. It’s now up to us as working group to get these documents out.

Embracing other communities

I think the community as a whole has been doing a good job in embracing other communities. This has been shown by those working on RDF stores who have embraced the database community. Also in semantic search there is a good conversation that is bridging the IR community and the database field. Interestingly, semantic search is really a driver of that conversation. I learned about a good survey paper by Thanh Tran and Peter Mika at Dagstuhl – highly recommended.

Federation is a spectrum

There was lots of talk about federation at the symposium. My general impression is that federation is not something that we can say yes or no. Instead different applications will require different kinds of federation. I think there is lots of room to research how we can systematically place systems on the federation spectrum. I come with a series of requirements, where and how should I include federation in my data management scheme. For example, I may want to trade-of computational overhead for space as suggested by Olaf Hartig in his Link Traversal Based Query Execution approach (i.e. follow your nose). This caused some of the most entertaining discussions at the symposium. Should you need a data center to query the Web of Data? Let’s find out.

Conclusion

I think the report coming from this symposium will provide a good document sketching out the research challenges in semantic data management for the next several years. I’m looking forward to it. I’ll end with a quote from a slide in José Manuel Gomez-Perez‘s talk. According to the IDC 2011 Digital Universe study, metadata is the fastest growing data category.

There’s demand for the work we are doing and there are many challenges remaining – this promises to be a fun couple of years.

Filed under: academia, linked data Tagged: dagstuhl, research challenges, semantic data managment