News and Updates on the KRR Group
Header image

Source: Data2Semantics


During the COMMIT/ Community event, April 2-3 in Lunteren, the Data2Semantics won one out of three COMMIT/ Valorization awards. The award is a 10000 euros subsidy to encourage the project to bring one of its products closer to use outside academia.

At the event, we presented and emphasized the philosophy of Data2Semantics to embed new enrichment tools in the current workflow of individual researchers. We are working closely with both Figshare.com (with our Linkitup tool) and Elsevier Labs to bring semantics at the fingertips of the researcher.

Source: Think Links

coffe from the worldThe rise of Fair Trade food and other products has been amazing over the past 4 years. Indeed, it’s great to see how certification for the origins (and production processes) of products  is becoming both prevalent and expected. For me, it’s nice to know where my morning coffee was grown and indeed knowing that lets me figure out the quality of the coffee (is it single origin or a blend?).

I now think it’s time that we do the same for data. As we work in environments where our data is aggregated from multiple sources and processed along complex digital supply chains, we need the same sort of “fair trade” style certificate for our data. I want to know that my data was grown and nurtured and treated with care and it would be great to have a stamp that lets me understand that with a glance without having to a lot of complex digging.

In a just published commentary in IEEE Internet Computing, I go into a bit more detail about how provenance and linked data technologies are laying the ground work for fair trade data. Take a look and let me know what you think.

 

 

Filed under: provenance, supply chains Tagged: data, fair trade, provenance, supply chain

Source: Think Links

In the context of the Open PHACTS and the Linked Data Benchmark Council projects, Antonis Loizou and I have been looking at how to write better SPARQL queries. In the Open PHACTS project, we’ve been writing super complicated queries to integrate multiple data sources and from experience we realized that different combinations and factors can dramatically impact performance. With this experience, we decided to do something more systematic and test how different techniques we came up with mapped to database theory and worked in practice. We just submitted a paper for review on the outcome. You can find a preprint (On the Formulation of Performant SPARQL Queries) on arxiv.org at http://arxiv.org/abs/1304.0567. The abstract is below. The fancy graphs are in the paper.

But if you’re just looking for ways to write better queries, here are the main rules-of-thumb that we found.

  1. Minimise optional triple patterns : Reduce the number of optional triple patterns by identify-ing those triple patterns for a given query that will always be bound using dataset statistics.
  2. Localise SPARQL subpatterns: Use named graphs to specify the subset of triples in a dataset that portions of a query should be evaluated against.
  3. Replace connected triple patterns: Use property paths to replace connected triple patterns where the object of one triple pattern is the subject of another.
  4. Reduce the effects of cartesian products: Use aggregates to reduce the size of solution sequences.
  5. Specifying alternative URIs: Consider different ways of specifying alternative URIs beyond UNION.

Finally, one thing we did learn was test, test, test. The performance of the same query can vary dramatically across stores.

Title: On the Formulation of Performant SPARQL Queries
Authors: Antonis Loizou and Paul Groth

The combination of the flexibility of RDF and the expressiveness of SPARQL
provides a powerful mechanism to model, integrate and query data. However,
these properties also mean that it is nontrivial to write performant SPARQL
queries. Indeed, it is quite easy to create queries that tax even the most optimised triple stores. Currently, application developers have little concrete guidance on how to write “good” queries. The goal of this paper is to begin to bridge this gap. It describes 5 heuristics that can be applied to create optimised queries. The heuristics are informed by formal results in the literature on the semantics and complexity of evaluating SPARQL queries, which ensures that queries following these rules can be optimised effectively by an underlying RDF store. Moreover, we empirically verify the efficacy of the heuristics using a set of openly available datasets and corresponding SPARQL queries developed by a large pharmacology data integration project. The experimental results show improvements in performance across 6 state-of-the-art RDF stores.

Filed under: linked data Tagged: heuristics, performance, sparql, triple store

Source: Think Links

You should go read Jason Preim‘s excellent commentary in Nature –  Scholarship: Beyond the Paper but I wanted to call out a bit that I’ve talked about with a number of people and I think is important. We should be looking at how we build the best teams of scientists and not just looking for the single best individual:

Tenure and hiring committees will adapt, too, with growing urgency. Ultimately, science evaluation will become something that is done scientifically, exchanging arbitrary, biased, personal opinions for meaningful distillations of entire communities’ assessments. We can start to imagine the academic department as a sports team, full of complementary positions (theorists, methodologists, educators, public communicators, grant writers and so on). Coming years will see evaluators playing an academic version of Moneyball (the statistical approach to US baseball): instead of trying to field teams of identical superstars, we will leverage nuanced impact data to build teams of specialists who add up to more than the sum of their parts.

Science is a big team sport especially with today’s need for interdisciplinary and large-scale experiments. We need to encourage the building of teams in sciences.

Filed under: academia, altmetrics Tagged: moneyball, science, team

Source: Think Links

Beyond the PDF - drawn notes day 1

Wow! The last three days have been crazy, hectic, awesome and inspiring. We just finished putting on The Future of Research Communication and e-Scholarhip (FORCE11)’s Beyond the PDF 2 conference  here in Amsterdam. (I was chair of the organizing committee and in charge of local arrangements) The idea behind Beyond the PDF was to bring together a diverse set of people (scholars, technologists, policy experts, librarians, start-ups, publishers, …) all interested in making scholarly and research communication better. In that case, I think we achieved are goal. We had 210 attendees from across the spectrum. Below are two charts: one of the types organizations of the attendees and domain they are from.

domainsorganization-chart

The program of the conference was varied. We covered new tools, business models, the context of the approach, research evaluation, visions for the futures and how to moved forward. Here, I won’t go over the entire conference here. We’ll have a complete video online soon (thanks Elsevier). I just wanted to call out some personal highlights.

Keynotes

We had two great keynotes from Kathleen Fitzpatrick of the Modern Language Association  and the other from Carol Tenopir (Chancellor’s Professor at the School of Information Sciences at the University of Tennessee, Knoxville). Kathleen discussed how it is essential for humanities to embrace new forms of scholarly communication as it allows for faster dissemination of their work.  Carol discussed the practice of reading for academics. She’s done in-depth tracking of how scientists read. Some interesting tidbits: successful scientists read more and so far social media use has not decreased the amount of reading that scientists do. The keynotes were really a sign of how much more humanities were present at this conference than Beyond the PDF 1.

2013-03-19 09.23.52

Kathleen Fitzpatrick (@kfitz). Director of Scholarly Communication , Modern Language Association


The tools are there

Jason Priem compares online journals to horses

Just two years ago at the first Beyond the PDF, there were mainly initial ideas and drafts for next generation research communication tools. At this year’s conference, there were really a huge number of tools that are ready to be used. Figshare, PDFX, Authorea, Mendeley, IsaTools, StemBook, Commons in a Box, IPython, ImpactStory and on…

Furthermore, there are different ways of publishing from PeerJ to Hypothes.is and even just posting to blog. Probably the interesting idea of the conference was the use of github to essential publish.

For me this made me think it’s time to think about my own scientific workflow and figure out how to update it to better use these tools in practice.

People made connections

At the end of the conference, I asked if people had made a new connection. Almost every hand went up. It was great to see publishers, technologists, librarians also talking together. The twitter back channel at the conference was great. We saw a lot of conversations that kept going on #btpdf2 and also people commenting while watching the live stream. Check out a great Storify of the social media stream of the conference done by Graham Steel.


Creative Commons-Licentie
Beyond the PDF 2 photographs van Maurice Vanderfeesten is in licentie gegeven volgens een Creative Commons Naamsvermelding-GelijkDelen 3.0 Unported licentie.
Gebaseerd op een werk op http://sdrv.ms/YI4Z4k.

Making it happen

We gave a challenge to the community, “what would you do with 1k today that would change scholarly communication for the better? ” The challenge was well received and we had a bunch of different ideas from sponsoring viewing parties to encouraging the adoption of DOIs in the developing world and by small publishers.

The Challenge of Evaluation

We had a great discussion around the role of evaluation.  I think the format that was used by Carole Goble for the evaluation session where we had role playing representing key players in the evaluation of research and researchers really highlighted the fact that we have a first mover problem. None of the roles feel that “they should go first”. It was unclear how to push past that challenge.

Various Roles in Science Evaluation

Summary

Personally, I had a great time. FORCE 11 is a unique community and I think brings together people that need to talk to change the way we communicate scholarship. This was my quick thoughts on the event. There’s a lot more to come. We will have the video of the event up soon. Also, we will have drawn notes posted provided by Jongens van de Tekeningen. Also, we will award a series of 1k grants to support ongoing work. Finally, I hope to see many more blog posts documenting the different views of attendees.

Thanks

We had many great sponsors that helped make a great event. Things like live streaming, student scholarships, a professional set-up, demos & dinner ensure that an event like this works.

Filed under: academia, altmetrics, events, interdisciplinary research Tagged: #btpdf2, force11, scholarly communication

Source: Data2Semantics

BeyondThePDF2 – TimeLapse from Jongens van de Tekeningen on Vimeo.

Data2Semantics was the local organizer for the Beyond the PDF 2 conference. This conference brought together over 200 scholars, technologists, librarians and publishers to discuss the future of research communication. The conference had a huge social media presence with 3,500 tweets sent by 625 participants over the course of the two days. There were also lots of other outcomes.

This is another example of how Data2Semantics is reaching out to the scientific and research communities to push new ways of doing research.

 

Benchmarks

Posted by admin in ldbc - (0 Comments)

This page will include details for vendors and users containing reference information about the RDF and graph databases benchmarks developed by LDBC once they are completed. One can track the development of the benchmarks at: http://www.ldbc.eu:8090/display/TUC/Benchmark+Task+Forces

Source: Think Links

One of the ideas in the altmetrics manifesto was that almetrics allow a diversity of metrics. With colleagues in the VU University Amsterdam’s Network Institute, we’ve been investigating the use of online data (in this case google scholar) to help create new metrics to measure the independence of researchers. In this case, we need fresh data to establish whether an emerging scholar is becoming independent from their supervisor. We just had the results of one our approaches accepted into the Web Science 2013 conference. The abstract is below and here’s a link to the preprint.

Identifying Research Talent Using Web-Centric Databases 

Anca Dumitrache, Paul Groth, and  Peter van den Besselaar

Metrics play a key part in the assessment of scholars. These metrics are primarily computed using data collected in offline procedures. In this work, we compare the usage of a publication database based on a Web crawl and a traditional publication database for computing scholarly metrics. We focus on metrics that determine the independence of researchers from their supervisor, which are used to assess the growth of young researchers. We describe two types of graphs that can be constructed from online data: the co-author network of the young researcher, and the combined topic network of the young researcher and their supervisor, together with a series of network properties that describe these graphs. Finally, we show that, for the purpose of discovering emerging talent, dynamic online resources for publications provide better coverage than more traditional datasets.

This is fairly preliminary work, it mainly establishes that we want to use the freshest possible data for this work. We are expanding the work to do a large scale study  of independence as well as to use different sources of data. But to me, this shows how the freshness of web data allows us to begin looking at and measuring research in new ways.

Filed under: altmetrics Tagged: #altmetrics, independence indicator, web science, websci13

Source: Think Links

I’ve been reviewing papers lately and I’m beginning to develop a new heuristic: If I follow a link mentioned in the paper and there’s something there that’s reasonable, there’s a good chance the paper is good. Not all the time, of course, but it’s surprisingly good predictor. In particular, I review computer science papers many of which describe frameworks, architectures or systems. The potential reusability of these artifacts is partly premised on the availability of their code. Unfortunately, in some cases there’s nothing on the other end of the link or the link doesn’t make sense.

The moral of the story – include links in your papers and make sure they work.

 

Filed under: academia

Source: Data2Semantics

Vrije Universiteit (Amsterdam). Left: Exact Sc...

VU University Amsterdam (Photo credit: Wikipedia)

Learn to build better code in less time.

Software Carpentry (http://www.software-carpentry.org) is a two day bootcamp for researchers to learn how to be more productive with code and software creation. VU University Amsterdam brings Software Carpentry to the Netherlands for the first time. PhD students, postdocs and researchers in physics are cordially invited for this free, 2-day workshop, on May 2–3, 2013, in Amsterdam.

Data2Semantics is sponsoring the event to help learn the issues facing scientists around managing their data.

Go to http://www.data2semantics.org/bootcamp for more information and registration (max. 40!) .

Enhanced by Zemanta