News and Updates on the KRR Group
Header image

Source: Think Links

It’s been about two weeks since we had the almetrics11 Workshop at Web Science 2011 but I was swamped with the ISWC conference deadline so I just got around till posting about this now.

The aim of the workshp was to gather together the group of people working on next generation measures of science based on the Web. Importantly, as organizers, Jason, Dario and I wanted to encourage the growth of the scientific side of altmetrics.

The workshop turned out to be way better than I expected. We had roughly 36 attendees, which was way beyond our expectations. You can see some of the attendees here:

There was nice representation from my institution (VU University Amsterdam) including talks by my collaborators Peter van den Besselaar and Julie Birkholtz. But we had attendees from Israel, the UK, the US and all over Europe. People were generally excited about the event and the discussions went well (although the room was really warm). I think we all had a good time the restaurant, the Alt-Coblenz – highly recommended by the way-and an appropriate name. Thanks to the WebSci organizing team for putting this together.

We had a nice mix of social scientists and computer scientists (~16 & 20 respectively). Importantly, we had representation from the bibliometrics community, social studies of science, and computer science.

Importantly, for an emerging community, there was a real honesty about the research. Good results were shown but importantly almost every author discussed where the gaps were in their own research.

Two discussions come to the fore for me. One was on how we evaluate altmetrics.  Mike Thelwall who gave the keynote (great job by the way) suggests using correlations to the journal impact factor to help demonstrate that there is something scientifically valid that your measuring. What you want is not perfect correlation but correlation with a gap and that gap is what your new alternative metric is then measuring. There was also the notion from Peter van den Besselaar is that we should look more closely our how our metrics match what scientists do in practice (i.e. qualitative studies). For example, do our metrics correlate with promotions or hiring. The second discussion was around where to go next with altmetrics. In particular, there was a discussion on how to position altmetrics in the research field and really it seemed to position itself within and across the fields of science studies (i.e scientometricswebometrics,virtual ethnograpy ). Importantly, it was felt that we needed a good common corpus of information in order to comparative studies of metrics. Altmetrics has the problem of data acquisition. While some people are interested in that others want to focus on metric generation and evaluation. A corpus of traces of science online was felt to be a good way to interconnect both data acquisition and metric generation and allow for such comparative studies. But how to build the corpus….Suggestions welcome.

The attendees wanted to have an altmetrics12 so I’m pretty sure we will do that. Additionally, we will have some exciting news soon about a journal special issue on altmetrics.

Some more links:

Abstracts of all talks

Community Notes

Also, could someone leave a link to the twitter archive in the comments? That would be great.

Filed under: academia, altmetrics, interdisciplinary research Tagged: #altmetrics, science studies, web science, websci11

Source: Think Links

While exploring the London Science Museum, I saw this great exhibit for the Toaster Project. The idea was to try to build a modern day toaster from scratch. There’s a video describing the project below and more info about the project from the site linked above.  What was interesting was that to get some information about how things were produced, Thomas Thwaites had to go look in some pretty old books to see how things get produced. I think it would be cool to make it easy to link  every product in my house to how to produce it (or how it was created) without going through a 9 month process to figure it out.

Filed under: supply chains Tagged: cool project, real world provenance, toaster project

Source: Think Links

I’m in London for  a number of meetings. Last week I had a great time talking with chemist and IT people about how to deal with chemistry data in the new project I’m working on OpenPhacts. You’ll probably hear more about this from me as the project gets up and running. This week I’m at a workshop discussing and hacking some next generation ways of measuring impact in science.

Anyway, on the weekend I got to visit some of London’s fantastic museums. I spend a lot of my time thinking about ways of describing the provenance of things particularly  data. This tends to get rather complicated… But visiting these museums, you see how some very simple provenance can add a lot to understanding something. Here’s some examples:

A very cool looking map of britain from the Natural History Museum:

Checking out the bit of text that goes with it:

We now know that it was produced by William Smith by himself in 1815 and that this version is a facsimile. Furthermore, we find out that it was the first geological map of Britain. That little bit of information about the map’s origins makes it even cooler to look at.

Another example this time from the Victoria and Albert Museum. An action packed sculpture:

And we look at the text associated with it:

and find some interesting provenance information. We have a rough idea about when it was produced between 1622-23 and who did it (Bernini). Interestingly, we also find out how it transitioned through its series of owners from Cardinal Montalto to Joshua Reynolds and then in was in the Yarborough Collection and finally purchased by the museum. This chain of ownership is classic provenance. Actually, wikipedia has even more complete provenance of the sculpture.

These examples illustrate how a bit of provenance can add so much more richness and meaning to objects.I’m going to be on the look out for provenance  in the wild.

If you spot some cool examples of provenance, let me know.

Filed under: communicating provenance

Source: Think Links

I’m pretty excited about the Beyond Impact workshop next week in London. It’s a workshop/hackathon to look at next generation ways of measuring impact in science. This has to do with the altmetrics initiative I’m involved with and our Semantically Mapping Science project.

Here’s me giving a video introduction of myself for the workshop….

Filed under: academia Tagged: #altmetrics, beyondimpact

Source: Semantic Web world for you

The LOD cloud as rendered by Gephi

One year ago, we posted on the LarkC blog a first network model of the LOD cloud. Network analysis software can highlight some aspects of the cloud that are not directly visible otherwise. In particular, the presence of dense sub-groups and several hubs – whereas in the classical picture, DBPedia is easily perceived as being the only hub.

Computing network measures such as centralities, clustering coefficient or the average path length can reveal much more about the content of a graph and the interplay of its nodes. As shown since that blog post, these information can be used to appreciate the evolution of the Web of Data and devise actions to improve it (see the WoD analysis page for more information about our research on this topic). Unfortunately, the picture provided by Richard and Anja on can not be fitted directly into a network analysis software which expects a .net or CSVs files instead. Fortunately, thanks to the very nice API of it is easy to write a script generating such files. We made such a script and thought it would be a good idea to share it :-)

The script is hosted on GitHub. It produces a “.net” file according to the format of Pajek and two CSV files, one for the nodes and one for the edges. These CSV can then easily be imported into Gephi, for instance, or any other software of your choice. We also made a dump of the cloud as of today and packaged the resulting files.

Have fun analysing the graph and let us know if you find something interesting ;-)


Source: Semantic Web world for you

Wayan recently blogged about the project SemanticXO, asking about its current status. Unfortunately, I couldn’t comment on his blog so I’d like to answer to his question here. Daniel also emitted some doubts about the Semantic Web, so I’ll try to clarify what this is all about.

To be honest, I’m not sure what that really means. Is this a database project? Is it to help translation of the Sugar User Interface? Or are children somehow to use SemanticXO in their language acquisition?

Semantic technologies are knowledge representation tools used to model factual information – for instance, “Amsterdam,isIn,Netherlands”. These facts are stored in optimised databases called the triple stores. So, yes, it is kind of a data base project which aims at installing such a triple store and provide an API for using it. The technologies developed for the Semantic Web are particularly suited to storing and querying multi-lingual data, thus activities that need to store text in different languages would directly benefit from this feature. The triple store could indeed eventually be used instead of the .po files to store multi-lingual data for Sugar.

The goal of SemanticXO is not only to provide an API to use a triple store on the XO but also to provide access to the data published using Semantic Web technologies. There has been many data sets being published on the Web, providing a network with more than 27 Billion factual information that can be queried and combined. Although not being exhaustive, the Linked Open Data (LOD) cloud provides a good idea of the amount of data out there. With SemanticXO an activity developer will be able to simply get the population of Amsterdam, or the exact location of Paris, or the population of London, or whatever. The LOD cloud can be queried just like a database and it contains a lot of information about many topics. And because the XO will itself be able to use the same publication system, the kids using Sugar will be able to publish their data on the cloud directly from an activity.

Currently, it is hard, if not impossible, to get such atomic information and just insert it somewhere into an activity with a few lines of code…

Regardless of its purpose, it seems that SemanticXO development has come to a halt. The only other post from Christophe Guéret detailed RedStore running on the XO, where he noted the challenges of installing a TripleStore on an XO using RedStore, namely that RedStore depends on some external libraries that are not yet packaged for Fedora11 and since it’s not so easy to compile directly on the XO, a second computer is required.

This post was published on the 11 of April 2011. To date, there were three posts about SemanticXO: the introduction (posted on December 15, 2010), the installation of a triple store (posted on December 20, 2010) and a first activity using the triple store (posted on April 5, 2011). So there was one other post made since the installation of the triple store. But that first step of installing a triple store was indeed important for what I want to do with SemanticXO and it was not easy to find one that would fit the low specs of an XO-1. Then, the installation was a bit challenging because of the dependencies but nothing really exceptional there. Ideally, the triple store will come installed by default on the OLPC OS releases some day :-)

Once installed, the XO didn’t return queries quickly. The XO failed on a number of benchmark different triple stores, even after being executed over a full night.

I was pleased, surprised and relieved to see that the triple store worked in the first place! From what I know, it was the first time a triple store was running on such low-spec hardware and I wanted to see how far I could push it. So I loaded a significant amount of triples (50k) and ran of the testing suite we typically use to test triple store performances. As expected, the response time was long and most complex queries just failed. But these evaluation systems are aimed at testing big triple stores on big hardware and the queries are designed to see how the triple store deal with extreme cases. Considering that on the oldest generation of XO the triple store managed to answer queries way more complex that the one it is expected to deal with, I found the results acceptable and decided to move onto the next steps.

So Christophe, what does this mean? Is a Semantic Web for children using the XO possible?

Yes, it is possible and I’m still actively working on it! The developement is going slower than I would like it to go, as many contributors I work on this project on my spare time, but it is going on. The last post on this blog shows an activity using the store for its internal data and contains a pointer to a technical report that, I hope, will bring more light onto the project goals & status. Right now, I’m working on extending this activity and implementing an drop-in replacement for the data store that would use the triple store to store metadata about the different entries. This clustering activity is only showing how activities in Sugar can store data using the triple store so I’m also working on an activity that will show the other aspect: how the same concepts can be used to get data from the LOD cloud and display it.

I have been able to detect no clear correlation between use of the term “Semantic Web” and knowledge of what it means. I think everybody just read it in Wired in 1999 and filed it away as a really good thing to put on a square of your Buzzword Bingo card.

Since 1999, and until some years ago, the Semantic Web has been searching for its own identity and meaning. It started out as a vision of having data being published on the Web just as the Web as we know it allows for the publication of Documents. Translating a vision into concrete technologies is a lengthy process subject of debates and trial&errors phases before you get into something everyone can see and play with. Now, we are getting on track with data sets being published on the Web using Semantic Web technologies (the LOD cloud, Linked Open Commerce), some dedicated high-end conferences (ISWC, ESWC, SemTech, …) and journals (JWS, SWJ, …). Outside of academia, there is also an increasing amount of Semantic Web application but most of it is invisible to the end user. Have you noticed Facebook is using Semantic Web technologies to mark up the pages for its famous “Like” button? Or that the NYTimes uses the same technologies to tag its articles? and these are only two example out of many more.

As highlighted by Tom Ilube from Garlik (an other company using Semantic Web technology), the Semantic Web is a change in the infrastructure of the Web itself that you won’t even see happening.

Related Articles

Source: Semantic Web world for you

In the past few years many data sets have been published and made public in what is now often called the Web of Linked Data, making a step towards the “Web 3.0”: a Web combining a network of documents and data suitable for both human and machine processing. In this Web 3.0, programs are expected to give more precise answers to queries as they will be able to associate a meaning (the semantic) to the information they process. Sugar, the graphical environment found on the XO, is currently Web 2.0 enabled – it can browse web sites – but has no dedicated tools to interact with the Web 3.0. The goal of the SemanticXO project introduced earlier in this blog is to make Sugar Web 3.0 proof by adding semantic software on the XO.

One corner stone of this project is to get a triple store, the software in charge of storing the semantic data, running on the limited hardware of the machine (in our case, an XO-1). As it proved to be feasible, we can now go further and start building activities making use of it. And to begin with, a simple clustering activity: the goal there is to cluster into boxes using drag&drop. The user can create as many boxes as he needs, and the items may be moved around boxes. Here is a screenshot of the application, showing Amerindian items:

Prototype of the clustering activity

The most interesting aspect of this activity is actually under its hood and is not visible on the screenshot. Here is a some of the triples generated by the application (note that the URLs have been shortened for readability) :

subject predicate object
olpc:resource/a05864b4 rdf:type olpc:Item
olpc:resource/a05864b4 olpc:name “image114″
olpc:resource/a05864b4 olpc:hasDepiction “image114.jpg”
olpc:resource/a82045c2 rdf:type olpc:Box
olpc:resource/a82045c2 olpc:hasItem olpc:resource/a05864b4
olpc:resource/78cbb1f0 rdf:type olpc:Box

It is relevant to note here the flexibility of that data model: The assignment of one item to the only box is stated by a triple using the predicate “hasItem”, one of the box is empty because there is no such statement linking it to an item. A varied number of similar triples can be used, without any constraint and the same goes for actually all the triples in the system. There is no requirement for a set of predicates all the items must have. Let’s see the usage that can be made of this data through three different SPARQL queries, introduced from the simple one to the most sophisticated:

  • List the URIs of all the boxes and the items they contain
  • SELECT ?box ?item WHERE {
    ?box rdf:type olpc:Box.
    ?box olpc:hasItem ?item.
  • List of the items and their attributes
  • SELECT ?item ?property ?val WHERE {
      ?item rdf:type olpc:Item.
      ?item ?property ?val.
  • List of the items that are not in a box
  • SELECT ?item WHERE {
      ?item rdf:type olpc:Item.
      OPTIONAL {
        ?box rdf:type olpc:Box.
        ?box olpc:hasItem ?item.
      FILTER (!bound(?box))

These three queries are just some examples, the really nice thing about this query mechanism is that (almost) anything can be asked through SPARQL. There is no need to define a set of API calls to cover a list of anticipated needs, as soon as the SPARQL end point is made available every activity may ask whatever it wants to ask! :)

We are not done yet as there is still a lot to develop to finish the application (game mechanism, sharing of items, …). If you are interested in knowing more about the clustering prototype, feel free to drop a comment on this post and/or follow this activity on GitHub. You can also find more information in this technical report about the current achievements of SemanticXO and the ongoing work.

Source: Semantic Web world for you

This post is a re-blog of this post published on

Some weeks ago, a first version of a wrapper for the GoogleArt project from Google was put on line (see also this blog post). This wrapper that was first offering data only data may only be available for individual paintings has now been extended to museums. The front page of GoogleArt is also available as RDF, providing a machine-readible list of museums. This index page makes it possible, and easy, to download an entire snapshot of the data set so let’s see how to do that ;-)

Downloading the data set from a wrapper

Wrappers around web services offer an RDF representation of a content available at the original source. For instance, the SlideShare wrapper provides an RDF representation of a presentation page from the SlideShare web site. The GoogleArt wrapper takes the same approach for paintings and museums listed on the GoogleArt site. Typically, these wrapper would work by mimicking the URI scheme of the site they are wrapping. Changing the hostname, and part of the path, of the URL of the original resource for that of the wrapper gives you access to the seeked data.

From a linked data perspective, wrappers are doing a valid job at providing de-referencable URIs for the entities they described. However, the “de-referencing only” scheme makes them more difficult to query. Wrappers don’t offer SPARQL end points as they don’t store the data they serve, that data being computing on-the-fly when the URIs are accessed. To query a wrapper one has to rely on an indexing service harvesting the different document and indexing them. Something that reminds us of the way to find Web documents and for which the semantic web index Sindice is the state of the art solution.

But such external indexing service may not provide you with the entire set of triples or not allow to download big chunks of their harvested data. In that case, the best way to get the entire dataset locally is to use a spider to download the content published under the different URIs.

LDSpider, an application developped by Andreas Harth (AIFB), Juergen Umbrich(DERI), Aidan Hogan and Robert Isele, is the perfect tool for doing that. LDSpider crawls linked data resources, looking for triples it stores in a Nquad file. Nquads are triples to which a named graph has been added. By using it, LDSpider keeps track of the sources of the triples in the final result.

Using a few simple commands, it is possible to harvest all the triples published by the GoogleArt Wrapper. As of the time of writting, there seems to be a bug with the latest release of LDSpider (1.1d) that prevented us from downloading the data. However, everything works fine with the trunk version which can be downloaded and compile that way:

svn checkout ldspider-read-only
cd ldspider-read-only
ant build

One we have LDSpider ready to go, point it to the index page “-u”, ask for a load balanced crawl “-c” and request to stay within the same domain name “-y” as the starting resource. This last option is very important! Since the resources published by the wrapper are connected to DBpedia resources, omitting the “-y” would allow the crawler to download the content of the resources that are pointed to in DBpedia, and then download the content of the resources DBpedia points to, and so on… The last parameter to set is the name of the output file “-o data.nq” and you are ready:

java -jar dist/ldspider-trunk.jar -u -y -c -o data.nq

After some time (24 minutes in our case), you get a file with all the data + some header information with extra information about the downloaded resource:

<> <> _:header1087646481301043174989 <> .
_:header1087646481301043174989 <> "200"^^<> <> .
_:header1087646481301043174989 <> "Fri, 25 Mar 2011 08:51:04 GMT" <> .
_:header1087646481301043174989 <> "TornadoServer/1.0" <> .
_:header1087646481301043174989 <> "5230" <> .
_:header1087646481301043174989 <> "application/rdf+xml" <> .
_:header1087646481301043174989 <> "Keep-Alive" <> .

To filter these out and get only the data contained in the document, simply use a grep:

grep -v "_:header" data.nq > gartwrapper.nq

The final document “gartwrapper.nq” contains around 37k triples, out of which 1.6k are links to DBpedia URIs. More information about the data set is available through it CKAN package description. That description also contains a link to a pre-made dump.

Concluding remarks

This download technique is applicable to downloading the content provided by any wrapper or, in general, data set for which only de-refenrencable URIs are provided. However, we should stress that to ensure completness an seed URI listing all (or most of) the published resources: the spider works by following links so be sure to have access to well connected resources. If several seeds are needed to cover the entire data set, iterate the same process starting at every one of them or use the dedicated option from LDSpider (“-d”).

Related Articles

Source: Think Links

I’ve posted  a couple of times on this blog about events organized at the VU University Amsterdam to encourage interdisciplinary collaboration. One of the major issues to come out of these prior events was that data sharing is a critical mechanism for enabling interdisciplinary research. However, often times it’s difficult for scientists to know:

  1. Who has what data? and;
  2. whether that data is interesting to them?

This second point is important. Because different disciplines use different vocabularies, it is often times hard to understand whether a data set is truly useful or interesting in the context of new domains. What is data for one domain may or may not be data in another domain.

To help bridge this gap, Iina Hellsten (Organizational Science), Leonie Houtman (Business Research) and myself (Computer Science) organized a Network Institute workshop this past this past Wednesday (March 23, 2011) titled What is Data?

The goal of the workshop was to bring people together from this different domains to discuss the data they use in their everyday practice and to describe what makes data useful to them.

Our goal wasn’t to come up with a philosophical answer to the question but instead build a map of what researchers from these disiplines consider to be useful data for them.  More importantly, however, was to bring these various researchers together to talk to one another.

I was very impressed with the turnout. Around 25  people showed up from social science, business/management research and computer science. Critically, the attendees were fully engaged and together produced a fantastic result.

The attendees

The Process

To build a map of data, we used a variant of a classic knowledge acquisition technique called card sorting. The attendees were divided up into groups (shown above) making sure that the groups had a mix of researchers from each disciplines. Within each group, every researcher was asked to give examples of the data they worked with on a daily basis and explain to the others a bit about they did with that data. This was a chance for people to get to know each other and have discussions in smaller groups. After the end of this each group had a pile of index cards with examples of data sets.

Writing down example data sets

The groups were then asked to group these examples together and then give those collections labels. This was probably the most  difficult part of the process and led to lots of interesting discussions:

Discussion about grouping

Here’s an example result from one of the groups (the green post-it notes are the collection labels):

Sorted cards

The next step was that everyone in the room got to walk around and label the example data sets from all groups with attributes that they thought were important to them. For example, a social networking data set is interesting to me if I can access it programmatically. Each discipline got their own color. Pink = computer science, Orange = social science, yellow = management science.

This resulted in very colorful tables:

After labelling

Once this process was complete, we merged the various tables groupings together by data sets and category (i.e. collection label) leading to a map of data sets:

The Results

A Map of Data

Above is the map created by the group. You can find a (more or less faithful) transcription of the map here. Here’s some highlights.

There were 10 categories of data:

  1. Elicited data (e.g. surveys)
  2. Data based on measurement (e.g. logfiles)
  3. Data wit a particular formats (e.g. xml)
  4. Structured-only data (e.g. databases)
  5. Machine data (e.g. results of a simulation)
  6. Textual data (e.g. interview transcripts)
  7. Social data (e.g. email)
  8. Indexed data (e.g. Web of Science)
  9. Data useful for both quantitative and qualitative analysis (e.g. newspapers)
  10. Data about the researchers themselves (e.g. how did they do an analysis)

After transcribing the data, I would say that computer scientists are interested in having strong structure in the data, whereas social scientists and business scientists are deeply concerned with having high quality data that is representative, credible, and was collected with care. Across all disciplines temporality (or having things on a timeline) seemed to be a critical attribute of useful data.

What’s next?

At the end of the workshop, we discussed where to go from here. The plan is to have a follow-up workshop where each discipline can present their own datasets using these categorizations. To help focus the workshop we are looking for two interdisciplinary teams within the VU that are willing to try data sharing and present the results of that trial at the workshop. If you have a data set, you would like to share, please post it to the Network Institute linked in group. Once you have a team, let myself, Leoni, or Iina know.




Filed under: academia, interdisciplinary research

Source: Think Links

Sunbelt is the annual meeting for  Social Network Analysis researchers. It’s been going on since 1981 (a couple of years before analyzing twitter graphs became hip) and this year it’s being held in Tampa. Two of my colleagues-Julie Birkholz and Shenghui Wang- are attending and presenting some joint work. The abstracts are below. If you’re at Sunbelt be sure to check out their presentations and have a chat.

At a higher level, I think both pieces of work emphasize the importance of using the combination of rich representations of the data underlying networks along with dynamic network analysis. Networks provide a powerful abstraction mechanism but it’s important to be able to situate that abstraction in a rich context. The techniques we are both developing and applying are steps along the way towards enabling these more “situated” network.

Dynamics Of Scientific Collaboration Networks

Groenewegen, Peter; Birkholz, Julie M.; van der Bunt, Gerhard; Groth, Paul

Evolution of scientific research can be considered as a dynamic network of collaborative relations between researchers. Collaboration in science leads to social networks in which authors can gain prominence through research (knowledge production), access to highly regarded field members, or network positions in the collaborative network. While a central position in network terms can be considered a measure of prominence, the same holds for citation scores. Causal evidence on a central position in the network corresponding to prominence in other dimensions such as the number of citations remains open. In this paper collaborative patterns, research interests and citation counts of co‐authoring scientists will be analyzed using SIENA to establish whether network processes, community or interest strategies lead to status in a scientific fields, or vice versa does status lead to collaboration. Results from an analysis of a subfield of computer science will be presented.

Multilevel Longitudinal Analysis For Studying Influence Between Co‐evolving Social And Content Networks

Wang, Shenghui; Groth, Paul; Kleinnijenhuis, Jan; Oegema, Dirk A

The Social Semantic Web has begun to provide connections between users within social networks and the content they produce across the whole of the Social Web. Thus, the Social Semantic Web provides a basis to analyze both the communication behavior of users together with the content of their communication. However, there is little research combining the tools to study communication behaviour and communication content, namely, social network analysis and content analysis. Furthermore, there is even less work addressing the longitudinal characteristics of such a combination. This paper proposes to take into account both the social networks and the communication content networks. We present a general framework for measuring the dynamic bi‐directional influence between co‐evolving social and content networks. We focus on the twofold research question: how previous communication content and previous network structure affect (1) the current communication content and (2) the current network structure. Multilevel time‐series regression models are used to model the influence between variables derived from social networks and content networks. The effects are studied at the group level as well as the level of individual actors. We apply this framework in two use‐cases: online forum discussions and conference publications. By analysing the dynamics involving both social networks and content networks, we obtain a new perspective towards the connection of social behaviour in the social web and the traditional content analysis.




Filed under: academia Tagged: semantic web, social network analysis, sunbelt