News and Updates on the KRR Group
Header image

Source: Think Links

One of the nice things about using cloud services is that sometimes you get a feature that you didn’t expect. Below is a nice set of stats from WordPress.com about how well Think Links did in 2010. I was actually quite happy with 12 posts – one post a month. I will be trying to increase the rate of posts this year. If you’ve been reading this blog, thanks! and have a great 2011. The stats are below:

Here’s a high level summary of this blogs overall blog health:

Healthy blog!

The Blog-Health-o-Meter™ reads Fresher than ever.

Crunchy numbers

Featured image

A Boeing 747-400 passenger jet can hold 416 passengers. This blog was viewed about 4,500 times in 2010. That’s about 11 full 747s.

 

In 2010, there were 12 new posts, growing the total archive of this blog to 46 posts. There were 12 pictures uploaded, taking up a total of 5mb. That’s about a picture per month.

The busiest day of the year was October 13th with 176 views. The most popular post that day was Data DJ realized….well at least version 0.1.

Where did they come from?

The top referring sites in 2010 were twitter.com, few.vu.nl, litfass.km.opendfki.de, 4store.org, and facebook.com.

Some visitors came searching, mostly for provenance open gov, think links, ready made food, 4store, and thinklinks.

Attractions in 2010

These are the posts and pages that got the most views in 2010.

1

Data DJ realized….well at least version 0.1 October 2010

2

4store Amazon Machine Image and Billion Triple Challenge Data Set October 2009
2 comments

3

Linking Slideshare Data June 2010
4 comments

4

A First EU Proposal April 2010
3 comments

5

Two Themes from WWW 2010 May 2010

Filed under: meta

Source: Semantic Web world for you

A few days ago, I posted about SemanticXO and how you will see how to install a TripleStore on your XO. Here are the steps to follow to compile&install RedStore on the XO, put some triples in it and issue some queries. The following has been tested with an XO-1 running the software 10.1.3 and a MacBookPro running ArchLinux x64 (it’s not so easy to compile directly on the XO, that’s why you will need a secondary machine). All the scripts are available here.

Installation of RedStore

RedStore depends on some external libraries that are not yet packaged for Fedora11, which is used as a base for the operating system of the XO. The script build_restore.sh will download and compile all the necessary stuff. You may however need to install external dependencies on your system, such as libxml. That script only takes care of the things redstore directly depends on, namely raptor2, rasqal and redland (all available here). Here is the full list of commands to issue:

mkdir /tmp/xo
cd /tmp/xo
wget --no-check-certificate https://github.com/cgueret/SemanticXO/raw/master/build_redstore.sh
sh build_restore.sh

Once done, you will get four files to copy on the XO and if you don’t, you can also download this pre-compiled package. These files shall be put all together somewhere, for instance “/opt/redstore”. Note that all the data redstore needs will be put into that same directory. In plus of these 4 files, you’ll need a wrapper script and an init scripts. Both are available on the source code repository. So, here what to do on the XO, as root (replacing “cgueret@192.168.1.105″ by the login/IP accurate for you) :

mkdir /opt/redstore
scp cgueret@192.168.1.105:/tmp/xo/libraptor2.so.0 .
scp cgueret@192.168.1.105:/tmp/xo/librasqal.so.2 .
scp cgueret@192.168.1.105:/tmp/xo/librdf.so.0 .
scp cgueret@192.168.1.105:/tmp/xo/restored .
wget --no-check-certificate https://github.com/cgueret/SemanticXO/raw/master/wrapper.sh
chmod +x wrapper.sh
cd /etc/init.d
wget --no-check-certificate https://github.com/cgueret/SemanticXO/raw/master/redstoredaemon
chmod +x redstoredaemon
chkconfig --add redstoredaemon

Then you can reboot your XO and enjoy the triplestore through its http frontend, available on the port 8080 :)

Loading some triples

Now that the triple store is running, it’s time to add some triples. The SP2Bench benchmark comes with a tool (sp2b_gen) to generate any number of triples. To begin with, you can generate 50000 triples. That should be about of the maximum amount of triples an XO will have to deal with later on when the activities will store data in it. Here is what to do, with “192.168.1.104″ being the IP of the XO:

sp2b_gen -t 50000
rapper -i guess -o rdfxml sp2b.n3 > sp2b.rdf
curl -T sp2b.rdf 'http://192.168.1.104:8080/data/http://example.com/data'

It takes about 43 minutes to upload these 50k triples which gives an average of 53 milliseconds per triple or 19 triples per second. That’s not fast but should be enough to have an API allowing to store a bunch triples with an acceptable response time. The data takes 4Mo of disk space on the XO for an initial RDF file of about 9.8Mo.

Issue some queries

The SP2Bench benchmark comes with a generator for the triples and a set of 17 SPARQL queries expressed over this data. The queries are of changing complexity in order to benchmark different triple stores. Unfortunately, 9 of them where to complex for RedStore on the XO, with these 50k triples. These queries where not solved, even after being executed over a full night! The 8 remaining queries are solved without much problems, as long as you have enough time to wait for the answer:

Query file Execution time
q1.sparql 14229.4 ms
q2.sparql 44189.2 ms
q3a.sparql 21506.8 ms
q3b.sparql 19498.4 ms
q3c.sparql 19663.9 ms
q10.sparql 3940.6 ms
q11.sparql 4685.2 ms
q12c.sparql 3539.6 ms

The queries have been executed using the “sparql-query” command line client that way:

cat q2.sparql | sparql-query http://192.168.1.104:8080/sparql -t -p -n

The long delay can sounds as a bad news but it must be noted that this was with 50k triples and with queries designed to be tricky in order to test triple store capabilities. Considering a normal usage with fewer triples and more standard queries, we can expect things to go better.

Source: Semantic Web world for you

The three XOs received for the project

The project One Laptop Per Child (OLPC) has provided millions of kids world wide with a low-cost connected laptop helping them to enhance their knowledge and develop learning skills. Learning a foreign language, getting an introduction to reading/writting or preserving/revive an endangered/extinct language are among the possible usages of these XOs. Such activities could take a significant benefit from a storage layer optimised for multi-lingual and loosely structured data.

One of the building block of the Semantic Web, the “Triple Store”, is such a data storage service.  A triple store is like a database engine optimised to store and provide access to triples, atomic statements binding together a subject a predicate and an object. For instance, <Amsterdam,isLocatedIn,Netherlands>. And these two triples would define two different names for two different languages: <Amsterdam,isLocatedIn,”Netherlands”@nl>,  <Amsterdam,isLocatedIn,”Pays-Bas”@fr>.

SemanticXO is a new project from the contributor program aimed at adding a triple store and a front-end API on the XOs’ operating system. This triple store will extend the functionalities of Sugar with the possibility for all activities to store loosely structured/multilingual data and easily connect information across activities. In plus, the SPARQL protocol will allow for an easy access to the data stored on any device.

A first goal is to setup RedStore on the XOs allocated to this project. RedStore is a lightweight triple store that should be able to run on low hardware and still provide nice performances. Stay tuned for the result! ;-)

Source: Semantic Web world for you

This is the first post on this blog, aimed at giving and pointing to information about the Semantic Web. The Semantic Web (or Web 3.0) is a new technology and research topic aimed at putting more semantic into the Web as we know it. The changes are happening, in a not so visible but very concrete way for you, user of the Web. On this blog you will learn more about it and how you can benefit from it, whoever you are.

Source: Think Links

This has been a great week if you think that it’s important to know the origins of content on the web. First, Google announced the support of explicit metadata describing the origins of news article content that will be used by Google News. Publishers can now identify using two tags whether they the original source of a piece of news or are syndicating it from some other provider. Second, the New York Times now has the ability to do paragraph level permalinks. (So this is the link to the third paragraph of an article on starbucks recycling). So one can link to the exact paragraph when quoting a piece. This was supported by some other sites as well and there’s a wordpress plug-in for it but having the Times support it is big news. Essentially, with a couple of tweaks these techniques could make the quote pattern that you see in blogs (shown below) machine readable.

In the W3C  Provenance Incubator Group that is just wrapping up, one of the main scenarios was how to support a News Aggregator that can makes use of provenance to help determine the quality of the articles it automatically creates. With these developments, we are moving one step closer to being able to make this scenario possible.

To me, this is more evidence that with simple markup, and simple link structures, we can achieve the goal of having machines know where content on the web originates. However, like with a lot of the web, we need to agree on those simple structures so that everyone knows how to properly give credit on the web.

Filed under: provenance markup Tagged: google news syndication tags, new york times, permalinks, provenance

Source: Think Links

Current ways of measuring scientific impact are rather course grained, they often don’t capture the many different ways that science and scientists might have impact. As science increasingly is done on-line and in the open, new metrics are being created to help measure this impact. Jason Priem, Dario Taraborelli, myself, and Cameron Neylon have recently put out a manifesto calling out-lining a research direction for these new metrics, termed alt-metrics.

You can read the manifesto here: http://www.altmetrics.org/manifesto/

 

Filed under: academia Tagged: alt-metrics, science impact

Source: Think Links

I wrote a post a while back around the idea of Data DJs: how do we make it as easy to mix data as it is to mix music. This notion requires advances on several fronts from data and knowledge integration, to user interfaces, along with data provenance and semantics. Most of the research I do then somehow relates to this Data DJ’s in some form or anther.

However, I always thought I it would be fun to push the analogy as far as I could. Last Christmas, I got a DJ deck (specifically a Numark Stealth Control-fantastic name, right?) with the idea of actually using it to mix data sets. For a host of reasons, including time but also a lack of a clear vision of what an integration interface should look like, I never got past just toying around with it. However, over the past couple of weekends I found time to revisit it and develop a super alpha version of a data integration system using the deck. Here’s a video to see what I’ve done, read on to get more details.

What really got me going was the notion that events (or who, what, when, where and why) are a perfect substrate for data integration. This is not my idea but has been something I’ve been hearing from a number of sources including from a number of people in the VU’s Web and Media Group down the hall, Raphaël Troncy, and probably best summed up by Mor Naaman. With this as inspiration, I developed a preliminary interface around integrating/and summarizing events (well actually tweets, but hopefully this will expand to other event sources) that you saw in the video above. The components of the interface (shown in the picture below) are as follows:

  • On the top is a list of the search terms that were used to retrieve the tweets. The tweets for each search term can be hidden and unhidden.
  • On the right is a list of the users (i.e. sources) who made the tweets. Each source can be filtered in and out impacting the term summary graph
  • In the middle are all the tweets on the same timeline.
  • On the right, is a bar graph that summarizes the most common terms across the tweets.
  • Below the bar graph, is the time span of the tweets and the current time of the selected tweet.
  • On the far right are hashtags that are selected by the user.

As you saw in the video it’s pretty fast to scroll through both sources and tweets. With a quick flick it’s easy to apply a filter and pretty natural to select and deselect search terms. Furthermore, we can easily delete tweets and data sources with the push of a button. There’s still much much more to be done to make this a viable user interface for the kind of data mixing task we want to support. But standing in front of the projector today scrolling through tweets, eliminating sources and seeing an overview fly-up really convinced me that this type of interaction is really suited to the data integration task. That being said any advice or comments on the interface would be greatly appreciated. In particular, suggestions for good infographics pertaining to events would be appreciated.

Technical Details:

The interface was completely implemented using HTML5. In particular, I used the nice ProtoVis framework along with JQuery and JQuery Tools. To get the fast updates from the deck, we use WebSockets. I have a small Java program reading midi off the deck which then acts as a socket server for WebSockets and pipes the midi signals (after translation to JSON) to the connected sockets. I’ve been using Google Chrome for development so I don’t know how it works in other browsers. To get data, we use the search interface of twitter and JSONP. In general, I was very impressed with what you can do in the browser. I felt like I wasn’t even pushing the capabilities especially since I don’t do web programming everyday.

What’s next?

Lots! This was really just a proof of concept. There’s a bunch of directions to go in: improved graphics, better use of the decks, social interaction around integration (two djs at once!), more data sources beyond twitter, experiments on task performance, live mixing of an event…. If you have any ideas, suggestions, or comments, I’d love to hear them.

How do you want to data DJ?

Filed under: data dj Tagged: data dj, decks, infographics, mixing data

Source: Think Links

As a computer scientist, I’ve always found it inspirational talking to people from other disciplines. There are always interesting problems where computational techniques could be applied and also questions about what we would have to improve in order to use technology in these disciplines. I also know from talking to a range of people (biologists, communication scientists, etc) that they often feel excited about the opportunity to work with cutting edge computer science.

But even with excitement on both sides, it is hard to engage in interdisciplinary work. We are often pulled to our own communities for a variety of reasons (incentives, social structure, vocabulary…) and even when we do engage, it is often only for the length of one project. Afterwards, the collaboration dwindles.

The VU (Vrije Universiteit Amsterdam) through the Network Institute has been putting effort in trying to increase and extend interdisciplinary engagement. In June, Iina Hellsten and I organized a half-day symposium for discussion about collaborations between social science and computer science. It was successful in two respects:

  1. It generated excitement.
  2. It identified a set of challenges and opportunities for collaboration.
We followed up this symposium two months later (Aug. 28, 2009) with a second meeting this time focused on turning this excitement into concrete initiatives. We had 13 participants this time again with attendees from both computer science and social science.

The meeting started by breaking into three groups where we spent about 40 minutes generating concrete collaboration ideas in the context of the 4 challenges and 4 opportunities identified at the last meeting. We ensured that each group had members from computer science and social science. After that session each group presented their top 3 ideas. Groups were good at using the “technology”:

After this session, the group selected three areas of interest and then discussed how these could be concretely acted upon.

Here are the results:

1. Advertising collaborations

One issue that came up was the difficulty in knowing what the other discipline was doing and whether collaboration would be helpful.

  • Announcement of talks on a central site. Simply, if the agent simulation group in CS is having a talk perhaps the organization architectures social science group would want to know about it. We thought we could use the Network Institute Linked In Group for this.
  • Consulting. I thought this was a fun idea… Here, one could advertise their willingness to spend 1/2, 1, or two days with a person from the other discipline advising and helping them out with no expectations on either side. For example, if a social scientist wanted to have help running a large scale analysis, a computer scientist could help for a day without expecting to have to continue to help. Likewise, a computer scientist wanting a social scientist to check if their paper on analyzing twitter was theoretically sound, the social scientist could spend a half day with them. It was proposed that the Network Institute could offer incentives for this.

2. Interdisciplinary master and PhD student projects.
Collaborating through students can provide a way to build longer lasting collaborations.

  • One initiative would be to advertise co-supervised masters projects hopefully as soon as this November.
  • Since PhD students usually require funding, it was felt there needs to be more collaboration on obtaining research funding between faculties. One challenge here is knowing what calls could be targeted. To attack this problem, we thought the subsidy desk at the VU could start a special email list for interdisciplinary calls.

3. Processing large-scale data
Large scale data (from the web or otherwise) was of interest to a big chunk of the people in the room. There was a feeling that it would be nice to know what sorts of data sets people have or what data sets they were looking for.

  • As a first step, we imagine a structured event sometime in 2011 where participants would present the data sets they have or what data sets they are looking for, and what analysis they aim to do. The aim of the event would be to try and build one-to-one connections across disciplines.

I think the group as a whole felt that these ideas could be straightforwardly put into practice and would lead to deeper and lasting collaborations between social and computer science. It would be great to hear your ideas along with comments and questions below.

Filed under: academia Tagged: collaboration, computer science, network institute, social science, vu

Source: Think Links

One of the things that I think is great about the VU (Vrije Universiteit Amsterdam) where I work is the promotion of interdisciplinary work through organizations like the Network Institute.  Computer Science is often known for interacting with biology, physics, and economics but we are now seeing the application of computing to Social Science problems. This is great for CS because domains often introduce new fundamental CS problems.

To talk about the overlap and potential opportunities for greater Social Science and Computer Science collaboration at the VU, Iina Hellsten (from Organization Science) and I organized a half-day symposium on Tuesday, June 29, 2010. We had a great environment for the discussion in the Intertain Lab (a space for investigating new interactive environments).

We had 17 participants about half from the Social Sciences (covering organization science, communication science, to psychology)  and half from Computer Science.

We started off with talks setting the scene from myself (on the CS side) and Peter Groenewegen and then moved to a series of shorter talks giving us a glimpse of the different focuses of some of the attendees. Even during these talks, there was clearly excitement about the possibilities for collaboration and there were several interesting conversations about the work itself.

The last part of the symposium was a session where we identified challenges and opportunities. We ran this as a post-it note session where each participant wrote two challenges and two opportunities on post it notes. (I got this idea from Katy Börner at her NSF Workshop on Mapping of Science and the Semantic Web. Thanks Katy!). Amazingly, these post-it notes always cluster together. Below is an image of the results of the session:

The group identified 8 different groupings of the 60 challenges and opportunities listed by the participants. They were:

  1. How do we bridge the vocabulary gap between social science and computer science?
  2. We have the opportunity to build new applications using insights from social science.
  3. Writing new proposals and fundraising.
  4. Knowing who in the other discipline is working on a particular subject and maintaining connections between the disciplines.
  5. Being able to answer new research questions.
  6. Having an opportunity to apply research results in the “real world”.
  7. Automating parts of social science analysis (think network extraction from data sets).
  8. Overcoming the differing research styles of the two disciplines especially in terms of publication cycles.

Below we list the actual text of the post-it notes grouped into the 8 areas.

The outcome of the symposium is that now that we’ve identified clusters of challenges and opportunities, we need to focus on concrete collaborations to address these areas. We will hold another session in September to discuss concrete actions.

Overall, this event showed me that at the VU, we have both the right structures but the right people to engage in this sort of interdisciplinary research.

Results of Post-it Note Session:
post-it content challenge or opportunity (c/o) category
More user centered/friendly systems. Not only usability, but also privacy strong communication ties o no category
convience peers (e.g reviewers) c no category
learn to give data (LOD) the right intrepretation o no category
use the methodological rigor (of social science?) to scope your results o no category
exploring/studying area for “design” of techno-social systems o vocab
seduce social scentists to think technical and computer scientist to think social c vocab
mix technical(cs) and social theoris and modes to advance understanding c vocab
deal with some fuzziness of social science models c vocab
time consuming coordination or alternatively miscommunication c vocab
different mindsets conceptualizations c vocab
it is difficult to develop shared understanding of theory c vocab
it is difficult to find common levels of abstraction c vocab
integrate low level network analysis with higher level models from social sciences c vocab
different sorts of thinking in cs and social social science c vocab
combining conceptual work to “bridge” the gap c vocab
very different outlook on research c vocab
speaking/interacting using the “same” vocabulary c vocab
finding coomon language between computer & social sciences c vocab
talk similar language c vocab
new applications of technology o new apps
teaching each other concepts/methods o new apps
developing new technology bundles together (e.g. pda-based surveys) o new apps
processing huge bulks of data o new apps
fundrasiing opportunities o funds
socio-technical support for agile social networks in organizations o funds
cross-polinization & cross-fertilization for developing meaningful insights o funds
keeping the connections across exisiting projects c who’s who
knowing who is doing what c who’s who
give overview of who is doing what in this field at the VU (via webpage?) o who’s who
identify the true webscience problems in the convergence of cs & ss o answering new questions
find relevant problems that are now solvable because of ICT solutions o answering new questions
generating new ideas o answering new questions
seeing research problems from new perspectives o answering new questions
provide overview of available methods, etc. o answering new questions
if we work together we can integrate our knoweldge and get a better idea about the big picture o answering new questions
make technical & interpretive knowledge come together o answering new questions
designing studies that have a greate change of producing real insights o real results
understand the social web phenomena like wikipedia, facebook (motivation/quality) o real results
share (experience) tools for network vizualization & analysis o real results
linking concepts that wouldn’t have been associated earlier (underlying frames) o real results
applying the results of the detailed tracking of people o real results
ending up with a lot of manual work to compensate for technical errors c automated analysis
combining social networks and content networks o automated analysis
automating social and content analysis o automated analysis
losing valuable information that might be essential to understanding phenomena c automated analysis
automated analysis & interpretations of social phenomena c automated analysis
thinking that one side (your side) always does things “the right way”. c research styles
interests are divergent c research styles
research timeframes are divergent c research styles
cs need short-term “help” -> pulbication cycle c research styles
different scientific approaches and styles (e.g. publication) c research styles

Filed under: academia Tagged: computational social science, post-it notes, symposium, vu unviersity amsterdam