News and Updates on the KRR Group
Header image

Author Archives: paulgroth

Source: Think Links

In preparation for Science Online 2011, I was asked by Mark Hahnel from over at Science 3.0 if I could do some analysis of the blogs that they’ve been aggregating since Octobor (25 thousand posts from 1506 authors). Mark along with Dave Munger will be talking more about the role/importance of aggregators in a session Saturday morning 9am (Developing an aggregator for all science blogs). These analysis provide a high level overview of the content of science blogs. Here are the results.

The first analysis tried to find the topics of blogs and their relationships. We used title words as a proxy for topics and co-occurrence of those words as representative of the relationships between those topics. Here’s the map (click the image to see a larger size):

The words cluster together according to their co-occurrence. The hotter the color the more occurrence of those words. You’ll notice that for example Science and Blog are close to one another. Darwin and days as well as fumbling and tenure are close as well. The visualization was done with Vosviewer software.

I also looked at how blogs are citing research papers. We looked for the occurrence of DOIs as well as research blogging style citations within all the blog posts. We found that there were 964 posts with these sorts of citations. In this case, I thought there would be more but maybe this is down to how I implemented it.

Finally, I looked at what URLs were most commonly used in all the blog posts. Here are the top 20:

URL Occurences 4476 3920 1002 930 789 648 533 485 482 376 350 336 295 271 269 266 265 232 232 195

I was quite happy with this list because they are pretty much all science links. I thought there would be a lot more links to non-science places.

I hope the results can provide a useful discussion piece. Obviously, this is just the start and we can do a lot more interesting analyses. In particular, I think such statistics can be the basis for alt-metrics style measures. If you’re interested in talking to me about these analysis come find me at Science Online.

Filed under: academia Tagged: #altmetrics, #scio11, analysis, science blogging

Source: Think Links

The university where I work asks us to register all our publications for the year in a central database [1].  Doing this obviously made me think of doing an ego search on my academic papers. Plus, it’s the beginning of the year, which always seems like a good time to look at these things.

The handy tool Publish-or-Perish calculates all sorts of citation metrics based on a search of Google Scholar. The tool lets you pick the set of publications to consider. (For example, I left out all the publications from another Paul Groth who’s a professor of architecture at Berkeley.) I did a cursory run through to remove publications that weren’t mine but I didn’t spend much time so all the standard disclaimers apply. There may be duplicates, it includes technical reports, etc. For transparency, you can find the set of publications considered in the Excel file here. Also, it’s worth noting that the Google Scholar corpus has it’s own problems, in particular, it makes you look better. With all that in mind, let’s get to the fun stuff.

My stats as of Jan. 4, 2011 are:

  • Papers:93,
  • Citations:1318,
  • Years:12,
  • Cites/year:109.83,
  • Cites/paper:14.17/4.0/0,
  • Cites/author:416.35,
  • Papers/author:43.27,
  • Authors/paper:3.04/3.0/2,
  • h-index:21,
  • g-index:34,
  • hc-index:16,
  • hI-index:5.58,
  • hI-norm:11,
  • AWCR:224.17,
  • AW-index:14.97,
  • AWCRpA:70.96,
  • e-index:24.98,
  • hm-index:9.07,

You can find the definitions for these metrics here.

What does it all mean? I don’t know :-) I think it’s not half bad.

For comparison, here’s a list of  the h-indexes for top computer scientist computed using Google Scholar. All have  an h-index of 40 or greater. A quick scan through that least, shows that there’s a pretty strong correlation between being a top computer scientist and a high h-index. Thus, I conclude that I should continue concentrating on being a good computer scientists and the statistics will follow.

[1] I don’t know why my university doesn’t support importing publication information from bibtex, or RIS. Everything has to be added by hand, which takes a bit.

    Filed under: academia, meta Tagged: citation metrics, computer science, h-index

    Source: Think Links

    One of the nice things about using cloud services is that sometimes you get a feature that you didn’t expect. Below is a nice set of stats from about how well Think Links did in 2010. I was actually quite happy with 12 posts – one post a month. I will be trying to increase the rate of posts this year. If you’ve been reading this blog, thanks! and have a great 2011. The stats are below:

    Here’s a high level summary of this blogs overall blog health:

    Healthy blog!

    The Blog-Health-o-Meter™ reads Fresher than ever.

    Crunchy numbers

    Featured image

    A Boeing 747-400 passenger jet can hold 416 passengers. This blog was viewed about 4,500 times in 2010. That’s about 11 full 747s.


    In 2010, there were 12 new posts, growing the total archive of this blog to 46 posts. There were 12 pictures uploaded, taking up a total of 5mb. That’s about a picture per month.

    The busiest day of the year was October 13th with 176 views. The most popular post that day was Data DJ realized….well at least version 0.1.

    Where did they come from?

    The top referring sites in 2010 were,,,, and

    Some visitors came searching, mostly for provenance open gov, think links, ready made food, 4store, and thinklinks.

    Attractions in 2010

    These are the posts and pages that got the most views in 2010.


    Data DJ realized….well at least version 0.1 October 2010


    4store Amazon Machine Image and Billion Triple Challenge Data Set October 2009


    Linking Slideshare Data June 2010


    A First EU Proposal April 2010


    Two Themes from WWW 2010 May 2010

    Filed under: meta

    Source: Think Links

    This has been a great week if you think that it’s important to know the origins of content on the web. First, Google announced the support of explicit metadata describing the origins of news article content that will be used by Google News. Publishers can now identify using two tags whether they the original source of a piece of news or are syndicating it from some other provider. Second, the New York Times now has the ability to do paragraph level permalinks. (So this is the link to the third paragraph of an article on starbucks recycling). So one can link to the exact paragraph when quoting a piece. This was supported by some other sites as well and there’s a wordpress plug-in for it but having the Times support it is big news. Essentially, with a couple of tweaks these techniques could make the quote pattern that you see in blogs (shown below) machine readable.

    In the W3C  Provenance Incubator Group that is just wrapping up, one of the main scenarios was how to support a News Aggregator that can makes use of provenance to help determine the quality of the articles it automatically creates. With these developments, we are moving one step closer to being able to make this scenario possible.

    To me, this is more evidence that with simple markup, and simple link structures, we can achieve the goal of having machines know where content on the web originates. However, like with a lot of the web, we need to agree on those simple structures so that everyone knows how to properly give credit on the web.

    Filed under: provenance markup Tagged: google news syndication tags, new york times, permalinks, provenance

    Source: Think Links

    Current ways of measuring scientific impact are rather course grained, they often don’t capture the many different ways that science and scientists might have impact. As science increasingly is done on-line and in the open, new metrics are being created to help measure this impact. Jason Priem, Dario Taraborelli, myself, and Cameron Neylon have recently put out a manifesto calling out-lining a research direction for these new metrics, termed alt-metrics.

    You can read the manifesto here:


    Filed under: academia Tagged: alt-metrics, science impact

    Source: Think Links

    I wrote a post a while back around the idea of Data DJs: how do we make it as easy to mix data as it is to mix music. This notion requires advances on several fronts from data and knowledge integration, to user interfaces, along with data provenance and semantics. Most of the research I do then somehow relates to this Data DJ’s in some form or anther.

    However, I always thought I it would be fun to push the analogy as far as I could. Last Christmas, I got a DJ deck (specifically a Numark Stealth Control-fantastic name, right?) with the idea of actually using it to mix data sets. For a host of reasons, including time but also a lack of a clear vision of what an integration interface should look like, I never got past just toying around with it. However, over the past couple of weekends I found time to revisit it and develop a super alpha version of a data integration system using the deck. Here’s a video to see what I’ve done, read on to get more details.

    What really got me going was the notion that events (or who, what, when, where and why) are a perfect substrate for data integration. This is not my idea but has been something I’ve been hearing from a number of sources including from a number of people in the VU’s Web and Media Group down the hall, Raphaël Troncy, and probably best summed up by Mor Naaman. With this as inspiration, I developed a preliminary interface around integrating/and summarizing events (well actually tweets, but hopefully this will expand to other event sources) that you saw in the video above. The components of the interface (shown in the picture below) are as follows:

    • On the top is a list of the search terms that were used to retrieve the tweets. The tweets for each search term can be hidden and unhidden.
    • On the right is a list of the users (i.e. sources) who made the tweets. Each source can be filtered in and out impacting the term summary graph
    • In the middle are all the tweets on the same timeline.
    • On the right, is a bar graph that summarizes the most common terms across the tweets.
    • Below the bar graph, is the time span of the tweets and the current time of the selected tweet.
    • On the far right are hashtags that are selected by the user.

    As you saw in the video it’s pretty fast to scroll through both sources and tweets. With a quick flick it’s easy to apply a filter and pretty natural to select and deselect search terms. Furthermore, we can easily delete tweets and data sources with the push of a button. There’s still much much more to be done to make this a viable user interface for the kind of data mixing task we want to support. But standing in front of the projector today scrolling through tweets, eliminating sources and seeing an overview fly-up really convinced me that this type of interaction is really suited to the data integration task. That being said any advice or comments on the interface would be greatly appreciated. In particular, suggestions for good infographics pertaining to events would be appreciated.

    Technical Details:

    The interface was completely implemented using HTML5. In particular, I used the nice ProtoVis framework along with JQuery and JQuery Tools. To get the fast updates from the deck, we use WebSockets. I have a small Java program reading midi off the deck which then acts as a socket server for WebSockets and pipes the midi signals (after translation to JSON) to the connected sockets. I’ve been using Google Chrome for development so I don’t know how it works in other browsers. To get data, we use the search interface of twitter and JSONP. In general, I was very impressed with what you can do in the browser. I felt like I wasn’t even pushing the capabilities especially since I don’t do web programming everyday.

    What’s next?

    Lots! This was really just a proof of concept. There’s a bunch of directions to go in: improved graphics, better use of the decks, social interaction around integration (two djs at once!), more data sources beyond twitter, experiments on task performance, live mixing of an event…. If you have any ideas, suggestions, or comments, I’d love to hear them.

    How do you want to data DJ?

    Filed under: data dj Tagged: data dj, decks, infographics, mixing data

    Source: Think Links

    As a computer scientist, I’ve always found it inspirational talking to people from other disciplines. There are always interesting problems where computational techniques could be applied and also questions about what we would have to improve in order to use technology in these disciplines. I also know from talking to a range of people (biologists, communication scientists, etc) that they often feel excited about the opportunity to work with cutting edge computer science.

    But even with excitement on both sides, it is hard to engage in interdisciplinary work. We are often pulled to our own communities for a variety of reasons (incentives, social structure, vocabulary…) and even when we do engage, it is often only for the length of one project. Afterwards, the collaboration dwindles.

    The VU (Vrije Universiteit Amsterdam) through the Network Institute has been putting effort in trying to increase and extend interdisciplinary engagement. In June, Iina Hellsten and I organized a half-day symposium for discussion about collaborations between social science and computer science. It was successful in two respects:

    1. It generated excitement.
    2. It identified a set of challenges and opportunities for collaboration.
    We followed up this symposium two months later (Aug. 28, 2009) with a second meeting this time focused on turning this excitement into concrete initiatives. We had 13 participants this time again with attendees from both computer science and social science.

    The meeting started by breaking into three groups where we spent about 40 minutes generating concrete collaboration ideas in the context of the 4 challenges and 4 opportunities identified at the last meeting. We ensured that each group had members from computer science and social science. After that session each group presented their top 3 ideas. Groups were good at using the “technology”:

    After this session, the group selected three areas of interest and then discussed how these could be concretely acted upon.

    Here are the results:

    1. Advertising collaborations

    One issue that came up was the difficulty in knowing what the other discipline was doing and whether collaboration would be helpful.

    • Announcement of talks on a central site. Simply, if the agent simulation group in CS is having a talk perhaps the organization architectures social science group would want to know about it. We thought we could use the Network Institute Linked In Group for this.
    • Consulting. I thought this was a fun idea… Here, one could advertise their willingness to spend 1/2, 1, or two days with a person from the other discipline advising and helping them out with no expectations on either side. For example, if a social scientist wanted to have help running a large scale analysis, a computer scientist could help for a day without expecting to have to continue to help. Likewise, a computer scientist wanting a social scientist to check if their paper on analyzing twitter was theoretically sound, the social scientist could spend a half day with them. It was proposed that the Network Institute could offer incentives for this.

    2. Interdisciplinary master and PhD student projects.
    Collaborating through students can provide a way to build longer lasting collaborations.

    • One initiative would be to advertise co-supervised masters projects hopefully as soon as this November.
    • Since PhD students usually require funding, it was felt there needs to be more collaboration on obtaining research funding between faculties. One challenge here is knowing what calls could be targeted. To attack this problem, we thought the subsidy desk at the VU could start a special email list for interdisciplinary calls.

    3. Processing large-scale data
    Large scale data (from the web or otherwise) was of interest to a big chunk of the people in the room. There was a feeling that it would be nice to know what sorts of data sets people have or what data sets they were looking for.

    • As a first step, we imagine a structured event sometime in 2011 where participants would present the data sets they have or what data sets they are looking for, and what analysis they aim to do. The aim of the event would be to try and build one-to-one connections across disciplines.

    I think the group as a whole felt that these ideas could be straightforwardly put into practice and would lead to deeper and lasting collaborations between social and computer science. It would be great to hear your ideas along with comments and questions below.

    Filed under: academia Tagged: collaboration, computer science, network institute, social science, vu

    Source: Think Links

    One of the things that I think is great about the VU (Vrije Universiteit Amsterdam) where I work is the promotion of interdisciplinary work through organizations like the Network Institute.  Computer Science is often known for interacting with biology, physics, and economics but we are now seeing the application of computing to Social Science problems. This is great for CS because domains often introduce new fundamental CS problems.

    To talk about the overlap and potential opportunities for greater Social Science and Computer Science collaboration at the VU, Iina Hellsten (from Organization Science) and I organized a half-day symposium on Tuesday, June 29, 2010. We had a great environment for the discussion in the Intertain Lab (a space for investigating new interactive environments).

    We had 17 participants about half from the Social Sciences (covering organization science, communication science, to psychology)  and half from Computer Science.

    We started off with talks setting the scene from myself (on the CS side) and Peter Groenewegen and then moved to a series of shorter talks giving us a glimpse of the different focuses of some of the attendees. Even during these talks, there was clearly excitement about the possibilities for collaboration and there were several interesting conversations about the work itself.

    The last part of the symposium was a session where we identified challenges and opportunities. We ran this as a post-it note session where each participant wrote two challenges and two opportunities on post it notes. (I got this idea from Katy Börner at her NSF Workshop on Mapping of Science and the Semantic Web. Thanks Katy!). Amazingly, these post-it notes always cluster together. Below is an image of the results of the session:

    The group identified 8 different groupings of the 60 challenges and opportunities listed by the participants. They were:

    1. How do we bridge the vocabulary gap between social science and computer science?
    2. We have the opportunity to build new applications using insights from social science.
    3. Writing new proposals and fundraising.
    4. Knowing who in the other discipline is working on a particular subject and maintaining connections between the disciplines.
    5. Being able to answer new research questions.
    6. Having an opportunity to apply research results in the “real world”.
    7. Automating parts of social science analysis (think network extraction from data sets).
    8. Overcoming the differing research styles of the two disciplines especially in terms of publication cycles.

    Below we list the actual text of the post-it notes grouped into the 8 areas.

    The outcome of the symposium is that now that we’ve identified clusters of challenges and opportunities, we need to focus on concrete collaborations to address these areas. We will hold another session in September to discuss concrete actions.

    Overall, this event showed me that at the VU, we have both the right structures but the right people to engage in this sort of interdisciplinary research.

    Results of Post-it Note Session:
    post-it content challenge or opportunity (c/o) category
    More user centered/friendly systems. Not only usability, but also privacy strong communication ties o no category
    convience peers (e.g reviewers) c no category
    learn to give data (LOD) the right intrepretation o no category
    use the methodological rigor (of social science?) to scope your results o no category
    exploring/studying area for “design” of techno-social systems o vocab
    seduce social scentists to think technical and computer scientist to think social c vocab
    mix technical(cs) and social theoris and modes to advance understanding c vocab
    deal with some fuzziness of social science models c vocab
    time consuming coordination or alternatively miscommunication c vocab
    different mindsets conceptualizations c vocab
    it is difficult to develop shared understanding of theory c vocab
    it is difficult to find common levels of abstraction c vocab
    integrate low level network analysis with higher level models from social sciences c vocab
    different sorts of thinking in cs and social social science c vocab
    combining conceptual work to “bridge” the gap c vocab
    very different outlook on research c vocab
    speaking/interacting using the “same” vocabulary c vocab
    finding coomon language between computer & social sciences c vocab
    talk similar language c vocab
    new applications of technology o new apps
    teaching each other concepts/methods o new apps
    developing new technology bundles together (e.g. pda-based surveys) o new apps
    processing huge bulks of data o new apps
    fundrasiing opportunities o funds
    socio-technical support for agile social networks in organizations o funds
    cross-polinization & cross-fertilization for developing meaningful insights o funds
    keeping the connections across exisiting projects c who’s who
    knowing who is doing what c who’s who
    give overview of who is doing what in this field at the VU (via webpage?) o who’s who
    identify the true webscience problems in the convergence of cs & ss o answering new questions
    find relevant problems that are now solvable because of ICT solutions o answering new questions
    generating new ideas o answering new questions
    seeing research problems from new perspectives o answering new questions
    provide overview of available methods, etc. o answering new questions
    if we work together we can integrate our knoweldge and get a better idea about the big picture o answering new questions
    make technical & interpretive knowledge come together o answering new questions
    designing studies that have a greate change of producing real insights o real results
    understand the social web phenomena like wikipedia, facebook (motivation/quality) o real results
    share (experience) tools for network vizualization & analysis o real results
    linking concepts that wouldn’t have been associated earlier (underlying frames) o real results
    applying the results of the detailed tracking of people o real results
    ending up with a lot of manual work to compensate for technical errors c automated analysis
    combining social networks and content networks o automated analysis
    automating social and content analysis o automated analysis
    losing valuable information that might be essential to understanding phenomena c automated analysis
    automated analysis & interpretations of social phenomena c automated analysis
    thinking that one side (your side) always does things “the right way”. c research styles
    interests are divergent c research styles
    research timeframes are divergent c research styles
    cs need short-term “help” -> pulbication cycle c research styles
    different scientific approaches and styles (e.g. publication) c research styles

    Filed under: academia Tagged: computational social science, post-it notes, symposium, vu unviersity amsterdam