News and Updates on the KRR Group
Header image

Source: Think Links

In preparation for Science Online 2011, I was asked by Mark Hahnel from over at Science 3.0 if I could do some analysis of the blogs that they’ve been aggregating since Octobor (25 thousand posts from 1506 authors). Mark along with Dave Munger will be talking more about the role/importance of aggregators in a session Saturday morning 9am (Developing an aggregator for all science blogs). These analysis provide a high level overview of the content of science blogs. Here are the results.

The first analysis tried to find the topics of blogs and their relationships. We used title words as a proxy for topics and co-occurrence of those words as representative of the relationships between those topics. Here’s the map (click the image to see a larger size):

The words cluster together according to their co-occurrence. The hotter the color the more occurrence of those words. You’ll notice that for example Science and Blog are close to one another. Darwin and days as well as fumbling and tenure are close as well. The visualization was done with Vosviewer software.

I also looked at how blogs are citing research papers. We looked for the occurrence of DOIs as well as research blogging style citations within all the blog posts. We found that there were 964 posts with these sorts of citations. In this case, I thought there would be more but maybe this is down to how I implemented it.

Finally, I looked at what URLs were most commonly used in all the blog posts. Here are the top 20:

URL Occurences
http://friendfeed.com/scienceartists 4476
http://scienceblogs.com/bookclub/?utm_source=rssTextLink 3920
http://friendfeed.com/science-magazine-blogs 1002
http://friendfeed.com/science-news-feeds 930
http://www.addtoany.com/share_save 789
http://friendfeed.com/nyt-science-blogs 648
http://friendfeed.com/sciam-blogs 533
http://www.guardian.co.uk 485
http://www.guardian.co.uk/help/feeds 482
http://www.wired.com/gadgetlab/category/hacks-mods-and-diy/ 376
http://friendfeed.com/student-science-journalism-blogs 350
http://www.guardian.co.uk/profile/grrlscientist 336
http://friendfeed.com/science-blog-carnivals 295
http://blogevolved.blogspot.com 271
http://news.sciencemag.org/scienceinsider/rss/current.xml 269
http://www.researchblogging.org 266
http://www.sciam.com/ 265
http://content.usatoday.com/communities/sciencefair/index 232
http://news.sciencemag.org/sciencenow/rss/current.xml 232
mailto:grrlscientist@gmail.com 195

I was quite happy with this list because they are pretty much all science links. I thought there would be a lot more links to non-science places.

I hope the results can provide a useful discussion piece. Obviously, this is just the start and we can do a lot more interesting analyses. In particular, I think such statistics can be the basis for alt-metrics style measures. If you’re interested in talking to me about these analysis come find me at Science Online.

Filed under: academia Tagged: #altmetrics, #scio11, analysis, science blogging

Source: Semantic Web world for you

LOD Around The Clock (LATC) logo
Althought being commonly depicted as one giant graph, the Web of Data is not a single entity that can be queried. Instead, it’s a distributed architecture made of different datasets each providing some triples (see the LOD Cloud picture and CKAN.net). Each of these data source can be queried separately, most often through an end point understanding the SPARQL query language. Looking for answers making use of information spanning over different data sets is a more challenging task as the mechanisms used internally to query one data set (database-like joins, query planning, …) do not scale easily over several data sources.

When you want to combine information from, say DBPedia and the Semantic Web doog food site, the easiest and quickest workaround is to download the content of the two datasets, eventually filtering out triples you don’t need, and load the content retrieved into a single data store. This approach as some limitations: you must have a store running somewhere (that may require a significantly powerful machine to host it), the downloaded data must be updated from time to time and the data you need may not be available for downloading at the first place.

When used along with a SPARQL datalayer, eRDF offers you a solution when one of these limitation prevents you from executing your SPARQL query over several datasets. The applications runs on a low-end laptop and can query, and combine the results from, several SPARQL end points. eRDF is a novel RDF query engine making use of evolutionary computing to search for the solution. Instead of the traditional resolution mecanism, an iterative trial and error process is used to progressively find some answers to the query (more information can be found in the published papers which are listed on erdf.nl and in this technical report). It’s a versatile optimisation tool that can run other different kind of data layers and the SPARQL data layer offers an abstraction over a set of SPARQL end points.

Let’s suppose you want to find some persons and the capital of the country they live in:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX db: <http://dbpedia.org/ontology/>

SELECT DISTINCT ?person ?first ?last ?home ?capital WHERE {
	?person  rdf:type         foaf:Person.
	?person  foaf:firstName   ?first.
	?person  foaf:family_name ?last.
	OPTIONAL {
	?person  foaf:homepage    ?home.
	}
	?person  foaf:based_near  ?country.
	?country rdf:type         db:Country.
	?country db:capital       ?capital.
	?capital rdf:type         db:Place.
}
ORDER BY ?first

Such a query can be answered by combining data from the dog food server and dbpedia. More data sets may also contain list of people but let’s focus on researchers as a start. We’ll have to indicate to eRDF which are the end points to query, this is done with a simple csv listing:

DBpedia;http://dbpedia.org/sparql
Semantic Web Dog Food;http://data.semanticweb.org/sparql

Assuming the query is saved into a “people.sparql” file and the end points list goes into a “endpoints.csv”, the query engine is called like this:

java -cp nl.erdf.datalayer-sparql-0.1-SNAPSHOT.jar nl.erdf.main.SPARQLEngine -q people.sparql -s endpoints.csv -t 5

The query will first be scanned for its basic graph patterns, all of them will be grouped and sent to the eRDF optimiser as a set of constraints to solve. Then, eRDF will look for solutions matching as many of these constraints as possible and push back all the relevant triples found back into an RDF model. After some time (set with the parameter “t”), eRDF is stopped and Jena is used to issue the query over the model that was just populated. The answers are then displayed, along with a list of the data sources that contributed in finding them.

If you don’t know which end points are likely to contribute to the answers, you can just query all of the WOD and see what happens… ;-)
The package comes with a tool to fetch a list of SPARQL end points from CKAN, test them and create a configuration file. It gets called like that:

java -cp nl.erdf.datalayer-sparql-0.1-SNAPSHOT.jar nl.erdf.main.GetEndPointsFromCKAN

After a few minutes, you will get a “ckan-endpoints.csv” allowing you to run query the WoD from your laptop.

The source code along with a package including all the dependencies are available on GitHub. Please note that this is a first public release of the tool still in snapshot state so bugs are expected to show up. If you spot some, report them and help us improve the software. Comments and suggestions are also much welcome :)


The work on eRDF is supported by the LOD Around-The-Clock (LATC) Support Action funded under the European Commission FP7 ICT Work Programme, within the Intelligent Information Management objective (ICT-2009.4.3).

Source: Think Links

The university where I work asks us to register all our publications for the year in a central database [1].  Doing this obviously made me think of doing an ego search on my academic papers. Plus, it’s the beginning of the year, which always seems like a good time to look at these things.

The handy tool Publish-or-Perish calculates all sorts of citation metrics based on a search of Google Scholar. The tool lets you pick the set of publications to consider. (For example, I left out all the publications from another Paul Groth who’s a professor of architecture at Berkeley.) I did a cursory run through to remove publications that weren’t mine but I didn’t spend much time so all the standard disclaimers apply. There may be duplicates, it includes technical reports, etc. For transparency, you can find the set of publications considered in the Excel file here. Also, it’s worth noting that the Google Scholar corpus has it’s own problems, in particular, it makes you look better. With all that in mind, let’s get to the fun stuff.

My stats as of Jan. 4, 2011 are:

  • Papers:93,
  • Citations:1318,
  • Years:12,
  • Cites/year:109.83,
  • Cites/paper:14.17/4.0/0,
  • Cites/author:416.35,
  • Papers/author:43.27,
  • Authors/paper:3.04/3.0/2,
  • h-index:21,
  • g-index:34,
  • hc-index:16,
  • hI-index:5.58,
  • hI-norm:11,
  • AWCR:224.17,
  • AW-index:14.97,
  • AWCRpA:70.96,
  • e-index:24.98,
  • hm-index:9.07,

You can find the definitions for these metrics here.

What does it all mean? I don’t know :-) I think it’s not half bad.

For comparison, here’s a list of  the h-indexes for top computer scientist computed using Google Scholar. All have  an h-index of 40 or greater. A quick scan through that least, shows that there’s a pretty strong correlation between being a top computer scientist and a high h-index. Thus, I conclude that I should continue concentrating on being a good computer scientists and the statistics will follow.

[1] I don’t know why my university doesn’t support importing publication information from bibtex, or RIS. Everything has to be added by hand, which takes a bit.

    Filed under: academia, meta Tagged: citation metrics, computer science, h-index

    We are glad to announce that the LarKC Platform Release v2.0 is now available in our repository on SourceForge.
    The redistributable package can be downloaded via the following URL:

    http://sourceforge.net/projects/larkc/files/Release-2.0/larkc-release-2.0.zip/download (OS independent)

    The source code belonging to this release can be checked out from SVN:

    http://larkc.svn.sourceforge.net/viewvc/larkc/branches/Release_2.0_prototype/platform/

    A complete manual for both users and developers can be found at:

    http://sourceforge.net/projects/larkc/files/Release-2.0/LarKC_Platform_Manual_2.0.pdf

    If [...]

    Source: Think Links

    One of the nice things about using cloud services is that sometimes you get a feature that you didn’t expect. Below is a nice set of stats from WordPress.com about how well Think Links did in 2010. I was actually quite happy with 12 posts – one post a month. I will be trying to increase the rate of posts this year. If you’ve been reading this blog, thanks! and have a great 2011. The stats are below:

    Here’s a high level summary of this blogs overall blog health:

    Healthy blog!

    The Blog-Health-o-Meter™ reads Fresher than ever.

    Crunchy numbers

    Featured image

    A Boeing 747-400 passenger jet can hold 416 passengers. This blog was viewed about 4,500 times in 2010. That’s about 11 full 747s.

     

    In 2010, there were 12 new posts, growing the total archive of this blog to 46 posts. There were 12 pictures uploaded, taking up a total of 5mb. That’s about a picture per month.

    The busiest day of the year was October 13th with 176 views. The most popular post that day was Data DJ realized….well at least version 0.1.

    Where did they come from?

    The top referring sites in 2010 were twitter.com, few.vu.nl, litfass.km.opendfki.de, 4store.org, and facebook.com.

    Some visitors came searching, mostly for provenance open gov, think links, ready made food, 4store, and thinklinks.

    Attractions in 2010

    These are the posts and pages that got the most views in 2010.

    1

    Data DJ realized….well at least version 0.1 October 2010

    2

    4store Amazon Machine Image and Billion Triple Challenge Data Set October 2009
    2 comments

    3

    Linking Slideshare Data June 2010
    4 comments

    4

    A First EU Proposal April 2010
    3 comments

    5

    Two Themes from WWW 2010 May 2010

    Filed under: meta