News and Updates on the KRR Group
Header image

Source: Semantic Web world for you

I’m currently spending some time at Yahoo labs in Barcelona to work with Peter Mika and his team on data analysis. Last week, I was invited to give a seminar on how we perform network-based analysis of Linked Data at the VU. The slides are embedded at the end of this post.

Essentially, we observe that focusing only on the triples (c.f., for instance, a BTC snapshot) is not enough to explain some of the patterns observed in the Linked Data ecosystem. In order to understand what’s really going on, one as to take in account the data, its publishers/consumers and the machines that serve it. Time also plays an important role and shouldn’t be neglected. This brings us to studying this ecosystem as a Complex System and that’s one of the thing that is keeping Paul, Frank, Stefan, Shenghui and myself busy these days ;-)

Exploring Linked Data content through network analysis

Source: Think Links

This past Tuesday, I had the opportunity to give a webinar for Elsevier Labs giving an overview of altmetrics. It was a fun opportunity to talk to people who have a great chance to influence the next generation of academic measurement. The slides are embedded below.

At the VU, we are also working with Elsevier Labs on the Data2Semantics project where we are trying to enrich data with additional machine understandable metadata. How does this relate to metrics? I believe that metrics (access, usage, etc) can be e a key piece of additional semantics for datasets. I’m keen to see how metrics can make our data more useful, findable and understandable.

 

Filed under: altmetrics Tagged: #altmetrics, data2semantics, presentation

Source: Think Links

This past Tuesday, I had the opportunity to give a webinar for Elsevier Labs giving an overview of altmetrics. It was a fun opportunity to talk to people who have a great chance to influence the next generation of academic measurement. The slides are embedded below.

At the VU, we are also working with Elsevier Labs on the Data2Semantics project where we are trying to enrich data with additional machine understandable metadata. How does this relate to metrics? I believe that metrics (access, usage, etc) can be e a key piece of additional semantics for datasets. I’m keen to see how metrics can make our data more useful, findable and understandable.

 

Filed under: altmetrics Tagged: #altmetrics, data2semantics, presentation

github kitty

We have opened up a Data2Semantics GitHub organisation for publishing all (open source) code produced within the Data2Semantics project. Point your browser (or Git client) to http://github.com/Data2Semantics for the latest and greatest!

Enhanced by Zemanta

The COMMIT programme was officially kicked-off by Maxime Verhagen, minister of Economic Affairs, Agriculture and Innovation at  the ICTDelta 2011 event held at the World Forum on November 16, in The Hague.

Throughout the day, members of the Data2Semantics project manned a very busy stand in the foyer, featuring prior and current work by the project partners such as the AIDA toolkit, OpenPHACTS, LarKC and the MetaLex Document Server.

Enhanced by Zemanta

Source: Semantic Web world for you

Scaling is often a central question for data intensive projects, making use of Semantic Web technologies or not, and SemanticXO is no exception to that. The triple store is used as a back end for the Journal of Sugar, which is a central component recording the usage of the different activities. This short post discusses the results found for two questions: “how many journal entries can the triple store sustain?” and “hoe much disk space is used to store the journal entries?”

Answering these questions means loading some Journal entries and measuring the read and write performances along with the disk space used. This is done by a script which randomly generate Journal entries and insert them in the store. A text sampler and the real names of activities are used to make these entries realistic in terms of size. An example of such generated entry, serialised in HTML, can be seen there. The following graphs show the results obtained for inserting 2000 journal entries. These figures have been averaged over 10 runs, each of them starting with a freshly created store. The triple store used is called “RedStore“, it is called with an hash based BerkleyDB backend. The test machine is an XO-1 running the software 11.2.0.

The disk space is minimal for up to 30 entries, grows rapidly between 30 and 70 entries and continues on a linear basis from that number on. The maximum space occupied is a bit less than 100MB which is few of the 1GB of storage of the XO-1.

 

Amount of disk space used by the triple store

The results for the read and write delay are a bit less of a good news. Write operations are constant in time and always take around 0.1 s. Getting an entry from the triple store proves to get linearly slower as the triple store gets filled. It can be noticed that for up to 600 entries, the retrieval time of an entry is below a second. This should provide a reasonable response time. However, with 2000 entries stored the retrieval time goes as high as 7 seconds :-(

Read and write access time

The answer to the question we started with (“Does it scale?”) is then “yes, for up to 600 entries” considering a first generation device and the current status of the software components (SemanticXO/Redstore/…). This answers also yields new questions, among which: Are 600 entries enough for a typical usage of the XO? Is it possible to improve the software to get better results? How are the result on some more recent hardware?

I would appreciate a bit of help for answering all of these, and especially the last one. I only have an XO-1 and can not thus run my script on an XO-1.5 or XO-1.75. If you have such device and are willing to help me getting the results, please download the package containing the performance script and the triple store and follow the instructions for running it. After a day of execution or so, this script will generate three CSV files that I could then postprocess to get similar curves as the one showed.

Related articles

Source: Think Links

The Journal of Web Semantics recently published a special issue on Using Provenance in the Semantic Web edited by myself and Yolanda Gil. (Vol 9, No 2 (2011)). All articles are available on the journal’s preprint server.

The issue highlights top research at the intersection of provenance and the Semantic Web. The papers addressed a range of topics including:

  • tracking provenance of DBpedia back to the underlying Wikipedia edits [Orlandi & Passant];
  • how to enable reproducibility using Semantic techniques [Moreau];
  • how to use provenance to effectively reason over large amounts (1 billion triples) of messy data [Bonatti et al.]; and
  • how to begin to capture semantically the intent of scientists [Pignotti et al.].
 Our editorial highlights a common thread between the papers and sums them up as follows:

A common thread through these papers is the use of already existing provenance ontologies. As the community comes to an increasing agreement on the commonalities of provenance representations through efforts such as the W3C Provenance Working Group, this will further enable new research on the use of provenance. This continues the fruitful interaction between standardization and research that is one of the hallmarks of the Semantic Web.

Overall, this set of papers demonstrates the latest approaches to enabling a Web that provides rich descriptions of how, when, where and why Web resources are produced and shows the sorts of reasoning and applications that these provenance descriptions make possible

Finally, it’s important to note that this issue wouldn’t have been possible without the quick and competent reviews done by the anonymous reviewers. This is my public thank you to them.

I hope you take a chance to take a look at this interesting work.

Filed under: academia, linked data Tagged: journal, linked data, provenance, semantic web