News and Updates on the KRR Group
Header image

Source: Data2Semantics
The Data2Semantics is part of the COMMIT/ research community. We attended the COMMIT kick-off meeting where we presented our project and networked with the rest of the 15 projects and learned about presenting our work to the broader community. Paul wrote up his thoughts on the kick-off, which you can find here. The whole team […]

Source: Semantic Web world for you
On March 16, 2012 the European Public Sector Information Platform organised the ePSIplatform Conference 2012 on the theme “Taking re-use to the next level!”. A very well organised and interesting event, also a good opportunity to meet new persons and put a face on the names seen on the mails and during teleconferences ūüôā

The program was intense: 3 plenary sessions, 12 break-out sessions and project presentations during the lunch break. That was a lot to talk about and a lot to listen to. I left Rotterdam with a number of take out messages and food for thoughts. What follows is a mix of my own opinions and things said by some of the many participants/speakers of the event.

Source: Semantic Web world for you
Last week, I attended a seminar about “Understanding and Managing Complex Systems” organised by the Royal Netherlands Academy of Arts and Sciences (KNAW) together with the Netherlands Organisation for Scientific Research (NWO). The take home message from this seminar is that 1) Complex Systems are highly popular in Amsterdam, all the 200 available seats where taken the day the registration was open and 2) Complex Systems is the science of cooperation.

Source: Semantic Web world for you
The VU is making short videos of 1 minute to highlight some of the research that is being done within its walls. This is the video for SemanticXO, realised by Pepijn Borgwat and presented by Laurens Rietveld. The script is in Dutch and is as follows: Ik ben laurens rietveld en ik doe onderzoek aan de vrije […]

TabLinker is experimental software for converting manually annotated Microsoft Excel workbooks to the RDF Data Cube vocabulary. It is used in the context of the Data2Semantics project to investigate the use of Linked Data for humanities research (Dutch census dataproduced by DANS).

TabLinker was designed for converting Excel or CSV files to RDF (triplification, RDF-izing) that have a complex layout and cannot be handled by fully automatic csv2rdf scripts.

A presentation about Linked Census Data, including TabLinker is available from SlideShare.

Please consult the Github page for the latest release information.

Using TabLinker

TabLinker takes annotated Excel files (found using the srcMask option in the config.ini file) and converts them to RDF. This RDF is serialized to the target folder specified using the targetFolder option in config.ini.

Annotations in the Excel file should be done using the built-in style functionality of Excel (you can specify these by hand). TabLinker currently recognises seven styles:

  • TabLink Title¬†– The cell containing the title of a sheet
  • TabLink Data¬†– A cell that contains data, e.g. a number for the population size
  • TabLink ColHeader¬†– Used for the headers of columns
  • TabLink RowHeader¬†– Used for row headers
  • TabLink HierarchicalRowHeader¬†– Used for multi-column row headers with subsumption/taxonomic relations between the values of the columns
  • TabLink Property¬†– Typically used for the header cells directly above RowHeader or HierarchicalRowHeader cells, cell values are the properties that relate Data cells to RowHeader and HierarchicalRowHeader cells.
  • TabLink Label¬†– Used for cells that contain a label for one of the HierarchicalRowHeader cells.

An eight style, TabLink Metadata, is currently ignored (See #3).

An example of such an annotated Excel file is provided in the input directory. There are ways to import the styles defined in that file into your own Excel files.

Tip: If your table contains totals for HierarchicalRowHeader cell values, use a non-TabLink style to mark the cells between the level to which the total belongs, and the cell that contains the name of the total. Have a look at the example annotated Excel file to see how this is done (up to row 428).

Once you’re all set, start the TabLinker by cd-ing to the¬†src¬†folder, and running:

python tablinker.py

Requirements

TabLinker was developed under the following environment:

Source: Semantic Web world for you
The WordPress.com stats helper monkeys prepared a 2011 annual report for this blog. Here’s an excerpt: A San Francisco cable car holds 60 people. This blog was viewed about 2,800 times in 2011. If it were a cable car, it would take about 47 trips to carry that many people. Click here to see the […]

Source: Semantic Web world for you

I’m currently spending some time at Yahoo labs in Barcelona to work with Peter Mika and his team on data analysis. Last week, I was invited to give a seminar on how we perform network-based analysis of Linked Data at the VU. The slides are embedded at the end of this post.

Essentially, we observe that focusing only on the triples (c.f., for instance, a BTC snapshot) is not enough to explain some of the patterns observed in the Linked Data ecosystem. In order to understand what’s really going on, one as to take in account the data, its publishers/consumers and the machines that serve it. Time also plays an important role and shouldn’t be neglected. This brings us to studying this ecosystem as a Complex System and that’s one of the thing that is keeping Paul, Frank, Stefan, Shenghui and myself busy these days ;-)

Exploring Linked Data content through network analysis

github kitty

We have opened up a Data2Semantics GitHub organisation for publishing all (open source) code produced within the Data2Semantics project. Point your browser (or Git client) to http://github.com/Data2Semantics for the latest and greatest!

Enhanced by Zemanta

Data2Semantics at ICTDelta 2011

Posted by data2semantics in collaboration | computer science | large scale | semantic web | vu university amsterdam - (Comments Off on Data2Semantics at ICTDelta 2011)

The COMMIT programme was officially kicked-off by Maxime Verhagen, minister of Economic Affairs, Agriculture and Innovation at  the ICTDelta 2011 event held at the World Forum on November 16, in The Hague.

Throughout the day, members of the Data2Semantics project manned a very busy stand in the foyer, featuring prior and current work by the project partners such as the AIDA toolkit, OpenPHACTS, LarKC and the MetaLex Document Server.

Enhanced by Zemanta

Source: Semantic Web world for you

Scaling is often a central question for data intensive projects, making use of Semantic Web technologies or not, and SemanticXO is no exception to that. The triple store is used as a back end for the Journal of Sugar, which is a central component recording the usage of the different activities. This short post discusses the results found for two questions: “how many journal entries can the triple store sustain?” and “hoe much disk space is used to store the journal entries?”

Answering these questions means loading some Journal entries and measuring the read and write performances along with the disk space used. This is done by a script which randomly generate Journal entries and insert them in the store. A text sampler and the real names of activities are used to make these entries realistic in terms of size. An example of such generated entry, serialised in HTML, can be seen there. The following graphs show the results obtained for inserting 2000 journal entries. These figures have been averaged over 10 runs, each of them starting with a freshly created store. The triple store used is called “RedStore“, it is called with an hash based BerkleyDB backend. The test machine is an XO-1 running the software 11.2.0.

The disk space is minimal for up to 30 entries, grows rapidly between 30 and 70 entries and continues on a linear basis from that number on. The maximum space occupied is a bit less than 100MB which is few of the 1GB of storage of the XO-1.

 

Amount of disk space used by the triple store

The results for the read and write delay are a bit less of a good news. Write operations are constant in time and always take around 0.1 s. Getting an entry from the triple store proves to get linearly slower as the triple store gets filled. It can be noticed that for up to 600 entries, the retrieval time of an entry is below a second. This should provide a reasonable response time. However, with 2000 entries stored the retrieval time goes as high as 7 seconds :-(

Read and write access time

The answer to the question we started with (“Does it scale?”) is then “yes, for up to 600 entries” considering a first generation device and the current status of the software components (SemanticXO/Redstore/…). This answers also yields new questions, among which: Are 600 entries enough for a typical usage of the XO? Is it possible to improve the software to get better results? How are the result on some more recent hardware?

I would appreciate a bit of help for answering all of these, and especially the last one. I only have an XO-1 and can not thus run my script on an XO-1.5 or XO-1.75. If you have such device and are willing to help me getting the results, please download the package containing the performance script and the triple store and follow the instructions for running it. After a day of execution or so, this script will generate three CSV files that I could then postprocess to get similar curves as the one showed.

Related articles