News and Updates on the KRR Group
Header image

LDBC project

Posted by admin in ldbc - (0 Comments)

The mission of the LDBC can be compared to that of the Transaction Processing Council (TPC) that Jim Gray founded in the area of relational database technology (www.tpc.org). LDBC will create a body in which vendors of RDF and graph database systems agree on relevant benchmarks and benchmark practices; and will publish official benchmark results. The objective of the project is to highlight the functional and performance characteristics of Graph and RDF systems, viz-a-viz each other and established relational data management technology. The motivation for this is to help IT practitioners understand and select Graph and RDF data management products, and thus, help make the emerging Graph and RDF data management industry more mature. Additionally, we hope that LDBC will spur competition and thereby accelerate technical progress.

In detail:

  • “agreeing on benchmark practices” means agreeing on the exact rules and metrics with which products can be compared. Without such rules, which include having benchmark results checked by independent auditors, it is very easy to skew any benchmark result in one’s favor; e.g. by precomputing (partial) answers; by implementing benchmark-special functionalities, by being not open about hot or cold runs; by comparing results on wholly different hardware (with wholly different price-tags). There are many ways in which one can game a result.
  • “agreeing on metrics” is important as, without balanced metrics, it is easy to pick the benchmark observations or statistics that favor one algorithm/system/product (conveniently forgetting about other metrics relevant for the benchmark on which the performance maybe favorable — often systems must make trade-offs, so a win on one metric can become a loss on another; see e.g. the difference between OLTP and OLAP workloads). This will include a notion of score-per-EURO (or $), taking into account hardware+software+maintenance cost aspects in the results.

These points underline the industrial nature of the project, since such elements are not usually present in academic benchmark work. The industry participation in LDBC include Ontotext, Openlink and Neo Technologies (neo4j), which are European industrial leaders in this emerging technological space. The council itself is international, so other companies will be able to join the non-profit body of LDBC as well. More than ten such companies have approached LDBC already: effectively the great majority of RDF and Graph database companies are interested. We expect the council to start growing by March 2013, when a non-profit legal entity for it will have been formed; and membership will become formally possible.

The LDBC EU project has also a research participation in the form of UPC Barcelona, VUA Amsterdam, Technical University Munich, FORTH and STI Innsbruck. The research task is to kick-start the LDBC by helping in selecting/defining an initial set of benchmarks. Even though in RDF and graph databases there already exist benchmarks, aspects like cost metrics, rules for running the benchmark, and benchmark audits are generally underdeveloped; so LDBC here will extend existing benchmark components were possible and create new ones where necessary. The academic partners have been selected to include groups that have technical expertise in data management (e.g. RDF-3X — Munich; MonetDB, VectorWise – Amsterdam, Sparsity – Barcelona) so benchmarks will stress systems in relevant areas “where it hurts” in order to maximize the potential for progress.

In order to ensure that benchmarks represent usage scenarios that matter for technology users, LDBC has a Technical User Community (TUC). This TUC had its first meeting last week November 19/20 in Barcelona, that was well attended and quite productive. A digital record is found on: ldbc.eu:8090/display/TUC/First+TUC+meeting+Nov+2012

We see it as a sign of relevance for LDBC that these users spent two days to talk in-depth about their technical challenges with Graph and RDF software, multiple of them flying in from the US (on their own cost). The TUC includes participants from the publishing, life sciences, security and marketing domains. The outcomes of the first TUC meeting have been used to determine the direction in establishing the first LDBC benchmark task forces; and the TUC will remain continuously involved in providing information on relevant datasets and workloads, and feedback to benchmark specifications as they evolve.

In case this description got you interested, and specifically if you are a user of RDF, graph or relational technology, we would like to invite you take a short survey: http://goo.gl/PwGtK

More about the project, its activities and its benchmarks in the future are found on: www.ldbc.eu. We are also on twitter @LDBCproject.
You can contact me via: larri “at” ac.upc.edu

Yours,
Josep Lluis Larriba Pey
LDBC coordinator

Tags: 

Source: Semantic Web world for you
Last week, on the afternoon of November 22, I co-organized a tutorial about Linked Data aimed at researchers from digital humanities. The objective was to give a basic introduction to the core principles and to do that in a very hands-on setting, so that everyone can get a concrete experience with publishing Linked Data. To […]

Source: Semantic Web world for you
Last week, on the afternoon of November 22, I co-organized a tutorial about Linked Data aimed at researchers from digital humanities. The objective was to give a basic introduction to the core principles and to do that in a very hands-on setting, so that everyone can get a concrete experience with publishing Linked Data. To […]

The TUC will kick off its activities on November 19/20, 2012 at the first scheduled meeting in Barcelona. An online questionnaire is available at http://goo.gl/PwGtK for interested parties can make contributions on their experiences and needs for consideration in the LDBC benchmarks.

This week we received notification from the EU that the LDBC project has been granted. We think this is great news. The LDBC project (is a STREP and will run until Q2 2015. LDBC stands for Linked Data Benchmark Council, and linked data here of course comprises RDF data management, but also includes the emerging class of graph database systems.

The mission of the LDBC project is to establish a long-term independent association among RDF and Graph database companies that define benchmarks, specify benchmarking practices and publish officially vetted benchmark results. Beyond the project partners, many commercial vendors of RDF and Graph database systems have already expressed their interest in joining this council (once we have founded the legal entity.. it will take a few months still).

The motivation behind the project is to show the strengths (and weaknesses) of RDF and Graph database technologies to the wider IT community pondering the adoption of these technologies, by enabling comparisons between the various products but also with established relational database technologies. Also, by establishing competition on these benchmarks LDBC aims to foment technical progress in the RDF and Graph database systems.

The LDBC project partners include for the RDF database community Ontotext and Openlink; from the graph database side there is Neo Technologies (of neo4j fame) and Sparsity is indirectly involved through academic project partner UPC (Barcelona). Other project partners are University of Innsbruck, FORTH, VU University Amsterdam and Technical University Munich (TUM). The academic partners will help to provide the council with an initial set of benchmarks.

The technical topics of interest for benchmarking are:

  • complex analytical queries for both graph and RDF
  • graph analysis algorithms and traversals
  • large-scale reasoning on RDF data
  • transaction performance
  • systems support for data integration and provenance

The use-case scenarios for these are:

  • social networking (e.g. marketing companies)
  • dynamic publishing (e.g. BBC)
  • telecommunication network analysis
  • bioinformatics data integration (e.g. OpenPhacts)

LDBC interacts with users of Graph and RDF technologies through is Technical User Community (TUC), and the TUC is having its first users workshop in Barcelona next week Nov19+Nov20 (http://www.ldbc.eu:8090/display/TUC/First+TUC+meeting+Nov+2012) on the premises of UPC. The main take-away for users to engage with the TUC is to influence the benchmarking agenda of the LDBC. Talk to us, and RDF vendors might start competing in how to best solve your problems! Even if the Barcelona meeting is too short notice, please drop a note if you want to be involved in the TUC or know people who should.

Finally, please fill in the questionnaire (http://goo.gl/PwGtK) to tell us about your usage (problems) with RDF (or graph) database technologies. We will be looking at the questionnaire results that we have received by Friday November 16 to help set the agenda in the users meeting, so if you want to contribute already this week, that would be highly appreciated.

Thanks for your time, also on behalf of the full LDBC consortium,

Peter Boncz (scientific director LDBC)
Paul Groth
Frank van Harmelen

Enhanced by Zemanta

“Graphs are everywhere. Organizations of all sizes, from large enterprise to new startups, are embracing graph databases as the fastest way to query and store graph data. The EU has recognized this, and has funded the Linked Data Benchmark Council to promote and further the research in graph databases. We are grateful to the EU for recognizing the leading role of Neo4j in graph database adoption worldwide and have accepted its invitation join the research team, where we will be working closely with graph reseachers to set the next generation of industry standards and benchmarks.”
Emil Eifrém, CEO of Neo Technology

Tags: 

November 9, 2012, the EU confirmed the start of the new FP7 project called Linked Data Benchmark Council (LDBC). The main objective of LDBC is the development of benchmarks for the emerging field of RDF and graph data management systems, as well as to spur industry cooperation around such benchmarks. This new council of database software vendors and academics will establish benchmarks and publish benchmark results that will make the properties of RDF and graph data management systems insightful.
The LDBC audience includes IT professionals interested in using these emerging technologies, researchers in both the database and semantic web research communities, and data management technology vendors.
The outcomes of the LDBC project will be

  • (i) a set of benchmarks that will span four technical expertise areas: complex query execution, transactionality in graphs, RDF inference and RDF support for ETL/data integration, and
  • (ii) the creation of an industry-supported LDBC organization that will outlive the EU project, and which ultimately aims to include the entire set of RDF and graph database vendors.

The LDBC also engages users of graph and RDF data management technology in its Technical User Community (TUC); where users have the opportunity to interact with the LDBC in order to make sure their experiences and needs find their way into LDBC benchmarks. The TUC will kick off its activities on November 19/20, 2012 at the first scheduled meeting in Barcelona. Alternatively, an online questionnaire is available at http://goo.gl/PwGtK for interested parties can make contributions on their experiences and needs for consideration in the LDBC benchmarks.
Please visit http://ldbc.eu/tuc to engage with its technical user community.

 

Tags: 

Update: Complexity, Learning and Semantics

Posted by data2semantics in collaboration | computer science | large scale | semantic web | vu university amsterdam - (Comments Off on Update: Complexity, Learning and Semantics)

Source: Data2Semantics

Complexity metrics form the backbone of graph analysis. Centrality, betweenness, assortativity and scale freeness are just a handful of selections from a large and quickly growing literature. It seems that every purpose has its own notion of complexity. Can we find a way to tie these disparate notions together?

Algorithmic statistics provide an answer. It posits that any useful property that is induced from data can be used to compress it—to store it more efficiently. If I know that my network is scale free, or that a set of points is distributed normally, that information will allow me to come up with a more efficient representation of the data. If not, the property we have learned is of no use.

This notion allows us to see data compression, learning and complexity analysis as simply three names for the same thing. The less a dataset can be compressed, the more complex it is, the more it can be compressed the more useful our induced information is.

But we can go further than just complexity. Occam’s razor tells us that the simplest explanation is often the best. Algorithmic statistics provides us with a more precise version. If our data is the result of a computational process, and we have found a short description of it, then with high probability the model that allowed that compression is also a description of the process that generated our data. And that is ultimately what semantics is, a description of a generating process. Whether it’s the mental state that led to a linguistic expression, or the provenance trail that turned one form of data into another. When we talk about semantics, we are usually discussing computational processes generating data.

Practically, algorithmic statistics will give us a means to turn any family of network models (from frequent subgraphs to graph grammars) into a family of statistics. If the network model is powerful enough, the statistics should be able to capture any existing property of complex graphs, including scale freeness, assortativity or fractal scaling.

Enhanced by Zemanta

Update: TabLinker & UnTabLinker

Posted by data2semantics in collaboration | computer science | large scale | semantic web | vu university amsterdam - (Comments Off on Update: TabLinker & UnTabLinker)

Source: Data2Semantics

TabLinker, introduced in an earlier post, is a spreadsheet to RDF converter. It takes Excel/CSV files as input, and produces enriched RDF graphs with cell contents, properties and annotations using the DataCube and Open Annotation vocabularies.

TabLinker interprets spreadsheets based on hand-made markup using a small set of predefined styles (e.g. it needs to know what the header cells are). Work package 6 is currently investigating whether and how we can perform this step automatically.

Features:

  • Raw, model-agnostic conversion from spreadsheets to RDF
  • Interactive spreadsheet marking within Excel
  • Automatic annotation recognition and export with OA
  • Round-trip conversion: revive the original spreadsheet files from the produced RDF (UnTabLinker)

In Data2Semantics, we have used TabLinker to publish linked socio-historical data, converting the historical Dutch censuses (1795-1971) to RDF (see slides).

 

Social historians are actively doing research using these datasets, producing rich annotations that correct or reinterpret data; these annotations are very useful when checking dataset quality and consistency (see model). Published RDF is ready-to-query and visualze via SPARQL queries.

 

 

Enhanced by Zemanta

Update: Machine Learning and Linked Data

Posted by data2semantics in collaboration | computer science | large scale | semantic web | vu university amsterdam - (Comments Off on Update: Machine Learning and Linked Data)

Source: Data2Semantics

Mathematics

Part of work package 2 is developing machine learning techniques to automatically enrich linked data. The web of data has become so large, that maintaining it by hand is no longer possible. In contrast to existing techniques for learning for the semantic web, we aim at applying the techniques directly to the linked data.

We use kernel based machine learning techniques, which can deal well with structured data, such as RDF graphs. Different graph kernels exist, typically developed in the bioinformatics domain, thus which kernels are most suited to RDF is an unanswered question. A big advantage of the graph kernel approach is that relatively little preprocessing/feature selection of the RDF graph is necessary and graph kernels can be applied for a wide range of tasks, such as property prediction, link prediction, node clustering, node ranking, etc.

Currently our research focusses on:

  • which graph kernels are best suited to RDF,
  • what part of the RDF graph do we need for the graph kernel,
  • which tasks are well suited to solve using kernels.

A paper with the most recent results is currently under submission at SDM 2013. Code for different graph kernels and for redoing our experiments is available at: https://github.com/Data2Semantics/d2s-tools.

Enhanced by Zemanta