Last week the RDF and graph DB benchmarking project LDBC had its 3rd  Technical User Community meeting in London, held in collaboration with the GraphConnect event.  This meeting marks the official launch of the LDBC non-profit company  which is the successor of the present EU FP7 project.
The meeting was very well attended, along with most of the new advisory board.  Xavier Lopez from Oracle, Luis Ceze from Washington  University and Abraham Bernstein of Zurich University were present.  Jans Aasman of Franz, Inc. and Karl Huppler, former chairman of the TPC were not present but are signed up as advisory board members.
We had great  talks by the new board members and invited graph and RDF DB users.
Nuno Carvalho of Fujitsu Labs presented on the Fujitsu RDF use cases and benchmarking requirements, based around analytics streaming on time series of streaming data.  The technology platform is diverse, with anything from RDF stores to H Base.  The challenge is integration.  I pointed out that with Virtuoso column store you could now efficiently host also time series data alongside RDF.  Sure, a relational format is more efficient with time series data but it can be collocated with RDF and queries can join between the two.  This is specially so after our stellar  bulk load speed measured with the TPC H dataset.
Luis Ceze of Washington University presented Grappa, a C++ graph programming framework that in his words would be like Cray XMT, later Yarc Data, in software.  The idea is to have a graph algorithm divided into small executable steps, millions in number and too have very efficient scheduling and switching between these, building latency tolerance into every step of the application.  Commodity interconnects like InfiniBand deliver bad throughput with small messages,  but with endless message combination opportunities from millions of mini work units the overall throughput stays good.  We know the same from all the Virtuoso scale-out work.  Luis is presently working on Graphbench, a research project at Washington State funded by Oracle for graph algorithm benchmarking.  The major interest for LDBC is in having a library of  common graph analytics as a starting point.  Having these, the data generation can further evolve so as to create challenges for the algorithms.  One issue that came up is  the question of validating graph algorithm results:  Unlike in SQL queries, there is not necessarily a single correct answer.  If the algorithm to use and the count of iterations to run is not fully specified response times will vary widely.  Random walks will anyway create variation between consecutive runs.
Abraham Bernstein presented about the work on his Signal-Collect graph programming framework and its applications in fraud detection.  He also talked about the EU FP7 project ViSTA-TV which does massive stream processing around the real time behavior of internet TV users.  Again, Abraham gave very direct suggestions for what to include in the LDBC graph analytics workload.
Andreas Both of Unister presented on RDF Ontology-driven applications in an e-commerce context.  Unister is Germany’s leading E commerce portal operator with a large number of properties ranging across travel to most B2C.  The RDF use cases are many, in principle down to final content distribution but high online demand often calls for specialized solutions like hbit field intersections for combining conditions.  Sufficiently advanced database technology may also offer this but this is not a guarantee. Selecting travel destinations based on attributes like sports opportunities, culture etc can be made into efficient query plans but this requires perfect query plans also for short queries.  I expect to learn more about this when visiting on site.  There is clear input for LDBC in these workloads.
There were three talks on semantic applications in cultural heritage. Robina Clayphan of Europeana talked about this pan-European digital museum and library, and the Europeana Data Model (EDM). C.E.Ore of the University of Oslo talked about the CIDOC  (Conceptual Reference Model) ontology (ISO standard 21127:2006) and its role in representing cultural, historic and archaeological information. Atanas Kiryakov of Ontotext gave a talk on a possible benchmark around CIDOC CRM reasoning.  In the present LDBC work RDF inference plays a minor role but reasoning would be emphasized with this CRM workload, in which the inference needed revolves around abbreviating unions between many traversal paths of different lengths between modeled objects. The data is not very large but the ontology has a lot of detail.  This still is not the elusive use case which would really require  all the OWL complexities.   We will first see how the semantic publishing benchmark work led by Ontotext in LDBC plays out.  There is anyhow work enough there.
The most concrete result was that the graph analytics part of the LDBC agenda starts to take shape.  The LDBC organization is getting formed and its processes and policies are getting defined.  I visited Thomas Neumann’s group in Munich just prior to the TUC meeting to work on this.  Nowadays Peter Boncz, who was recently awarded the Humbolt  prize  goes to Munich on a weekly basis so Munich is the favored destination for much LDBC related  work.
The first workload of the social network benchmark is taking shape and there is good advance also in the semantic publishing benchmark.  I will in a future post  give more commentary on these workloads, now that the initial drafts from the respective task forces are out.
Orri Erling
OpenLink Software, Inc.
 

Tags: