NWO has awarded 12M Euro to CLARIAH, a project to build a digital infrastructure for software, data, enrichment, search and analytics in the Humanities. Frank van Harmelen, Maarten de Rijke en Cees Snoek are among the 9 scientists that form the core team of the project. See http://clariah.nl/, http://bit.ly/TERNC0, http://bit.ly/1mWtnje for more details.
Author Archives: admin
Recently the Linked Data Benchmark Council (LDBC) launched its portal http://ldbcouncil.org
LDBC is an organization for graph and RDF data management systems benchmarking that came out of a EU PF7 project by the same name (ldbc.eu).
The LDBC will survive the EU project and be industry supported and operate with ldbcouncil.org as its web presence.
Also, LDBC announced public drafts of its two first benchmarks. Public draft means that implementations of benchmark software and technical specification documents are available, and ready for public testing and comments. The two benchmarks are
- the Semantic Publishing Benchmark (SPB – http://ldbcouncil.org/benchmarks/spb) which is based on the BBC use case and ontologies, and
- the Social Network Benchmark (SNB – http://ldbcouncil/benchmarks/snb) on which an interactive workload has been released. Later, a Business Intelligence and a Graph Analytics will follow on this same dataset. The SNB data generator was recently used in the ACM SIGMOD programming contest, which was about graph analytics.
The ldbcouncil.org website also holds a blog with new and technical backgrounds of LDBC. The most recent post is about “Choke-Point based Benchmark Design”, by Peter Boncz.
Last week the RDF and graph DB benchmarking project LDBC had its 3rd Technical User Community meeting in London, held in collaboration with the GraphConnect event. This meeting marks the official launch of the LDBC non-profit company which is the successor of the present EU FP7 project.
The meeting was very well attended, along with most of the new advisory board. Xavier Lopez from Oracle, Luis Ceze from Washington University and Abraham Bernstein of Zurich University were present. Jans Aasman of Franz, Inc. and Karl Huppler, former chairman of the TPC were not present but are signed up as advisory board members.
We had great talks by the new board members and invited graph and RDF DB users.
Nuno Carvalho of Fujitsu Labs presented on the Fujitsu RDF use cases and benchmarking requirements, based around analytics streaming on time series of streaming data. The technology platform is diverse, with anything from RDF stores to H Base. The challenge is integration. I pointed out that with Virtuoso column store you could now efficiently host also time series data alongside RDF. Sure, a relational format is more efficient with time series data but it can be collocated with RDF and queries can join between the two. This is specially so after our stellar bulk load speed measured with the TPC H dataset.
Luis Ceze of Washington University presented Grappa, a C++ graph programming framework that in his words would be like Cray XMT, later Yarc Data, in software. The idea is to have a graph algorithm divided into small executable steps, millions in number and too have very efficient scheduling and switching between these, building latency tolerance into every step of the application. Commodity interconnects like InfiniBand deliver bad throughput with small messages, but with endless message combination opportunities from millions of mini work units the overall throughput stays good. We know the same from all the Virtuoso scale-out work. Luis is presently working on Graphbench, a research project at Washington State funded by Oracle for graph algorithm benchmarking. The major interest for LDBC is in having a library of common graph analytics as a starting point. Having these, the data generation can further evolve so as to create challenges for the algorithms. One issue that came up is the question of validating graph algorithm results: Unlike in SQL queries, there is not necessarily a single correct answer. If the algorithm to use and the count of iterations to run is not fully specified response times will vary widely. Random walks will anyway create variation between consecutive runs.
Abraham Bernstein presented about the work on his Signal-Collect graph programming framework and its applications in fraud detection. He also talked about the EU FP7 project ViSTA-TV which does massive stream processing around the real time behavior of internet TV users. Again, Abraham gave very direct suggestions for what to include in the LDBC graph analytics workload.
Andreas Both of Unister presented on RDF Ontology-driven applications in an e-commerce context. Unister is Germany’s leading E commerce portal operator with a large number of properties ranging across travel to most B2C. The RDF use cases are many, in principle down to final content distribution but high online demand often calls for specialized solutions like hbit field intersections for combining conditions. Sufficiently advanced database technology may also offer this but this is not a guarantee. Selecting travel destinations based on attributes like sports opportunities, culture etc can be made into efficient query plans but this requires perfect query plans also for short queries. I expect to learn more about this when visiting on site. There is clear input for LDBC in these workloads.
There were three talks on semantic applications in cultural heritage. Robina Clayphan of Europeana talked about this pan-European digital museum and library, and the Europeana Data Model (EDM). C.E.Ore of the University of Oslo talked about the CIDOC (Conceptual Reference Model) ontology (ISO standard 21127:2006) and its role in representing cultural, historic and archaeological information. Atanas Kiryakov of Ontotext gave a talk on a possible benchmark around CIDOC CRM reasoning. In the present LDBC work RDF inference plays a minor role but reasoning would be emphasized with this CRM workload, in which the inference needed revolves around abbreviating unions between many traversal paths of different lengths between modeled objects. The data is not very large but the ontology has a lot of detail. This still is not the elusive use case which would really require all the OWL complexities. We will first see how the semantic publishing benchmark work led by Ontotext in LDBC plays out. There is anyhow work enough there.
The most concrete result was that the graph analytics part of the LDBC agenda starts to take shape. The LDBC organization is getting formed and its processes and policies are getting defined. I visited Thomas Neumann’s group in Munich just prior to the TUC meeting to work on this. Nowadays Peter Boncz, who was recently awarded the Humbolt prize goes to Munich on a weekly basis so Munich is the favored destination for much LDBC related work.
The first workload of the social network benchmark is taking shape and there is good advance also in the semantic publishing benchmark. I will in a future post give more commentary on these workloads, now that the initial drafts from the respective task forces are out.
OpenLink Software, Inc.
Perspectives on the BSBM benchmarking effort.
This page will include details for vendors and users containing reference information about the RDF and graph databases benchmarks developed by LDBC once they are completed. One can track the development of the benchmarks at: http://www.ldbc.eu:8090/display/TUC/Benchmark+Task+Forces
In recent days cyberspace has seen some discussion concerning the relationship of the EU FP7 project LDBC (Linked Data Benchmark Council) and sociotechnical considerations. It has been suggested that LDBC, to its own and the community’s detriment, ignores sociotechnical aspects.
LDBC, as research projects go, actually has an unusually large, and as of this early date, successful and thriving sociotechnical aspect, i.e. involvement of users and vendors alike. I will here discuss why, insofar the technical output of the project goes, sociotechnical metricss are in fact out of scope. Then yet again, to what degree the benefits potentially obtained from the use of LDBC outcomes are in fact realized does have a strong dependence on community building, a social process.
One criticism of big data projects we sometimes encounter is the point that data without context is not useful. Further, one cannot just assume that one can throw several data sets together and get meaning from this, as there may be different semantics for similar looking things, just think of 7 different definitions of blood pressure.
LDBC, in its initial user community meeting was, according to its charter, focusing mostly on cases where the data is already in existence and of sufficient quality for the application at hand.
Michael Brodie, Chief Scientist at Verizon, is a well known advocate of focusing on meaning of data, not only on processing performance. There is a piece on this matter by him, Peter Boncz, Chris Bizer and myself on the Sigmod Record: “The Meaningful Use of Big Data: Four Perspectives”.
I had a conversation with Michael at a DERRI meeting a couple of years ago about measuring the total cost of technology adoption, thus including socio-technical aspects such as acceptance by users, learning curves of various stakeholders, whether in fact one could demonstrate an overall gain in productivity arising from semantic technologies. ‘Can one measure the effectiveness of different approaches to data integration?’ asked I. ‘Of course one can,’ answered Michael, ‘this only involves carrying out the same task with two different technologies, two different teams and then doing a double blind test with users. However, this never happens. Nobody does this because doing the task even once in a large organization is enormously costly and nobody will even seriously consider doubling the expense.’ [in my words, paraphrased]
LDBC does in fact intend to address technical aspects of data integration, i.e. schema conversion, entity resolution and the like. Addressing the sociotechnical aspects of this such as whether one should integrate in the first place, whether the integration result adds value, whether it violates privacy or security concerns, whether users will understand the result, what the learning curves are etc. is simply too diverse and so totally domain dependent that a general purpose metric cannot be developed, not at least in the time and budget constraints of the project. Further, adding a large human
element in the experimental setting, e.g how skilled the developers are, how well the stakeholders can explain their needs, how often these needs change, etc. will lead to experiments that are so expensive to carry out and whose results will have so many unquantifiable factors that these will constitute an insuperable barrier to adoption.
Experience demonstrates that even agreeing on the relative importance of quantifiable metrics of database performance is hard enough. Overreaching would compromize the project’s ability to deliver its core value. Let us next talk about this.
It is only a natural part of the political landscape that the EC’s research funding choices are criticized by some members of the public. Some criticism is about the emphasis on big data. Big data is a fact on the ground and research and industry need to deal with it. Of course there have been and will be critics of technology in general on moral or philosophical grounds. Instead of opening this topic, I will refer you to an article by Michael Brodie http://www.michaelbrodie.com/michael_brodie_statement.asp In a world where big data is a given, lowering the entry threshold for big data applications, thus making them available not only to government agencies and the largest businesses seems ethical to me, as per Brodie’s checklist. LDBC will contribute to this by driving greater availability, performance and lower costfor these technologies.
Once we accept that big data is there and is important, we arrive at the issue of deriving actionable meaning from it. A prerequisite of deriving actionable meaning from big data is the ability to flexibly process this data. LDBC is about creating metrics for this. The prerequisites for flexibly working with data are fairly independent of the specific use case whereas the criteria of meaning, let alone actionable analysis, are very domain specific. Therefore in order to provide the greatest service to the broadest constituency, LDBC focuses on measuring that which is most generic, yet will underlie any decision support or other data processing deployment that involves RDF or graph data.
I would say that LDBC is an exceptionally effective use of taxpayer money. LDBC will produce metrics that will drive technology innovation for years to come. The total money spent towards pursuing goals set forth by LDBC is likely to vastly exceed the budget of LDBC. Only think of the person-centuries or even millennia that have gone into optimizing for TPC C and H. The vast majority of the money spent for these pursuits is paid by industry, not by research funding. It is spent worldwide, not in Europe alone.
Thus, if LDBC is successful, a limited amount of EC research money will influence how much greater product development budgets are spent in the future. This multiplier effect applies of course to highly successful research outcomes in general but is specially clear with LDBC.
European research funding has played a significant role in creating the foundations of the RDF/linked data scene. LDBC is a continuation of this policy, however the focus has now shifted to reflect the greater maturity of the technology. LDBC is now about making the RDF and graph database sectors into mature industries whose products can predictably tackle the challenges out there.
OpenLink Software, Inc.
The mission of the LDBC can be compared to that of the Transaction Processing Council (TPC) that Jim Gray founded in the area of relational database technology (www.tpc.org). LDBC will create a body in which vendors of RDF and graph database systems agree on relevant benchmarks and benchmark practices; and will publish official benchmark results. The objective of the project is to highlight the functional and performance characteristics of Graph and RDF systems, viz-a-viz each other and established relational data management technology. The motivation for this is to help IT practitioners understand and select Graph and RDF data management products, and thus, help make the emerging Graph and RDF data management industry more mature. Additionally, we hope that LDBC will spur competition and thereby accelerate technical progress.
- “agreeing on benchmark practices” means agreeing on the exact rules and metrics with which products can be compared. Without such rules, which include having benchmark results checked by independent auditors, it is very easy to skew any benchmark result in one’s favor; e.g. by precomputing (partial) answers; by implementing benchmark-special functionalities, by being not open about hot or cold runs; by comparing results on wholly different hardware (with wholly different price-tags). There are many ways in which one can game a result.
- “agreeing on metrics” is important as, without balanced metrics, it is easy to pick the benchmark observations or statistics that favor one algorithm/system/product (conveniently forgetting about other metrics relevant for the benchmark on which the performance maybe favorable — often systems must make trade-offs, so a win on one metric can become a loss on another; see e.g. the difference between OLTP and OLAP workloads). This will include a notion of score-per-EURO (or $), taking into account hardware+software+maintenance cost aspects in the results.
These points underline the industrial nature of the project, since such elements are not usually present in academic benchmark work. The industry participation in LDBC include Ontotext, Openlink and Neo Technologies (neo4j), which are European industrial leaders in this emerging technological space. The council itself is international, so other companies will be able to join the non-profit body of LDBC as well. More than ten such companies have approached LDBC already: effectively the great majority of RDF and Graph database companies are interested. We expect the council to start growing by March 2013, when a non-profit legal entity for it will have been formed; and membership will become formally possible.
The LDBC EU project has also a research participation in the form of UPC Barcelona, VUA Amsterdam, Technical University Munich, FORTH and STI Innsbruck. The research task is to kick-start the LDBC by helping in selecting/defining an initial set of benchmarks. Even though in RDF and graph databases there already exist benchmarks, aspects like cost metrics, rules for running the benchmark, and benchmark audits are generally underdeveloped; so LDBC here will extend existing benchmark components were possible and create new ones where necessary. The academic partners have been selected to include groups that have technical expertise in data management (e.g. RDF-3X — Munich; MonetDB, VectorWise – Amsterdam, Sparsity – Barcelona) so benchmarks will stress systems in relevant areas “where it hurts” in order to maximize the potential for progress.
In order to ensure that benchmarks represent usage scenarios that matter for technology users, LDBC has a Technical User Community (TUC). This TUC had its first meeting last week November 19/20 in Barcelona, that was well attended and quite productive. A digital record is found on: ldbc.eu:8090/display/TUC/First+TUC+meeting+Nov+2012
We see it as a sign of relevance for LDBC that these users spent two days to talk in-depth about their technical challenges with Graph and RDF software, multiple of them flying in from the US (on their own cost). The TUC includes participants from the publishing, life sciences, security and marketing domains. The outcomes of the first TUC meeting have been used to determine the direction in establishing the first LDBC benchmark task forces; and the TUC will remain continuously involved in providing information on relevant datasets and workloads, and feedback to benchmark specifications as they evolve.
In case this description got you interested, and specifically if you are a user of RDF, graph or relational technology, we would like to invite you take a short survey: http://goo.gl/PwGtK
More about the project, its activities and its benchmarks in the future are found on: www.ldbc.eu. We are also on twitter @LDBCproject.
You can contact me via: larri “at” ac.upc.edu
Josep Lluis Larriba Pey
The TUC will kick off its activities on November 19/20, 2012 at the first scheduled meeting in Barcelona. An online questionnaire is available at http://goo.gl/PwGtK for interested parties can make contributions on their experiences and needs for consideration in the LDBC benchmarks.