News and Updates on the KRR Group
Header image

Although LarKC is based in Europe, the project of building, and applying, web-scale reasoning is world wide. One of the most exciting things about living in a connected world, and a world of abundant, location independent computational resources, is that people anywhere in the world can do world class AI research, and develop applications based on that research. The recent, and very rapid, increase in internet bandwidth going into Africa means that one can now use Shazam to get impromptu karaoke lyrics for the Texas country-and-western playing in a hotel bar in Accra. It also means that previously isolated African researchers can make a full contribution to the advance of semantic technology.  In February, partially supported by the FP7 Active project, we had the opportunity to present LarKC, and the potential benefits of AI and human-computer collaboration, to students and researchers at the Ghana-India Kofi Annan Centre of Excellence in ICT in Ghana. Discussion following the talks was lively, with great local ideas for the application of AI in knowledge capture from small farmers, and resource allocation for rural health care. Video from some talks is being made available on, there was good coverage from the local media, and we look forward to building a collaboration with our new colleagues.

Related articles

Enhanced by Zemanta

Source: Think Links

I’ve posted  a couple of times on this blog about events organized at the VU University Amsterdam to encourage interdisciplinary collaboration. One of the major issues to come out of these prior events was that data sharing is a critical mechanism for enabling interdisciplinary research. However, often times it’s difficult for scientists to know:

  1. Who has what data? and;
  2. whether that data is interesting to them?

This second point is important. Because different disciplines use different vocabularies, it is often times hard to understand whether a data set is truly useful or interesting in the context of new domains. What is data for one domain may or may not be data in another domain.

To help bridge this gap, Iina Hellsten (Organizational Science), Leonie Houtman (Business Research) and myself (Computer Science) organized a Network Institute workshop this past this past Wednesday (March 23, 2011) titled What is Data?

The goal of the workshop was to bring people together from this different domains to discuss the data they use in their everyday practice and to describe what makes data useful to them.

Our goal wasn’t to come up with a philosophical answer to the question but instead build a map of what researchers from these disiplines consider to be useful data for them.  More importantly, however, was to bring these various researchers together to talk to one another.

I was very impressed with the turnout. Around 25  people showed up from social science, business/management research and computer science. Critically, the attendees were fully engaged and together produced a fantastic result.

The attendees

The Process

To build a map of data, we used a variant of a classic knowledge acquisition technique called card sorting. The attendees were divided up into groups (shown above) making sure that the groups had a mix of researchers from each disciplines. Within each group, every researcher was asked to give examples of the data they worked with on a daily basis and explain to the others a bit about they did with that data. This was a chance for people to get to know each other and have discussions in smaller groups. After the end of this each group had a pile of index cards with examples of data sets.

Writing down example data sets

The groups were then asked to group these examples together and then give those collections labels. This was probably the most  difficult part of the process and led to lots of interesting discussions:

Discussion about grouping

Here’s an example result from one of the groups (the green post-it notes are the collection labels):

Sorted cards

The next step was that everyone in the room got to walk around and label the example data sets from all groups with attributes that they thought were important to them. For example, a social networking data set is interesting to me if I can access it programmatically. Each discipline got their own color. Pink = computer science, Orange = social science, yellow = management science.

This resulted in very colorful tables:

After labelling

Once this process was complete, we merged the various tables groupings together by data sets and category (i.e. collection label) leading to a map of data sets:

The Results

A Map of Data

Above is the map created by the group. You can find a (more or less faithful) transcription of the map here. Here’s some highlights.

There were 10 categories of data:

  1. Elicited data (e.g. surveys)
  2. Data based on measurement (e.g. logfiles)
  3. Data wit a particular formats (e.g. xml)
  4. Structured-only data (e.g. databases)
  5. Machine data (e.g. results of a simulation)
  6. Textual data (e.g. interview transcripts)
  7. Social data (e.g. email)
  8. Indexed data (e.g. Web of Science)
  9. Data useful for both quantitative and qualitative analysis (e.g. newspapers)
  10. Data about the researchers themselves (e.g. how did they do an analysis)

After transcribing the data, I would say that computer scientists are interested in having strong structure in the data, whereas social scientists and business scientists are deeply concerned with having high quality data that is representative, credible, and was collected with care. Across all disciplines temporality (or having things on a timeline) seemed to be a critical attribute of useful data.

What’s next?

At the end of the workshop, we discussed where to go from here. The plan is to have a follow-up workshop where each discipline can present their own datasets using these categorizations. To help focus the workshop we are looking for two interdisciplinary teams within the VU that are willing to try data sharing and present the results of that trial at the workshop. If you have a data set, you would like to share, please post it to the Network Institute linked in group. Once you have a team, let myself, Leoni, or Iina know.




Filed under: academia, interdisciplinary research

An interesting peek in Microsoft’s kitchen (the Beijing labs, by the looks of it): Probase and ReadWriteWeb writeup on it. It’s a very large web-fed knowledge-base, including concept hierarchies (2.7 million concepts, 4.5 million subclass relations, 16 million instances). Including all major knowledge sources (Freebase, WordNet, Cyc, DBPedia, Yago, a.o.), with pretty well researched quality measures. Unfortunately, none of the data is Linked in any way, none of this available, let alone in some standard format.This is interestingly different from IBM’s Watson knowledge base. That is mostly filled with knowledge extracted from linguistic sources (although structured data does play a limited role). Probase seems to rely much more on structured knowledge sources.

Web of Data Interpreter (WoDI) is a recently launched spin-off company from the LarKC project, currently located in Innsbruck, Austria. The targeted development segment of WoDI is the implementation of intelligent tools and methods for accessing, reasoning and consuming linked data. The main areas of WoDI innovation are scalable reasoning with rules and streams of […]