News and Updates on the KRR Group
Header image

On the use of HTTP URIs and the archiving of Linked Data

Posted by Wouter Beek in semantic web | Semantic Web meeting

On Monday the 7th of October the Knowledge, Reasoning and Representation and the Web&Media research groups of the Free University Amsterdam joined for their biweekly Semantic Web (SW) meeting. The topic was on the purpose of using HTTP URIs for denoting SW resources and on the implications for archiving Linked Data (LD).

An important aspect of archiving LD is that in archiving the LD is decoupled from its native Web environment (thus the title of the talk). The two most important Web-based properties that are lost in the process are (1) authority and (2) dereferenceability. We first discussed the relevance of both these properties.

Authority

In RFC 3986 Uniform Resource Identifier (URI): Generic Syntax [1] the authority of an URI/IRI is defined as follows:

Many URI schemes include a hierarchical element for a naming authority so that governance of the name space defined by the remainder of the URI is delegated to that authority [...].

Authorities & identity

In the discussion on the use of authorities the ambiguity of the identity relation was brought up. Within-namespace identity defines an equivalence relation on the data, but between-namespace identity is not even symmetric.

This ties into one of the fundamental principles of RDF, namely that “Anyone can make statements about any resource” [2]. But even though anyone can assert a triple, the fact that a triple is hosted at the same authority that is the namespace of the URI that appears in the subject position of the triple is indeed significant.

E.g. if a triple at http://dbpedia.org states that dbpedia:George_H._W._Bush was a US president and another triple hosted at some arbitrary Web location states that dbpedia:George_H._W._Bush was a charlatan, then the former triple is relatively easy to find (every dereference of dbpedia:George_H._W._Bush will return it), whereas the latter may be very difficult to find. This has a parallel on the traditional Web, where a crawler can easily identify outgoing links, but back-links are more difficult to find.[3]

This means that non-authoritative triples that are too opinionated are given less attention. Another way of formulating this is: alternative views regarding a topic are given less attention.

Dereferencability

From a show of hands at the meeting it became apparent that many SW researches actually use dereferenceability! Some even think it is one of the nice things of the SW. However, our research groups may not be indicative on this topic: most of the LOD appears to be not dereferenceable today.

What we did not talk too much about is whether people outside of the LD community, i.e. normal users, are using these dereferences. Surely some of the faceted browsers out there (green-background tables!, e.g. [4]) will not entice too many users, but I may be unaware of some of the innovations that the Web&Media people have come up with in this area. ClioPatria [5] for instance provides a standard way of dereferencing and loading linked data sources that are not already in the local triple store.

Authority & dereferenceability (interaction)

Authority and dereferenceability are intertwined in the following two ways:

  1. The existence of authorities on the Web makes dereferencing possible, since the name of the resource contains the location of the authority that disseminates it.
  2. Dereferencing is a way in which authority is effectuated, because it allows the view of the authority that disseminates a specific topic to be easily/standardly retrieved.

Permanence

A third property of the use of HTTP URIs is permanence [6]. The service behind a URI should have continuous up-time and should be available for a long period of time. Frank pointed out that the traditional Web never suffered from 404s (although this used to be a big concern for critics of the early Internet). The problem may be different on the SW since it may be harder for machine agents to recover from link rot than human agents.

Archiving LD: ‘dead’ or ‘alive’?

What I got from the discussion on archiving LD is that it is basically fine to decouple LD from the Web for archiving purposes and that:

    1. dereferenceability is not that important in the case of archived data since it is considered ‘dead’ (i.e. not actively used), and that
    2. the original URI authority should be replaced by meta-data describing the authority (since triples will be stored under a different authority when archived by DANS).

It was pointed out that a potential discrepancy exists between traditional archiving and LD archiving. E.g. in the humanities curated archives are often considered the most ‘alive’, i.e. the most quoted, referred to, and used in active research. But storing LD decoupled from its original Web architecture may turn them into ‘dead’ snapshots (whereas active use will be made of the non-archived versions of the LOD).

The use of the words ‘dead’ and ‘alive’ was of course metaphorically. As Antoine Isaac pointed out: “The very idea of an archive is that you’re ready to accept a lower level of accessibility of stuff in exchange for longer-term preservation of it.” Btw. Antoine is working in the Prelida project [7] one of whose purpose is to inform the LD community of Digital Preservation techniques. A very relevant project for this topic!

Archiving: human or machine processing?

The proponents of archived LD dumps assume that, when the data will be used at a later point in time, human agents (e.g. historians) will be needed to add context. This may conflict with the machine-processable nature of SW data, although some of the context may be stored as well using a dataset description language [8].

Antoine observed that one of the reasons why the question of how many context to include in archiving LD arises since context data is so easy to obtain for it (e.g. using automated crawling methods) when compared to e.g. offline content.

Archiving URI meaning / context

Another interesting point is the question as to how much context needs to be stored in order for archived LD to be able to retain its original meaning. This depends on the theory of meaning one adheres to w.r.t. the SW. Memento’s [9] approach is to archive the dereferenced content. Implicit in this approach is the assumption that the meaning of a resource can be given in a local description. We see this assumption formulated in the COOL URI Interest Group Note [10]: “[A dereferencing] look-up mechanism is important to establish shared understanding of what a URI identifies.”

An alternative view is that the meaning of a URI is determined by the set of inferential roles of triples in which the URI occurs.[11] In this case all triples related to the archived triples must be stored as well in order for the resource’s meanings to be (fully) retained.

Archiving LD by archiving HTML pages

Ed Summers wrote a blog post [12] recently in which he explains how ‘traditional’ Web archiving may be used for LD archiving. If RDFa or Microformats are embedded in an HTML Web page, then this data can be extracted from the stored HTML.

Dataset availability

Archiving multiple versions of a dataset is obviously very difficult. Christophe Guéret pointed to a discussion at the open-government mailinglist regarding the US government shutdown [13] as an example that it is even difficult to reliably disseminate the latest version of a government-backed dataset. (On a related note, we have seen difficulties in the teaching context as well, where a remote dataset goes offline in the midst of a course in which students are querying that dataset for the applications they develop.) In cases such as the US government shutdown, important datasets may go offline overnight causing issues both for applications that depend on them and for other dataset that link to them.

You can follow any responses to this entry through the RSS 2.0 You can leave a response, or trackback.

One Response

  • From this summary, it looks like this was a really interesting discussion. I wish I could have been there. I have a few comments:

    * Regarding the mentioning of Memento in the section “Archiving URI meaning/context”: Memento says nothing about archiving per se. It leverages existing archives and it offers a way to access archives in an interoperable way. It just happens to be the case that my team decided to deploy an archive for dbpedia.org versions as a way to illustrate the potential of the Memento protocol for Linked Data. And that happens, indeed, to be a “meaning” rather than “context” oriented archive: only descriptions of DBpedia resources are included. The reason is very basic: ingesting, storing, and providing access to all versions of DBpedia descriptions consumes resources and our DBpedia archive is operated as a little side project. But anyhow, the Memento protocol has nothing to say about the meaning versus context notion, and a Memento compliant Linked Data archive could fit under either category. Just like traditional web archives send out bots to archive a web page and its linked resources, a linked data archival bot could go out and archive the description available from a given URI as well as the descriptions (and ontologies) for URIs embedded in that description. A nice thing about Memento as a protocol is that its inherently distributed nature allows all those descriptions to be available from different archives; no need to stuff them all in one place.

    * The Memento protocol, which will achieve RFC status soon, has something to offer in the realm of authority and dereferencing:
    - Authority: A site that publishes LD can point to its preferred Memento-compliant archive. In the protocol, this is achieved by including a link in the HTTP Link header in the response to a HTTP HEAD/GET request against the URI of a LD description. The link has the “timegate” relation type and points to a “TimeGate” for that URI in the archive. This approach is operational in DBpedia: for example, the HTTP response header for http://dbpedia.org/data/Paris includes an HTTP Link header with the link: ; rel=”timegate”
    - Dereferencing: The URI of a LD description can be dereferenced subject to time with the Memento protocol. Step 1: Do a HTTP HEAD on the URI to determine what the TimeGate for the resource is by looking at the HTTP Link header as described previously. Step 2: Follow the link to the TimeGate. Step 3: Perform datetime negotiation with the TimeGate to obtain a version of the LD description at a required datetime.

    * There’s an interesting issue that does not seem to have been discussed: How to actually store/disseminate archived LD descriptions. Our DBpedia archive stores those descriptions “as is” meaning URIs in the descriptions appear in the same way they did when the descriptions were live. For Memento-aware clients this works just fine: a Memento-Datetime header informs the client that the description is an archival/frozen one, and the client can dereference URIs embedded in the description subject to time: in essence a Memento-aware client can traverse the LD cloud as it existed at a given moment in time provided archival descriptions exist in Memento-compliant archives. But what to do with clients that are not Memento-aware? If such a client happens upon an archival/frozen description, it doesn’t understand this is the case and it receives a description that it assumes to be current. It follows embedded links and in doing so it mixes past (the archival/frozen description) and present (the embedded URIs that are dereferenced in the present). One way to deal with this problem would be to rewrite embedded URIs as DURIs: a Memento-aware client could dereference these in an appropriate manner, a client without Memento capabilities would not be able to dereference these and hence would not risk mixing past and present data. Note that a similar issue exists in classic web archives and that some of those (for example the Internet Archive) rewrite embedded URIs to point back into the archive rather than to the current web. There’s issues with this approach but those go beyond this particular discussion.

    * I wrote up a document about reference rot. It focuses on HTML but the essence of the argument is about providing temporal context information for resources linked from an HTML page, i.e. what is the intended datetime for the link, what is the URI of an archived version of the linked resource. It would be interesting to explore how such information could be provided for URIs listed in Linked Data descriptions to achieve the goal of being able to interpret the URIs at the appropriate datetime.



Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>