Univ. of Edinburgh crestnescent logoODI logo

URIs in data: An ideology-free analytic

Henry S. Thompson
Jonathan Rees
Jeni Tennison
4 May 2013

Table of Contents

1. Acknowledgements

I owe much of what I understand about the Web to Tim Berners-Lee, along with Dan Connolly, Harry Halpin and Larry Masinter

Brian Cantwell Smith has hugely influenced my approach to all things computational

Jonathan and Jeni, my co-authors, haven't vetted these slides, and so can't be held responsible for anything contained herein with which you disagree

2. The core argument

  1. The foundational Web standards left the question of (how we find out) the 'meaning', or 'referent' of URIs under-, if not altogether un-, specified
  2. Berners-Lee knew what he meant, and his view is reflected in AWWW
  3. But underspecification amounts to an extension point
  4. And other extensions, incompatible with Berners-Lee's, have emerged and been widely implemented
  5. The TAG's effort to settle the question after-the-fact, essentially by endorsing Berners-Lee's extension, has not succeeded
  6. Mutually incompatible extensions at the same extension point can lead to failures of interoperability
  7. Machine-actionable documentation can restore interoperability without prejudice to at least two common extensions

3. Narrowing the focus: URIs and HTTP

The URIs this discussion focusses on are limited

  • http: scheme
  • hashless
  • 'retrieval-enabled':
    • an HTTP GET request for the URI will result in an HTTP 200 response
    • (in the absence of connectivity (construed widely) issues)

Restricted, but a very substantial part of the usage of URIs in data

And the only HTTP operation we address is GET

4. Narrowing the focus: extensions

By considering the use of URIs in data, we can exemplify two common usage patterns which reveal more-or-less covert allegance to two distinct extensions

  • Without actually taking a stance on what any of the controversial words mean

Sometimes information in data involving a URI is evidently about what you can retrieve from that URI:

{           "@id": "http://www.w3.org/People/Berners-Lee/",
  "last modified": "2012-06-08" }

{           "@id": "http://www.websci13.org/files/2013/04/ht-317x370.jpg",
     "resolution": "72x72"}
  • [We use JSON for our examples throughout, and assume the widespread '@id' convention]

Sometimes information in data involving a URI is evidently about what is described/depicted by what you can retrieve from that URI:

{           "@id": "http://www.w3.org/People/Berners-Lee/",
       "birthday": "1955-06-08" }

{           "@id": "http://www.websci13.org/files/2013/04/ht-317x370.jpg",
        "surname": "Thompson"}

The key word here is 'evidently' -- it's evident to us, but not, at least not without help, to computers

5. Failures of interoperability 1: Aggregation

The value proposition of the Web:

  • URIs in data enable information aggregation
  • "No-one ever expects the network effect"

What happens if data based on two different extensions are aggregated?

{           "@id": "http://www.w3.org/People/Berners-Lee/",
       "birthday": "1955-06-08" 
  "last modified": "2012-06-08" }

At best this is confused, at worst potentially destructive of reliable inference

6. Failures of interoperability 2: Landing pages

First, some terminology around the case where data containing URIs is about whatever is described/depicted by what you can retrieve from them

Proxy page
The describing/depicting result of retrieval in such a case
Landing page
A proxy page which describes something (commonly an image) which is itself retrievable via URI
Retrievable
What you can retrieve from a URI

So for example http://www.w3.org/People/Berners-Lee/ is a proxy page for Berners-Lee and http://www.w3.org/2011/05/w3cteam.html is a landing page for http://www.w3.org/2011/05/team-photo.jpg

The image and the descriptions are all retrievables

  • Berners-Lee is not a retrievable

7. Landing pages, cont'd

How do we (even we humans) understand data involving the URIs of landing pages?

  • When it includes properties which apply to retrievables?

For example

{           "@id": "http://www.w3.org/2011/05/w3cteam.html",
  "last modified": "2012-06-08" }

Does the information here apply to the landing page, or to the image?

This is not a hypothetical: metadata found on e.g. Flickr landing pages is often, but not always, about the image linked from that page, not about the page itself

8. Towards a solution

Consider what we have seen in terms of relationships:

diagram relating URIs, retrievables etc.
  • ERf   entity retrieved/retrievable from
  • EDb   entity described/depicted by
  • immediate property   supplies information about ERf(U), e.g. last_modified
  • shorthand property   supplies information about ERd(ERf(U)), e.g. birthday

Machine-readable documentation of properties as to whether they are immediate or shorthand would enable interoperability

  • at least with respect to the two problems identified earlier

9. Conclusion

This talk has been an attempt at a proof-of-concept

  • To move the conversation around httpRange-14 away from turf-wars with respect to the meaning etc. of URIs
  • And towards a pragmatically-grounded acknowledgement that reasonable people do differ

And an introduction to an analysis of usage which provides the basis for an approach to interoperability based on documentation of properties