An introduction to naming and reference on the Web

Henry S. Thompson
28 January 2012

1. Acknowledgements

I owe much of what I understand about the Web to Tim Berners-Lee, along with Dan Connolly, Harry Halpin, Jonathan Rees and Larry Masinter

Brian Cantwell Smith has hugely influenced my approach to all things computational

• He first brought the flight name examples to my attention over 30 years ago
• And stimulated me to think harder about the URI/link distinction just a few months ago. . .

2. The Web demands our attention

To say the Web is ubiquitous, at least in the so-called developed world, is commonplace to the point of vacuity

Ubiquity alone doesn't require philosophical enquiry

• After all, paved roads are ubiquitous (and important) too, but they don't engender a lot of philosophical interest

What I aim to show you today is that we do need to study the mechanisms of the Web

• In particular, that means the thing that makes the Web a web

3. Why focus on URIs?

URIs turn out to be something quite new

And getting agreement on what they really are has proved. . .challenging

Understanding them is a crucial part of understanding how to make the Web work better for everyone

Their very name proclaims them to be names

• But they are not quite like any kind of name we've ever seen before

4. The official story of URIs

There is a moderately clear official consensus about URIs

• Explicitly specified in IETF's RFC 3986
• Summarised and extended in W3C's WebArch

'URI' stands for

Uniform
There's a standard generic syntax and (partial) semantics
Identifier
URIs are identifiers, that is, names (not addresses)
Resource
What it is they identify -- anything at all

5. URI syntax: some terminology

We will need to refer to parts of URIs as they are written:

More examples:

• `http://www.ed.ac.uk/`
• `ftp://ftp.funet.fi/pub/mirrors/perl/`
• `mailto:ht@inf.ed.ac.uk [that's a bit weird. . .]`
• `file:///D:/Documents/HTalks/PhilWeb_2012/slides.html [this talk]`
• `http://localhost/ht/Documents/HTalks/PhilWeb_2012/slides.html [likewise]`
• `http://www.ltg.ed.ac.uk/~ht/travel.html#travel`
• ```http://maps.google.co.uk/maps?q=10+Crichton+Street,+Edinburgh&layer=c& sll=55.944586,-3.187494&cbp=13,332.96,,0,1.84&cbll=55.944532,-3.18738& hl=en&sspn=0.006295,0.006295&ie=UTF8&hq=&hnear=10+Crichton+St,+Edinburgh+EH8,+United+Kingdom& ll=55.944532,-3.18738&spn=0.000012,0.006727&t=m&z=17& vpsrc=0&panoid=e8EPc2P5vtC6b-oE8SlGiw```
• `https://mail.google.com/mail/?ui=2&shva=1#inbox`

6. Behind the scenes

Before we dive further into the details, let's try to bring what's hidden out into view

Put ILCC people page in a separate window & view source

1. User clicks on a link in a page displayed by a Web browser
2. Browser looks at URI behind the link and identifies `http` scheme
3. Browser looks up domain name part of the URI via DNS and gets an IP address
4. Browser sends HTTP GET message to port 80 at that address with the path part of the URI
5. Browser computer (the client) operating system and Ethernet hardware
1. Disassemble the message into packets
2. Send them via TCP/IP to a gateway
3. Which routes them through the Internet to their destination
4. Where they are reassembled into a message by the destination computer (the server) hardware and operating system
5. And delivered to the computer program (the web server) listening on port 80
6. Web server maps path onto its file system
7. Server sends header specifying mime type `text/html` plus file contents back to the client [see above wrt TCP/IP]
8. Client dispatches response to browser
9. Notice the URI appears in the address bar
10. Browser interprets message body as HTML, displays page to user

We'll call the process embodied in steps 3–8 accessing the URI

7. The nature of resources

Back to URIs

At first 'resource' seems like a vacuous label

• If anything at all can be a resource
• Knowing something is a resource tells you nothing

But consider the word referent

• Anything can be a referent
• When/if someone/something refers to it!

So, being a resource is not about some intrinsic property of something

• It just means, being identified by a URI

Who gets to say what the resource is that a URI identifies?

• The owner of the URI
• Just like the coiner of a new word

We'll use the phrase "referent of a URI" for the resource which the minter of a URI intends for it to identify

Who is the owner (sometimes also called the 'minter')?

• For the kinds of URIs we care about
• (mostly ones that begin `http` and `https`, maybe `ftp` a bit)
• the owner of the URI is the owner of the domain name at the beginning of the URI
• Actually, domain names aren't owned, only leased
• But that's a can of worms I don't propose to open further today

In principle, all URIs are opaque

• That is, no conclusions about the shape of the space referents should be drawn from the shape of the set of their identifiers.
• In natural languages this is called l'arbitraire du signe—the arbitrariness of the sign
• You can't tell anything about the relationship between cats, dogs and puppies just from the words 'cat', 'dog' and 'puppy'

In practice, for most `http` URIs, this is false

• We expect URIs to 'make sense'
• We expect the referents of apparently related URIs to actually be related
• For example
• `http://www.inf.ed.ac.uk/teaching/courses/inf2a/`
• `http://www.inf.ed.ac.uk/teaching/courses/inf2b/`
• It's entirely up to minters to manage this

10. Out of scope

Giving your URIs consistent names contributes to the utility of the Web as a shared information space

• Consistent with one another
• Consistent with the nature of their referent

It's part of what is referred to as the social contract

11. How do I know what a URI identifies?

Now things start to get interesting

In the beginning, there was only one (legitimate) way

• Access it
• If you get something back, turn it in to a presentation
• (The illegitimate way is to read the URI itself, see previous discussion about opacity)

12. Ah, finally, something about web pages

Isn't that what URIs are really for?

• Actually, no!
• The preceding slide is wrong
• It's perfectly OK to mint URIs with no intention to ever produce a web page at all
• `http://www.ltg.ed.ac.uk/~ht/Concepts/philosophy`
• There, I just did it.
• 3 billion web page authors can't be wrong
• But (in the terms of the standards)
• what they've authored
• and what you can access
• are not the referents of the URIs
• They're representations of those resources.

13. Representations?

'Representation' names a pair: a character sequence and a media type.

• The media type specifies how the character sequence should be interpreted
• For example JPG or HTML or MP3 would be likely media types for representations of, respectively
• an image of an apple
• a news report about an orchard
• a recording of a Beatles song

Just as, in order to interpret utterances or enscriptions, we need to know the language they are expressed in

The result of a successful access of a URI consists mostly of a representation

14. Representation to presentation

A browser goes beyond just retrieving a representation

• From representation to, shall we say, presentation

15. Putting it all together

We can combine our story about representations with the behind-the-scenes narrative

Notice the `HTTP/1.1 200 OK` line in the response

• That's where the famous `404 Not Found` would appear if the server couldn't find a representation for the URI
• Remember this, it will be important later

16. Resources vs. representations

Why make this distinction? Why isn't the web page itself the resource?

• That is, the URI's referent, the resource identified by the URI

The distinction was barely there, if at all, in the earliest standard (RFC 1630)

• Which talks about "accessing" "objects on the network"

Thereafter, the distinction emerged over time

• As people reflected on how URIs were being used
• Considering, for instance, that the result of accessing a URI might change over time
• And that this was a good thing
• So the URI must be identifying something more abstract than the thing retrieved

17. Resources vs. representations, cont'd

From quite early on, as well, the idea arose of URIs without associated web pages

• For example from information scientists thinking about using URIs in electronic library records
• "The search for a physical work was identical in all important respects to the search for a digital work. . . . The need was clear for an identifier system that spanned both the physical and digital worlds."
(John Kunze, personal communication)

18. How can a resource lack a representation?

There are a number of un-interesting reasons why accessing a URI may fail to produce a representation

• The net is down
• The owner of the URI has failed to make one available
• The original owner has lost the domain

But there's a more interesting reason as well

• Some resources can't have a representation

19. Having a representation isn't universal

Here's the official story about representations

• A 'representation' is a sequence of octets, along with representation metadata describing those octets, that constitutes a record of the state of the resource at the time when the representation is generated.
(from RFC 3986)
• A representation is data that encodes information about resource state.
(from WebArch)
• The distinguishing characteristic of these resources [web pages, images, etc.] is that all of their essential characteristics [are] conveyed [by their representation]. We identify this set as “information resources.”
(from WebArch)
• A response code of `200 OK`means the response contains a representation of the referent of the requested URI

20. Having a representation, cont'd

A weather report is certainly an information resource

• That is, its representation conveys all its 'essential characteristics'
• So doing a `GET` on something like `http://weather.example.com/oaxaca` makes sense

But what about a `GET` on `http://cities.example.org/oaxaca`?

• If, as we might reasonably suppose, this URI is meant to identify Oaxaca itself.
• Surely it's not possible for a representation to convey Oaxaca's essential characteristics

So whatever response you get, it shouldn't be a `200 OK`

But what should it be, if you have useful information to offer about Oaxaca?

21. Resources about vs. representations of

If what you have describes or depicts a resource, as opposed to conveying its essential characteristics

• Then you have a representation of a different resource
• In our example, not Oaxaca, but a description of Oaxaca

So the officially correct response is something like

``````HTTP/1.1 303 See other
• Supposing that `http://cities.example.org/metadata/oaxaca.html` identifies your web page about Oaxaca

The browser interprets the `303 See other` to mean "Do a `GET` on the 'Location' URI instead"

• So you get useful information about Oaxaca without it, as it were, claiming to be Oaxaca

22. Angels on the head of a pin?

Surely this is all a bit over the top?

Well, things and their descriptions are not the same

And when people started using URIs to make assertions (using RDF, on the Semantic Web)

• They wanted to be able to say both
• `http://cities.example.org/oaxaca` has a radish festival every year on December 23rd
• `http://cities.example.org/metadata/oaxaca.html` was written by Raphael Sabattini

23. Interim summary

URIS identify resources

Resources can be anything at all

Accessing a URI may yield a representation of (the current state of) its referent

• But some resources have no representation
• And others have more than one

Only information resources have representations

24. A glaring omission

What about the fact that information resources (Voltaire's Candide, Debussy's La Mer, today's issue of Le Monde, Minard's visualisation of Napolean's Russian campaign) have parts

Do we have to make up URIs for all their parts?

How can we do that?

First Brian will expand on this,

• And then at the end of today's session I'll give a very partial answer

25. What's the problem, then?

That's all pretty much OK, isn't it?

• A bit complex
• Uses words as technical terms in not completely obvious ways
• Some of the concepts are a bit soft/under-specified

No, it's not OK

In practice

• Too many people either don't know or don't care
• Virtually all users
• Most developers
• The Web has moved on
• Many background assumptions/motivations are no longer as important as they once were
• Or are simply not true at all
• Some important things have always been missing
• The status of presentations

26. A narrow view. . .

Not before time, I need to emphasise that there's a lot I've ignored so far

• The Web browser is not the only user agent/client
• `http` is not the only URI scheme
• Human beings are not the only users

I'll branch out a bit with respect to the first point in a minute

• But a lot will remain unconsidered

27. Marxist interlude

Why did things go wrong?

The Web as the standards tried to capture it has not stood still

• The drivers for change have been many
• But the economic significance of the Web has got to be near the top of the list

The Web is now more than a conduit for documents to read

• It's a means to a variety of ends
• We use it to do things
• Share photographs, music and videos
• Play games
• Participate in communities
• And above all, buy and sell goods and services

The potential the Web offers for people to make money influences behaviour in ways and at a pace that no after-the-fact standard can hope to match

28. Web practice: web pages

Historically, URIs were mostly seen as simply the way you accessed web pages

• Hand-authored, stored in files, relatively stable and simply shipped out on demand

Not any more

• Rarely hand-authored or stable
• Rather, automatically synthesised from 'deeper' data sources on demand
• Significantly influenced by aspects of the way we initiate the access

Furthermore, the relation between retrieved representation and observed presentation has changed enormously

The presentation a user experiences as a result of accessing a URI depends

• as much if not more on running Javascript programs embedded in the retrieved HTML representation
• and on the results of other, behind-the-scenes, resource accesses
• as it does on that HTML representation itself

29. Web practice: web pages, cont'd

For example, the representation you get by accessing the `www.weather.com` home page

• Reproduced under fair use
• Is more than 50% Javascript program
• And the presentation that eventuates from decoding it requires more than 70 subsidiary GET requests to construct

Such a representation certainly captures all the 'essential characteristics' of whatever it is that that URI identifies

• But via a much more complex route than was the case when the theory was developed
• And in a way that seems almost circular
• And thus illegitimate
• That is, it's difficult if not impossible to speak accurately about the resource identified by `www.weather.com`
• without depending almost entirely on its attendant presentations

30. Web practice: URIs/The Semantic Web/Linked Open Data

The Semantic Web names an initiative and a Web-based technology

• Aimed at automating the discovery and aggregation of information on the Web
• For automatic exploitation

You could think of it as the marriage of Knowledge Representation and the Web

• With some Automated Reasoning thrown in

At its core is the idea of a web of assertions

• In the form of two-place relations
• Usually called a triple, and described in terms of subject, object and predicate
• With subject and predicate (always) and object (often) identified using URIs
• Courtesy of Sascha Meissner & Laura Bashaw

31. Web practice: SemWeb URIs

One triple from that example:

``````Subject: http://www.example.org/index.html
Predicate: http://purl.org/dc/elements/1.1/creator
Object: http://www.example.org/members/1234``````

The subject is presumably meant to name an information resource

• But the predicate and object do not, because they are
• A concept (of being the creator (author) of a document)
• A person

The growth of the Semantic Web

• And more recently of its somewhat less ambitious offspring the Open Linked Data movement

mean that huge numbers (at least billions) of such URIs are in use

32. SemWeb URIs: the bad news

Accessing one of those URIs will in practice almost always either

• fail, with a `404 Not found`
• succeed, with a `200 OK`
• and a representation (usually in terms of more triples) of information which is at least in part about the URI in question's referent

But per the official story accessing a URI which identifies a person should never result in a `200 OK` response!

33. Does this matter?

We've identified a divergence between Web standard/theory and Web practice

• SemWeb/LOD URI owners respond with `200 OK` when they 'should not'

Not only does this matter in principle

• Because it suggests we just don't have the right theory about the way the Web works today

It also matters in practice

• If, when users and developers look at the standards, they see things that label commonplace behaviour as incorrect
• Behaviour which they themselves think is correct and unproblematic
• They will not respect the standards

In other words, laws that everybody breaks bring the law into disrepute

34. What can be done?

We need to look hard at the things we missed before

• The status of presentations
• Or, the importance of the context of use

35. Where can we look for help?

Although URIs are a new kind of identifier, they share some properties with other identifiers

• And the Philosophy of Language has struggled with the nature of identifiers in ordinary language for many years
• With some potentially useful results

Other disciplines have something to contribute as well

• Information Science
• Philosophy of Computation

36. Taking presentation seriously

For the vast majority of web developers, it's all about what I've been calling presentation

• The user experience

Maybe we should make that more central in our theory

Here's a sketch of how we might start

• A browser implements a function from a URI and a browser state to a URI plus the substance of a request: ${B}_{1}\left(U,{S}_{B}\right)⟶$
• A server implements a function from a URI plus a substance from a request and a server state to a substance in a response: $S\left(U,\sigma ,{S}_{S}\right)⟶\mathrm{\sigma \text{'}}$
• A browser also implements a function from a substance from a response, a browser state and a web state to a presentation: ${B}_{2}\left(\sigma ,{S}_{B},{S}_{W}\right)⟶\pi$

Compose all three of these, and a browser implements a function from URIs to presentations (with dependencies on three kinds of state: browser, server and web): $B\left(U\right)⟶\pi$

People responsible for URIs know this

• They understand what they are doing in terms of this story
• Their goal is precisely to effect the connection between a particular URI and a particular presentation

37. Theory of presentations, cont'd

B2 (and thus B) depends on Web state

• because e.g. rendering an HTML page may involve appeals to B1S for stylesheets and scripts
• As well as to the whole of B for frames, embedded images, etc.

In principle this recursion could fail to terminate

• In practice this is extremely unusual
• Although I have heard anecdotal evidence that crawlers do sometimes fall into unintended recursion tarpits

38. Presentations and persistence

The stability over time of B for some URI is then what people are talking about when they discuss persistence

We can outline some interesting classes of expected behaviour in these terms

1. Once a particular instance of B is established, there is no expectation that the resultant presentation will change in any way
• URIs which present as images, audio and video are often in this category
2. Once a particular instance of B is established, there is an expectation that the resultant presentation will not change
• but there is a recognition that errors may need to be corrected
• URIs which present as some form of transcription of documents created outside the Web are typically in this category
• as are W3C Recommendations
3. URIs whose presentations change only in accordance with a published versioning policy
6. A 'live' news blog

39. Resource, representations, alternatives

This space is much richer and more complex than the current theory allows for

The Information Scientists have struggled with the ontology of created works for a long time

• Just what do we need to deal with when we consider, for example, Charles Dickens's novel Martin Chuzzlewit
• I have here not one but two Martin Chuzzlewits
• Are they the same book?
• Yes and no, right?
• Same author, same story
• Same language
• Different publishers, different editions
• Different physical objects

40. The nature of the work of art

IS have developed a detailed story about all this

• Known as FRBR
• Functional Requirements for Bibliographic Records

FRBR has a four-level ontology

• Work   "a distinct intellectual or artistic creation"; "an abstract entity"
• Expression   "the intellectual or artistic realization of a work in the form of [some] notation", e.g. "variant texts"; first, second, . . . editions; English original vs. French translation
• Manifestation   "the physical embodiment of an expression of a work"; "all the physical objects that bear the same characteristics, in respect to both intellectual content and physical form"
• Item   "a single exemplar of a manifestation"; "a concrete entity"

41. FRBR and the Web

Not all of the FRBR architecture maps directly onto the Web situation

• But getting clear where it does
• And where it doesn't
• Would be of use to both communities

Some people in the IS community, notably Allen Renear, have made a start on this

• But it needs a lot more work

42. Time-varying resources

Clearly there's a lot of it about

• But the official story about it is hopelessly vague
• In particular, with respect to what counts as acceptable (with respect to the Social Contract) variation

We looked at this briefly under the heading of persistence

The Philosophy of Language offers another possible way in

Consider English words such as this, here and tomorrow, as well as you and I.

• Linguists call this class of words indexicals

On one well-thought-of Philosophy of language account, an indexical such as now has

• a single meaning
• but multiple interpretations

The meaning is fixed, the interpretation varies according to the context of use

43. Indexicals, cont'd

The meaning of now is something like "the time at which the utterance containing the word is made"

• It is the same for every ordinary use of the word

But the interpretation is, well, whatever time it happens to be when the word is used

• and this of course changes all the time

More generally, the meaning of an indexical can be understood as a function from contexts (that is, contexts of utterance) to interpretations

In a formal approach to all this, indexical meaning really is just an index

• Because context of utterance is modelled as a sequence
• E.g. `<[Henry S. Thompson], {[a lot of folk]}, 2011-09-13:18:00:05,. . .>`
• So on this account the meaning of now is pretty much just $\lambda \left(c\right){c}_{3}$

The parallel with time-varying resources is clear

• But more work is needed to see if that observation can be cashed in
• In terms of a useful formal story
• Some aspects of the functional account of presentation given above are a tiny step in this direction

44. Local context

Arguably, much of the focus of Web architecture discussions to date has been misplaced

• That is, the significance of URIs on their own has been exaggerated

Words, after all, don't mean anything

• When considered as simply a sequence of characters or sounds

They have to be used before we can talk about their meaning

• And we have to know how they're used

Consider the following character sequence as if found on a piece of paper in an otherwise featureless bottle on a desert island beach:

chat

Conversation? Cat? Something else altogether? There's no way to tell

It's words in use that have meaning

• So not just the external context
• As for indexicals, above
• But also the local, linguistic, context

45. Enter philosophy

One of the primary roles of "the philosophy of ..."

• Help the subject discipline by identifying parallels
• "Your problem is not new"
• And, good news, the solution is ...
• Alternatively, bad news, the prognosis is not good, sorry

Within the web community discussions about URIs use words such as 'identify' and 'denote', as well as 'name' itself

These are terms of art within the Philosophy of Language

• Is there a parallel here that can be helpful?
• Or is the implied connection mistaken and irrelevant?
• Or at worst even harmfully misleading?

46. Quibble from the philosophers

In natural language, names are easily discoverable

That is, we know what to call things

• This is why even the Kripkeans can't dispense with descriptions

There is nothing corresponding to this for URIs

• Except search engines!
• Do I hear you say "aha, the extended mind"?

47. Back to local context

Ordinary language names function within specific linguistic contexts

What a name means depends on the details of the surrounding utterance or sentence

Consider what at first seems like a very ordinary (imaginery) name: EZY386 (short for easyJet 386)

Here are some example uses of this 'simple' name

• EZY386 will depart from gate E17 at 2010 [announcement]
• Just arrived on EZY386 [text message]
• EZY386 flies from Stansted to Avalon
• EZY386 is easyJet's 3rd most popular flight to Avalon
• I prefer EZY386 to EZY387
• EZY386 has an 102% on-time record
• EZY386 was cancelled yesterday
• EZY386 was delayed because of a problem with one of its engines

48. Not so unique. . .

So EZY386 isn't so simple after all

• We seem to be happy to use/understand it to mean a wide range of things
• Up and down some kind of specific/generic or abstractness scale
• Based on clues from the local linguistic context

People are smart

• Computers are dumb
• And natural language understanding is an unsolved problem

None the less it might be that this kind of flexibility in the use and understanding of names would help us with our theories about URIs

• By widening the focus a bit
• from URIs as such
• to URIs in a local context

49. Local context for URIs

So, for example, maybe the resource/representation distinction is like this

• That is, in at least some cases, different local contexts for a URI
• Signal different points on an abstraction-like scale

So apparently non-canonical behaviour get brought back into line

For example, contrast these two local contexts:

• ``<a href="http://www.example.org/index.html">. . .``
• ``````Subject: http://www.example.org/index.html
Predicate: http://purl.org/dc/elements/1.1/creator
Object: http://www.example.org/members/1234``````

From this perspective, maybe `200 OK` should be more flexible

• And no longer mean "Here is a representation of your URI's referent"
• But rather "Here is a representation useful for that URI as you are using it"

50. Elaborating context

To put this in a way that connects up with the discussion of indexicals

• Stop talking about a URI's (single) referent
• But rather about a URI-in-context's referent
• Elaborating the 'context' in "Meaning as a function from context to interpretation"
• to include the local as-it-were linguistic context
• or, alternatively but equivalently, to include intent

Something like this seems preferable to just throwing up our hands

• And declaring the SemWeb and the OFW to be two different Webs
• With two different core constituents
• Which just happen to look the same
• And be called URIs

51. Envoi

A whistle-stop tour of some key components of the Web

• Both in terms of the relevant standards
• And the breadth of actual use

A handful of cases where these two don't line up

• Mostly to do with URIs

And finally some more-or-less wild-and-crazy suggestions

• Drawing on some aspects of the Philosophy of Language

Importing an outside perspective into a complex space is easily dismissed

• As at best uninformed and irrelevant
• At worst arrogant and disruptive

I hope at least some of the foregoing proves to be neither