Dirk and Nadia design a naming scheme


Web naming schemes good practices

Henry S. Thompson
Jonathan Rees
22 June 2009

Table of Contents

1. Introduction

The first "Good practice" note in AWWW gives the following advice:

To benefit from and increase the value of the World Wide Web, agents should provide URIs as identifiers for resources.

But existing Web technologies and specifications offer a wide range of options to anyone embarking on a project which involves the creation and management of Web-enabled names. As a result, for anyone attempting to follow the above advice, a question immediately arises, namely "What kind of URIs should agents provide?"

This document is intended to supplement AWWW by addressing this question, for which AWWW does not provide much help.

Following the precedent of AWWW, we proceed in terms of a simple scenario which supplies motivating context for a subsequent more detailed exploration of requirements for Web-enabled names and best practices for addressing those requirements.

2. An exemplary scenario

Nadia and Dirk's work on film industry data (see AWWW section 4.2.2) has been successful up to a point, but their employers, a consortium of film studios called FSC, are not happy that they have used URIs from a widely-used public film database to identify films, actors, directors, etc. "There's no guarantee those URLs will always give the right information" says Robin, the project officer at FSC, "and anyway, FSC should control all of this. And also, we want names, not old-fashioned locations. Isn't there a way we can have a bunch of names that are clearly ours, and guarantee what they identify?"

At this point Dirk and Nadia both reply at once:

"Well, sounds like there are some choices to make. You two get your story straight, and bring me back a proposal with a rationale and some costings", says Robin.

So, who's right? What should the consensus proposal look like? Not surprisingly, that depends on the requirements, which need to be articulated in a lot more detail than is provided by Robin's off-the-cuff remarks. In what follows we'll explore the requirements space and the solution space, and conclude that in a large number of cases both Dirk and Nadia are on the wrong track, because http-scheme URIs can satisfy Robin's requirements while looking very attractive from a cost-benefit perspective.

3. Requirements

What Robin says FSC wants for its web-enabled naming scheme is very close to what many groups and projects have specified as requirements in similar situations: they all want names which are identifiable, reliable and stable. The technology behind the names should allow delegation of naming authority and provide uniform access to metadata.

It's important to be clear what each of these requirements really means, so let's start with a more detailed list, making some key distinctions along the way, and reflecting a common rough order of importance:

Dirk says "People must be able to tell by looking at one of our URIs that it is one of ours". Nadia adds "Yes, and browsers too: URIs which are part of our scheme should be syntactically distinguishable from all other URIs".

Hereafter we'll use the phrase [URI] in the scheme to refer to the set of URIs identifiable as part of a naming scheme, and the phrase access a URI as shorthand for the retrieval of a representation of the resource identified by a URI.

Not sure about the name of this requirement. . .
"Our URIs should work right away for everyone", says Nadia. "Right", say Dirk, "in web pages, email, all the places people expect to be able to put URIs and then click on them". That is, it should require little or no effort on the part of ordinary users to access a URI in the scheme.
Nadia says "It's very important that our users never see a 'not found' response (as long as they're actually online)". That is, it should always be possible to get a positive response (either a representation or other definite advice about the resource) from an attempt to access a URI in the scheme.
Dirk and Nadia have read the TAG finding on opacity of URIs, and are agreed that their scheme should be explicitly transparent, that is, it should be evident what each of their URIs is about just by looking at it. They understand and accept that this means they will have to document the nature of the mapping from URIs to resources as part of their public site description.
Dirk says "We have to be able to give our members control of the URIs that concern them". Nadia says "Right, for instance, each studio has to be responsible for their own pictures". That is, the owner of a set of URIs in the scheme must be able to delegate support for the transfer of naming authority (control over the meaning of URIs) for designated parts of the scheme.
Nadia says "People have to be able to rely on our URIs". Dirk adds "Yes, the meaning of our URIs shouldn't change, no matter what happens". After further discussion, they distinguish three kinds of stability:
  • owner stability   "No-one can take our URIs away from us." That is, ownership of a URI, and the authority over a URI's meaning which follows from it, continues as long as the owner wants it to;
  • resource stability   "For at least some of our URIs, we won't ever change what they are for". That is, in AWWW terms, what resource such a URI identifies shouldn't change;
  • representation stability   "For at least some of our URIs, we want to guarantee that exactly the same page will always come up". That is, the representation retrieved from such URIs shouldn't change.
"How do we find out the status of one of our URIs", asks Dirk, and goes on "Who its naming authority is, who wrote its representation(s), whether it's meant to be stable or not—oh, lots of stuff"? "There should be a standard place to put metadata, and a standard way to get at it, for every one of our URIs", replies Nadia. That is, given a URI in the scheme it should be possible to retrieve metadata about the URI and the resource it identifies independently of the representation of that resource, if any.
Nadia remembers security: "If we turn out to want to hold or exchange information which is private to some of our members or users, they have to know it's safe." "Right," says Dirk, "We should be able to reliably identify our users and members, and keep our interchanges with identified parties private, without otherwise changing things."
Do we need this? Is it too obviously a floater for https: to bang out of the park?
Half of this example is in detail false: W3C is not actually commited to repr stab for dated URIs, only that to using dated URIs for non-time-dependent-resources. Do we have a better example in the public domain?

If the distinction between resource stability and representation stability isn't clear, consider the difference between http://www.w3.org/ and http://www.w3.org/TR/1998/REC-xml-19980210. The each identify a resource, the W3C home page and the first edition of the XML language specification, respectively. The W3C observes the maxim "Cool URIs don't change", which is to say, it is commited to resource stability across the board. The specific consequence of that for the examples at hand is that the W3C is commited to maintain the use of those two URIs to identify those two resources, in perpetuity. But the W3C is only commited to representation stability for the second of these two URIs. Indeed a significant contribution to the value of the first URI, which identifies the W3C home page, is that the representations which can be retrieved from it are not stable: they change on a regular basis, to provide up-to-date information about W3C activities. On the other hand, representation stability is important for the date-containing-URI names of W3C specifications: the W3C is commited to always provide exactly that original representation of the XML specification.

Still not great -- what I really want is something such as a product key URI. . .
Should I say something to the effect that people are often (always?) mistaken when they think they want this?

So, let's try a rewrite: If the distinction between resource stability and representation stability isn't clear, consider the difference between http://www.w3.org/ and http://web.archive.org/web/19990503021459/http://www13.w3.org/. The each identify a resource, the W3C home page as such and the W3C home page as it was at a particular date and time, respectively. The W3C observes the maxim "Cool URIs don't change", which is to say, it is commited to resource stability across the board. The specific consequence of that for the first example is that the W3C is commited to maintain the use of that URI to identify the resource which is its home page in perpetuity. But the W3C is not commited to representation stability. Indeed a significant contribution to the value of that URI is that the representations which can be retrieved from it are not stable: they change on a regular basis, to provide up-to-date information about W3C activities. On the other hand the value of the second URI depends on representation stability: that is, retrieving from that URI will always give you the same representation the Wayback machine retrieved from the W3C site on May 3rd 1999.

Could use this, which I just created:

A link to my PGP public key, which includes a SHA1 digest of the representation it should retrieve. . ..

4. Using http: URIs to address the requirements

After looking into their requirements more carefully, as tabulated above, and investigating the space of possible solutions more thoroughly, Dirk and Nadia come to the conclusion that a carefully designed scheme using http: URIs can come as close if not closer to satisfying all their requirements than either of their original suggestions, or any other non-http: scheme.

http: URIs aren't perfect, and in some cases there are trade-offs: fully satisfying one requirement may require some compromises with respect to another: The Challenges and tradeoffs section below introduces some of these, but a complete analysis is beyond the scope of this finding.

4.1. http: URIs are identifiable

The desire for branding to be evident in URIs is both widespread and understandable. URI identifiability is a form of advertising, where the admittedly modest impact of a single use of an identifiable URI is potentially magnified greatly by widespread replication. Identifiability also is a cornerstone of trust: brand recognition and successful URI access are mutually reinforcing.

RFC 3986 [ref], the standard which governs all URIs, provides for both a registry-based authority segment and a local, typically hierarchical, path segment in URIs, and recommends both, together with the use of the IANA Domain Name system [ref IANA] for the authority segment, for any URI scheme that intends to be global in scope. http: URIs do exactly that, and so clearly follow the RFC recommendation and thus satisfy the identifiability requirement, since all the participants in a web-targeted naming scheme can be assumed to already have domain names which are readily identifiable, or can come to be readily identifiable, as theirs.

4.2. http: URIs are useable

Pervasive support for http: URIs is the foundation of the success of the Web today. A wide range of user agents, not only web browsers, recognize http: URIs and know how to access them using widely deployed software support for the DNS and HTTP protocols.

At the other end, again a wide range of server software is available, both free and commercial, ranging from fully-integrated website and document management systems with support for on-the-fly synthesis of documents to simple lightweight filesystem-backed servers.

"Nothing ubicts like ubiquity" :-)

With the exception of the legacy ftp: scheme and the non-Web file: scheme, no other URI scheme has anything like this degree of ubiquity.

Domain names: RFC background?
Social v. technical?
Hierarchy (social)
Sub-domains, escaped domains or private names (contract!)
Private convention, public convention forthcoming

5. Challenges and tradeoffs

Why is this hard? Dirk and Nadia's requirements all seem sensible, and we've had names for use on the Web for nearly 20 years now. What stands in the way of a naming scheme satisfying those requirements?

5.1. Challenges for Owner Stability

The stability of name ownership is at risk for at least two pretty much incorrigible reasons: The instability of human institutions, and the contingent nature of name registration.

It is in the very nature of owning anything that the first kind of risk inheres: the owner of a name may sell it, or give it away, or lease it, or indeed go out of business, or sell their business, or the relevant division. Any of these changes amount to or result in a change of ownership.

The second kind of risk arises from the nature of names on the Web. Virtually all naming schemes used on the Web are based on a division of names into a global part, managed by a global registry, and a local part, typically involving some form of hierarchical decomposition. The syntax of URIs is designed to support this decomposition. The RFC which governs URIs [ref 3986] distinguishes between the authority segment (registry-based) and the path segment (local, hierarchical) of a URI, and recommends the use of the IANA Domain Name system [ref IANA] for the authority segment of any URI scheme that intends to be global in scope. Any naming scheme that follows this recommendation, and thus equates URI ownership to Domain Name ownership, such as the http: URI scheme, depends on the stability of ownership of Domain Names for URI owner stability. But Domain Names are not really owned, only leased, for fixed terms, with no guarantee of renewability, with the possibility of expropriation and with the in-principle risk, however unlikely in practice, that the Domain Name registration system itself may cease to function.

There is ultimately no way around this. In particular, there is no point in proposing naming schemes that use their own registries and/or lookup mechanisms (not involving IANA) solely in order to get around this, because the reasons IANA operates Domain Name registration the way they do, and the vulnerabilities that the Domain Name system has, are universal and inescapable, given the requirements it must satisfy. See the appendix below for a discussion of why this is the case.

5.2. Challenges for resource stability and reliability

The owner of a URI has the right to determine what it means, that is, what resource it identifies, and the responsibility to respond to requests to access representations thereof. It follows that any change of ownership may (but need not) mean a change here too. And even without a change in ownership, control over resource identity and/or responsibility for reliability may change if the owner delegates that control or responsibility, or changes an existing delegation.

The most common threat to both reliability and resource stability in a global plus local naming system is the single point of failure implicit in registry-based ownership. The technical aspect of this isn't the problem: multiple servers, aliasing, failover, etc. are all well-understood, widely and successfully deployed techniques. Rather it's the management aspect: Even if no real or effective change in ownership occurs, once again it's the frailty of human institutions that is the problem: a change in business focus, or loss of interest in the relevant aspect of their business, or just misconfiguration of a DNS entry or server, may compromise reliability or resource stability. That is, at worst, the (new) owner of some names in a scheme will stop responding to requests for what they see as old and irrelevant URIs (a failure of reliability), or, worse, will decide to re-use those 'old' URIs for different resources (a failure of resource stability). Users of the URIs will no longer be able to access such URIs in the most obvious way or will not get what they expect when they do.

5.3. Challenges for representation stability

Somewhat perversely, the main challenge here is that it's actually rarely if ever really what is wanted -- to tie a URI to a particular character sequence to be interpreted as a particular media type is a very strong constraint.

And if it really is what is wanted, an externally verifiable guarantee is probably wanted as well, which in turn at least compromises transparency, because it means that the URI for a representationally stable resource will have to include both the intended media type and a hash or checksum of the intended character sequence, as for example has become common practice among peer-to-peer sharing of Anime [ref http://animechecker.sourceforge.net/].

5.4. Challenges for delegation

The ultimate in delegation is a fully decentralised system, in which anyone can mint URIs in the scheme. The minimum necessary to avoid collisions is the use of a central registry such as the Domain Name system for the authority part of the scheme. The challenge here of course is that there is no place for any structure to ensure that minters of scheme URIs respect whatever constraints the scheme owners have specified to guarantee that other requirements on the scheme are satisfied.

Furthermore, the more entities actually mint scheme URIs, the more likely it is that one of them will undergo one of status changes mentioned above under Challenges to owner stability.

So the fundamental challenge is to find the right point on the continuum from fully centralised to fully decentralised which delivers on all the other requirements.

5.5. Challenges for identifiability

The desire for branding to be evident in URIs is both widespread and understandable. URI identifiability is a form of advertising, where the admittedly modest impact of a single use of an identifiable URI is potentially magnified greatly by widespread replication.

Identifiability seems to follow naturally from delegation at the highest level: if different entities are free to mint URIs in the scheme, and Domain Names have a place in the scheme, then identifiability is provided. But the previous section suggests that a fully decentralised scheme is unlikely to satisfy other requirements, so a place for identifiability in a less than fully decentralised scheme has to be found.

6. Solutions

Many of the requirements listed above are not essentially technical in nature. Rather they are social. That is, they impose conditions on the management of the names, not their essential nature. We'll start by looking at name management policy, then move on to specific mechanisms which can be deployed to assist in name management, or to some extent protect against potential breakdowns of name management policy.

6.1. Name management policy

6.1.1. Providing owner stability

As things stand all that anyone, or any group, can do is to put carefully-designed mechanisms in place to ensure that all Domain Name registrations are legitimate (that is, not vulnerable to expropriation for cause), monitored for impending expiry, and renewed in a timely fashion.

Providing persistence and resource stability

Assuming owner stability, good will, and continuing commitment to participation in the scheme, these requirements are entirely in the hands of the originators, operators, and participants in any naming scheme. They are nonetheless among the hardest to address well. Restricting naming authority to trusted participants whose corporate self-interest is evidently tied up with their commitment to maintain their web presence and not change what their URIs in the scheme mean is an obvious starting point, but deciding just how commitments are to be phrased and what sanctions, if any, are to be available to enforce those commitments is inevitably a difficult business.

Some degree of protection against this kind of failure of the scheme can be provided by delegation and/or replication, see below.

6.1.2. Providing representation stability

Needs to be expanded

The solution is partly management, i.e. the imposition of participation requirements, and partly technical, e.g. include the representation's checksum in the URI itself, or provide the checksum in a metadata record... it's not a guarantee (nothing is) but it at least provides a way to check, and if a mismatch is detected the site operator can be shamed into fixing it (same as for checksum-in-the-URI). OBO [ref?] does this.

6.2. Relevant mechanisms and how they can help

This section started out trying to be a framework within which all existing schemes could be characterised. Perhaps that's the wrong goal. . .

In the simplest, and very common, situation with respect to URIs, resources and representations, the owner of a Domain Name is the implicit proprietor of a naming scheme, consisting of all URIs which use that Domain Name (and its sub-domains) as their authority. The owner decides what resources will be given names in that scheme, what those names will look like, how representations will be stored and/or computed and provides the necessary computing resources for storage, computation and servicing of access requests.

Once more than one party is involved, as is the case in the Dirk and Nadia FII scenario we are considering, choices arise with respect to each of the decisions and provisions just listed, and these choices in turn have implications for the requirements placed on the scheme. In what follows we consider what choices affect the stability and persistence requirements.

Name ownership

I'm not sure about using 'Domain Name' below—I keep going back and forth between thinking about this document as if it's meant to cover _all_ URIs, or just URIs which follow the RFC advice and use Domain Names in their 'authority' part, or just http: URIs. . .

Three options arise here:

Every participant in the scheme uses their own Domain Name for the authority segment of names in the scheme.
All names in the scheme use the same Domain Name for their authority segment.
All names in the scheme use the same Domain Name in their authority segment, but ownership of subsets of names is delegated to individual participants.

Resource identification

The same three options arise here:

Every participant in the scheme can choose names in the scheme to identify their resources. If name ownership is centralised, there is a risk of collision.
One central authority issues names, for resource owners to use.
Subsets of names, determined centrally, are allocated to participants to assign to resources as they see fit.

7. Spare parts

This section contains bits and pieces, some from from other sources, which may or may not find a place in the final document.

Once any of those decisions or provisions are placed in hands other than the owner's, we have an instance of delegation. Almost any combination of retention of some aspects control/provision and delegation of the rest is possible in principle—in practice we observe a small number of common patterns, which we will explore below.

Some patterns of delegation can go a long way towards mitigating the negative impact of institutional frailty on naming schemes. There are two primary delegation patterns we will look at: centralisation (or delegation upwards, from the members of a group to the group itself) and replication (or delegation downwards, from a group to its members).

The framework is theoretical, which it has to be in order to catch all the objections that are going to be thrown at it. The general reader (scheme designer) won't be able to apply it without help, so it will have to be augmented (as the existing draft finding was) with examples.

7.1. True permanence

Expropriation on appeal and repossession are not necessarily the issues. One could assert that effective permanence happens sometimes in practice (e.g. scheme names, chemical element names) and could happen more often if we just figured out how, so this document has to answer:

(JAR's naming scheme: some subspace of tag: URIs resolvable through some cockamamie protocol... now convince JAR that he should use http: URIs instead.)

However, I agree that the XRI folks are not putting immortality out as a major issue for them, and my constituency may not be important enough to spend column-inches on right now.

7.2. Improving owner stability, at a cost

There are evidently stable, managed sets of names in existence: the periodic table, the names of surface features of planets and satellites. What is it about names for use on the Web that precludes true stability for them? The combination of arbitrary, dereferencable and identifiable seems to be the source of the problem. These three together means there is real value in owning a name, and that there can, and therefore will, be dispute about what legal entity gets to use what name. This in turn requires a dispute resolution procedure for registered names, which in turn means expropriation must be possible. Because supporting dereferencing requires resources, owning names incurs costs, which means they will be abandoned, which in turn, along with the fact that name ownership has real value, means that it makes sense to lease, rather than sell, registration.

If we look at existing systems on the Web, that is URN namespaces and URI schemes, which do not rely (entirely) on IANA Domain Names, we find broadly speaking three cases:

Some URI schemes, including doi (not registered), and URN namespaces (such as uuid) use opaque strings, typically numbers, either self-allocated (uuid) or via registries (doi). Such approaches may involve outright ownership (uuid), or may not (doi, at least from some registrars), and since they don't provide identifiability, need not provide for expropriation, but they are none-the-less heir to the other vulnerabilities of owned names.
Some schemes, including the info and xri (not registered) URI schemes, provide identifiability and operate their own registries, distinct from the IANA Domain Name registry (although their current lookup mechanisms do rely on the DNS system). The xri registry is parallel to IANA's in all the aspects relevant to lack of stability discussed above. The info registry is qualitatively quite different, as it is restricted to names for the operators of large public namespaces, and is clearly intended to operate in terms of dozens or at most hundreds of registrations. No appeal or expropriation mechanisms are defined for it, and since dereferencing is explicitly not required to be supported, the impact of a registered info name owner going out of business is not necessarily very great.
dated Domain Names
Some schemes, including the tag URI scheme and the NEWSML URN namespace, combine a Domain Name with a date, in an attempt to avoid the majority of the vulnerabilities we've identified. However, tag URIs explicitly do not support resolution, and NEWSML URN resolution is left unspecified in principle, and in practice seems not to be supported.
Still need to say something about how the meaning of e.g. http://www.w3.org/1999/xhtml might be stable even if W3C loses ownership of w3.org. . .

In summary, a number of schemes exist whose vulnerability to the challenges to ownership stability identified above is reduced, but they all achieve this at the expense of one or more of the Dirk and Nadia's other requirements.

Delegation as centralisation

Copied wholesale from Persistence, Delegation and URIs, needs editing

Or, "Put all your eggs in one basket, and watch that basket!" In its simplest form, centralisation means that all the participants in a common naming scheme agree that there will be only one repository of representations, and one domain name used for names. This is technically very simple, but has several major constraints which make it less likely to be a satisfactory solution:

  1. The owner of the domain name is still a single point of failure. If it is a member of the community, complex contractual constraints will be needed to reassure the rest of the community. If it's an entity created for the purpose, ditto, and governance of that entity needs to be negotiated as well.
  2. The cost of maintaining the representation repository falls on one entity, at least in the first instance.
  3. In the administratively simplest approach, control over naming is centralized too, and this is likely to mean loss of branding. That is, the identity of the creator of a representation is not necessarily obvious from its name.

It should be clear that it is straightforward to use http: URIs to implement this variant of delegation.

Delegation as replication

Or, "Split up, one of us is bound to survive!" Technical replication at the level of DNS cannot solve the problem of domain name loss. Only methods that involve a second domain name can do that.

In the same way that a web cache or proxy server provides an alternative to a standard DNS lookup plus hierarchy, a naming system may specify an arbitrary alternative algorithm for looking up its URIs. Such an algorithm may be invoked as an alternative to other methods that might contain unreliable steps.

This seems too narrow to me (JR); I can think of a zillion variants (eg). Concrete is good, but maybe this is too concrete. Nb what's described here is effectively an alternative DNS...

Delegation here means two lookups plus hierarchy. The first lookup gets you to a naming-system-specific naming authority, whose name is known ahead of time to clients of the naming system. This authority itself implements the second lookup, which leads to the repository where the hierarchical part can be interpreted.

The good news is that most of the drawbacks of the centralised approach have been removed:

  1. No more single point of failure: by design, multiple entities owning multiple domains provide replicated second-level name lookup.
  2. (Second-level) name authorities are back to providing for the storage/serving of their own representations.
  3. The hierarchical part of naming is in the hands of (second-level) name authorities, and as long as Domain Names are used as second-level names, branding is retained.

There are downsides of this approach too:

The ARK naming system is a good example of a naming system along these lines using http: URIs.

Hybrid delegation: centralized naming, distributed storage

It should be clear that there is no necessary association between centralisation of naming and centralisation of storage: a middle way that centralises naming, but leaves storage in the hands of content owners, is clearly possible. Doing things this way could also accommodate preservation of branding.