The XML Meta-Architecture

Henry S. Thompson

HCRC Language Technology Group
University of Edinburgh

World Wide Web Consortium

Presented at XML DevCon, London, 2001-02-21

© 2001 Henry S. Thompson

XML has grown

XML the language


A great success

As long as you keep your expectations suitably low




XML Schema

Canonical XML/XML Signatures


XML Query/XML Protocols

What’s missing?

In the interests of time, XML 1.0 did not define its own data model

So XPath had to define it

And XLink had to define it

And the DOM had to define it

Finally, later than we’d have liked, we’re about to get

The XML Information Set

Or Infoset

(now in Last Call)

What’s the Infoset?

The XML 1.0 plus Namespaces abstract data model

What’s an ‘abstract data model’?

The thing that a sequence of start tags and attributes and character data represents

A formalization of our intuition of what it means to “be the same document”

The thing that’s common to all the uninterestingly different ways of representing it

Single or double quotes

Whitespace inside tags

General entity and character references

Alternate forms of empty content

Specified vs. defaulted attribute values

What does it mean to be ‘abstract’?

The Infoset is a description of the information in a document

It’s a vocabulary for expressing requirements on XML applications

It’s a bit like numbers

As opposed to numerals

If you’re a type theorist

It’s just the definition of the XML Document type

What the Infoset isn’t

It’s not the DOM

Much higher level

It’s not about implementation or interfacing at all

But you can think of it as a kind of fuzzy data structure if that helps

It’s not an SGML property set/grove

But it’s close

Infoset details

Defines a modest number of information items

Element, attribute, namespace declaration, comment, processing instruction, document ...

Each one is composed of properties

Which in turn may have information items as values

Both element and attribute information items have [local name] and [namespace URI] properties

Element information items have [children] and [attributes]

Attribute information items have a [normalized value]

For more details, see my colleague Richard Tobin’s talk on Thursday

He’s the editor of the Infoset spec.

The Infoset Revolution

We’ve sort of understood that XML is special because of its universality

Schemas and stylesheets and queries and … are all notated in XML

But now we can understand this in a deeper way

The Infoset is the common currency of all the XML specs and languages

XML applications can best be understood as Infoset pipelines

Angle brackets and equal signs are just an Infoset’s way of perpetuating itself

The Infoset Pipeline begins

An XML Parser builds an Infoset from a character stream

A streaming parser gives only a limited view of it

A validating parser builds a richer Infoset than a non-validating one

Defaulted values

Whitespace normalisation

Ignorable whitespace

If a document isn’t well-formed invalid, or isn’t Namespace-conformant

It doesn’t have an Infoset!

The XML Schema comes next

Validity and well-formedness are XML 1.0 concepts

They are defined over character sequences

Namespace-compliant is a Namespace concept

It’s defined over character sequences too

Schema-validity is the XML Schema concept

It is defined over Infosets

The Schema and the Infoset

So crucially, schemas are about infosets, not character sequences

You could schema-validate a DOM tree you built by hand!

Using a schema which exists only as data structures ditto

The Infoset grows

Crucially, schemas are about much more than validation

They tell you much more than ‘yes’ or ‘no’

They assign types to every element and attribute information item they validate

This is done by adding properties to the Infoset

To produce what’s called the post schema-validation Infoset (or PSVI)

So schema-aware processing is a mapping from Infosets to Infosets

The Infoset is transformed

XSLT 1.0 defined its own data model

And distinguished between source and result models

XSLT 2.0 will unify the two

And make use of the Infoset abstraction to describe them

So XSLT will properly be understood as mapping from one Infoset to another

The Infoset is composed

XLink resources (the things pointed to by XPointers) can now be understood as items in Infosets

The XInclude proposal in particular fits in to my story

It provides for the merger of (parts of) one Infoset into another

The Infoset is accessed

XML Query of course provides for more sophisticated access to the Infoset

It also allows structuring of the results into new Infoset items

The Infoset is transmitted

And finally XML Protocol can best be understood as parcelling up information items and shipping them out to be reconstructed elsewhere

A big step forward

This is so much better than the alternative


Pretending to talk about character sequences all the time


Requiring each member of the XML standards family to define its own data model

Schemas at the heart

I would say that, wouldn’t I :-)

Seriously, schema processing can be integrated into this story in a way DTDs could not

You may want to schema-process both before and after XInclude

Or between every step in a sequence of XSLT transformations

We actually are missing a piece of the XML story

How do we describe Infoset pipelines?

Types and the Infoset

The most important contribution to the PSVI

Every element and attribute information item is labelled with its type

Integer, date, boolean, …

Address, employee, purchaseOrder

XPath 2.0 and XML Query will be type-aware

Types will play a key role in the next generation of XML applications

XML is ASCII for the 21st century

ASCII (ISO 646) solved a fundamental interchange problem for flat text documents

What bits encode what characters

(For a pretty parochial definition of 'character')

UNICODE/ISO 10646 extends that solution to the whole world

XML thought it was doing the same for simple tree-structured documents

The emphasis in the XML design was on simplifying SGML to move it to the Web

XML didn't touch SGML's architectural vision

flexible linearisation/transfer syntax

for tree-structured prose documents with internal links

The alternative take on XML?

It's a markup language used for transferring data

It is concerned with data models

to convert between application-appropriate and transfer-appropriate forms

It is not concerned with human beings

It's produced and consumed by programs

Application data

Structured markup

 <DATETIME qualifier="DOCUMENT">
 <OPERAMT qualifier="EXTENDED" type="T">
  . . .

What just happened!?

The whole transfer syntax story just went meta, that's what happened!

XML has been a runaway success, on a much greater scale than its designers anticipated

Not for the reason they had hoped

Because separation of form from content is right

But for a reason they barely thought about

Data must travel the web

Tree structured documents (Infosets) are a useable transfer syntax for just about anything

So data-oriented web users think of XML as a transfer mechanism for their data

The new challenge

So how do we get back and forth between application data and the Infoset

Old answer

Write lots of script

New answer

Exploit schemas and types

A type may be either

simple, for constraining string values

complex, for constraining elements which contain other elements

Mapping between layers

We can think of this in two ways

In terms of abstract data modelling languages




In concrete implementation terms

Tables and rows

Class instances and instance variables

The first is more portable

The second more immediately useful

Mapping between layers 2

Regardless of what approach we take, we need

A vocabulary of data model components

An attachment of that vocabulary to types

Sample vocabularies

entity, relationship, collection

table, row, column

instance, variable, list, dictionary

Where should attachment be specified?

In the schema


Outside it


Overall Conclusion

Think about things in terms of Infosets and Infoset pipelines




Use XML Schema and its type system to facilitate mapping

Unmarshalling is easy

Marshalling takes a little longer