Introduction to XML
Schema
Copyright © 2001 Henry S. Thompson
Basic Concepts and Vocabulary
What is an XML
application?
We define an XML application as
having
A form: what do all the documents
involved in this application share?
A
vocabulary (elements and attributes)
A
grammar (how they are allowed to combine)
A function: what those elements and
attributes mean
You already know the basic story
about defining a syntax
You can use English (or French or .
. .)
You have used a
DTD
Now you can use an XML
Schema
Components of the XML family
XML Namespaces
Managing multiple
vocabularies
XSLT
Transforming XML
XLink/XPointer
Connecting XML documents
XML Schema
Defining XML document
families
XML Query
Database-style query
language
XML Protocols
XML-based communication
Namespaces for XML
First, an example
<xh:p
xmlns:xh='http://www.w3.org/1999/xhtml'
>
So the result can be expressed as <!--
(a+b)2
-->
<mml:apply
xmlns:mml='http://www.w3.org/TR/REC-MathML'
>
<mml:power/>
<mml:apply>
<mml:plus/>
<mml:ci>a</mml:ci>
<mml:ci>b</mml:ci>
</mml:apply>
<mml:cn>2</mml:cn>
</mml:apply>
</xh:p>
Namespaces for XML, cont’d
Where did those colons come from?
xh:this,
mml:that, xml:the_other
Two communities pushed for namespaces
Vendors, to manage the composition of document
fragments
E.g. the
inclusion of mathematical formulae in a document
Working groups, to reserve names without
compromising users'
freedom to name things
E.g. it
wouldn't do for XML-link to reserve <link> for simple
links, or XSL to reserve
<text>
Namespaces, cont'd
A W3C Recommendation was endorsed in January 1999
There was a lot of vendor pressure to get
something in place, which caused political tension and at least one
resignation from the WG
The example illustrates how namespaces are declared, scoped and
used
Namespaces defined
You can use qualified names, consisting of two simple names
separated by a colon (:)
The namespace prefix is an abbreviation for a URI which uniquely
identifies the owner/meaning/identity of the source of the
name
Using a namespace essentially cedes responsibility for the meaning
of the qualified names to the owner of the URI
Declaring a namespace
The association between namespace
prefixes and URIs is declared using reserved attributes
<doc
xmlns:mml='http://www.w3.org/TR/REC-MathML/'>
...</doc>
Anywhere inside the above doc element mml
is a legal namespace prefix,
standing for the URI given
There is also a mechanism for defining the default
(unprefixed) namespace
Declarations are scoped
Prefixed names can be used
for
Element type names
Attribute names
XML Schema: some details
XML Schema is a language for
defining the structure of XML documents
Notated in XML itself
So there are elements defined for
use in schemas to define. . .
Elements :-)
Attributes
Types
Terminology
Documents have structure
Document types
Document instances
Structure can be defined
Informally (D. S. D.)
SGML DTD
XML DTD
Schema using XML
Why validate?
A D. S. D. is a contract between producers and consumers
It provides a guaranteed interface
Producers validate to ensure they are providing what they
promised
Consumers validate to check up on producers
and to protect their applications
Application authors validate to simplify their task
Leave error detection and analysis to the
validating parser
Why validate? cont'd
Validation is fundamental to the distributed application
It guarantees a minimum level of data integrity
Validate early, validate often
Localise the source of error
Schema-based validation gives you more
Type assignment
A simple example
<!ELEMENT text (#PCDATA|emph|name)*>
<!ATTLIST text
timestamp NMTOKEN #REQUIRED>
<xs:element name="text">
<xs:complexType
mixed="true">
<xs:choice
minOccurs="0"
maxOccurs="unbounded">
<xs:element
ref="emph"/>
<xs:element
ref="name"/>
</xs:choice>
<xs:attribute
name="timestamp"
type="xs:date" use="required"/>
</xs:complexType></xs:element>
The Schema Architecture: Static
A document or an application or a user identifies a schema
document
Document and schema document are well-formed XML
The document is schema-valid w.r.t the schema
(The schema document is schema-valid wrt the schema for
schemas)
The Schema Architecture: Dynamic
An XML application (XSP) which schema-validates
And augments the information with defaults, types, etc.
The state of play
Chartered in the autumn of
1998
Requirements document out in
February of 1999
Three component
documents
Primer (non-normative)
Structures
Datatypes
8 public working drafts so
far
May, September, November
1999
February, April, September, October
2000
March 2001:
[contains pointers to previous
drafts]
Proposed Recommendation
Member comments due by 16 April
2001
XML Schema: Four requirements
Reconstruct DTD functionality using XML
'Eat your own cooking'
Integrate Namespaces
Modular schemas for modular document
types
Provide a usable inventory of basic datatypes
For elements as well as attributes
Support object-oriented design
Kind-of as well as part-of
Modular design
Schemas are about elements and attributes named by qualified
names
A pair of namespace name and local
name
A schema may include components for multiple namespaces
Schema documents are primarily about one namespace
But you can assemble multiple schema documents to build a single
schema
include a schema document for the same namespace
import a schema for another namespace
Simple Type Definitions
Treats attributes and sub-elements the same
A frequently-expressed requirement for XML
We need an inventory of simple types for strings
<xs:attribute name='birthday' type='xs:date'/>
Other built-in simple types:
boolean,
number, uriReference, hexBinary, dateTime, duration, . .
.
QName,
NOTATION, . . .
integer,
NCName, ID, IDREFS, . . .
Object-oriented design
Type definitions are distinct from attribute and element
declarations
The tag-type distinction
Type definitions can be based on other definitions
restriction
extension
list
union
The XML Schema worldview
Validity and well-formedness are XML 1.0 concepts
They are defined over character sequences
Namespace-compliant is a Namespace concept
It's defined over character sequences too
Schema-validity is the XML Schema concept
It is defined over XML document
Infosets
So the whole XML Schema exercise is predicated on and layered on
top of XML 1.0 well-formedness plus Namespaces
Because they are constitutive of the
Infoset
What's the Infoset?
The XML 1.0 plus Namespaces abstract data model
Defines a modest number of information items
Element, attribute, namespace declaration,
...
Each has required and optional properties
Name, children, …
The Schema and the Infoset
So crucially, schemas are about infosets, not character
sequences
You could schema-validate a DOM tree you built by hand!
Using a schema which exists only as a DOM tree
ditto
This simplifies things tremendously
but is hard to get your head around at
first
Where did the Infoset come from?
In the interests of time, XML 1.0 did not define its own
data model
So XPath had to define it
And XLink had to define it
And the DOM had to define it
Finally, later than we’d have liked, we’re about to
get
The XML Information Set
Or
Infoset
(now in
Last Call)
What’s the Infoset? Take two.
The XML 1.0 plus Namespaces abstract
data model
What’s an ‘abstract data
model’?
The thing that a sequence of start
tags and attributes and character data represents
A formalization of our intuition of
what it means to “be the same document”
The thing that’s common to all
the uninterestingly different ways of representing it
Single or double quotes
Whitespace inside tags
General entity and character references
Alternate forms of empty content
Specified vs. defaulted attribute values
What does it mean to be ‘abstract’?
The Infoset is a description of the information in a
document
It’s a vocabulary for expressing requirements on XML
applications
It’s a bit like numbers
As opposed to numerals
If you’re a type theorist
It’s just the definition of the XML
Document type
What the Infoset isn’t
It’s not the DOM
Much higher level
It’s not about implementation or
interfacing at all
But you can think of it as a kind of fuzzy data structure if that
helps
It’s not an SGML property set/grove
But it’s close
Infoset details
Defines a modest number of
information items
Element, attribute, namespace
declaration, comment, processing instruction, document
...
Each one is composed of
properties
Which in turn may have information
items as values
Both element and attribute
information items have [local name] and [namespace
name] properties
Element information items have
[children] and [attributes]
Attribute information items have a
[normalized value]
The Infoset Revolution
We’ve sort of understood that XML is special because of its
universality
Schemas and stylesheets and queries and
… are all notated in XML
But now we can understand this in a deeper way
The Infoset is the common currency of
all the XML specs and languages
XML applications can best be understood as Infoset pipelines
Angle brackets and equal signs are just an
Infoset’s way of perpetuating itself
The Infoset Pipeline begins
An XML Parser builds an Infoset from a character stream
A streaming parser gives only a limited view
of it
A validating parser builds a richer Infoset than a non-validating
one
Defaulted values
Whitespace normalisation
Ignorable whitespace
If a document isn’t well-formed or isn’t
Namespace-conformant
It doesn’t have an Infoset!
The XML Schema comes next
Validity and well-formedness are XML 1.0 concepts
They are defined over character sequences
Namespace-compliant is a Namespace concept
It’s defined over character sequences
too
Schema-validity is the XML Schema concept
It is defined over Infosets
The Infoset grows
Crucially, schemas are about much more than validation
They tell you much more than ‘yes’
or ‘no’
They assign types to every element and attribute information item
they validate
This is done by adding properties to the Infoset
To produce what’s called the post
schema-validation Infoset (or PSVI)
So schema-aware processing is a mapping from Infosets to
Infosets
The XML Schema Type System
DTD-based validation is based entirely on element types
XML Schema adds attribute types, simple and complex types to
this
Simple types consist of strings
Complex types consist of AII sets plus
sequences of characters and EIIs
More terminology
Types are (usually infinite) sets
Type definitions (and element and attribute declarations) are the
characteristic functions for such sets
Expressed as necessary and sufficient
conditions on membership
Attribute Declarations
The simple case
An association between a qualified
name (local name plus optional namespace URI) and a simple type
definition
Determines a set of AIIs
[local name] and [namespace URI]
must match
[normalized value] must satisfy the
simple type def’n
May be scoped by a particular
complex type definition
I.e. two AIIs with the same name may
have different types if they occur within different
EIIs
May include default/fixed
value
Element Declarations
An association between a qualified name and
A type definition (simple or complex)
A set of identity constraints
A substitution group head (optional)
Determines a set of EIIs
[local name] and [namespace URI] must
match*
[children] must satisfy the type
definition
[attributes] must satisfy the type
definition
May be scoped by a particular complex type definition
Element Declaration, cont’d
Subtree of IIs rooted at the EII
must satisfy the identity constraints, if any
Three kinds of identity constraints,
over (sequences) of values identified by XPath
expressions:
Uniqueno duplicates allowed
Keyno duplicates, must exist
Keyrefmust match some value of a named key
*EIIs which satisfy element
declarations which name this one as their substitution group head
(transitively) are also allowed
May include default/fixed
value
Simple Type Definitions
Based on ISO 11404
Distinguishes between lexical and value spaces
Identifies fundamental and constraining facets
For example, the number type has
Lexical
space:
([+-]?[0-9]*)?(.[0-9]*)?
Value space: the real numbers
Fundamental facets: Ordered: yes; Cardinality:
countably infinite; Bounded: no; etc
Constraining facets: min, max, enumeration,
…
Simple type definition example
<xs:simpleType
name='bodytemp‘>
<xs:restriction base='xs:number'>
<xs:totalDigits
value='4'/>
<xs:fractionDigits
value='1'/>
<xs:minInclusive
value='97.0'/>
<xs:maxInclusive
value='105.0'/>
</xs:restriction>
</xs:simpleType>
Complex Type Definitions
Constrains [attributes]
Required/optional
Local or global declarations
Constrains [children]
Finite-state grammar for EII sequence
Interpolated characters allowed or not
Simple type for text-only case
Local or global declarations
Complex Type Definition, cont’d
Membership assessment is two-part
Locally valid
All
required attributes present
No
non-declared attributes present
Sequence
of names of EII children, if any, satisfies content model
Recursively valid
All
attributes/children not exempted have known types
All
attributes with known types satisfy them
All EII
children with known types satisfy them
Wildcards
The
<any/> content model
particle, in all of its forms, allows EIIs regardless of local
name
A true ‘any’, i.e. any
well-formed XML
<any/> allows a single
well-formed element information item
the namespace attribute allows finer
control
##any
##other
##targetNamespace
##local
<anyAttribute/> has a
similar semantics for attributes
Type definition by derivation
XML Schema makes it easy to construct type definitions which
restrict or extend other type definitions, by specifying only the
method of derivation and the differences between the base and
derived type definitions.
Derived type definition
<xs:simpleType
name='healthyBodytemp‘>
<xs:restriction base='bodytemp'>
<xs:maxInclusive
value='99.5'/>
</xs:restriction
</xs:simpleType>
The healthyBodytemp type definition is defined by closing
down the permitted range of bodytemp. We say it 'inherits' the
other facets of bodytemp, so the 'effective type definition'
of healthyBodytemp is
Effective type definition
<xs:simpleType
name='healthyBodytemp‘>
<xs:restriction base=‘xs:number>
<xs:maxInclusive
value='99.5'/>
<xs:totalDigits
value='4'/>
<xs:fractionDigits
value='1'/>
<xs:minInclusive
value='97.0'/>
</xs:restriction>
</xs:simpleType>
Extension for complex types
The next simplest case is extension
for complex types
Start with this base
type
<xs:complexType name='name'>
<xs:sequence>
<xs:element
name='title‘
minOccurs='0'/>
<xs:element
name='forename'
minOccurs='0'
maxOccurs='*'/>
<xs:element
name='surname'/>
</xs:sequence>
</xs:complexType>
Derived type definition
<xs:complexType
name='fullName‘>
<xs:extension
base='name'>
<xs:sequence>
<xs:element
name='genMark'
minOccurs='0'/>
</xs:sequence>
</xs:complexType>
The effective type definition
<xs:complexType
name='fullName'>
<xs:sequence>
<xs:element
name='title'
minOccurs='0'/>
<xs:element
name='forename'
minOccurs='0'
maxOccurs='*'/>
<xs:element
name='surname'/>
<xs:element
name='genMark'
minOccurs='0'/>
</xs:sequence></xs:complexType>
Restriction for complex types
Restriction for complex types is harder to handle syntactically,
because of the significance of linear order in content models, but
the semantics are completely parallel to the simple type case:
Restriction example
<xs:complexType
name='simpleName'>
<xs:restriction
base='name'>
<xs:sequence>
<xs:element
name='forename'
minOccurs='1'/>
<xs:element
name='surname'/>
</xs:sequence>
</xs:restriction>
</xs:complexType>
Restriction and Inheritance
There must be a one-to-one line-up between the particles in the
restriction and the particles in the base
Unlike
<simpleType> case, what you see is what you get, so
the effective type definition of simpleName is just
the same
But for attributes, it works like the
<simpleType> case, with unmentioned attributes being
inherited unchanged
Element
Substitution Groups
An element
declaration can identify another declaration as something it wants
to be equivalent to
<xs:element name='cat'
substitionGroup='pet'>
Two things follow
from this:
The type of
cat must be derived
from the type of
pet
Whereever a
pet is allowed, so is
a
cat:
<element ref='pet'/>
is equivalent
to
<choice><element ref='pet'/>
<element
ref='cat'/></choice>
Union types
<simpleType name="maxType">
<union
memberTypes="nonNegativeInteger">
<simpleType>
<restriction
base="token">
<enumeration
value="unbounded"/>
</restriction>
</simpleType>
</union>
</simpleType>
Open Enumerations
<simpleType name="color">
<union>
<simpleType>
<restriction
base="token">
<enumeration value="red"/>
<enumeration value="green"/>
<enumeration value="blue"/>
</restriction>
</simpleType>
<simpleType>
<restriction
base="token"/>
</simpleType>
</union>
</simpleType>
Conclusions
XML Schema has a substantial inventory of mechanisms for defining
the structure of documents
Its type system is the basis for the interface between application
semantics and transfer syntax
The Infoset is the abstraction which application developers should
think in terms of