Towards a logical foundation for XML Schema

Keywords: XML Schema

Henry S. Thompson
Markup Technology
University of Edinburgh
World Wide Web Consortium
Edinburgh
Scotland
ht@markuptechnology.com
http://www.markuptechnology.com/

Biography

Henry S. Thompson is Managing Director of Markup Technology Ltd. He is also Reader in Artificial Intelligence and Cognitive Science in the Division of Informatics at the University of Edinburgh, based in the Language Technology Group of the Human Communication Research Centre. He received his Ph.D. in Linguistics from the University of California at Berkeley in 1980. His university education was divided between Linguistics and Computer Science, in which he holds an M.Sc. His research interests have ranged widely, including natural language parsing, speech recognition, machine translation evaluation, modelling human lexical access mechanisms, the fine structure of human-human dialogue, language resource creation and architectures for linguistic annotation. His current research is focussed on articulating and extending the architectures of XML. Markup Technology is focussed on delivering the power of XML pipelines for application development. Henry was a member of the SGML Working Group of the World Wide Web Consortium which designed XML, is the author of the XED, the first free XML instance editor and co-author of the LT XML toolkit and is currently a member of the XSL and XML Schema Working Groups of the W3C. He currently is a member of the W3C Team, and is lead editor of the Structures part of the XML Schema W3C Recommendation, for which he co-wrote the first publicly available implementation, XSV. He has presented many papers and tutorials on SGML, DSSSL, XML, XSL and XML Schemas in both industrial and public settings over the last five years.


Abstract


This paper defines a logic in which to express constraints on W3C XML Schema components and relationships between components and XML infoset items, for use in a formal rewriting of the W3C XML Schema Recommendations. The logic is essentially a constraint language over path expressions, interpreted equally with respect to an Infoset graph or a schema component graph. By 'logic' I mean the traditional three-part story comprised of a sentential form, a model theory and an interpretation.


Table of Contents


Introduction
Illustrative rewrite
A preliminary sketch of the logic
     Sentential form
     The Model
     Signatures
     Interpretation and satisfaction
Open questions and next steps
     Component-valued properties identified by name
     Focussing on Infosets
Bibliography

Introduction

In the period after the publication of [XML Schema] the W3C's XML Schema Working Group received frequent requests to tighten up the normative definitions of schema components, both in terms of the constraints on their well-formedness and their validation semantics. And as our experience has grown the Working Group has wished to clarify the fundamental nature of schemas as such, for instance when discussing a possible move to subsumption-based restriction checking (see [FSA for Schema Restriction] ).

I have thought for some time that the basis for a response to these requests lies in the work of Rounds, Moshier, Johnson, Carpenter and others on a logic of feature structures. This paper is a preliminary sketch of what such a logic might look like, and how it might be used, intended to provide sufficient detail to assess the likely value of the idea, without being anywhere near complete.

The basic idea is to define a logic in which to express constraints on components and relationships between components and infoset items. By 'a logic' I mean a pretty traditional three-part story:

A brief trivial example may help explain what these three things are. First a BNF to define our WFFs :

Some examples of WFFs are then

Our models are very simple -- a model consists of a set of individuals (the domain) and a set of named disjoint subsets of that set (the colourings), e.g.

domain
{a, b, c, d, e, f, g, h}
colourings
{black:{a, b, c},white:{d, e, f, g}}

Finally the constructive definition of validity. This is expressed in terms of a mapping from WFFs to conditions on the model. We use double square-brackets 〚〛to denote this mapping. We treat our birds and stuff as variables, that is, they have an arbitrary association with individuals in a model. We denote the variable assignment function which establishes this assignment using the Greek letter iota (ι). The mapping rules then are as follows, using X and Y is meta-variables over parts of WFFs:

So, given the specific model above and a variable assignment which maps 'doves' to 'f', we get, for the interpretation of the last example WFF above, the interpretation:

( f ∈ {a, b, c} ) ∨ ( f ∉ {a, b, c} )

which is evidently true. Furthermore, for models in which every individual has a colour, it should be clear that it's a tautology, that is, it's true in all such models for any variable assignment.

This may serve to explain why Frege was not being childish when he, famously, said

〚Snow is white〛 is true just in case 〚snow〛 is 〚white〛 [punctuation added :-]

So the idea for XML Schema is to define a sentential form which can express all component constraints and validation rules, a model which formalises components and the PSVI, and an interpretation which maps WFFs from the sentential form onto set-theoretic constraints on the model. We won't do anywhere near all the work ourselves -- the sentential form is a modest extension of the standard sentential form for first-order predicate calculus with equality, and the model and interpretation bootstrap from both first-order logic and graph theory.

Before beginning the formal definitions required, the next section provides an extended worked example.

Illustrative rewrite

There follows hereafter a rewrite in the proposed new approach of the Schema Component Constraint: Complex Type Definition Properties Correct from [XML Schema] . The original looks like this:

  1. The values of the properties of a complex type definition must be as described in the property tableau in The Complex Type Definition Schema Component (§3.4.1), modulo the impact of Missing Sub-components (§5.3).
  2. If the {base type definition} is a simple type definition, the {derivation method} must be extension.
  3. Circular definitions are disallowed, except for the ·ur-type definition·. That is, it must be possible to reach the ·ur-type definition· by repeatedly following the {base type definition}.
  4. Two distinct attribute declarations in the {attribute uses} must not have identical {name}s and {target namespace}s.
  5. Two distinct attribute declarations in the {attribute uses} must not have {type definition}s which are or are derived from ID.

The rewrite would look something like this, with informal glosses on formal statements of the constraints. We use a variable C to stand for the type definition node itself, and double vertical bars to signal de-referencing of a QName, with the sort of component indicated by a subscript, e.g. t for a type definition. We miss out the first constraint, which will be handled elsewhere via signatures for the sorts:

  1. [skipped]
  2. C∙base type definition ISA Simple Type Definition
       ⇒
    C∙derivation method = extension
    Derivation from simple type definitions must be by extension only.
  3. ‖xs:anyType‖t ∈ C∙base type definition*
    Circular derivations are not allowed -- all derivations originate with the ·ur-type definition·.
  4. ∀ a1, a2 ∈ C∙attribute uses ,
      (a1∙name = a2∙name ∧ a1∙target namespace = a2∙target namespace)
        ⇒
      a1 = a2
    Only one attribute declaration per qualified name.
  5. ∀ a1, a2 ∈ C∙attribute uses ,
      (‖xs:ID‖t ∈ a1∙type definition∙base type definition*
      ‖xs:ID‖t ∈ a2∙type definition∙base type definition*)
        ⇒
      a1 = a2
    Only one declaration of type ID allowed.

A preliminary sketch of the logic

Sentential form

We use the same sentential form as standard first order predicate calculus with equality, replacing the standard atoms with paths. A path is a specification of a route through the component/PSVI graph, denoting a (possibly empty) set. It starts from a graph node, and takes any number of property-name-labelled steps ( is used to join steps, * indicates transitive closure):

For example, here's a familiar path, assuming S is a node variable which contains a simple type definition:

S∙base type definition∙variety

This denotes the {variety} (atomic, list or union) of the {base type definition} of S

The optional star indicates transitive closure, so for example

S∙base type definition*∙name

would denote a set consisting of the {name}s of every type in the {base type definition} chain up from S.

The double vertical bar indicates QName lookup, with the sort key identifying the sort of the named component. The question of where namespace bindings come from for this remains to be addressed. For example, here's a path from a top-level element named my:doc to its {type definition}:

‖my:doc‖e∙type definition

The Model

The model starts from the standard formulation of a labelled directed graph, and adds some bells and whistles we'll need. A model is a model for a component language. A component language is four-tuple < S, P, A, G >:

S
a set of sorts, e.g. Complex Type Definition, Attribute Information Item
P
a set of property names e.g. base type definition, validation attempted
A
a union of disjoint sets of atomic values A 1A 2A n e.g. strings, numbers, dates
G
a sort signature function G: S
   P ×
   {required, optional} ×
   {component, infoitem, micro-component, literal} ×
   {singleton, set, sequence} ×
   A*S*

Given a component language, a model for that language is a triple < N, E, L >:

N
a set of nodes
E
a set of directed labelled edges, a subset of N × ( P ∪ ( P × ) ) × N
L
a node labelling function L: NSA

This model does not represent sets as such, but rather by allowing multiple edges with the same label. That is, a property such as [children] will appear in our model as zero or more child edges. The presence of indexed edge labels is to provide for sequences, so that for instance a model group with two particles would be represented in the model by a node with edges labelled <particle,1> and <particle,2>.

We constrain our models to use 1-origin indexing and to make coherent use of indexed labels as follows:

∀ <n, <p, i>, m> ∈ E , (i > 0) ∧ (∀ j ∋ 0 < j < i , <n, <p, j>, m> ∈ E)

We constrain node labels so that atomic values label leaf vertices and sorts label the rest:

∀ n ∈ N , (L(n) ∈ A) ≡ ¬ (∃ l, m, <n, l, m> in E)

This approach (instead of just partitioning V into sorted internal vertices and atomic values) is perhaps overkill -- it allows us to represent bags of atomic values, which I don't believe is required by anything in the PSVI or component inventory.

Signatures

Signatures are the way we specify what properties go with what sorts, and what we can expect about their values. For example,

corresponding to the definition of the Attribute Use component in the [XML Schema] .

Interpretation and satisfaction

We define the relationship between the sentential form and model in terms of the normal satisfaction relation, written with the double-turnstile character (). Here are enough examples of what we add to the rules we inherit from first-order predicate logic with equality to make the project clear, I hope, without pretending to be complete.

Open questions and next steps

This section surveys the areas in which work remains to be done before the logic proposed here might be ready for use in a revised version of [XML Schema] .

Component-valued properties identified by name

As presented above, the logic makes no provision for anything less than complete schemas, with no dangling references -- indeed, the whole issue of schema construction has been ignored. If we want to use this approach to address schema construction and composition, we need to look more closely at this area.

Ignoring annotation-related properties and the schema component itself, [XML Schema] has the following component-valued properties today (* means that when a schema is constructed from a schema document, the value is always established by dereferencing a QName, ? means it may be established by dereferencing):

Attribute_Declaration
a-simple_type_definition?
singleton Simple_Type_Definition
Attribute_Group_Definition
ag-attribute_use
set Attribute_Use
ag-attribute_wildcard
singleton Wildcard
Attribute_Use
attribute_declaration?
singleton Attribute_Declaration
Complex_Type_Definition
ct-attribute_use
set Attribute_Use
ct-attribute_wildcard
singleton Wildcard
ct-base_type_definition?
singleton Type_Definition
Element_Declaration
class_exemplar*
singleton Element_Declaration
identity-constraint_definition
set Identity-Constraint_Definition
type_definition?
singleton Type_Definition
Identity-Constraint_Definition
referenced_key*
singleton Identity-Constraint_Definition
Model_Group
particle
sequence Particle
Model_Group_Definition
model_group
singleton Model_Group
Particle
term?
singleton Term
Simple_Type_Definition
facet
set Facet
fundamental_facet
set Fundamental_Facet
st-base_type_definition?
singleton Simple_Type_Definition
st-item_type_definition?
singleton Simple_Type_Definition
st-member_type_definition?
sequence Simple_Type_Definition
st-primitive_type_definition?
singleton Simple_Type_Definition

There are actually only a small number of cases here:

  1. Declarations in their context of use (Attribute Declarations in Attribute Uses, Element Declarations in Particles), which may be either referenced by name or locally declared;
  2. Type definitions of Declarations, which may be either referenced by name or locally declared;
  3. Base type definitions, almost always referenced by name, a few corner cases wrt ur-type;
  4. Collateral uses: substitution groups, referenced key, item and member type definitions, almost always referenced by QName.
  5. Named attribute groups, which are a major embarassment, because they disappear when referenced. I'm inclined to say we actually need to change that, i.e. change the definition of AttributeUse to allow it to contain a group as well as single attributes. This would also require a different treatment of attribute wildcards.

There are a number of ways we could deal with QName resolution and laziness -- I've picked one to illustrate how it would be formalised.

Suppose the signature of all * and ? properties has in fifth (value sort) place not only every 'real' value sort but also an expanded-name micro-component placeholder sort with a related name (e.g. Element Declaration Name goes with Element Declaration). Now we can actually define what it means for a schema to be complete:

We say a model < N, E, L > is complete if and only if

∀ o ∈ N , L(o) ∈ S ⇒ ¬ ∃ p x a s, (<p, x, component, a, s>) ∈ G(L(o)) ∧ s ∈ S N

where by S N I mean the set of 'placeholder' components defined above. This formalises the notion of completeness as the state in which no component-valued property of any component has as its value a placeholder component.

Focussing on Infosets

The design presented here intentionally covers both schema components and infoset infoitems. Some of its complexity derives from needing to cover both -- it would be worth exploring a logic for the infoset alone, comparable to [The Essence of XML] .

Bibliography

[XML Schema]
Thompson, Henry S., Mendelsohn, N., Maloney, M. and D. Beech, eds. XML Schema Part 1: Structures, W3C, Boston, 2001. Available online as http://www.w3.org/TR/xmlschema-1/. The particular constraint referred to is at http://www.w3.org/XML/Group/2002/09/xmlschema-1/#ct-props-correct.
[FSA for Schema Restriction]
Thompson, Henry S. and Richard Tobin, Using Finite State Automata to Implement W3C XML Schema Content Model Validation and Restriction Checking, W3C, London, 2003. Available online as http://www.idealliance.org/papers/dx_xmle03/papers/02-02-05/02-02-05.html.
[The Essence of XML]
Siméon, Jérôme and Philip Wadler, The Essence of XML, POPL 2003. Available online as http://homepages.inf.ed.ac.uk/wadler/papers/xml-essence/xml-essence.pdf

XHTML rendition created by gcapaper Web Publisher v2.1, © 2001-3 Schema Software Inc.