Using Finite State Automata to Implement W3C XML Schema Content Model Validation and Restriction Checking

Keywords: Content Model, Schema, W3C XML Schema, XML

Henry S. Thompson
University of Edinburgh
Division of Informatics
Scotland
ht@inf.ed.ac.uk

Biography

Henry S. Thompson is Reader in Artificial Intelligence and Cognitive Science in the Division of Informatics at the University of Edinburgh, based in the Language Technology Group of the Human Communication Research Centre, and Managing Director of Markup Technology Ltd. He received his Ph.D. in Linguistics from the University of California at Berkeley in 1980. His university education was divided between Linguistics and Computer Science, in which he holds an M.Sc. His current research is focussed on articulating and extending the architectures of XML. He was a member of the SGML Working Group of the World Wide Web Consortium which designed XML, is the author of the XED, the first free XML instance editor and co-author of the LT XML toolkit and is currently a member of the XSL and XML Schema Working Groups of the W3C. He currently splits his time between the World Wide Web Consortium, the University of Edinburgh and Markup Technology, and is lead editor of the Structures part of the XML Schema W3C Recommendation, for which he co-wrote the first publicly available implementation, XSV. He has presented many papers and tutorials on SGML, DSSSL, XML, XSL and XML Schemas in both industrial and public settings over the last five years.

Richard Tobin
University of Edinburgh
Division of Informatics
Scotland
richard@inf.ed.ac.uk

Biography

Richard Tobin is a researcher in the Language Technology Group (LTG) in the Human Communication Research Centre, University of Edinburgh. He received his BA in Mathematics from Cambridge in 1983. His current research is in the area of XML, which is used in LTG for representing and processing linguistic data, including large corpora of natural language. He is the author of RXP, a widely-used validating XML parser, and is co-author of XSV, an XML Schema validator. He is a member of the W3C's XML Core Working Group, and was co-editor of the XML Infoset specification. He is also editor of the proposed 1.1 revision of the XML Namespaces specification.


Abstract


Implementing validation and restriction checking for W3C XML Schema content models is harder than for DTDs. This paper gives complete details on how to convert W3C XML Schema content models to Finite State Automata, including handling of numeric exponents and wildcards. Enforcing the Unique Particle Attribution constraint and implementing restriction checking in polynomial time using these FSAs is also described.


Table of Contents


The Problems and an Overview of our Solutions
Finite State Automata for Content Model computations
     Algorithm 1: Conversion to FSA
         XML Schema abstract data model
         Algorithm details
              Algorithm Tp(S)
              Algorithm Tt(S)
         Comments on the algorithm
     Algorithm 2: UPA
         Algorithm details
         Comments on the algorithm
     Algorithm 3: Subsumption
         Algorithm details
              Matches
         A problem and a solution
Additional issues
     Further properties of element declarations
     Another problem with wildcards
Notes on complexity
Acknowledgements
Bibliography

The Problems and an Overview of our Solutions

W3C XML Schema content models [XML Schema] are richer and more powerful than their XML DTD [XML] equivalent in several ways:

  1. Numeric occurrence ranges;
  2. Substitution groups;
  3. Wildcards.

Content models are also required to obey the Unique Particle Attribution constraint, a weak form of determinism requirement, and, if derived by restriction, to stand in a certain relationship to the content model of the base from which they are derived.

In this paper we present a complete approach to all these issues, based on the translation of content models into augmented Finite State Automata (hereafter FSAs).

Three algorithms are presented:

Algorithm 1:
A modified version of Aho and Ullman's algorithm [Aho & Ullman] for converting regular expressions to FSAs, in which edges are labelled with term schema components. The modifications treat substitution group as disjunctions and handle occurrence indicators other than 0, 1 or unbounded by unrolling the sub-expression governed by the indicator;
Algorithm 2:
Enforcement of the Unique Particle Attribution constraint by relabelling the FSA constructed by Algorithm 1: Conversion to FSA with expanded element names, and then checking by cases for ambiguity for element-element, element-wildcard and wildcard-wildcard cases.
Algorithm 3:
The current W3C XML Schema Recommendation [XML Schema] defines valid restriction by cases, in a way which not only is difficult to understand and implement, but which also fails to implement the stated design goal for restriction, namely that a restriction is valid if and only if it accepts a subset of the information items accepted by its base. The W3C XML Schema Working Group has expressed an interest in replacing the current definition with an extensional one, which simply uses the subset formulation. But doing this without specifying how it can be checked would be unhelpful to implementors to say the least. The third algorithm achieves this:
A subsumption check algorithm for two augmented FSAs, which essentially works by attempting to parse every path through the derived FSA using the base FSA, with special attention to certain corner cases to do with wildcards. If the parse succeeds, the base FSA subsumes the derived FSA, which in turn therefore accepts a subset of what the base accepts.

Finite State Automata for Content Model computations

XML DTD content models are regular expressions over an atomic term vocabulary of element names. Alternation and concatenation (indicated with vertical bar (|) and comma (,) respectively) are the basic term constructors for composing complex terms from other terms. In addition the three Kleene operators ?, + and *, indicating optionality, one-or-more and zero-or-more respectively, can be added to any term.

XML Schema goes beyond this in three ways:

Wildcards
In addition to element names, XML Schema adds wildcards, which accept any element, possible constrained with respect to namespace, as atomic terms;
Substitution groups
A replacement for one common use of parameter entities, the impact on content models is that a single atomic element term may accept a number of alternative elements;
Numeric occurrence ranges
Replacing and extending Kleene operators on terms, these allow any range of occurrence, with lower bound any non-negative integer, and upper bound and positive integer or infinity.

The impact of these three additions is that the schema-validation process cannot use existing approachs to regular-expression matching to check that a content model is satisfied. The first algorithm we present addresses this problem, by augmenting the normal algorithm for converting a regular expression to an FSA in ways which accommodate the three differences outlined above. The result can not only serve to check local schema-validity (that is, check that the sequence of element children of a parent in an instance are accepted by the parent's content model), but also as the basis for solving our other two content-model related problems.

The algorithm that follows is a modification of the one beginning on page 95 of [Aho & Ullman] .

Algorithm 1: Conversion to FSA

We need to explain the abstract data model which XML Schema uses for content models in a bit more detail before setting out the algorithm in detail.

XML Schema abstract data model

The {content model} of a complex type definition is a single particle.

particle
A particle has three properties:
{min occurs}
A non-negative integer, the minimum number of occurrences of the {term} that will be accepted. If 0, the {term} is optional.
{max occurs}
A positive integer or ∞, the maximum number of occurrences of the {term} that will be accepted. If ∞, there is no maximum.
{term}
Either an element declaration, a wildcard, a sequence or a choice
Element declaration
An element declaration has three properties of relevance:
{local name}
A string, the local (unprefixed) name of the element being declared
{namespace name}
A string, the name of the namespace of the element being declared (or absent if the element does not belong to any namespace)
substitution group
A (possibly empty) set of all the element declarations which accept elements when this declaration is used. See http://www.w3.org/TR/xmlschema-1/#cos-equiv-class in [XML Schema] for a definition and discussion.
Wildcard
A wildcard has two properties:
{namespace constraint}
One of any; a pair of not and a namespace name or absent; or a set whose members are either namespace names or absent.
{process contents}
One of skip, lax or strict, to determine how much, if any, additional validation is required.
Sequence
A sequence has one property:
particles
A (possibly empty) sequence of particles.
Choice
A choice has one property:
particles
A set of particles.

XML Schema also defines an all term, but its use is heavily constrained -- in particular it cannot occur within other terms, but only as the {term} of the particle which is a content model. Given this, and the fact that it does have any simple translation into an ordinary FSA, it is appropriate that it should be handled specially. Accordingly we will have nothing further to say about such particles in this paper.

Algorithm details

The algorithm for converting a content model to an FSA has two parts, one for particles and one for terms.

Algorithm Tp(S)

To translate a particle to an FSA ending at a state S

  1. Set n to S
  2. If the particle's {max occurs} is ∞
    1. Set t to a new state; Set b to the result of translating {term} to an FSM ending at t using Tt(t); Add lambda (also known as epsilon, or empty) edges from t to b and from b to n; Set n to b.
    2. This builds a fragment as follows:
    3.  <------lambda------<
      /                    \
      n>--[term machine]-->t   S
      \                       /
       >---------lambda------> 
  3. otherwise ({max occurs} is numeric)
    1. Build a chain of {max occurs}-{min occurs} copies of the translation of {term} backwards from S, with lambda transitions from the beginning of each step to S, and set n to the beginning of the chain.
    2. This builds e.g. a fragment as follows, for min=2 max=4:
    3. n-->[term machine]-->x>--[term machine]-->S
       \                    \                  /
        \                    >-----lambda----->
         \                                    /
          >--------------lambda--------------> 
  4. Now build a chain of {min occurs} copies of the translation of {term} back from n, and return (the start state of) the resulting machine.
Algorithm Tt(S)

To translate a term to an FSA ending at a state S

  1. If the term is a wildcard, connect a new state b to S with an edge labelled with the term itself and return b;
  2. Otherwise if the term is an element declaration, create a new state b, then for each element declaration in its substitution group create an edge from b to S labelled with that element declaration, and return b
  3. Otherwise if the term is a choice, create a new state b, translate each particle in its particles into an FSA ending at S using Tp(S), connect b to the start state of all the results with a lambda edge and return b;
  4. Otherwise (the term is a sequence), chain translations of each particle in its particles, in reverse order, back from S and return the first state in the chain.

Comments on the algorithm

It should be noted that the non-lambda edges of the resulting automata are all labelled either with wildcard or element declaration terms.

There are two circumstances in which the algorithms above produce a state with no edges leaving it, which may (although it need not) render the whole FSA unsatisfiable:

  1. An empty choice, i.e. <xs:choice/>. This results from an empty particles.
  2. An abstract element declaration which is not identified as the substitutionGroup of any other element declaration. This results from an empty substitution group at the component level.

It should be obvious that FSAs produced by the above algorithm can be used either as-is or after determinisation and/or minimisation to do local schema-validity assessment.

Algorithm 2: UPA

XML Schema's UPA (Unique Particle Attribution) constraint on content models (http://www.w3.org/TR/xmlschema-1/#cos-nonambig in [XML Schema] ) reconstructs a constraint inherited by XML DTDs from SGML. Given an FSA constructed from a content model using Algorithm 1: Conversion to FSA , it is easy to enforce this constraint by simple manipulations and checks.

Algorithm details

  1. Call the translation of a content model into a (non-deterministic) FSA ending at a final state F using Algorithm 1: Conversion to FSA M1;
  2. As noted above, the non-lambda edges in M1 are labelled with simple terms, either wildcards or element declarations, not element names. Determinise M1 to yield M2, using the algorithm given on page 93 of [Aho & Ullman] , collapsing non-lambda edges only if they are labelled with the same simple term.
  3. M2 violates the UPA if it is non-deterministic ignoring term identity, that is, if there is any state in M2 which has two outgoing edges such that any of the following hold:
    1. Their labels are both element declarations with the same {local name} and {namespace name}.
    2. Their labels are both wildcards whose ranges overlap.
    3. Their labels are a wildcard and an element declaration and the {namespace name} of the element declaration is in the range of the wildcard.

Comments on the algorithm

Since step 3.1 above is by construction operating on an FSA which has been determinised with respect to terms, any name collision is necessarily between two distinct terms from the original content model, thus violating UPA. Steps 3.2 and 3.3 are the obvious additions to handle wildcard-wildcard and wildcard-element declaration overlaps respectively.

Note there's nothing special in any of the above for numeric ranges. That is all handled by Algorithm 1: Conversion to FSA .

Algorithm 3: Subsumption

The XML Schema REC [XML Schema] defines two ways by which new complex type definitions can be derived from old, extension and restriction. Restriction, as its name suggests, is intended to define a new type, call it D for 'derived', which accepts a subset of what its base type definition (call this B) accepts. The REC as it stands does not, however, base the formal definition of restriction by appeal to subsetting, but rather by a set of detailed constraints on the allowed relationship between the content models of B and D (see http://www.w3.org/TR/xmlschema-1/#cos-particle-restrict in [XML Schema] ). Not only is the current approach complex and difficult to understand (and implement), but it is also inaccurate, in that it both allows some defintions as restrictions which accept more than their base, and bars some definitions which accept lest. There is therefore strong interest in moving to a definition which accurately correctly implements the subset story.

Two broad classes of approach to this can be imagined: One which deals explicitly with types, defined as (infinite) sets of infosets, and one which deals only with type definitions. We adopt the latter approach, as it fits neatly with the use of FSAs set out above, which we have already adopted in our XML Schema implementation [XSV] .

Since we are working in the domain of type definitions, it's appropriate to describe our goal as checking whether a the content model of a type definition B subsumes that of a derived type definition D. Here's a detailed sketch of an algorithm which attempts to check subsumption of two FSAs B and D, constructed with Algorithm 1: Conversion to FSA and determinised and checked with Algorithm 2: :

Algorithm details

  1. Initialise SS to be a set containing the unprocessed pair <B.initialState,D.initialState>
  2. Access such pairs with .base for the first member and .derived for the second
  3. Choose an unprocessed pair P from SS. For each edge DE leaving P.derived
    1. If there is an* edge BE leaving P.base which Matches DE, then add a pair <BE.end,DE.end> to SS (unless it's already there).
    2. Otherwise fail
  4. Mark P as processed, and repeat from step (2) until no unprocessed pairs are left in SS
  5. Check that for every end-state in D, there is a processed pair in SS which pairs it with an end-state of B, otherwise fail.

*There can be only one such edge because B satisfies Unique Particle Attribution .

Matches

An edge BE matches another edge DE iff one of the following three cases holds:

  1. They are both labelled with element declarations with the same {local name}s and {namespace name}s;
  2. BE is labelled with a wildcard, DE is labelled with an element declaration and DE's {namespace name} is allowed by BE;
  3. BE and DE are both labelled with wildcards and DE's label is an intensional subset (defined in [XML Schema] ) of BE's.

A problem and a solution

The above algorithm is too strict in one area, involving wildcards. A single wildcard in D may require two wildcards in B to cover it, as it were. Consider the following:

B: <xs:choice>
    <xs:any ns='a'/>
    <xs:any ns='b c d'/>
   </xs:choice>

vs.

D: <xs:choice>
    <xs:any ns='a b'/>
    <xs:any ns='c'/>
   </xs:choice>

The second clearly accepts a subset of what the first accepts, but Algorithm 3: Subsumption will reject it, because no edge in the top choice subsumes the 'a b' edge in the bottom one.

Here's the fix:

  1. This problem only arises in cases where there is more than one wild edge leaving a given state in B and at least one wild edge leaving its paired derived state which is either negated or has more than one allowed namespace;
  2. If the wild edges in B share a destination, then simply compute their union and replace them with it, problem solved;
  3. Otherwise, unpack (see below) the wild edges from the corresponding state in D and use them instead of the actual wild edges therefrom.

In all cases, then proceed with Algorithm 3: Subsumption as given above.

  1. To unpack a set of wild edges with no negated wild edge, simply replace each edge therein which has more than one allowed namespace with a set of wild edges, one for each allowed namespace;
  2. To unpack a set of wild edges which include one negated wild edge, replace it with a non-negated edge which allows any namespace mentioned in the corresponding base wildcards as well as a new created namespace, and proceed as in (1) immediately above.

Note that to avoid violating the UPA there cannot be more than one negated wild edge leaving any state, and when there is one all the non-negated wild edges can only allow namespaces which are negated by the negated one.

The simple example above is covered by case (2) in the fixup above. Here's an example that is covered by (3) and so requires unpacking:

B: <xs:choice>
    <xs:seq>
     <xs:any ns='a'/>
     <xs:elt ref='z'/>
    </xs:seq>
    <xs:seq>
     <xs:any ns='not(a)'/>
     <xs:elt ref='z'/>
    </xs:seq>
   </xs:choice>

versus

D: <xs:choice>
    <xs:seq>
     <xs:any ns='b'/>
     <xs:elt ref='z'/>
    </xs:seq>
    <xs:seq>
     <xs:any ns='not(b)'/>
     <xs:elt ref='z'/>
    </xs:seq>
   </xs:choice>

The rules for unpacking give us a transformation in two steps:

First D becomes

 <xs:choice>
  <xs:seq>
   <xs:any ns='b'/>
   <xs:elt ref='z'/>
  </xs:seq>
  <xs:seq>
   <xs:any ns='q a'/>
   <xs:elt ref='z'/>
  </xs:seq>
 </xs:choice>

then

 <xs:choice>
  <xs:seq>
   <xs:any ns='b'/>
   <xs:elt ref='z'/>
  </xs:seq>
  <xs:seq>
   <xs:any ns='q'/>
   <xs:elt ref='z'/>
  </xs:seq>
  <xs:seq>
   <xs:any ns='a'/>
   <xs:elt ref='z'/>
  </xs:seq>
 </xs:choice>

and it should be clear that the subsumption check will now succeed, as it should.

Additional issues

Further properties of element declarations

So far, we've been looking at very local properties of content models, looking at the elements accepted only in terms of their names. To complete the picture, we need to check a few other properties of the element declarations involved, and place a recursive constraint on their type definitions. For this we need to look at some other properties of element declarations:

{type definition}
A simple or complex type definition, constraining the content and attributes of the element being declared.
{value constraint}
Optional provision for a default and/or fixed value.
{identity constraints}
A set of key, keyref and unique constraints (see http://www.w3.org/TR/xmlschema-1/#cIdentity-constraint_Definitions in [XML Schema] ).

Each of these properties needs to be checked to ensure valid restriction. The following should be considered as additions to clause (1) in the definition of Matches above:

  1. DE's {type definition} must be either the same as BE's {type definition}, or declared to be derived from it by (a chain of) restriction.
  2. If BE has a fixed {value constraint}, DE must also have a fixed {value constraint} with the same value.
  3. DE's {identity constraints} must be either the same as BE's {identity constraints} or a superset thereof.

Clause (1) above is the point at which the necessary recursion into the children of an element and their types is invoked---each step of the derivation chain will itself be checked by Algorithm 3: Subsumption .

Another problem with wildcards

There is one place where the move beyond local issues discussed immediately above interacts with our earlier treatment of wildcards, and the fact that that treatment ignores the {process contents} property. This leads to problems in certain cases. Consider the case where B has a wildcard and D has one or more explicit

However further thought suggests some interesting possible corner cases which arise when B has a wildcard and D has one or more explicit element declarations:

B: <any ns='xyzzy' processContents='strict'>
D: <elt name='{xyzzy}:foo' type='xs:decimal'/>

when there is a global

<elt name='{xyzzy}foo' type='xs:integer'/>

element declaration available.

As defined above, Algorithm 3: Subsumption allows this, via clause (2) of the definition of Matches . But this is wrong, since D accepts things not accepted by B, for example <foo>3.1415</foo>. To fix this we need to both take account of {process contents}, and extend the consideration of type definitions from clause (1) to clause (2) of Matches .

In clause (2) of Matches , we have a wildcard in B which corresponds to an element declaration in D. Let EEN be the expanded name (that is, the {local name} and {namespace name}) of the explicit element declaration, and ET be its {type definition}.

If the wildcard doesn't allow EEN's namespace name, then clause (2) is fails right away.

If it does allow it, then we need to consider sub-cases based on the wildcard's {process contents}:

  1. skip: OK, since this means anything is allowed by the wildcard in B;
  2. strict: if there exists a top-level element declaration whose expanded name is EEN and from whose {type definition} ET is derived by (a chain of) restriction, then OK, otherwise fail.
  3. lax: if exists a top-level element declaration whose expanded name is EEN but ET is not derived from its {type definition} by (a chain of) restriction, then fail, otherwise (there is no top-level element declaration with the right name, or there is and its {type definition} is OK) OK.

Basically this addition to (2) takes account of the fact that lax and skip wildcard may function as if they were an element declaration.

Notes on complexity

Algorithm 1: Conversion to FSA is polynomial in time and exponential in space with respect to the size of the original content model considered as a regular expression.

The time complexity of determinisation is quadratic in the number of states. The time complexity of the UPA checking part of Algorithm 2: has a linear term with respect to the size of the input FSA and an exponential term with respect to the maximum fanout of the input FSA.

The time complexity of Algorithm 3: Subsumption is linear with respect to the size of the FSA for D and the product of the maximum fanouts of the two FSAs. It's not exponential because both FSAs are known to respect the Unique Particle Attribution constraint, which is sufficient to render the pseudo-parsing deterministic in all cases. The exponential space requirement arises because of the unfolding approach to numeric exponents of Algorithm 1: Conversion to FSA .

Acknowledgements

Acknowledgements

The work reported here was supported in part by a W3C Fellowship for the first author and a research contract from Microsoft.

The algorithms presented here have been discussed by the W3C XML Schema Interest Group [XML Schema IG] , and a number of improvements derive from comments made there. Particular thanks to Sandy Gao of IBM for identifying several flaws in an early version.

Bibliography

[Aho & Ullman]
Aho, A. and J. Ullman Principles of Compiler Design, Addison-Wesley, Reading, MA, 1977.
[XML]
Bray, T., Paoli, J., Sperberg-McQueen, C. M. and E. Maler, eds. Extensible Markup Language (XML) 1.0 (Second Edition), W3C, Boston, 2000. Available online as http://www.w3.org/TR/REC-xml.
[XML Schema]
Thompson, Henry S., Mendelsohn, N., Maloney, M. and D. Beech, eds. XML Schema Part 1: Structures, W3C, Boston, 2001. Available online as http://www.w3.org/TR/xmlschema-1/.
[XML Schema IG]
The W3C XML Schema Interest Group: mailing list archives at http://lists.w3.org/Archives/Member/w3c-xml-schema-ig, W3C members only.
[XSV]
Thompson, Henry S. and R. Tobin, XML Schema Validator, W3C and University of Edinburgh, Boston and Edinburgh, 2003. See http://www.ltg.ed.ac.uk/~ht/xsv-status.html for description and download information.