Computational Linguistics and XML

Henry S. Thompson

  HCRC Language Technology Group
  University of Edinburgh

  Markup Technology Ltd

  World Wide Web Consortium

Copyright © 2003 Henry S. Thompson

Introduction and Motivation

The Web is ours, not theirs

What is XML, anyway?

XML micro-syntax (i.e., morpho-grammar)

<book>The Mill on the Floss</book>
<book author="Eliot">. . .</book>

XML macro-syntax (i.e. grammar)

XML applications (i.e. semantics)

XML and Formal Language Theory

XML DTDs are equivalent to infinite CF-PSGs

XML DTD validation is like the tree well-formedness interpretation of CF-PSGs

XML documents are more than trees

Is the language specified by a DTD ever empty?

What is XML Schema?

XML Schema and FSAs

Converting Content Models to FSAs

Extending the published (Aho&Ullman) algorithm

Converting particles

Converting particles, cont'd

Term translation

Two states joined by a transition labelled with the wildcard
Two states joined by one transition for each element in the substitution group of the element, labelled with that element
A chain of conversions of each particle in the sequence, in order
Two states joined by the conversions of each of the particles in the choice

Unique Particle Attribution

UPA cont'd: The FSA-based algorithm

  1. Convert it to a (non-deterministic) FSA as above. Note that each transition in the result has one of three types of labels:
    1. An element declaration component
    2. A wildcard component
    3. A lambda
  2. Determinise it, using component identity for equality
  3. Re-label all transitions which have element declaration labels with the expanded name of that declaration
  4. If the FSA is still deterministic, using name matching as for equality, it satisfies UPA, otherwise it violates it.

UPA cont'd: Name matching

Gentzen calculi and XML

First example

We write d in g if forest d matches group g.


in e


d1  in g1     d2  in g2

d1 , d2  in g1 , g2

Choice 1:

in g1

in g1 | g2

Choice 2:

in g2

in g1 | g2


d |?inter d'1;d"2    d1 in g1    d2 in g2

d in g1 & g2

Second example

Putting the above notations together, here is an example of an inference rule that occurs later in this document:

statEnv |- Expr1 : Type1      statEnv |- Expr2 : Type2

statEnv |- Expr1 , Expr2 : Type1, Type2

This rule is read as follows: if two expressions Expr1 and Expr2 are known to have the static types types Type1 and Type2 (the two premises above the line), then it is the case that the expression below the line "Expr1 , Expr2" must have the static type "Type1, Type2", which is the sequence of types Type1 and Type2.

Model theory and XML Schema

  • Two distinct attribute declarations in the {attribute uses} must not have identical {name}s and {target namespace}s.
  • ∀ a1, a2 ∈ C∙attribute uses ,
      (a1∙name = a2∙name ∧ a1∙target namespace = a2∙target namespace) ⇒ a1 = a2
    Only one attribute declaration per qualified name.

Logic and the Semantic Web