Element Type Hierarchies for Transparent Document Structure Definition

Author:Henry S. Thompson

University of Edinburgh, HCRC Language Technology Group, 2 Buccleuch Place
Edinburgh, EH8 9LWScotland,
ht@cogsci.ed.ac.uk
http://www.ltg.ed.ac.uk/~ht/

Bio:

Henry S. Thompson is a Reader in the Department of Artificial Intelligence and the Centre for Cognitive Science at the University of Edinburgh, where he is also a member of the Human Communication Research Centre. Since coming to Edinburgh in 1980, he has become a leading member of the British and European speech and language processing research community. His research interests are in the area of Natural Language and Speech processing, from both the applications and Cognitive Science perspectives, with a particular focus on tools and architectures for representing textual and transcribed spoken data. He was a charter member of the XML working group as was, and has been involved in SGML, XML, DSSSL implementation work. He is co-author of both the XML-Data and the XSL proposals.

Abstract:

Two recent proposals for meta-applications of XML (XML-Data and MCF) have included DTD fragments for describing document structure, sometimes called 'schemata'. In this paper I describe the XML-Data schemata proposal, concentrating on the motivation for and nature of the provision of an element-type hierarchy, in which element types can inherit attribute declarations and positions in content models from ancestors in the hierarchy. I argue that this represents a major improvement over the use of parameter entities to structure and maintain DTDs.

Introduction

Complex document types require rich and complex structural markup. SGML provides powerful mechanisms for defining the grammar of such markup, with element type and attribute declarations in the document type definition (DTD). The structure of the DTD itself, however, finds no explicit expression in SGML. The fact that element types are related in a structured fashion can only be represented implicitly, e.g. through the use of parameter entities. There is a real need, for ease of understanding and ease of maintenance, to address this issue.

There is an obvious solution, prefigured by the following, which appeared recently in a public XML-related newsgroup:

"We really need to build an object-oriented hierarchy, with classes that are extended by subclasses and so on...For example, a <restaurant> is a subclass of <location> and inherits the properties of <location> such as <address> and <street number>, but adds other properties, such as <menu>."

In this paper I outline a proposed XML application which provides exactly this facility.

Taking control of the D. S. D.

The watchword of SGML used to be "Taking control of your data". SGML gave you the means to express the grammar of your markup yourself, rather than be bound by wordprocessor and document compiler manufacturers.

A side-effect of the XML initiative has been to open up the possibility of a similar move one level up, as it were. Just as SGML allowed us to experiment with different markup for document instances, so I think XML invites us to experiment with different markup for document structure definitions.

In a recent proposal ^{[XML-Data 97]} I and my co-authors used the word 'schema' (plural 'schemata') for an XML document instance which itself described the structure of a document type.

In our approach, we envisage

a) the schema DTD, a definition of an XML representation of document structure, that is, an old-style DTD for schemata;

b) a master XML application, the equivalent of the XML parser, which is capable of processing pairs of XML documents, where the first, a schema, is valid in terms of the schema DTD; the second, an instance, has no old-style DTD, but is both well-formed in the XML sense and meta-valid in terms of the schema expressed by the first.

Meta-validity is, of course, validity with respect to the document structure constraints contained in the associated schema, which conforms to the schema DTD.

This "takes control of the D.S.D." in that experimenting with the grammar of schemata now involves changing the schema DTD (and the master application), not changing XML itself.

Document type hierarchies

The first move we make after introducing schemata which reproduce the expressive capabilities of existing XML DTDs is to add an explicit element type hierarchy.

Consider the following example, taken from the XML-Data proposal:

<schema>
    <elementType id="animalFriends">
      <elt href="#pet" occurs="PLUS"/>
    </elementType>

    <elementType id="pet">
      <any/>
      <attribute id='name'/>
      <attribute id='owner'/>
    </elementType>

    <elementType id="cat" extends="#pet"/>
      <elt href='#kittens'/>
      <attribute id='lives' type='NMTOKEN'/>
    </elementType>

    <elementType id="dog" extends="#pet"/>
      <elt href='#puppies'/>
      <attribute id='breed'/>
    </elementType>
  <schema>

This schema says that the animalFriends element type can contain one or more pet elements. Because cat and dog are subtypes of pet (declared by the extends="#pet"), they can occur as well. So the following instance fragment is now meta-valid under this schema:

  <animalFriends>
    <cat name="Fluffy" lives='9'/>
    <pet name="Diego"/>
    <dog name="Gromit" owner='Wallace' breed='mutt'/>
  </animalFriends>

Not only can dog elements occur within animalFriends, but also the name and owner attributes are valid, being inherited from pet.

A more realistic example comes from the TEI^{[TEI P3]}. Consider

  <!ENTITY % paraContent '(#PCDATA | %m.phrase | %m.inter)*'      >
  <!ENTITY % m.phrase '%x.phrase %m.data; . . .'>
  <!ENTITY % a.global '        id ID #IMPLIED
			       . . .'>
  <!ELEMENT p         - O  (%paraContent;)                    >
  <!ATTLIST p              %a.global;
	    TEIform            CDATA               'p'            >

There are two hierarchies implied here:

Now compare the XML-Data version:

  <!elementType id='p' extends='#global'>
    <mixed>
     <elt href='#phrase'/>
     <elt href='#inter'/>
    </mixed>
    <attribute id='TEIform' presence='fixed' default='p'/>
  </elementType>

  <elementType id='phrase'>
   . . .
  </elementType>

  <elementType id='global'>
   <attribute name='id' type='id'>
   . . .
  </elementType>

All that is required to plug in to the paraContent content model is to use extends='phrase' or extends='data' on the relevant element type declarations. Note this is true not only for the DTD designer, but also for the user, who can add his/her own elements into the hierarchy in the same, non-intrusive, way, thus doing away with the need for parameter entities such as %x.phrase in the example above, which are provided solely in order to allow such post-hoc augmentation.

Lexical Types

The XML-Data document contained a proposal regarding lexical typing, that is, the expression of constraints on the values of attributes and the contents of PCDATA element types. Without going into this in detail, I note that the incorporation in the XSL proposal ^{[XSL 97]} of a version of ECMAScript opens up the possibility of providing an operational definition of type-checking, e.g.

 . . .
 <attribute name="birthday" lextype="DATE.ISO8861" validate="#dateScript"/>
 . . .
 <define-script id="dataScript">
  . . .
  return true;
 </define-script>

where define-script is borrowed from XSL.

Maintaining Order

The only coherent development policy in my view is to introduce things into the schema DTD which we know how to translate into vanilla XML. Not only does this guarantee inter-operability in the limit, but the translation serves to define the semantics of each part of the schema DTD in a concrete and unequivocal way.

Acknowledgements

This work was carried out at the Human Communication Research Centre, whose baseline funding comes from the UK Economic and Social Research Council, with additional support from Microsoft.

Bibliography

^{[XML-Data 97]}: "Specification for XML-Data", Andrew Layman, Jean Paoli, Steve De Rose, Henry S. Thompson, http://www.microsoft.com/standards/xml/xmldata.htm, 1997
^{[TEI P3]}: "Guidelines for Electronic Text Encoding and Interchange", C. M. Sperberg-McQueen and Lou Burnard, eds., Text Encoding Initiative, 1994
^{[XSL 97]}: "A Proposal for XSL", Sharon Adler, Anders Berglund, James Clark, Istvan Cseri, Paul Grosso, Jonathan Marsh, Gavin Nicol, Jean Paoli, David Schach, Henry S. Thompson, Chris Wilson, http://www.w3.org/TR/NOTE-XSL.html, 1997