Normal Form Conventions for XML Representations of Structured Data

Henry S. Thompson

October 1 2001

As submitted to and accepted by XML 2001: For unclear reasons not published in their proceedings.

1. Introduction

Starting with the work of Andrew Layman [Lay98], the issue of establishing conventions for the XML representation of structured data has generated considerable interest. In this paper we present an explicit definition not only of what might be called Layman normal form, but also of three other normal forms.

2. Motivation

The initial motivation for the work presented here is to serve as the starting point for a declarative approach to XML data binding, i.e. to the use of XML as a transfer medium for structured data (see [ThomSwick99] for an introduction to this viewpoint). Such an approach must accommodate both marshalling application data into XML document, and unmarshalling XML documents into application data.

However with hindsight it has become apparent that it is also a useful point of departure for consideration of the general question of the semantics of markup vocabularies in the wild, so to speak. That is, when we look at the DTDs and other schemas which have been written for markup applications, what generalisations can we make about how markup is used to convey or record domain properties? It is my contention that in practice existing DTDs often can be understood as employing one or another of the normal forms set out below as their encoding strategies. Understanding these normal forms can thus contribute to the analysis of DTD patterns of markup use.

3. Terminology

The kind of data we will explore encodings for will not be specified in detail. For the time being, we'll assume a very simple ontology of individuals, atoms, unary types and binary relations. Individuals have identifiers and atoms have notations. Both types and relations have names. Identity is an issue for individuals, but not for atoms. That is, it may make sense to assert that two distinct identifiers identify the same individual, but atoms are assumed to be uniquely determined by their type and notation. A type is a named set of individuals or atoms. Relations are named pairs of individuals and (individuals or atoms). Relations are not restricted to being functional, that is, they may be one-to-many.

This ontology is intended to be a reasonable intermediary between XML and any number of different actual modelling languages or actual application data models: Entity-Relation, UML, RDF, C or Pascal structures, OO class instances and variables, first-order logic.

4. Example dataset

All the following sections will use the following dataset to illustrate their approach.

Types

PurchaseOrder: P1
Address: A1, A2
Item: I1, I2
string, integer, decimal, date: The types of most atoms are shown explicitly inline below. The one exception is string, where we use italics instead of the explicit typename. For example, we use "Alice Smith" instead of string("Alice Smith").

Relations

orderDate: [P1,date("1999-10-20")]
shipTo: [P1,A1]
billTo: [P1,A2]
comment: [P1,"Hurry, my lawn is going wild!"], [I1,"Confirm this is electric"]
item: [P1,I1], [P1,I2]
country: [A1,"US"], [A2,"US"]
name: [A1,"Alice Smith"], [A2,"Robert Smith"]
street: [A1,"123 Maple Street"], [A2,"8 Oak Avenue"]
city: [A1,"Mill Valley"], [A2,"Old Town"]
state: [A1,"CA"], [A2,"PA"]
zip: [A1,integer("90952")], [A2,integer("19144")]
partNum: [I1,"872-AA"], [I2,"926-AA"]
productName: [I1,"Lawnmower"], [I2,"Baby monitor"]
quantity: [I1,integer("1")], [I2,integer("1")]
USPrice: [I1,decimal("148.95")], [I2,decimal("39.98")]
shipDate: [I2,date("1999-05-21")]

5. What is a 'Normal Form'?

I use the phrase 'normal form' in this paper in two related ways, one concrete and one abstract. By the concrete use, when talking about a normal form for a particular dataset, that is a particular collection of individuals, atoms, types and relations, I mean a representation or encoding in XML of that dataset. By the general use, e.g. when giving the definition below of Alternating Normal Form, I mean a set of principles for constructing and/or interpreting concrete normal forms. The first use will be notated in lower case (normal form), the second capitalised (Normal Form)

In other words, once you know the principles for a Normal Form, you know how, given a particular dataset, to construct a normal form for it, that is an XML document which represents it in that Normal Form, and given an XML document known to be a normal form, that is constructed according to the principles of a Normal Form, you know how to (re)construct the dataset it represents or encodes.

It follows that in defining a Normal Form, both the mapping from datasets to XML and the mapping from XML to datasets must be specified. In practice specifying this from the perspective of the XML, in a way which is manifestly invertable, seems to work well, and it is this approach which is followed below.

The following four sections set out the principles of four distinct Normal Forms, together with the corresponding normal form of the example dataset.

6. Alternating Normal Form

Attributes are not used.
The document element encodes a typed individual.
The children of the document element encode relations.
Their children encode atoms or typed individuals.
And so on -- that is, the children of any element encoding an individual encode relations, the children of any element encoding a relation encode either atoms or typed individuals.
The types of atoms are not encoded in the document -- they must come from elsewhere, e.g. from an XML Schema.

The alternating normal form for our example is as follows:

          <PurchaseOrder>
      <orderDate>1999-10-20</orderDate>
      <shipTo>
       <Address>
        <country>US</country>
        <name>Alice Smith</name>
        <street>123 Maple Street</street>
        <city>Mill Valley</city>
        <state>CA</state>
        <zip>90952</zip>
       </Address>
      </shipTo>
      <billTo>
       <Address>
        <country>US</country>
        <name>Robert Smith</name>
        <street>8 Oak Avenue</street>
        <city>Old Town</city>
        <state>PA</state>
        <zip>19144</zip>
       </Address>
      </billTo>
      <comment>Hurry, my lawn is going wild!</comment>
      <item>
       <Item>
        <partNum>872-AA</partNum>
        <productName>Lawnmower</productName>
        <quantity>1</quantity>
        <USPrice>148.95</USPrice>
        <comment>Confirm this is electric</comment>
       </Item>
      </item>
      <item>
       <Item>
        <partNum>926-AA</partNum>
        <productName>Baby Monitor</productName>
        <quantity>1</quantity>
        <USPrice>39.98</USPrice>
        <shipDate>1999-05-21</shipDate>
       </Item>
      </item>
     </PurchaseOrder>

This Normal Form is the simplest to specify, but leads to very verbose documents. It is the basis for the XML reflection of Infosets and PSVIs proposed by [ThomTobin01]. Of the four Normal Forms presented here it requires the least additional information to complete the XML-to-dataset mapping.

One could also have a fully explicit Alternating Normal Form, in which there was one more level, so that the types of atoms were encoded by elements as well, and then no out-of-band information would be required, e.g.

     . . .
     <Item>
      <partNum><string>926-AA</string></partNum>
      <productName><string>Baby Monitor</string></productName>
      <quantity><integer>1</integer></quantity>
      <USPrice><decimal>39.98</decimal></USPrice>
      <shipDate><date>1999-05-21</date></shipDate>
     </Item>

Note that here as elsewhere the approach to atomic types is short on detail. Namespace-qualified type atomic type names would almost certainly be needed in practice.

7. Relation Normal Form

The basic idea of this Normal Form was first proposed to me by Istvan Cseri [Cseri98].

Elements encode relations (except for the document element, which is just a placeholder).
Attributes (optionally) encode relations to atoms.
(Schema) types encode typed individuals or the types of atoms.

A Relation normal form for our example is as follows:

     <PurchaseOrder orderDate="1999-10-20">
      <shipTo>
       <country>US</country>
       <name>Alice Smith</name>
       <street>123 Maple Street</street>
       <city>Mill Valley</city>
       <state>CA</state>
       <zip>90952</zip>       
      </shipTo>
      <billTo>       
       <country>US</country>
       <name>Robert Smith</name>
       <street>8 Oak Avenue</street>
       <city>Old Town</city>
       <state>PA</state>
       <zip>19144</zip>
      </billTo>
      <comment>Hurry, my lawn is going wild!</comment>
      <item>
       <partNum>872-AA</partNum>
       <productName>Lawnmower</productName>
       <quantity>1</quantity>
       <USPrice>148.95</USPrice>
       <comment>Confirm this is electric</comment>
      </item>
      <item><partNum>926-AA</partNum>
       <productName>Baby Monitor</productName>
       <quantity>1</quantity>
       <USPrice>39.98</USPrice>
       <shipDate>1999-05-21</shipDate>
      </item>
     </PurchaseOrder>

It is noticeable that this form is very close to the example as originally encoded, in the XML Schema Primer [Fallside01]. The only substantive difference is the lack above of an <items> wrapper for the separate <item> elements.

This Normal Form requires out-of-band information, e.g. from an XML Schema, to supply the types of individuals and atoms. When unmarshalling from XML to data, schema-validation with an appropriate schema will supply the necessary type information in the post schema-validation Infoset (PSVI).

8. Layman Normal Form

Elements encode typed individuals.
Attributes encode relations.
Attributes of type IDREF or IDREFS encode relations to individuals, namely those whose ID(s) is/are their value.
Attributes of any other type encode relations to typed atoms, encoded by their value.
(Schema) types encode the types of atoms.

Out-of-band information is required to identify the attributes of type ID, IDREF and IDREFS. This could be provided by a minimal set of XML 1.0 attribute declarations in the internal subset, or by reference to an external DTD, or by an XML Schema.

A Layman normal form for our example is as follows:

          <PurchaseOrder orderDate="1999-10-20" shipTo="A1" billTo="A2"
                    comment="Hurry, my lawn is going wild!" item="I1 I2">
       <Address id="A1" country="US" name="Alice Smith" street="123 Maple Street"
                city="Mill Valley" state="CA" zip="90952"/>
       <Address id="A2" country="US" name="Robert Smith" street="8 Oak Avenue"
                city="Old Town" state="PA" zip="19144"/>
       <Item id="I1" partNum="872-AA" productName="Lawnmower" quantity="1"
             USPrice="148.95" comment="Confirm this is electric"/>
       <Item id="I2" partNum="926-AA" productName="Baby Monitor" quantity="1"
             USPrice="39.98" shipDate="1999-05-21"/>
</PurchaseOrder>

This is the most removed from any kind of `traditional' approach of the Normal Forms presented here. It does not require the use of nesting to convey any information at all, although it does not preclude it: Although in this example the Address and Item elements are only nested within the PurchaseOrder to ensure the whole is well-formed, in examples with richer structure nesting indicative of dependency could perfectly well be used.

9. Individual Normal Form

The basic idea of this Normal Form was first proposed to me by Adam Bosworth [Bosworth98].

Attributes are not used.
Elements encode typed individuals or typed atoms.
Relations are encoded by position or by schema annotation.

This Normal Form is quite counter-intuitive from the perspective of the document markup tradition, but is quite similar to some approaches to archive or transfer syntaxes for relational databases. It requires out-of-band information, e.g. in appInfo elements in an XML Schema, to determine relations. Alternatively a further convention for ordering relations is required.

The Bosworth normal form for our example is as follows:

<PurchaseOrder>
      <date>1999-10-20</date>
      <Address>
       <string>US</string>
       <string>Alice Smith</string>
       <string>123 Maple Street</string>
       <string>Mill Valley</string>
       <string>CA</string>
       <integer>90952</integer>
      </Address>
      <Address>
       <string>US</string>
       <string>Robert Smith</string>
       <string>8 Oak Avenue</string>
       <string>Old Town</string>
       <string>PA</string>
       <integer>19144</integer>
      </Address>
      <string>Hurry, my lawn is going wild!</string>
      <Item>
       <string>872-AA</string>
       <string>Lawnmower</string>
       <integer>1</integer>
       <decimal>148.95</decimal>
       <string>Confirm this is electric</string>
      </Item>
      <Item>
       <string>926-AA</string>
       <string>Baby Monitor</string>
       <integer>1</integer>
       <decimal>39.98</decimal>
       <date>1999-05-21</date>
      </Item>      
     </PurchaseOrder>

10. Open issues

Re-entrancy: Only one of the four Normal Forms above addresses re-entrancy (shared structure) and/or circularity directly, namely Layman Normal Form. As presented here, the others are constrained to encode only proper trees, and cannot handle directed graphs. It is obviously possible to add the use of IDREF/ID connections to at least Alternating and Property Normal Form to handle these cases. An alternative which works in those cases and also for Individual normal form is to introduce a distinguished (i.e. in a separated namespace) pointer element with an IDREF attribute, along with a distinguished normalID attribute. None of these approaches are quite as clean as Layman Normal Form in handling re-entrancy however, in that given any item which is the value of multiple relations, nesting will be used to encode one of the relations, and some form of indirection for the others, where in principle there is not distinction between them.
Collections: The absence of collections from the simple ontology used here is obviously a weakness. There is after all a difference between a relation between, say, an individual and a set, on the one hand, and multiple instances of a relation between an individual and various others. This difference cannot be encoded as the proposals above stand. The best way forward to fill this gap is not yet clear to me.

11. Conclusions

Our understanding of how we have used markup in the past, and how we may use it in the future, is surprisingly limited -- mostly people have just used markup, without thinking too hard about what they were doing in the abstract (although see [HuitfeldMcQueen00] for a notable exception). I hope the enumeration of Normal Forms presented here may help others in redressing this deficit: we already have some evidence [KrupnikovThompson01] that it yields practical benefits in supporting an XML-Schema-based approach to the declarative specification of the relation between XML document types and the application data they encode.

12. References

Bosworth98: Bosworth, Adam, personal communication. Microsoft Corporation, Redmond, WA.
Cseri98: Cseri, Istvan, personal communication. Microsoft Corporation, Redmond, WA.
HuitfeldMcQueen00: Huitfeld, Klaus and C. Michael Sperberg-McQueen, [to be filled in].
KrupnikovThompson01: Krupnikov, K. Ari and Henry S. Thompson, Data Binding Using W3C XML Schema Annotations, in this volume.
Lay98: Layman, Andrew, "XML Syntax Recommendation for Serializing Graphs of Data", 1998. In QL'98 - The Query Languages Workshop Position Papers, World Wide Web Consortium, Cambridge, MA. Available online at http://www.w3.org/TandS/QL/QL98/pp/microsoft-serializing.html
ThomSwick99: Thompson, Henry S. and Ralph Swick, eds., The Cambridge Communiqué, 1999. World Wide Web Consortium, Cambridge, MA. Available online at http://www.w3.org/TR/schema-arch
ThomTobin01: Thompson, Henry S. and Richard Tobin, A Schema for Serialized Infosets, 2001. World Wide Web Consortium, Cambridge, MA. Available online at http://www.w3.org/2001/05/serialized-infoset-schema.html