Starting with the work of Andrew Layman [Lay98], the issue of establishing conventions for the XML representation of structured data has generated considerable interest. In this paper we present an explicit definition not only of what might be called Layman normal form, but also of three other normal forms.
The initial motivation for the work presented here is to serve as the starting point for a declarative approach to XML data binding, i.e. to the use of XML as a transfer medium for structured data (see [ThomSwick99] for an introduction to this viewpoint). Such an approach must accommodate both marshalling application data into XML document, and unmarshalling XML documents into application data.
However with hindsight it has become apparent that it is also a useful point of departure for consideration of the general question of the semantics of markup vocabularies in the wild, so to speak. That is, when we look at the DTDs and other schemas which have been written for markup applications, what generalisations can we make about how markup is used to convey or record domain properties? It is my contention that in practice existing DTDs often can be understood as employing one or another of the normal forms set out below as their encoding strategies. Understanding these normal forms can thus contribute to the analysis of DTD patterns of markup use.
The kind of data we will explore encodings for will not be specified in detail. For the time being, we'll assume a very simple ontology of individuals, atoms, unary types and binary relations. Individuals have identifiers and atoms have notations. Both types and relations have names. Identity is an issue for individuals, but not for atoms. That is, it may make sense to assert that two distinct identifiers identify the same individual, but atoms are assumed to be uniquely determined by their type and notation. A type is a named set of individuals or atoms. Relations are named pairs of individuals and (individuals or atoms). Relations are not restricted to being functional, that is, they may be one-to-many.
This ontology is intended to be a reasonable intermediary between XML and any number of different actual modelling languages or actual application data models: Entity-Relation, UML, RDF, C or Pascal structures, OO class instances and variables, first-order logic.
All the following sections will use the following dataset to illustrate their approach.
string
, where we use italics
instead of the explicit typename. For example, we use "Alice
Smith" instead of string("Alice Smith")
.date("1999-10-20")
]integer("90952")
], [A2,integer("19144")
]integer("1")
], [I2,integer("1")
]decimal("148.95")
], [I2,decimal("39.98")
]date("1999-05-21")
]I use the phrase 'normal form' in this paper in two related ways, one concrete and one abstract. By the concrete use, when talking about a normal form for a particular dataset, that is a particular collection of individuals, atoms, types and relations, I mean a representation or encoding in XML of that dataset. By the general use, e.g. when giving the definition below of Alternating Normal Form, I mean a set of principles for constructing and/or interpreting concrete normal forms. The first use will be notated in lower case (normal form), the second capitalised (Normal Form)
In other words, once you know the principles for a Normal Form, you know how, given a particular dataset, to construct a normal form for it, that is an XML document which represents it in that Normal Form, and given an XML document known to be a normal form, that is constructed according to the principles of a Normal Form, you know how to (re)construct the dataset it represents or encodes.
It follows that in defining a Normal Form, both the mapping from datasets to XML and the mapping from XML to datasets must be specified. In practice specifying this from the perspective of the XML, in a way which is manifestly invertable, seems to work well, and it is this approach which is followed below.
The following four sections set out the principles of four distinct Normal Forms, together with the corresponding normal form of the example dataset.
The alternating normal form for our example is as follows:
<PurchaseOrder> <orderDate>1999-10-20</orderDate> <shipTo> <Address> <country>US</country> <name>Alice Smith</name> <street>123 Maple Street</street> <city>Mill Valley</city> <state>CA</state> <zip>90952</zip> </Address> </shipTo> <billTo> <Address> <country>US</country> <name>Robert Smith</name> <street>8 Oak Avenue</street> <city>Old Town</city> <state>PA</state> <zip>19144</zip> </Address> </billTo> <comment>Hurry, my lawn is going wild!</comment> <item> <Item> <partNum>872-AA</partNum> <productName>Lawnmower</productName> <quantity>1</quantity> <USPrice>148.95</USPrice> <comment>Confirm this is electric</comment> </Item> </item> <item> <Item> <partNum>926-AA</partNum> <productName>Baby Monitor</productName> <quantity>1</quantity> <USPrice>39.98</USPrice> <shipDate>1999-05-21</shipDate> </Item> </item> </PurchaseOrder>
This Normal Form is the simplest to specify, but leads to very verbose documents. It is the basis for the XML reflection of Infosets and PSVIs proposed by [ThomTobin01]. Of the four Normal Forms presented here it requires the least additional information to complete the XML-to-dataset mapping.
One could also have a fully explicit Alternating Normal Form, in which there was one more level, so that the types of atoms were encoded by elements as well, and then no out-of-band information would be required, e.g.
. . . <Item> <partNum><string>926-AA</string></partNum> <productName><string>Baby Monitor</string></productName> <quantity><integer>1</integer></quantity> <USPrice><decimal>39.98</decimal></USPrice> <shipDate><date>1999-05-21</date></shipDate> </Item>
Note that here as elsewhere the approach to atomic types is short on detail. Namespace-qualified type atomic type names would almost certainly be needed in practice.
The basic idea of this Normal Form was first proposed to me by Istvan Cseri [Cseri98].
A Relation normal form for our example is as follows:
<PurchaseOrder orderDate="1999-10-20"> <shipTo> <country>US</country> <name>Alice Smith</name> <street>123 Maple Street</street> <city>Mill Valley</city> <state>CA</state> <zip>90952</zip> </shipTo> <billTo> <country>US</country> <name>Robert Smith</name> <street>8 Oak Avenue</street> <city>Old Town</city> <state>PA</state> <zip>19144</zip> </billTo> <comment>Hurry, my lawn is going wild!</comment> <item> <partNum>872-AA</partNum> <productName>Lawnmower</productName> <quantity>1</quantity> <USPrice>148.95</USPrice> <comment>Confirm this is electric</comment> </item> <item><partNum>926-AA</partNum> <productName>Baby Monitor</productName> <quantity>1</quantity> <USPrice>39.98</USPrice> <shipDate>1999-05-21</shipDate> </item> </PurchaseOrder>
It is noticeable that this form is very close to the example
as originally encoded, in the XML Schema Primer [Fallside01]. The only substantive difference is the lack above of an <items>
wrapper for the separate <item>
elements.
This Normal Form requires out-of-band information, e.g. from an XML Schema, to supply the types of individuals and atoms. When unmarshalling from XML to data, schema-validation with an appropriate schema will supply the necessary type information in the post schema-validation Infoset (PSVI).
Out-of-band information is required to identify the attributes of type ID, IDREF and IDREFS. This could be provided by a minimal set of XML 1.0 attribute declarations in the internal subset, or by reference to an external DTD, or by an XML Schema.
A Layman normal form for our example is as follows:
<PurchaseOrder orderDate="1999-10-20" shipTo="A1" billTo="A2" comment="Hurry, my lawn is going wild!" item="I1 I2"> <Address id="A1" country="US" name="Alice Smith" street="123 Maple Street" city="Mill Valley" state="CA" zip="90952"/> <Address id="A2" country="US" name="Robert Smith" street="8 Oak Avenue" city="Old Town" state="PA" zip="19144"/> <Item id="I1" partNum="872-AA" productName="Lawnmower" quantity="1" USPrice="148.95" comment="Confirm this is electric"/> <Item id="I2" partNum="926-AA" productName="Baby Monitor" quantity="1" USPrice="39.98" shipDate="1999-05-21"/> </PurchaseOrder>
This is the most removed from any kind of `traditional' approach of the
Normal Forms presented here. It does not require the use of nesting to convey
any information at all, although it does not preclude it: Although in this
example the Address
and Item
elements are only nested
within the PurchaseOrder
to ensure the whole is well-formed, in
examples with richer structure nesting indicative of dependency could perfectly
well be used.
The basic idea of this Normal Form was first proposed to me by Adam Bosworth [Bosworth98].
This Normal Form is quite counter-intuitive from the perspective of the
document markup tradition, but is quite similar to some approaches to archive
or transfer syntaxes for relational databases. It requires out-of-band
information, e.g. in appInfo
elements in an XML Schema, to
determine relations. Alternatively a further convention for ordering relations
is required.
The Bosworth normal form for our example is as follows:
<PurchaseOrder> <date>1999-10-20</date> <Address> <string>US</string> <string>Alice Smith</string> <string>123 Maple Street</string> <string>Mill Valley</string> <string>CA</string> <integer>90952</integer> </Address> <Address> <string>US</string> <string>Robert Smith</string> <string>8 Oak Avenue</string> <string>Old Town</string> <string>PA</string> <integer>19144</integer> </Address> <string>Hurry, my lawn is going wild!</string> <Item> <string>872-AA</string> <string>Lawnmower</string> <integer>1</integer> <decimal>148.95</decimal> <string>Confirm this is electric</string> </Item> <Item> <string>926-AA</string> <string>Baby Monitor</string> <integer>1</integer> <decimal>39.98</decimal> <date>1999-05-21</date> </Item> </PurchaseOrder>
pointer
element
with an IDREF attribute, along with a distinguished normalID
attribute. None of these approaches are quite as clean as Layman Normal Form
in handling re-entrancy however, in that given any item which is the value of
multiple relations, nesting will be used to encode one of the relations, and
some form of indirection for the others, where in principle there is not
distinction between them.Our understanding of how we have used markup in the past, and how we may use it in the future, is surprisingly limited -- mostly people have just used markup, without thinking too hard about what they were doing in the abstract (although see [HuitfeldMcQueen00] for a notable exception). I hope the enumeration of Normal Forms presented here may help others in redressing this deficit: we already have some evidence [KrupnikovThompson01] that it yields practical benefits in supporting an XML-Schema-based approach to the declarative specification of the relation between XML document types and the application data they encode.
http://www.w3.org/TandS/QL/QL98/pp/microsoft-serializing.html
http://www.w3.org/TR/schema-arch
http://www.w3.org/2001/05/serialized-infoset-schema.html