Schema Languages Comparison

Analysts Report

Norman Walsh

Staff Engineer
Sun Microsystems, XML Technology Center

John Cowan

Senior Internet Systems Developer
Reuters Health Information

19 Nov 2001


Introduction

[FIXME: introduction]

And the Winner Is …

There is no winner. The purpose of this panel is not to decide which schema language is universally best or right. There is no more a best schema language than there is a best programming language or a best make and model of automobile. Different tools are suited to different tasks.

The purpose of this panel is to survey the popular schema languages, identify some of the distinguishing features that one might use to decide which language was best suited to a particular task, and give users and schema authors a chance to ask schema language designers and experts questions about their favorite schema technologies.

About the Panel

This panel ...

About the Analysts

Norman Walsh participates actively in a number of standards efforts worldwide, including the XML Core, XSL and XML Schema Working Groups of the World Wide Web Consortium, the OASIS XSLT Conformance and RELAX NG Committees, the OASIS Entity Resolution Committee, for which he is the editor, and the OASIS DocBook Technical Committee, which he chairs.

John Cowan ...

Method of Analysis

Schema languages can be evaluated on many criteria, including:

  • Expressiveness.

  • Compactness.

  • Syntax.

  • Tools support.

  • Validation efficiency.

  • Datatype library richness.

  • And many others …

To obtain a representative selection of schemas for analysis, we asked each team to provide two sets of schemas, one of their own design and another to satisfy a specific set of user requirements. Our goal was to allow the schema authors the freedom to demonstrate the unique strengths of their language on the one hand and to obtain some basis for an “apples-to-apples” comparison on the other; see Appendix A.

The Schema Languages

The analysts examined four schema languages in detail: [XML 1.0] DTDs, [W3C XML Schema], [RELAX NG], and [Schematron]. These are by no means the only schema languages available, but we felt that these languages cover the vast majority of the current landscape. Your favorite schema language, even if it isn't represented in this sample, is no more wrong or right than any of these.

DTDs

Birthplace:

ISO

Age:

SGML: >10 years [ISO 8879:1986]; XML: almost 4 years [XML 1.0]

Parents:

SGML

Type:

Grammar

Formalism:

ad hoc

Syntax:

XML declaration

Team:

Steven J. DeRose, Deborah Lapeyre and B. Tommie Usdin.

FIXME: general observations

FIXME: discuss the team's schemas

W3C XML Schemas

Birthplace:

W3C

Age:

7 months [W3C XML Schema]

Parents:

SOX, XDR, DDML, …

Type:

Grammar

Formalism:

ad hoc. A formalism is being developed after the fact.

Syntax:

XML instance

Team:

Martin Gudgin, Henry S. Thompson, and Priscilla Walmsley.

FIXME: general observations

FIXME: discuss the team's schemas

RELAX NG

Birthplace:

OASIS

Age:

4 months [RELAX NG]

Parents:

RELAX and TREX

Type:

Grammar

Formalism:

Hedge Automata

Syntax:

XML instance

Team:

James Clark and Murata Makoto.

FIXME: general observations

FIXME: discuss the team's schemas

Schematron

Birthplace:

Academia Sinica

Age:

6 months

Parents:

XPath and XSLT

Type:

Tree Pattern

Formalism:

ad hoc

Syntax:

XML instance

Team:

Rick Jelliffe, Francis Norton, and Eddie Robertsson.

FIXME: general observations

FIXME: discuss the team's schemas

Scorecard

The common schemas were designed to probe the following language features:

  • Namespace support. Vocabularies with and without namespaces.

  • Multiple top-level elements. Vocabularies which allow instances that may be rooted at more than one element.

  • Unordered content. Required content without a required order (approximately the SGML "&" connector).

  • Simple datatypes Dates, integers, prices, etc.

  • Schema modularity. Schemas that can be composed of distinct schema modules.

  • Content model extensibility. The ability for one schema module to extend or modify the content of an element declared in another module.

  • Subtype/equivalence relationships. The ability to express that two elements are in some sense the same kind of object.

  • Context-dependency. Content-dependent content models (local element declarations).

  • Enumerations. Enumerated attribute and element content.

  • Default values. Default attribute and element values.

  • Pattern matching. Attribute and element content validity based on patterns (e.g., regular expressions).

  • Exclusions.

  • Mixed namespaces:

    • Explicit namespaces. Elements from an explicit namespace.

    • Any namespace. Elements from any namespace without restriction.

    • Any namespace (XSL). Elements from any namespace with namespace URI restrictions (like XSL top-level elements).

    • Any namespace (nesting). Elements from any namespace with nesting restrictions (any namespace but not containing elements or attributes from some explicit namespace at any depth).

  • Links:

    • Simple links. Internal (ID/IDREF-style) links.

    • Typed links. Links whose link target must be a specific element type.

    • ID-type element content. Allowing element content to be identified as IDs.

    • Simple XLinks. Another test of namespace support and default attribute content, really.

  • Various co-constraints:

    • Sibling content. The content of sibling elements must be the same.

    • Sibling attribute values. Sibling elements must have different values for a given attribute.

    • Mutual exclusion. An attribute or a child element must be present, but not both.

    • Element type from attribute presence. Element data type dependent on presence or absence of an attribute.

    • Element type from attribute content. Element content dependent on attribute value.

    • Attribute type from element content. Attribute values dependent on element content.

    • Attribute value exclusion. Attributes that are required to be different if both are specified.

The following table provides an “at a glance” summary of how the languages compared, as demonstrated by the teams submission of the common schemas.

Language Feature DTDs W3C XML Schemas RELAX NG Schematron
Namespace support Some[a] Yes Yes Yes
Multiple top-level elements Yes Yes Yes Yes
Unordered content No[b] Yes[c] Yes Yes
Simple datatypes No[d] Yes[e] Yes[f] No[g]
Schema modularity Some[h] Yes Yes ?
Content model extensibility Some[h] Yes Yes ?
Subtype/equivalence relationships No Yes No Some[i]
Context-dependency No No Yes Yes
Enumerations Some[j] Yes Yes Yes
Default values Some[j] Yes Some[k] No
Pattern matching No Yes Yes[l] No
Exclusions No[m] Some[n] Some[n] Yes
Explicit namespaces Some[a] Yes Yes Yes
Any namespace No Yes Yes Yes
Any namespace (XSL) No No Yes ?
Any namespace (nesting) No Yes Yes Yes
Simple links Yes Yes Some[o] Yes
Typed links No Some[p] No Yes
ID-type element content No Yes No Yes
Sibling content No No Yes ?
Sibling attribute values No No No ?
Mutual exclusion No No Yes Yes
Element type from attribute presence No No ? Some[i]
Element type from attribute content No No ? Some[i]
Attribute type from element content No No ? Some[i]
Attribute value exclusion No No No Yes

[a] Namespace prefixes are fixed on a per-document basis at best.

[b] SGML had the "&" connector, but it's not in XML.

[c] At the top-level of an element declaration

[d] There are a tiny set of data types.

[e] W3C XML Schema Part II datatypes

[f] Datatype library identified by URI

[g] Some simple types could be checked by assertion.

[h] Dependent on the DTD author using parameter entities appropriately.

[i] By writing appropriate assertions.

[j] On attribute values only.

[k] Only on attributes and only with [RNG DTD].

[l] Using W3C XML Schema Part 2 datatypes

[m] SGML had exclusions, but XML does not.

[n] By content model manipulation

[o] Only with [RNG DTD].

[p] Requires explicit hierarchy

A. Common Schemas

This appendix summarizes the three schemas that each team was required to implement (The full text of the original requirements is also available). All of these schemas are contrived. Schema authors were encouraged to balance readability with absolute adherence to the requirements.

Cross-referencing mechanisms are intentionally vague. Some schema languages, like DTDs, will have to use ID/IDREF. Other teams may choose to use different language facilities for this purpose. All teams were free to add attributes as necessary to accomplish the required linking.

Technical Memorandum

The first schema is for a technical memorandum. It does not have a namespace. It begins with either a memo or techmemo element, these are synonymous. It consists of a head and body.

The head contains exactly one each of the following: date, author, and title. It may also include any number of meta elements from the XHTML namespace. These elements may appear in any order.

The date must be a valid date, the other head fields contain only text. What constitutes a valid date is intentionally vague. If you're using a datatype library that supports something that might reasonably be called a “date type”, you can use that. If you prefer to use a regular expression, that's fine too. As long as you describe how you interpreted “valid date” and how you achieved validation of that date, it's up to you.

The body contains a mixture of zero or more para and list elements. The emph, footnote, footnoteref, and link (a simple XLink) elements may appear inside para, along with text.

The footnote element contains one or more para elements. However, footnotes may not nest; a footnote may not contain a footnote as a descendant. The footnoteref element is empty; it has a required ref attribute which must point to a footnote. The emph and link elements contain only text.

A list consists of an optional title followed by two or more item elements. An item may contain text and inlines (emph, footnote, footnoteref, and link), or one or more para elements, but not both. All of the items in a list must have the same kind of content (all text and inlines or all paragraphs).

A Whitepaper

The second schema is for a white paper, it also has no namespace. It intentionally shares many of the same structures as the technical memorandum schema. Teams are invited to factor the common bits, write one schema as a customization of the other, or otherwise take advantage of as much reuse as is practical.

A whitepaper consists of a required head followed by a mixture of para and list elements (which may be absent), followed by zero or more section elements and an optional glossary.

The head must contain exactly one date and one title. It must contain at least one author. It may contain at most one titleabbrev element. It may contain zero or more copyright, keywords and legalnotice elements. It may also contain any number of meta elements from the XHTML namespace. The order of elements in the head is irrelevant.

The keywords element has an optional vocabulary attribute. If multiple keywords are provided, they must come from different vocabularies.

The content of date, author and title elements is as before. The titleabbrev element contains text. The keywords element contains a whitespace-delimited list of one or more tokens (there are no restrictions on the characters in the tokens). The legalnotice contains an optional title followed by one or more para elements. Finally, copyright contains one or more year elements and one or more holder elements, in that order.

Copyright years should be valid years, holders simply text.

Sections must have a head, but only the title element is required in section heads. The body of a section consists only of paragraphs, lists, and optionally trailing sections.

The whitepaper schema adds a new inline to the content of para: glossterm. A glossterm must point to a glossdef. If the glossterm has a ref attribute, that attribute points to the definition, otherwise the body of the glossterm is to be used for the cross reference.

A glossary consists of an optional head (of the same form as section) followed by one or more glossdef elements. Each glossdef consists of a term followed by one or more paragraphs or lists. The terms contain only text.

Order Form

The order form schema uses addresses for both billing and shipping information. For our purposes, there are two kinds of addresses in the world: US addresses and international addresses. A US address consists of the following fields: one or more street elements followed by city, state (which must be one of the 50 US state postal abbreviations), zip (which must be either a five digit zip code or a nine digit “zip+4” code), and an optional country. If country is specified, it must be “US”.

An international address consists of: one or more street elements followed by city, an optional stateOrProvince, an optional postalcode, and a country.

Either of these forms may be used for the address fields of the order form schema.

The namespace name for elements in the order form schema is “urn:x-xmlns:example:orderForm”. An orderForm contains exactly one of each of the following elements, in this order: billToAddress, order, shippingInfo, and paymentMethod. If the billToAddress is a US address and the state is not one of the following: AK, DE, HI, MT, NO, OR, or WY, then the orderForm must also include a salesTax element immediately after the order.

The orderForm may additionally contain any element not from the order form namespace, provided that the expanded-name of the element has a non-null namespace URI. Elements not from the order form namespace may not contain elements or attributes from the order form namespace.

An order consists of one or more item elements.

Each item begins with an itemNumber. If the item number has the form “CL-” followed by a four digit number, it is a clothing item. If it has the form “NC-” followed by a four digit number, it is a non-clothing item. If it matches neither pattern, it is invalid.

Non-clothing items have the following additional fields: description, quantity, and unitPrice in that order. The description may contain text and elements from any namespace other than the order form namespace (including elements whose expanded-name has a null namespace URI and without any restricton on their content). The quantity must be a positive integer. The unitPrice must be a positive decimal number with two digits after the decimal point. The quantity element is optional, if it is not specified, it must default to “1”; description and unitPrice are required. The description must not be empty and may not contain only whitespace.

Clothing items must have all of the fields of a non-clothing item, plus the following additional fields (in this order): size (“S”, “M”, “L”, “XL”, “LT”, or “XLT”), color, alternateColor (color and alternate color may not be the same), and an optional monogram which must consist of 1-3 upper-case letters (“A”-“Z”).

The shippingInfo contains a shipToAddress and a shipBy element in that order.

The shipBy is either “USPS”, “FedEx”, “UPS”, or “DHL” (Tokenized element content, like attribute values, may have leading and trailing whitespace). The shipBy must have a shippingCost attribute and may optionally have a rush attribute containing “none”, “3day”, “2day”, or “overnight”. If unspecified, rush defaults to “none”. Overnight shipping is not available to international addresses.

The paymentMethod consists of either a creditCard or a checkOrMoneyOrder. The amount of the payment is recorded in the amount attribute on the paymentMethod.

The creditCard element must have either a type attribute or a type child (it is an error to have neither or both). In either case, the content must be one of the following “Amex”, “Visa”, or “Mastercard”. The creditCard must also have a number and an expiration. For “Amex” payments, the number must be 15 digits long, for “Mastercard” it must be 16, for “Visa” it must be either 13 or 16 digits long. The expiration must match the pattern “99/99”.

The checkOrMoneyOrder element is empty.

Finally, salesTax must be a positive decimal number with two digits after the decimal point.

References

[ISO 8879:1986] JTC 1, SC 34. ISO 8879:1986 Information processing -- Text and office systems -- Standard Generalized Markup Language (SGML). 1986.

[XML 1.0] Tim Bray, Jean Paoli, and C. M. Sperberg-McQueen, Eve Maler, editors. Extensible Markup Language (XML) 1.0 Second Edition. World Wide Web Consortium, 2000.

[W3C XML Schema] Henry S. Thompson, David Beech, Murray Maloney, et al. editors. XML Schema Part 1: Structures. World Wide Web Consortium, 2000.

[W3C XML Datatypes] Paul V. Biron and Ashok Malhotra, editors. XML Schema Part 2: Datatypes. World Wide Web Consortium, 2000.

[RELAX NG] James Clark, editor. RELAX NG Specification (Committee Specification). OASIS. 2001.

[RNG DTD] James Clark, editor. RELAX NG DTD Compatibility (Committee Specification). OASIS. 2001.

[Schematron] Rick Jelliffe, editor. The Schematron Assertion Language 1.5. Rick Jelliffe and Acedemia Sinica Computing Centre. 2001, 2001.