|Extreme Markup Languages|
Mapping XML Schemas to class definitions and relational schemas allows for seamless marshaling and unmarshaling of an object's state to and from XML documents. W3C XML Schemas provide a rich set of datatypes and validity rules, both built-in, and user definable. It is possible to map data structures defined in a Schema document to data structures in object oriented languages and relational databases. This paper presents approaches to automating such mapping.
In simple cases, a Schema document can be generated based on a class definition, or a set of table definitions based on a Schema document with generic heuristics based on XML normal forms governing this generation process. In more interesting cases, a mapping between existing classes and Schemas is needed, such as when an existing class is required to marshal its state to an XML document that is valid with respect to a standard Schema.
A mechanism built into the Schema Recommendation is schema annotations. Annotations can be used to modify generic heuristics of mapping schemas to classes, or replace them completely with custom mappings. Schema annotations are fragments of XML text embedded in a Schema document that can be interpreted by the mapping software.
There are numerous schema languages being developed and proposed for XML. In this paper, we discuss W3C XML Schema; for brevity, we refer to it simply as "schema".
It is possible to represent an object's state as an XML document to serialize it. Such serialization would map an object's data members to nodes in a document. The structure of mapped objects and documents need not be identical. Indeed, some constructs in programming languages, such as the distinction between arrays and linked lists, do not map trivially to XML; the converse is also true.
In this paper, we use a somewhat artificial example, taken from Schema Part 0: Primer, to demonstrate how minimal human input may produce automated, meaningful mappings.
The document is reproduced here unchanged :
The schema we use is based on the one in the Primer. We don't include it here because of space constraints, but include fragments that are augmented with mapping information.
The Schema recommendation is divided into two parts - structures and datatypes. Defaults for mapping, too, can be divided into two - normal forms for structures and mappings for builtin datatypes.
In an accompanying paper,[Thompson (2001)], Henry Thompson introduces the notion of "XML normal forms". Inspired, in part, by relational normal forms, [Codd (1970)], these conventions for XML representation of structured data can be a basis for meaningful defaults in mapping XML schemas to object-oriented classes, relational tables or other data structures. Unlike relational normal forms, XML normal forms do not represent progressive abstractions, but different, unrelated ones.
A number of normal forms have been proposed; some are discussed in the paper mentioned above. In this paper, we use a simple form in which both elements and attributes in a document are assumed to represent properties, and their names are assumed to represent the properties' names; simple types are assumed to represent atoms and complex types, relations. Other normal forms lend themselves equally well to mapping techniques described in this paper, through application of slightly different heuristics.
A normal form may be formally defined in an XML document. Here is the definition for the normal form described above:
Schema Part 2: Datatypes describes a wealth of builtin datatypes. We use a simple XML format to encode a mapping from these builtin types to types available in host languages. Here is a fragment from type-defaults.java.xml, the mapping for Java:
These defaults may be overridden for specific components in a schema as described below, or gobally, by creating an alternative mapping document.
We use the map namespace to describe mappings; its URI is http://www.cogsci.ed.ac.uk/~kari/schema-mapping. It is used to describe both defaults, as in this case, and to override defaults as we describe in the next section.
In the previous section, we described defaults of a particular normal form and host language. In many cases it may be desirable to deviate from these defaults. There may be inconsistencies in naming conventions between XML and the host language; certain syntactic constructs in the host language may require extra levels of enclosing elements for clarity.
Note that in the extreme case, every default may be overridden and the normal form ignored.
There are two ways to include information in a schema document - appinfo children of annotation elements and attributes in namesaces other than the schema namespace. For our purposes, the difference is negligible; we use the attributes.
Two examples we show here are overriding type name default and promoting removing an element, promoting its children to the next level of hierarchy. Here is fragment of the schema document:
As an example we show how an element can be mapped to a class with a different name. The XML document's top-level element is purchaseOrder. In Java, class names start with a capital letter by convention. In the schema document, the xs:complexType element has a map:name="PurchaseOrder" attribute.
In the sample XML document, item elements are grouped within an items element. This grouping is redundant for an in-memory representation as it can be better expressed by an array. To remove the items element and promote its children to the next higher level of hierarchy, the xs:element element has a map:to="promote" attribute.
XML documents that are valid wrt to a schema contain implicit information that is found in the schema rather than the document itself . Such documents' InfoSet, known as PSVI [Post Schema Validation InfoSet] combines information from a document instance and the schema associated with the it. In PSVI, every datum from the instance document is associated with type information from the schema definition. Based on defaults dictated by an XML normal form, this information can be used to construct an in-memory representation of the information contained in an instance document.
PSVI may be available as an object in memory that may be queried and based on which an in-memory representation of the document data may be constructed; it may be available as a stream of events in response to which the in-memory representation may be built incrementally. PSVI may also be serialized as an XML document, known as reflected PSVI. In this paper, we concentrate on the reflected PSVI. Naturally, techniques described here are as applicable to other PSVI representations as they are to reflected PSVI.
We use XSLT stylesheets to gather information from the PSVI. The result of these transformations is not a binary data structure but source code in the host language that can build such structures.1 1 An intriguing alternative would be to use XSLT's extension function mechanism to construct native objects in the host environment.
The form in which the PSVI is available to an application is non-standardized and no-portable; the generated text or binary representation is naturally specific to the target environment. The process of building a run-time representation therefore can be logically separated into two phases: gathering relevant information from the PSVI building some sort of intermediate representation that is generic enough to be useful in a variety of environments, and building the final representation in whatever format is required by the host environment.
In this step we build what we call a decorated pseudo-instance. All information pertinent to mapping that was gathered from the PSVI is added to the instance document in the form of attributes in the map namespace added to every element. Otherwise, the document is unmodified; elements are not removed, renamed or moved.
For example the items element now looks like this:
The decorated instance contains, for every node, a complete set of instructions needed to create a runtime in-memory representation: name of host language type (map:item-name), whether this is a primitive or reference type an whether it should be present in in-memory representation (map:item-to), whether this is a scalar or array member (map:maxOccurs).
Based on the intermediate representation described in the previous section, we can build the final representation in any specific environment. Here, we present to examples: code that initializes a Java object and FOPL [Firs Order Predicate Logic] assertions. Both examples use gensyms [generated symbols] to denote individual objects. These gensyms have no special significance other than their uniqueness within the scope of a document or object; in this implementation they are imply the hexadecimal sequential numbers of corresponding nodes in the source document.
Here, we present a list of assertions that can be deduced from the source document. Individuals are printed in bold; named binary relations between individuals and other individuals or are represented by square brackets. For each atom, its type is indicated.
Note that the type of the top-level individual is not purchaseOrder as the name of the element in the document is, but PurchaseOrder as is mandated by the mapping. Also, note that the items element is gone, its children promoted to the next higher level, as specified by the mapping.
Here, we present code that can initialize a Java object based on an XML document. We present the source code, but a run-time system can as easily execute the commands instead of generating them.
This code assume that all user-defined classes have zero-argument constructors and public access to members. Note that atoms' types are mapped to specific Java classes per Java mapping discussed in the section on defaults, e.g. price's xsd:decimal is mapped to java.math.BigDecimal. Also note that our example has no types that, by default, map to primitive Java types. If there were any, those atoms would naturally have been initialized as primitive variables in Java, through literal assignment as opposed to new object construction syntax.
For interpreted languages, the difference is immaterial.
[Codd (1970)] Codd, E. 1970 "A Relational Model for Large Shared Data Banks," CACM, June 1970
[Thompson (2001)] Thompson, H. 2001 "Normal Form Conventions for XML Representations of Structured Data", to appear