<?xml version='1.0'?>
<?xml-stylesheet type='text/xsl' href='http://www.ltg.ed.ac.uk/~ht/doc.xsl'?>
<!DOCTYPE doc SYSTEM "http://www.ltg.ed.ac.uk/~ht/doc.dtd" >
<doc>
 <head>
  <title>Adding simple type definitions by union to XML Schema</title>
  <author>Paul V. Biron</author>
  <author>Allen Brown</author>
  <author>Martin Gudgin</author>
  <author>Ashok Malhotra</author>
  <author>Henry S. Thompson</author>
  <date>26 June 2000</date>
 </head>
 <body>
  <div>
   <title>Introduction and Background</title>
   <p>The issue of union types, often discussed in the past by this group, was
raised again as a Last Call issue (LC-2: <link href="http://www.w3.org/XML/Group/xmlschema-current/lcissues.html#conjunction-types">Conjunction types?</link> and others).  In its simplest form, the requirement at the core of this and other comments is for a simple type definitions which disjunctively combines two or more other simple type definitions, with a result whose lexical and value spaces are the unions of those of its input types.  A simple and indeed prototypical example is the <code>maxOccurs</code> attribute on the <code>element</code> element in XML Schema itself:  it should be constrained as a union of <code>non-negative-integer</code> and an enumeration with the single member <name>unbounded</name>, but the current mechanisms for defining simple types do not allow for this.</p>
   <p>In response to this issue, the above-named examined a number of
approaches, and present the following design for consideration by the group. 
We believe it is simple both to explain and to implement, and will address the
stated requirements well.  The opportunity to reconsider certain aspects of the
existing design has allowed for some overall simplification and cleanup, as
well, with an increase in parallelism at the level of XML representation with
the definition of complex types.</p>
  </div>
  <div>
   <title>Changes to the simple type definition schema component</title>
   <p>From the current unitary form of simple type definitions, with six
properties: <name>name</name>, <name>base type definition</name>,
<name>facets</name>, <name>fundamental facets</name>, <name>variety</name> and
<name>target namespace</name>, we move to a core+variants structure, as follows:</p>
   <list type="defn">
    <item term="core">Properties <name>name</name> (optional), <name>target
namespace</name>, <name>base type definition</name> and
<name>variety</name>, with possible values for the latter of
<code>atomic</code>, <code>list</code> or <code>union</code>, which in turn
determine which of the subsequent variants are filled in:
     <list type="defn">
      <item term="atomic"><name>primitive type definition</name>, <name>facets</name> and <name>fundamental facets</name></item>
      <item term="list"><name>item type definition</name> (an
<emph>atomic</emph> or <emph>union</emph> simple type definition) and
<name>facets</name> (limited to <code>enumeration</code>, <code>length</code>, <code>min/maxLength</code> and <code>pattern</code>)</item>
      <item term="union"><name>constituent type definitions</name>
(<emph>atomic</emph> or <emph>list</emph> simple type definitions) and
<name>facets</name> (limited to <code>enumeration</code> and <code>pattern</code>)</item>
     </list>
    </item>
   </list>
   <p>The semantics are straightforward:  a string is schema-valid per a union
type definition iff it satisfies the specified facets and is valid per at least
one of the constituent type definitions.  Processors must not check beyond the
first constituent they find which the string satisfies.  The type definition
outcome property in the PSV infoset reports both the union type definition and
the constituent type definition which matched.</p>
   <note>We considered a range of other strategies and constraints,
particularly on whether to allow overlapping lexical spaces or not, and in the
end concluded that attempting to rule out overlapping was a bad idea, as it
would rule out e.g. float and double as members of a union.  We also considered
allowing nested spaces to be disambiguated in favour of the most specific, but
in the end concluded that given that user-order would <emph>have</emph> to play
a role in the case of non-nested overlap, it was better to use it for
everything.  A note will be needed that having e.g. <code>double</code> before
<code>float</code> will be pointless given our choice.</note>
   <note>The constraint on the constituents of union types to be atomic or list does
not rule out unions of unions at the XML representation level, it simply
requires them to be unfolded at schema construction time.</note>
   <note>We also considered making <code>binary</code> a fourth possible value
of <name>variety</name>, to reflect both its parameterisation by an encoding type
and its (very) limited inventory of allowed facets, but didn't reach consensus
on this point.</note>
  </div>
  <div>
   <title>Changes to the XML representation of simple type definitions</title>
   <p>We thought we would take this opportunity to move
<code>&lt;simpleType></code> more in line with the (newly simplified itself)
form of <code>&lt;complexType></code>.  Accordingly the three basic ways of
producing new simple type definitions from old all have a common shape:  a
<code>simpleType</code> element with an optional name and a choice between
<code>restriction</code>, <code>list</code> or <code>union</code> as the
single required child (after optional <code>annotation</code>).</p>
   <p>The <code>restriction</code> option has either a <code>base</code>
attribute (a QName) or a <code>simpleType</code> child, and allows the facets
appropriate to that base type as children.  Also, if the base is a list, then a
<code>list</code> child whose <code>type</code> restricts the base's <code>type</code>
is alternatively allowed.  Similarly, if the base is a <code>union</code>, a <code>union</code> child whose <code>types</code>
restrict the base's <code>types</code> is a possibility.</p>
   <p>The <code>list</code> option has either a <code>type</code> attribute (a
QName) or a <code>simpleType</code> child.</p>
   <p>The <code>union</code> option has a <code>types</code> attribute (a list of
QNames) and any number of <code>simpleType</code> children.</p>
   <p>This design tightens the content models and matches them better to their
use, without completely eliminating semantic dependencies.  So
although we can now do much better at allowing only the appropriate facets for lists, the
allowed facets for the restriction case are still a function of the base type, which cannot be expressed
in the schema for schema documents.</p>
   <p>An example taken from XHTML (I gather) of an attribute definition would be as follows on this account:</p>
   <display><code><![CDATA[<xs:attribute name="size">
 <xs:simpleType>
  <xs:union>
   <xs:simpleType>
    <xs:restriction base="xs:positive-integer">
      <xs:maxInclusive="10"/>
    </xs:restriction>
   </xs:simpleType>
   <xs:simpleType>
    <xs:restriction base="xs:NMTOKEN">
     <xs:enumeration value="small"/>
     <xs:enumeration value="medium"/>
     <xs:enumeration value="large"/>
    </xs:restriction>
   </xs:simpleType>
  </xs:union>
 </xs:simpleType>
</xs:attribute>
]]></code></display>
   <p>This example uses only embedded anonymous simple types, but
a list of QName-references can be used for named constituents, or the two
combined as required.  The parallelism of the cases means that you are never
<emph>forced</emph> to name a type definition just in order to use it as part
of another definition, so for instance to constrain the length of a list of
constrained strings
without exposing the string type itself, the following will work:</p>
   <display><code><![CDATA[<simpleType name='fourTuple'>
 <restriction>
  <simpleType>
   <list>
    <simpleType>
     <restriction base='string'/>
      <enumeration value='1'/>
      <enumeration value='one'/>
     </restriction>
    </simpleType>
   </list>
  </simpleType>
  <length value='4'/>
 </restriction>
</simpleType>
]]></code></display>
  </div>
  <div>
   <title>Changes to XML representation of complex type definitions</title>
   <p>We strongly recommend that a further change to the content model of
<code>&lt;complexType></code> should be made to bring the two completely in to
line:  eliminate the <code>derivedBy</code> attribute here as well, in favour
of a required child, either <code>&lt;restriction></code> or
<code>&lt;extension></code>, with a choice between <code>base</code> attribute
or nested type definition, as above.  Only <code>&lt;restriction></code> would
be allowed to have neither <code>base</code> nor a nested type definition,
in which case the actual base would default to the appropriate flavour of
urType, as it does now.</p>
  </div>
  <div>
   <title>A note on 'cost'</title>
   <p>This does represent a backward incompatible change:  existing schema
documents will become invalid.  To reduce the practical impact of this, Martin
Gudgin has produced XSLT stylesheets which do the necessary forward
conversions, and we'll make these available if these changes are agreed.</p>
  </div>
 </body>
</doc>

