NITE XML Toolkit - Metadata Detail

This page describes in detail each component that goes into a NITE metadata file. See this page for a more general overview and example metadata files. It will be useful to have an example metadata file as well as a copy of the metadata DTD file handy while reading this guide.

Preliminaries

Attribute definitions

A number of sections of the metadata (<code>, <ontology>, <object-set>) rely on the same basic mechanism for defining attributes. Attributes can have three different types: string, meaning free text; number, where any kind of numeric value is permitted; or enumerated, where only values listed in the enclosed value elements are permitted. They are defined using an <attribute> tag where the name attribute gives the name of the attribute and the type attribute, the type. For enumerated attributes, the attribute declaration must also include the enumerated values within <value>tags. For instance,

           <attribute name="gender" value-type="enumerated">
               <value>male</value>
               <value>female</value>
           </attribute>

defines an attribute named gender that can two possible values, male and female.

The NITE namespace

NXT's design includes the assumption that it will be useful to namespace parts of the XML data, and some of the default element and attribute names use a specific NITE namespace (e.g., <nite:root>, <nite:start>, <nite:end>, <nite:id>). This has the usual advantages described for namespacing.

Metadata contents

Metadata files consist of a description of:

Top-level corpus description (optional)
Reserved attributes (optional)
Reserved elements (optional)
CVS Details (optional)
Independent variables taken for each observation (optional)
Agents (optional)
Signals (optional)
Corpus Resources (optional)
Ontologies (optional)
Object Sets (optional)
Codings
Styles (optional)
Views (optional)
Callable Programs (optional)
Observations (optional)

Top-level corpus description

The root element of a metadata file is corpus and here's an example of what it looks like:

 <corpus description="Map Task Corpus" id="maptask" 
     links="ltxml1" type="standoff">
   ...
 </corpus>

The important attributes of the corpus element are links and type. The type attribute should have the value standoff. The previous use of simple corpora is deprecated. The links attribute defines the syntax of the standoff links between the files. It can be one of: ltxml1 or xpointer. The former looks like this:

   <nite:child href="q4nc4.g.timed-units.xml#id('q4nc4g.1')"/>

The latter looks like this:

   <nite:child xlink:href="o1.words.xml#xpointer(id('w_1'))" 
   xlink:type="simple"/>

Reserved Attributes

The reserved attributes section of the metdata file describes the names of those attributes in the NITE corpus that we consider to be privileged in some manner. Example of setting reserved attributes:

    <reserved-attributes>
        <stream name="stream"/>
        <identifier name="identifier"/>
        <starttime name="starttime"/>
        <endtime name="endtime"/>
        <agentname name="who"/>
        <observationname name="obs"/>
        <commentname name="mycomment"/>
        <keystroke name="mykey"/>
    </reserved-attributes>

If no reserved-attributes entry appears in the metadata file, or the specific attribute is not overriden, the values will default (see table). The name values are expected to be namespace-qualified.

	metadata tag name	Default value
Root / stream element name	stream	nite:root
Element identifier	identifier	nite:id
Element start time	starttime	nite:start
Element end time	endtime	nite:end
Agent	agentname	agent
Observation	observationname	-
Comment	commentname	comment
Key Stroke	keystroke	keystroke

The name of the root (or stream) element is the expected name of the element at the root of every XML file in the corpus other than ontology files. These elements will essentially be invisible through the API, and are only required for serialization.

Identifiers are required on all elements in a NITE corpus. Start and end times may appear on time-aligned elements. A time aligned element in the corpus with the above description might look like this:

 <word identifier="word_1" starttime="1.3" endtime="1.5">the</word>

The attributes describing an agent or an observation are a special case. Normally, the agent and observation associated with an element won't be named in the data explicitly as an attribute value (although they might be, at least for some elements) but can be derived from the metadata and filename from which the element came. Think of the number of words in the usual corpus - representing agent and observation for each one increases the data set size considerably. The agentname and observationname declarations expose the agent and observation as attributes on every element that doesn't already have an attribute of the given name, making them accessible from the query language. Explicit agent/observation names therefore aren't very useful unless the overhead of having the extra attributes on every element is acceptable.

All elements and pointers can have an associated comment whose default attribute name is 'comment'. You can change the reserved attribute name using commentname.

Builds after 09/03/05 - Any element can have an associated keystroke. This is normally used to represent keyboard shortcuts for elements in an ontology, though it can be used for other purposes. The value is simply a string, and what application programs do with the string (if anything) is up to them.

Reserved Elements (top)

The reserved elements section of the metadata file describes the names of those elements in the NITE corpus that we consider to be privileged in some manner. Example of setting reserved element names:

    <reserved-elements>
        <pointername name="mypointer"/>
        <child name="mynamespace:child"/>
        <stream name="stream"/>
    </reserved-elements>

If no reserved-elements entry appears in the metadata file, or the specific element is not overriden, the values will default (see table). The name values are expected to be namespace-qualified.

	metadata tag name	Default value
Pointer	pointername	nite:pointer
Child (pointing to remote child)	child	nite:child
Stream element	stream	nite:root

A stream of word elements may look like this with the above example:

  <stream>
    <word nite:id="word_1">
      <mypointer role="antecedent" href="doc2.xml#ante_2"/>
      <mynamespace:child href="doc2#syllable_1"/>
    </word>
  </stream>

Pointers and children will have an unqualified href attribute that specifies the pointed-to element unless XLink links are being used in which case an xlink:href attribute will be assumed to be used. This attribute name is not changeable. More information on pointers and children below.

CVS Details (top)

You may associate your corpus with a particular CVS repository. If the tools you use are CVS aware, this data can be used to check out and check in data directly to CVS. We are currently (Apr 05) adding functionality to allow NXT GUIs to work directly from a CVS repository.

    <cvsinfo protocol="pserver" server="cvs.inf.ed.ac.uk" module="/disk/cvs/ami"/>

The CVS repository is described by three required attributes: protocol (one of pserver, ext, local, sspi); server (the machine name on which the CVS server is hosted); module (the top level directory where the corpus is found). More information on CVS here.

Independent Variables on Observations (top)

A corpus is a set of observations all of which conform to the same basic format. NXT has allowed a corpus to declare independent variables which can be used to divide the corpus into subsets. Use of this facility is now deprecated, but it is described here, along with how to do the same thing now.

Agents (top)

A corpus is a set of observations all of which conform to the same basic format and have the same number of agents being observed, with the same basic roles. An agent is one role in this structure; for each observation, the agent role will be filled by some individual, someone human or artificial. The following table shows how to fit some well-known corpus types into this agent categorization:

Corpus	Agents
Map task corpus:	giver,follower
Smartkom:	system,user
Wall Street Journal articles:	writer
Five person discussion:	1,2,3,4,5

For group discussion corpora of mixed size, the user must define agents for the maximum size and fail to use some of them for the observations with fewer people.

Here's a sample agent description (as used for the MapTask corpus):

    <agents>
        <agent name="g" description="giver"/>
        <agent name="f" description="follower"/>
    </agents>

The name attribute must be a string with no spaces as it is used to derive filenames.

At present, the metadata does not give a way of specifying personal information about the individuals that fill the agent roles within individual observations.

Signals (top)

Each observation in a corpus will have been recorded separately using some signal or set of signals. Signals can either be for a single agent (like a video trained exclusively on the route giver), or of the interaction as a whole (like an overhead video that captures the whole group, or at least part of it). All signals for the same observation are assumed to start at the same time. This can be achieved through pre-editing. Note that because there could be several video signals associated with the same corpus, any GVM (video overlay markup) needs to know which signal it applies to.

Signal specification in the metadata file will tell NITE what signals are present, and where they reside on disk. Here's an example of some signal definitions:

    <signals path="../signals/">
        <agent-signals>
            <signal extension="au" format="mono au" name="audio" type="audio"/>
        </agent-signals>
        <interaction-signals>
            <signal extension="avi" format="stereo avi"
                name="interaction-video" type="video"/>
        </interaction-signals>
    </signals>

The path attribute on the signals element specifies where the media files can be located on disk. If the path is a relative pathname, it is relative to the metadata file. Signals are divided into agent-signals and interaction-signals as discussed above. The name attribute of the signal is used in filenames so must not include any spaces.

In this example, imagining there is an observation named o1 and agents g and f, we would expect to find the media files:

../signals/o1.g.audio.au
../signals/o1.f.audio.au
../signals/o1.interaction-video.avi

From 25th May 2006 - the signals element can now take a pathmodifier attribute which is a string where any instance of the string observation will be replaced with the observation name. For the example above, if the signals element had attribute pathmodifier="observation", we'd look for signals in subdirectories named by observation e.g.:

../signals/o1/o1.interaction-video.avi

Note that the file names we look for are still the same.

Similarly, there can be pathmodifier attributes on any of the individual signal elements (with replacement of the string observation as above) which can further customize the location of signals even within a single observation. For example if the second signal had attribute pathmodifier="video", and assuming the above pathmodifier is still persent, that signal's path would then be:

../signals/o1/video/o1.interaction-video.avi

This builds up to provide fairly flexible placement of signal files. File naming is strict however.

Corpus Resources (top)

A corpus resource is a set of elements that are globally relevant in some way to an entire corpus. They are not as strictly specified as ontologies or object sets (below). They will probably eventually replace the use of those things. Typically these will be files that come from the original application and can be used almost without alteration. You may specify the exact hierarchical breakdown of such a file, but typically there will just be one recursive layer (pointing to itself) that specifies all the codes permissible. Here is an example where the resource describes participants in a meeting corpus:

    <corpus-resources path=".">
        <corpus-resource-file name="speakers" description="anonymised details of meeting contributors">
        <structural-layer name="speaker-layer" recursive-draws-children-from="speaker-layer">
         <code name="speaker">
           <attribute name="id" value-type="string"/>
           <attribute name="gender" value-type="enumerated">
               <value>male</value>
               <value>female</value>
           </attribute>
         </code>
         <code name="language">
           <attribute name="name" value-type="string"/>
           <attribute name="region" value-type="string"/>
         </code>
         <code name="age" text-content="true"/>
        </structural-layer>

       </corpus-resource-file>
    </corpus-resources>

The path attribute on the corpus resources element tells NITE where to look for resources for this corpus. A corpus resource has a name attribute which is unique in the metadata file. Combined with the name attribute of an individual resource, we get the filename. The name attribute can also be used to refer to this object set from a coding layer (see below).

The contents of an individual corpus resource are defined in exactly the same manner as layers within codings (below)

Ontologies (top)

An ontology is a tree of elements that makes use of the parent/child structure to specify specializations of a data type. In the tree, the root is an element naming some simple data type that is used by some annotations. In an ontology, if one type is a child of another, that means that the former is a specialization of the latter. We have defined ontologies to make it simpler to assign a basic type to an annotation in the first instance, later refining the type. Here's an example of an ontology definition:

    <ontologies path="../xml/MockCorpus">
        <ontology description="gesture ontology" name="gtypes"  
	  element-name="gtype" attribute-name="type"/>
    </ontologies>

The path attribute on the ontologies element tells NITE where to look for ontologies for this corpus. An ontology has a name attribute which is unique in the metadata file and is used so that the ontology can be pointed into (e.g. by a coding layer - see below). It also has an attribute element-name: ontologies are a hierarchy elements with a single elemnt name: this defines the element name. Thirdly, there is an attribute attribute-name. This names the privileged attribute on the elements in the ontology: the attributes that define the type names.

The above definition in the metadata could lead to these contents of the file gtypes.xml - a simple gesture-type hierarchy.

<gtype nite:id="g_1" type="gesture" xmlns:nite="http://nite.sourceforge.net/">
   <gtype nite:id="g_2" type="discursive">
      <gtype nite:id="g_3" type="baton-like"/>
      <gtype nite:id="g_4" type="ideographic"/>
   </gtype>
   <gtype nite:id="g_5" type="topographic">
      <gtype nite:id="g_6" type="deictic"/>
      <gtype nite:id="g_7" type="physiographic">
         <gtype nite:id="g_8" type="iconographic"/>
         <gtype nite:id="g_9" type="kinetographic"/>
      </gtype>
   </gtype>
</gtype>

An ontology can use any number of additional, un-privileged attributes, as long as they are declared in the metadata for the ontology using an <attribute> tag. For example, to extend the ontology above with a new attribute, foo, with possible values bar and baz, the declaration would be as follows:

   <ontology description="gesture ontology" name="gtypes"  
        element-name="gtype" attribute-name="type">
      <attribute  name="foo" type="enumerated">
            <value>bar</value>
            <value>baz</value>
      </attribute>
   </ontology>

Object Sets (top)

An object is an element that represents something in the universe to which an annotation might wish to point. An object might be used, for instance, to represent the referent of a referring expression or the lexical entry corresponding to a word token spoken by one of the agents. When an element is used to represent an object, it will have a data type and may have features, but no timing or children. An object set is a set of objects of the same or related data types. Object sets have no inherent order. Here is a possible definition of an object set - imagine we want to collect a set of things that are referred to in a corpus like telephone numbers and town names:

  <object-sets path="/home/jonathan/objects/">
   <object-set-file name="real-world-entities" description="">
    <code name="telephone-number">
      <attribute name="number" value-type="string"/>
    </code>
    <code name="town">
      <attribute name="name" value-type="string"/>
    </code>
   </object-set-file>
  </object-sets>

The path attribute on the object-sets element tells NITE where to look for object sets on disk for this corpus. Combined with the name attribute of an individual object set we get the filename. The name attribute is also used to refer to this object set from a coding layer (see below).

The code elements describe the element names that can appear in the object set, and each of these can have an arbitrary number of attributes. The above spec describes an object set in file /home/jonathan/objects/real-world-entities.xml which could contain:

  <nite:root nite:id="root_1">
    <town nite:id="town3" name="Durham"/>
    <telephone-number nite:id="num1" number="0141 651 71023"/>
    <town nite:id="town4" name="Edinburgh"/>
    <town nite:id="town1" name="Oslo"/>
  </nite:root>

where the contents are unordered and can occur any number of times.

Codings and Layers (top)

Here we define the annotations we can make on the data in the corpus. Annotations are specified using codings and layers, and we start with an example.

    <codings path="/home/jonathan/MockCorpus">
        <interaction-codings>
            <coding-file name="prosody" path="/home/jonathan/MockCorpus/prosody">
                <structural-layer name="prosody-layer" draws-children-from="words-layer">
                    <code name="accent">
                        <attribute name="tobi" value-type="string"/>
                    </code>
                </structural-layer>
            </coding-file>
            <coding-file name="words">
                <time-aligned-layer name="words-layer">
                    <code name="word" text-content="true">
                        <attribute name="orth" value-type="string"/>
                        <attribute name="pos" value-type="enumerated">
                            <value>CC</value>
                            <value>CD</value>
                            <value>DT</value>
                        </attribute>
                        <pointer number="1" role="ANTECEDENT" target="phrase-layer"/>
                    </code>
                </time-aligned-layer>
            </coding-file>
        </interaction-codings>
    </codings>

First of all, the codings element has a path attribute which (as usual) specifies the directory in which codings will be loaded from and saved to by default. Note that any coding-file can override this default by specifying its own path attribute (from release 1.3.0 on). Codings are divided into agent-codings and interaction-codings in exactly the way that signals are (we show only interaction codings here). Each coding file will represent one entity on disk per observation (and per agent in the case of agent codings).

The second observation is that codings are divided into layers. Layers contain code elements which define the valid elements in a layer. The syntax and semantics of these code elements is exactly as described for object sets.

From 25/04/2006 Layers can point to each other using the draws-children-from attribute and the name of another layer. If your build is older, use the now-deprecated points-to attribute. For recursive layers like syntax, use the attribute recursive="true" on the layer to mean that elements in the layer can point to themselves. The attribute recursive-draws-children-from="layer-name" means that elements in the layer can recurse but they must "bottom out" by pointing to an element in the named layer. With builds pre 25/04/2006, use the now-deprecated recursive-points-to attribute.

Layers are further described by their three types which are all described in detail in this paper.
NOTE: in builds after 16/9/05 there is now a fourth layer type: external-reference-layer that is described below. It was introduced to link NXT data with data in formats from other programs.

Time-aligned layer - elements are directly time-stamped to signal.
Structural layer - elements can inherit times from any time-aligned layer they dominate. Times are not serialized with these elements by default.

NOTE: As of any build after 9/10/04, structural layers can be prevented from inheriting times from their children. This is important as it is now permitted that parents can have temporally overlapping children so long as the times are not inherited. In order to make use fof this, use the attribute inherits-time="false" on the structural-layer element. Allowing parents to inherit time when their children can overlap temporally may result in unexpected results from the search engine, particularly where precedence operators are used.

Featural layer - elemnts can have no time stamps and cannot dominate any other elements - they can only use pointers.

On disk, the above metadata fragment could describe the file /home/jonathan/MockCorpus/o1.prosody.xml for observation o1:

 <nite:root nite:id="root1">
   <accent nite:id="acc1" tobi="high">
     <nite:child href="o1.words.xml#w_6"/>
     <nite:child href="o1.words.xml#w_7"/>
   </accent>   
   <accent nite:id="acc1" tobi="low">
     <nite:child href="o1.words.xml#w_19"/>
     <nite:child href="o1.words.xml#w_20"/>
   </accent>   
 </nite:root>

A note on effective content models: the DTD content model equivalent of this layer definition

   <structural-layer name="prosody-layer" draws-children-from="words-layer">
        <code name="high"/>
        <code name="low"/>
   </structural-layer>

Would be (high|low)*. However, if a code has the attribute text-content set to the value true (as for the element word above) the content model for this element is overridden and it can contain only text. This is the only way to allow textual content in your corpus. Mixed content is not allowed anywhere.

The fourth layer type: external-reference-layer

Builds after 16/9/05

An external reference layer is one which contains a set of standard NITE elements each of has a standard nite:pointer to an NXT object, and an external pointer to some part of a data structure not represented in NXT format. The idea is that when an application program encounters such an external element, it can start up an external program with some appropriate arguments, and highlight the appropriate element in its own data structure.

A metadata fragment looks like this:

    <coding-file name="external" path="external">
        <external-reference-layer element-name="prop"
             external-pointer-role="owl_pointer" content-type="text/owl"
             layer-type="featural" name="prop-layer" program="protege">
             <pointer number="1" role="da" target="words-layer"/>
             <argument default="owl_file_1.owl" name="owl_file"/>
             <argument default="arg_value" name="further_arg"/>
        </external-reference-layer>
    </coding-file>

The corresponding data looks like this:

   <prop>
      <nite:external_pointer role="owl_pointer"  href="owlid42"/>
      <nite:child href="IS1008a.A.words.xml#id(IS1008a.A.words0)"/>
   </prop>

In the metadata fragment, you can choose to explicitly name the program that is called using the program attribute, or you can specify the content-type of the external file using a content-type attribute (not shown in the metadata fragment). Both are treated as String values and not interpreted directly by NXT.

Styles (top)

Styles are the files that allow either NIE (NITE interface engine) of OTAB (observable track annotation board) to produce an appropriate display. In the case of NIE, these files are stylesheets and in the case of OTAB they are specification files. Styles may be grouped into views. An example of a definition of some styles:

    <styles path="/home/styles/">
        <style application="nie" description="basic syntax coder"
            extension=".xsl" name="maptask-editor" type="editor"/>
        <style application="otab" description="annotation board"
            extension=".xml" name="maptask-annotation-board" type="editor"/>
    </styles>

As with many other elements in the metadata file, the styles element has a path attribute whose value is the directory in which style files for this corpus exist. The name of the individual styles act as the filename as well as allowing them to be referred to from a view. So in this example, we will expect to have a stylesheet in the file /home/styles/maptask-editor.xsl which is a basic syntax coder. The type attribute describes whether the style is an editor or just a display.

Views (top)

Views are combinations of displays that combine to produce an editing or display environment for a particular purpose. Views can comprise zero or one NIE displays, zero or one OTAB displays, and any number of video and audio windows. Here's an example combining a styled display and an audio window:

    <views>
        <view description="basic transcription" type="editor">
            <styled-window nameref="maptask-editor"/>
            <audio-window nameref="audio" sound="yes"/>
        </view>
    </views>

Callable Programs (top)

To help with housekeeping it's useful to know what programs have been written for the corpus and how to call them. This also allows NXT's top level interface list the programs and run them. Each callable-program contains a list of required arguments. for example, a program described thus:

    <callable-programs>
      <callable-program name="SwitchboardAnimacy" description="animacy checker">
        <required-argument name="corpus" type="corpus"/>
        <required-argument name="prefix" default=""/>
        <required-argument name="observation" type="observation"/>
      </callable-program>
    </callable-programs>

Would be called java SwitchboardAnimacy -corpus <metadata-path> -prefix -observation <obs-name>. The type attribute can take one of two values: corpus meaning that the expected argument is the metadata filename and observation meaning the argument is an observation name. Arguments can also have default values. Note also that the argument name or the default values can be empty strings.

Observations (top)

Each observation in a corpus must have a unique name which is used in filenames. This is declared in a list of observations using the name attribute, for instance, like this:

    <observations>
        <observation name="q4nc4"/>
        <observation name="q3nc8"/>
    </observations>

NXT currently includes the option of declaring two additional types of information for each observation: its categorization according to the observation variables that divide the corpus into subsets, and some very limited data management information about the state of coding for the observation. We expect in future to rethink our approach to data management which will probaby mean removing this facility from the metadata. The facility is described here.

It has been pointed out that one might expect observations to have information mapping from agent (roles) to personal information about the individuals filling them in that observation (age, dialect, etc.). We don't propose a specific set of kinds of information one might wish to retain, because in our experience different projects have different needs (but see, for instance, the ISLE/IMDI metadata initiative). We also don't provide a specific way of storing it. This is partly because some of the information that projects retain falls under data protection and some of it doesn't, so there are issues about how it should be designed. At the moment, the best one can do is define a set of variables that together give the information one is looking for. We intend further improvements that will allow the corpus designer to specify a structure for the information and will allow private information to be kept in a separate file that is linked to from the metadata. Currently, the query language doesn't give access to the metadata about an observation, which means that it is only useful for deciding programmatically which observations to load as a filter on the entire corpus set, not for any finer-grained filtering. This also is something we hope to look at. Meanwhile, given these shortcomings, sometimes the best option is to store any detailed information in a separate file of one's own design and build variables that link agent roles to individuals in the separate file by idref.

Last modified 01/11/07