We have been asked about the relationship between NXT and the
Text Encoding Initiative, and
in particular, whether it is possible to produce an annotation for
spoken dialogue compliant with the TEI standards using NXT
GUIs. (Although NXT does get used on text, we have not considered the
relationship between NXT and the TEI on textual materials yet, but we
expect there to be fewer issues that arise for them.) These are our
thoughts on the issue so far. We have made some reference to the P5
documentation in writing them, although we are also relying partly on
memory and have not thoroughly checked our work, so it is not
definitive. Corrections are welcome.
Note also that the TEI states that their guidelines are under
revision in this area.
Summary of Answer
If one has TEI-compliance in mind from the start, then it should be
possible to design the NXT storage format for the data set so that it
only requires a simple transform to be TEI-compliant, and for some
data sets it may be possible to make it TEI-compliant as is. However,
designing the NXT data representation for maximum TEI-compliance loses
the main benefits of using NXT. If the data has crossing hierachies
of annotation, using a TEI-compliant representation means losing the
search facility that handles these nicely. If the data represents
temporal relationships, using a TEI-compliant representation means
losing the ability of NXT browsers to highlight the current annotations
as a signal plays. In addition, the configurable interfaces for
dialogue acts and named entities currently constrain the NXT data
representation in ways that violate TEI recommendations, which means
that data sets which aim for TEI-compliance would either need to
write their own tailored GUIs for everything or contribute (fairly
modest) changes to them. If one wants to make use of NXT's best
properties, then it would be better to develop a data path for getting
between the NXT and TEI-compliant data formats than to build TEI-compliance
into the NXT format. If one doesn't need NXT's facilities for crossing
hierachies or timing, then there may be a simpler framework upon which
annotation tools can be built.
Data without crossing hierarchies or timing
The TEI recommends particular tag names for orthographic transcription
element. These are not a problem for NXT, which has no constraints
on tag naming - it just requires the tags to be formally defined in
the NXT "metadata" using the TEI's set.
The TEI recommends the use of markup within one XML tree as the
orthography for the representation of dialogue acts, named entities,
turns, and the like. For instance, dialogue acts are represented in
the TEI as <seg>'s and named entities as <rs>'s (or similar
non-segmenting spans of transcription elements, such as <persName>).
One hierarchy of <seg>'s over the transcription can be represented in
NXT, again by authoring the metadata to match, but the metadata will
not be particularly useful for data validation because it will simply
have the semantics that all <seg>'s draw from the transcription
elements as children; if there is internal structure among the
segments, NXT will not by itself enforce or check that. Similarly,
<rs> and similar tags can be used, but technically they violate NXT's
data model unless hey are either defined within the orthographic
transcription tag set (with recursive descent through that set of
tags). This is because strictly speaking, NXT requires "layers" of
annotation to span the layers beneath them (in this case, the layer of
transcription elements). However, this is a only a weak data model
violation, and NXT copes with it by allowing tags to contain either
the element types declared as their children or skip directly to the
ones declared as their children's children.
If one's data does not have crossing hierarchies or a relationship
to signal, this suggests that TEI-compliance is either possible
or very close.
There may be a problem with the representation of links. The TEI
practice for relating data elements uses IDREF or IDREFS or in-file
links. Some NXT data sets use string matching on attribute values
which is similar to using IDREFs, but there is nothing in the
attribute declarations which lets NXT validate that relationship. NXT
currently writes in-file links using a syntax that (redundantly)
contains the filename, although this could be changed without much
difficulty.
There may also be differences in what's expected at file roots.
NXT doesn't require a particular tag name at the root (although it
does currently warn if an unexpected one is used), but it
doesn't expect headers and bodies in the same file, and the
metadata declaration won't allow different content models for
two tags at the same depth from the root in the same file, weakening
the data validation where they are stored together (since then the
content model must specify a disjunction of the possible types at
that depth).
Every NXT element must have an id, which may be a burden for some
data sets.
Crossing hierarchies
The main difference between NXT's representation and that of the
TEI is whether or not overlapping (crossing) hierarchies pointing
down to the same elements are expected. NXT is designed specifically
for cases where they are; the TEI contains mechanisms for dealing
with crossing hierarchies, but because this is not their primary
concern, the mechanisms are more cumbersome.
NXT's data representation is based on the idea of multi-rooted trees;
in the data model, individual nodes can have one set of children, but
multiple parents from different upward trees. A typical use of for
this representation in the annotation of spoken dialogue (which makes
up NXT's largest user group) is to have time-aligned orthographic
transcription at the bottom, and then separate hierarchies for, say,
named entities, dialogue acts, prosodic phrases, turns, or whatever
that use the words as children. The data is serialized into XML by
divided the multi-rooted tree into convenient trees where the XML
structure mirrors the data structure and representing the remaining
connections between nodes using stand-off links in XLink format.
NXT also allows arbitrary additional links to be represented on top of
the multi-rooted tree, again using XLinks, but ones that have a
different semantics within NXT.
The TEI representation for a data set with crossing hierarchies would
choose one hierarchy as the primary one, mirror that in the XML
structure, and use milestone tags for the other hierarchies. This
keeps everything in one file. For extreme cases, one could use the
TEI's recommended form for representing graphs, which gives a list of
nodes and links where the XML structure does not mirror any part of
the graph.
Either of these styles of representation can be defined in NXT's
"metadata" describing the set of tags, and as long as everything
fits into one XML tree they can be kept in one file, but the NXT
data validation won't be particularly useful then, and there are no
existing GUIs or search facilities that will help in creating or
using this data, which means building new ones using the GUI library.
Timing data
The other main difference between NXT and the TEI is in the
representation of timing relationships. The TEI gives a choice of
mechanisms, ranging from the coarse statement that an element is
overlapped via 'trans="overlap"', through the use of <anchor> tags
that link to overlapping events, to the representation of complete
timelines that give time points which then can be used to indicate the
start and end times for an element. Any of these representations can
be defined in NXT's data storage format, but none of them will get
the timing data recognized as time in NXT,
which disables one of the most useful
features of NXT browsers (the ability to play signals and show which
annotations are current as they play). NXT's format for timing
information is closest to the last one, but is not TEI-compliant;
where annotations of a
particular type for different speakers ("agents") can overlap
temporally, NXT requires them to be stored in separate files. This is
in aid of the temporal semantics inherent in NXT's data model which
allows timings to percolate up trees. This requirement can only be
circumventing by failing to declare the attributes as times.
Standardized GUIs
NXT comes with some configurable tools for annotating dialogue
acts and named entities. These currently rely on an NXT data
representation in which the dialogue act and named entity tags
point into an external ontology of act or entity types, rather
than allowing the type to be expressed as an attribute value.
That means that if a data set is represented to be as TEI-compliant
as possible in the NXT format itself, these tools cannot be used.
We are considering making it possible to configure the tools
to use an enumerated attribute, but we don't have an immediate need
for the result so the work hasn't been scheduled yet. If there
is more than one type of <seg> in the data, this will cause problems
for setting up the tool because the NXT metadata will have no way
of specifying which types go together into one set to be annotated
together (so, for instance, making dialogue act annotation different
from some other segmentation and classification task).
Other frameworks
The difficulties in mapping between the TEI and NXT arise from
the fact that NXT is designed for data that is rather esoteric
for the TEI. If one doesn't need crossing hierachies or relationships
to signal, there may be other annotation frameworks that are closer
to TEI-compliance in their native data formats. We have never considered
other frameworks in this light. MMAX2 uses multiple file stand-off,
so probably isn't any closer. Other key words to search on are
AGTK, Callisto, Atlas, and Wordfreak.
Author: Jean Carletta
Date created: 19 Apr 2006
Last modified 04/19/06