NITE XML Toolkit - Relationship to TEI

We have been asked about the relationship between NXT and the Text Encoding Initiative, and in particular, whether it is possible to produce an annotation for spoken dialogue compliant with the TEI standards using NXT GUIs. (Although NXT does get used on text, we have not considered the relationship between NXT and the TEI on textual materials yet, but we expect there to be fewer issues that arise for them.) These are our thoughts on the issue so far. We have made some reference to the P5 documentation in writing them, although we are also relying partly on memory and have not thoroughly checked our work, so it is not definitive. Corrections are welcome. Note also that the TEI states that their guidelines are under revision in this area.

Summary of Answer

If one has TEI-compliance in mind from the start, then it should be possible to design the NXT storage format for the data set so that it only requires a simple transform to be TEI-compliant, and for some data sets it may be possible to make it TEI-compliant as is. However, designing the NXT data representation for maximum TEI-compliance loses the main benefits of using NXT. If the data has crossing hierachies of annotation, using a TEI-compliant representation means losing the search facility that handles these nicely. If the data represents temporal relationships, using a TEI-compliant representation means losing the ability of NXT browsers to highlight the current annotations as a signal plays. In addition, the configurable interfaces for dialogue acts and named entities currently constrain the NXT data representation in ways that violate TEI recommendations, which means that data sets which aim for TEI-compliance would either need to write their own tailored GUIs for everything or contribute (fairly modest) changes to them. If one wants to make use of NXT's best properties, then it would be better to develop a data path for getting between the NXT and TEI-compliant data formats than to build TEI-compliance into the NXT format. If one doesn't need NXT's facilities for crossing hierachies or timing, then there may be a simpler framework upon which annotation tools can be built.

Data without crossing hierarchies or timing

The TEI recommends particular tag names for orthographic transcription element. These are not a problem for NXT, which has no constraints on tag naming - it just requires the tags to be formally defined in the NXT "metadata" using the TEI's set. The TEI recommends the use of markup within one XML tree as the orthography for the representation of dialogue acts, named entities, turns, and the like. For instance, dialogue acts are represented in the TEI as <seg>'s and named entities as <rs>'s (or similar non-segmenting spans of transcription elements, such as <persName>). One hierarchy of <seg>'s over the transcription can be represented in NXT, again by authoring the metadata to match, but the metadata will not be particularly useful for data validation because it will simply have the semantics that all <seg>'s draw from the transcription elements as children; if there is internal structure among the segments, NXT will not by itself enforce or check that. Similarly, <rs> and similar tags can be used, but technically they violate NXT's data model unless hey are either defined within the orthographic transcription tag set (with recursive descent through that set of tags). This is because strictly speaking, NXT requires "layers" of annotation to span the layers beneath them (in this case, the layer of transcription elements). However, this is a only a weak data model violation, and NXT copes with it by allowing tags to contain either the element types declared as their children or skip directly to the ones declared as their children's children. If one's data does not have crossing hierarchies or a relationship to signal, this suggests that TEI-compliance is either possible or very close. There may be a problem with the representation of links. The TEI practice for relating data elements uses IDREF or IDREFS or in-file links. Some NXT data sets use string matching on attribute values which is similar to using IDREFs, but there is nothing in the attribute declarations which lets NXT validate that relationship. NXT currently writes in-file links using a syntax that (redundantly) contains the filename, although this could be changed without much difficulty. There may also be differences in what's expected at file roots. NXT doesn't require a particular tag name at the root (although it does currently warn if an unexpected one is used), but it doesn't expect headers and bodies in the same file, and the metadata declaration won't allow different content models for two tags at the same depth from the root in the same file, weakening the data validation where they are stored together (since then the content model must specify a disjunction of the possible types at that depth). Every NXT element must have an id, which may be a burden for some data sets.

Crossing hierarchies

The main difference between NXT's representation and that of the TEI is whether or not overlapping (crossing) hierarchies pointing down to the same elements are expected. NXT is designed specifically for cases where they are; the TEI contains mechanisms for dealing with crossing hierarchies, but because this is not their primary concern, the mechanisms are more cumbersome. NXT's data representation is based on the idea of multi-rooted trees; in the data model, individual nodes can have one set of children, but multiple parents from different upward trees. A typical use of for this representation in the annotation of spoken dialogue (which makes up NXT's largest user group) is to have time-aligned orthographic transcription at the bottom, and then separate hierarchies for, say, named entities, dialogue acts, prosodic phrases, turns, or whatever that use the words as children. The data is serialized into XML by divided the multi-rooted tree into convenient trees where the XML structure mirrors the data structure and representing the remaining connections between nodes using stand-off links in XLink format. NXT also allows arbitrary additional links to be represented on top of the multi-rooted tree, again using XLinks, but ones that have a different semantics within NXT. The TEI representation for a data set with crossing hierarchies would choose one hierarchy as the primary one, mirror that in the XML structure, and use milestone tags for the other hierarchies. This keeps everything in one file. For extreme cases, one could use the TEI's recommended form for representing graphs, which gives a list of nodes and links where the XML structure does not mirror any part of the graph. Either of these styles of representation can be defined in NXT's "metadata" describing the set of tags, and as long as everything fits into one XML tree they can be kept in one file, but the NXT data validation won't be particularly useful then, and there are no existing GUIs or search facilities that will help in creating or using this data, which means building new ones using the GUI library.

Timing data

The other main difference between NXT and the TEI is in the representation of timing relationships. The TEI gives a choice of mechanisms, ranging from the coarse statement that an element is overlapped via 'trans="overlap"', through the use of <anchor> tags that link to overlapping events, to the representation of complete timelines that give time points which then can be used to indicate the start and end times for an element. Any of these representations can be defined in NXT's data storage format, but none of them will get the timing data recognized as time in NXT, which disables one of the most useful features of NXT browsers (the ability to play signals and show which annotations are current as they play). NXT's format for timing information is closest to the last one, but is not TEI-compliant; where annotations of a particular type for different speakers ("agents") can overlap temporally, NXT requires them to be stored in separate files. This is in aid of the temporal semantics inherent in NXT's data model which allows timings to percolate up trees. This requirement can only be circumventing by failing to declare the attributes as times.

Standardized GUIs

NXT comes with some configurable tools for annotating dialogue acts and named entities. These currently rely on an NXT data representation in which the dialogue act and named entity tags point into an external ontology of act or entity types, rather than allowing the type to be expressed as an attribute value. That means that if a data set is represented to be as TEI-compliant as possible in the NXT format itself, these tools cannot be used. We are considering making it possible to configure the tools to use an enumerated attribute, but we don't have an immediate need for the result so the work hasn't been scheduled yet. If there is more than one type of <seg> in the data, this will cause problems for setting up the tool because the NXT metadata will have no way of specifying which types go together into one set to be annotated together (so, for instance, making dialogue act annotation different from some other segmentation and classification task).

Other frameworks

The difficulties in mapping between the TEI and NXT arise from the fact that NXT is designed for data that is rather esoteric for the TEI. If one doesn't need crossing hierachies or relationships to signal, there may be other annotation frameworks that are closer to TEI-compliance in their native data formats. We have never considered other frameworks in this light. MMAX2 uses multiple file stand-off, so probably isn't any closer. Other key words to search on are AGTK, Callisto, Atlas, and Wordfreak.

Author: Jean Carletta

Date created: 19 Apr 2006