NITE XML Toolkit - Specifying a data build

This page explains how to use the Build utility of NXT to produce packaged-up versions of your corpus in a way that other users can unpack and use. You can specify which annotations and observations are included, and an appropriate metadata file will be produced to go along with the data you select.

To specify a data build for an NXT corpus you need to follow these steps

Write a build specification file conforming to this simple DTD file.
With NXT and its various lib jar files on your CLASSPATH, run
java net.sourceforge.nite.util.Build <<yourfile>>
This creates an ant file to actually do the build. You will also be told what command to issue...
Run ant -f <<your_antfile>> to produce a data file (you'll be told what the data file is called).

Examples and explanation of format

First, here's a valid build specification. The resulting ant file extracts a set of words and abstractive summaries from all observations matching the regular expression Bed00*. We take the standard words from the corpus, but for the abstractive summary, we decide we want to use the files from the annotator sashby.

<build metadata="Data/ICSI/NXT-format/Main/ICSI-metadata.xml" 
    description="ICSI extract" name="jonICSI" 
    type="gold" corpus_resources="off" ontologies="on" object_sets="off">
 <extras dir="/home/jonathan/configuration" includes="*.html" dir="config"/>
 <coding-file name="words"/>
 <coding-file name="abssumm" annotator="sashby"/>
 <observation name="Bed00*"/>
</build>

There are two types of build: gold and multi-coder. The first of these is for builds where we want only one set of XML files for each coding and for that set to be treated as the gold-standard. Note that in the example above we actually chose a specific annotator's abstractive summary: in the resultant build, that annotator's abstractive summaries will replace any existing 'gold-standard' abstractive summaries.

multi-coder builds result in corpora which may have gold-standard codings, but can also have all the different annotators' data included. Here's an example:

Output of any of the corpus-wide information can be toggled on or off, using attributes of the same name: corpus_resources; ontologies; object_sets. They are all output by default.

Arbitrary extras can also be included in the build. These are essentially specs that are passed straight through to ant.

<build metadata="Data/ICSI/NXT-format/Main/ICSI-metadata.xml" 
    description="ICSI multi-coder extract" name="jonICSImulti" 
    type="multi_coder">
 <coding-file name="words"/>
 <coding-file name="abssumm"/>
 <observation name="Bmr*"/>
</build>

This requests the same two codings as before but for a different set of observations. This time we'll end up with the gold-standard words (since that's all there is in our corpus), but the entire tree of abstractive summaries including any 'gold-standard' files plus subdirectories of all the annotators' abstractive summaries. Note that if an annotator is named in multi-coder mode, only that annotator's data is included but it is not raised to the gold-standard location.

One extra element allowed is a default-annotator element before any of the coding-file elements: the name attribute of will be the name of the annotator that is used by default (where not overriden by an annotator element on a coding-file element).

Note: In any circumstance where a specific annotator's data has been requested, but there is none present, the 'gold-standard' data (if present) will be used instead.

Last modified 05/11/06