NITE XML Toolkit

End user and developer questions for NXT still tend to be dealt with by private email, although we do realize that we should move over to using public forums for this. When we receive a question more than once, we try to make time to change the web pages to make the answer clear in the correct location. This page is for frequently asked questions that haven't yet found a proper home, plus their answers.

Index

FAQ

Namespacing

Q: Exactly what does xmlns:nite="http://nite.sourceforge.net/" do in the xml files? Is it necessary?

A: It declares the nite namespace. If you use it in your data, then you have to include this attribute on the root element of the data files that include elements and attributes from this namespace. In NXT format data, users typically namespace the reserved attributes and element names to avoid naming conflicts (e.g., attributes for ids,a and start and end times, and elements for document roots, out-of-file children, and pointers).

Q: Can I use namespacing in my data set?

A: In theory namespacing is a good idea, but there is a bug in NXT's query language parser that means it can't handle namespaced element names and attributes. For this reason, you should avoid namespacing, with the possible exception of XML document roots (which aren't available to query anyway) and the reserved attributes that have their own special meaning to NXT and dedicated query language syntax (the id, available as ID($x), the start time, available as START($x), and the end time, available as END($x)).

Fonts and Font Sizes

Q: How do I change the font in an NXT GUI?

A: You can do whatever you want in a customized tool. The standard and configurable NXT GUIs don't specify a font, so what you get depends on your java installation. Getting different fonts for different parts of the displayed data requires you to write customized tools or to contribute code to the project that allows the user to specify in the configuration file what font to use for a particular element, attribute, or element's textual content.

Q: How do I change the font size in an NXT GUI?

A: You can do whatever you want in a customized tool. The standard and configurable NXT GUIs have a font size (usually 12 point) wired in, with the exception (at September 2006) of the GenericDisplay, which allows a font size to be passed in at the command line. The simplest change would be to recompile other GUIs with the font size you want, although it would be better to contribute code that allows users to specify the font size in the configuration file. Some previous customized tools have allowed the end user to change the font size for a display from a menu. If you wish to revive this code for general use, contact us.

The main NXT GUI (net.sourceforge.nite.nxt.GUI) that allows the user to choose among the registered programs for a data set (those mentioned in the metadata under <callable-programs/>) automatically adds a GenericDisplay to the list. This automatic addition uses the default font size (12 point). If you want a menu entry for a different font size, you need to register the generic display with the font size you require. The declaration to do this is, e.g.:


<callable-programs>
   <callable-program description="20 point GenericDisplay" name="net.sourceforge.nite.gui.util.GenericDisplay">
       <required-argument name="corpus" type="corpus"/>
       <required-argument name="observation" type="observation"/>
       <required-argument name="fontsize" default="20"/>
   </callable-program>
</callable-programs>

To pop up a window asking the user to enter the fontsize they require, use:


<callable-programs>
   <callable-program description="20 point GenericDisplay" name="net.sourceforge.nite.gui.util.GenericDisplay">
       <required-argument name="corpus" type="corpus"/>
       <required-argument name="observation" type="observation"/>
       <required-argument name="fontsize" default="20"/>
   </callable-program>
</callable-programs>

GUIs

Q: Why is the GenericDisplay unusable? / Why does the GenericDisplay run out of memory?

A: The GenericDisplay is designed to throw up windows corresponding to every XML tree in the data set for the observation chosen. If your data set has many different annotations, this will be too many windows for the user to handle, and if it's really big, you many not even be able to load them all at once. You can cut it down using the query argument to specify the kinds of things you actually want to see in the display. The GenericDisplay is designed to be something that will work, badly, for any NXT format data set - for actual work you will almost certainly want to set up one of the configurable interfaces or write your own customized display.

Data Model

Q: Are filenames case sensitive?

A: Yes.

Q: Can I use the same element name in two different layers?

A: No. NXT needs each element to belong to exactly one layer because otherwise it doesn't know how to serialize the data set, or what files to load when it requires elements of a specific type.

Q: Can I use the same attribute name for two different elements?

A: Yes.

Q: What kinds of properties can elements inherit from their children?

A: Only timing information using the reserved start and end time attributes, and this only if time inheritance is enabled for the element type involved.

Q: What are ids for, and what constraints are there on the values for ids?

A: An id can be any string that's globally unique. If you are importing data and don't have ids on it yet, you can get NXT to generate ids for you by loading the data and then saving it. Ids are used to manage the relationship between display elements in a GUI and the underlying data, and for specifying out-of-file child and pointer links.

Q: Can elements in two structural layers point to each other?

A: Yes. In general, any element can point to any other element, as long as all the elements from a given layer point to elements from the same layer, and this relationship is declared in the metadata. Pointers do not have to be in featural layers; the featural layer is just useful conceptually for the kind of layer that only relates to the rest of the data set via pointers.

Data Set Design

Q: What if I want elements from one layer to be able to draw children from either some layer or the the layer that layer draws children from, skipping straight to what is usually a grandchild?

A: This violates the NXT data model. Suppose the phrase-layer contains the element phrase, which draws children from the subphrase-layer, which contains the element subphrase, which draws children from the word-layer, which contains the element word. There are two standard ways to encode the relationship you want:

Wrap non-subphrase runs of word elements in some new tag, say, nosubphrase, and use these as the children for phrases, so that you get get strict decomposition in the layers. Then the data conforms completely, but users who are used to distance limited operators like ^1 will need to know that the intermediate nosubphrase tag is there in the structure.
Serialize phrase and subphrase elements in the same file, and declare them as two tags within the same recursive layer. Then either can contain words, but also either can contain each other This has the disadvantage that the data model design is declared to be less restrictive than it should be for the data set, so data validation wouldn't catch subphrase elements that contain phrase elements, for instance.
Declare phrase-layer to draw children from subphrase-layer, have phrase elements point to words directly whenever you want, and either store all three layers in the same file or never use code that lazy loads.

The first one is what was designed in as the preferred solution; the others are what data sets usually do. The third one may not be robust against future NXT development.

Q: When should I use pointers and when should I use children?

A: Use children whenever this is acceptable in the data model (i.e., when it doesn't create loops or require an element to have multiple, conflicting sets of children), turning off the temporal inheritance if you need to - it's much easier to query elements related by hierarchy than by pointer.

Q: How much data should I put in one XML file?

A: Divide your data into files by thinking about typical uses of the data. If one layer draws children from another, and the two layers always get used together (both within NXT and in external processing), then you can save some loading overhead by putting them in the same file. If, however, users may want one without the other, separate them into two files so that lazy loading can minize the data set size in working memory. If you have an element with many attributes, most of which are rarely used, consider putting the information conveyed by the attributes in one or more files containing elements that use the old, reduced elements as children, or that point to them. This makes querying the rarely used information more cumbersome, but saves overhead in the more common uses.

Q: Should I represent my orthography in textual content, or use an attribute?

A: The original NXT developers were split between some who wanted to preserve the TEI-ish notion that the textual content is the base text and some who didn't want any privileged textual content at all. Both designs have strengths for different kinds of data sets, so it depends. Most current data sets seem to use textual content.

For NXT, textual content has the following special properties:

In query, you can get at it using e.g. TEXT($w) Some users find this more intuitive than having to remember a specific attribute name.
Some of the libraries for building GUIs based on text or transcription expect textual content, and so e.g. coding tools and transcription-based displays (which you haven't been using so far) can require less setup if the data is laid out this way - but adding a delegate function that displays based on an attribute isn't hard.
Some command line utilities, like SortedOutput, treat an element as having textual content equal to the whitespace-delimited concatenation of its children in order. This can make it easier to extract some kinds of tables out of an NXT data set (for instance, a list of phrases by syntactic type) It's possible to get the text out in such tables if it is in attributes on words lower down in the hierarchy using FunctionQuery with the extract function, but cumbersome.
In future, it's possible that the query language will always treat an element as having textual content equal to the whitespace-delimited concatenation of its children in order. This was part of our original design and we have recently had someone complain that NXT doesn't do this, but we haven't made a decision about whether to make this extension or committed resource to it yet. If we do this work we could consider adding a reserved attribute for orthography so that we can treat it equivalently to textual content and suit both choices.

There are cases where using textual content is less elegant, as, for instance, in parallel corpora, where there are two rival versions of the orthography of equal importance.

Q: What's special about ontologies? Can I search for the "top-level" code and get all the child codes? How is it reflected in the underlying data structure?

A: Ontologies are a way of providing type or attribute value information that isn't just a string, but where the types or values fit into a hierarchical structure in their own right. Suppose your ontology contains

 [ontol.xml]
 <foo id="id0" name="animal">
    <foo id="id1" name="bird">
       <foo id="id2" name="sparrow"/>
       <foo id="id3" name="chickadee"/>
    </foo>
    <foo id="id4" name="dog">
       <foo id="id5" name="mutt"/>
    </foo>
</foo>

Your elements can point into the ontology:

 <el>
     <nite:pointer href="ontol.xml#id3"/>
 </el>

to get type information. You can test for chickadees:

($a el)($b foo):($a > $b) &&  ($b@name="chickadee")

but you can also test for birds in general:

($a el)($b foo):($a > $b)::($c foo):($c@name="bird") && ($c ^ $b)

Elements in ontologies have searchable relationships just like everything else.

In another sense, ontologies aren't at all special, because you could encode the same information as a corpus-resource and still be able to access the information from the query language. Using an ontology is more restrictive because it assume one tag name throughout the hierarchy.

Query Language

Q: is there a "not dominates" operator, like !^?

A: Use e.g. !($a ^ $b).

Performance

Q: what are the memory limits to NXT in loading data?

A: The in-memory data representation uses around 7 times the disk storage space for the same data, or a bit less. If lazy loading is on, only the files that are actually needed are loaded.

Last modified 10/10/06

NITE XML Toolkit - FAQ