NITE XML Toolkit - Processing NXT Format Data

Suppose that you have data in NXT format, and you need to make some other format for part or all of it - a tailored HTML display, say, or input to some external process such as a machine learning algorithm or a statistical package. There are an endless number of ways in which such tasks can be done, and it isn't always clear what the best mechanism is for any particular application (not least because it can depend on personal preference). Here we walk you through some of the ones we use.

The hardest case for data processing is where the external process isn't the end of the matter, but creates some data that must then be re-imported into NXT. (Think, for instance, of the task of part-of-speech tagging or chunking an existing corpus of transcribed speech.) In the discussion below, we include comments about this last step of re-importation, but it isn't required for most data processing applications.

Option 1: Write an NXT-based application

Often the best option is to write a Java program that loads the data into a NOM and use the NOM API to navigate it, writing output as you go. For this, the iterators in the NOM API are useful; there are ones, for instance, that run over all elements with a given name or over individual codings. It's also possible from within an application to evaluate a query on the loaded NOM and iterate over the results within the full NOM, not just the tree that saving XML from the query language exposes. (Many of the applications in the sample directory both load and iterate over query results, so it can be useful to borrow code from them.) For re-importation, we don't have much experience of making Java communicate with programs written in other languages (such as the streaming of data back and forth that might be required to add, say, part-of-speech tags) but we know this is possible and that users have, for instance, made NXT-based applications communicate with processes running in C (but for other purposes).

This option is most attractive:

for those who write applications anyway (since they know the NOM API)
for applications where drawing the data required into one tree (the first step for the other processing mechanisms) means writing a query that happens to be slow or difficult to write, but NOM navigation can be done easily with a simpler query or no query at all
for applications where the output requires something which is hard to express in the query language (like immediate precedence) or not supported in query (like arithmetic)

Option 2: Make a tree, process it, and (for re-importation) put it back

Since XML processing is oriented around trees, constructing a tree that contains the data to be processed, in XML format, opens up the data set to all of the usual XML processing possibilities.

First step: make a tree

Individual NXT codings and corpus resources are, of course, tree structures that conveniently already come in XML files. Often these files are exactly what you need for processing anyway, since they gather together like information into one file. Additionally, there are the following ways of making trees from the data.

Knitting and knit-like tree construction

By "knitting", we mean the process of creating a larger tree than that in an individual coding or corpus resource by traversing over child or pointer links and including what is found. Knitting an XML document from an NXT data set performs a depth-first left-to-right traversal of the nodes in a virtual document made up by including not just the XML children of a node but also the out-of-document children links (usually pointed to using nite:child and nite:pointer, respectively, although after 05 May 04 this is configurable). In the data model, tracing children is guaranteed not to introduce cycles, so the traversal recurses on them; however, following links could introduce cycles, so the traversal is truncated after the immediate node pointed to has been included in the result tree. For pointers, we also insert a node in the tree between the source and target of the link that indicates that the subtree derives from a link and shows the role. The result is one tree that starts at one of the XML documents from the data set, cutting across the other documents in the same way as the "^" operator of the query language, and including residual information about the pointer traces. At May 2004, we are considering separating the child and pointer tracing into two different steps that can be pipelined together, for better flexibility, and changing the syntax of the element between sources and targets of links.

Using a stylesheet

Knit.xsl, from NXT's lib directory, is a stylesheet that can be used to knit NXT format data. It recurses down child links, incorporating trace summaries of pointer links as it encounters them. Stylesheet processor installations vary locally. Some people use Xalan, which happens to be redistributed with NXT. It can be used to run a stylesheet on an XML file as follows. With the stylesheet knit.xsl (distributed in lib) copied into the same directory as the data:

java org.apache.xalan.xslt.Process -in INFILE -xsl STYLESHEET

The default linkstyle is LTXML, the default id attribute is nite:id, the default indication of an out-of-file child is nite:child, and the default indication of an out-of-file pointer is nite:pointer. These can be overridden using the parameters linkstyle, idatt, childel, and pointerel, respectively, and so for example if the corpus is not namespaced and uses xpointer links,

java org.apache.xalan.xslt.Process -in INFILE -xsl STYLESHEET -param linkstyle xpointer -param idatt id -param childel child -param pointerel pointer

There's a known problem between some versions of Xalan and some installations of java 1.4 that means sometimes this doesn't work; the fix is documented elsewhere (although some people just back off to java 1.3, if they have it). There are lots of other stylesheet processors around, like Saxon and jd.xslt.

In NXT-1.2.6 and before (pre 05 May 04), the use of nite:id, nite:child, and nite:pointer are hardwired, ranges don't work, and there are separate stylesheets for the two link styles, knit.xsl for xpointer and knit.ltxml.xsl for LTXML.

A minor variant of this approach is to edit knit.xsl so that it constructs a a tree that is drawn from a path that could be knitted, and/or document calls to pull in off-tree items. The less the desired output matches a knitted tree and especially the more outside material it pulls in, the harder this is. Also, if a subset of the knitted tree is what's required, it's often easier to obtain it by post-processing the output of knit.

Using lxinclude/lxnitepointer (pre-release)

Knit.xsl is painfully slow. It follows both child links and pointer links, but conceptually, these operations could be separate. One can always knit the child links first and pipe through something that knits the pointer links, giving more flexibility. We have implemented separate "knits" for child and pointer links as command line utilities with a fast implementation based on LT XML2 (an upcoming upgrade to LT XML). These are currently (Nov 04) available on the Edinburgh DICE system in /group/ltg/projects/lcontrib/bin as lxinclude (for children) and lxnitepointer (for pointers).

lxinclude -t nite FILENAME reads from the named file (which is really a URL) or from standard input, writes to standard output, and knits child links. (The "-t nite" is required because this is a fuller XInclude implementation; this parameterizes for NXT links). If you haven't used the default nite:child links, you can pass the name of the tag you used with -l, using -xmlns to declare any required namespacing for the link name:

lxinclude -xmlns:n=http://example.org -t nite -l n:mychild

This can be useful for recursive tracing of pointer links if you happen to know that they do not loop. Technically, the -l argument is a query to allow for constructions such as -l '*[@ischild="true"]'.

Similarly,

lxnitepointer FILENAME

will trace pointer links, inserting summary traces of the linked elements.

When LT XML2 is released, we will consider what the best option is for making them available to NXT users outside Edinburgh.

Using stylesheet extension functions

As a footnote, LT XML2 contains a stylesheet processor, and we're experimenting with implementing extension functions that resolve child and pointer links with less pain than the mechanism given in knit.xsl; this is very much simpler syntactically and also faster, although not as fast as the LT XML2 based implementation of knit. This approach could be useful for building tailored trees and is certainly simpler than writing stylesheets without the extension functions. Edinburgh users can try it as

/group/ltg/projects/lcontrib/bin/lxtn -s STYLESHEET XMLINPUTFILE

where a stylesheet to knit children would look like this (and do look, it's impressively simple compared to without the extension function).

We're not quite sure what to do with this. We do not currently intend to try the natural next step, an extension function that finds the (multiple) parents of a given node, because this is much harder to implement efficiently. For this reason even if we release it this approach will not have the same flexibility as using Java and navigation in the NOM. As it is, though, if one needs something a bit like knit but can't just knit, this tidies up the stylesheet considerably and sppeds up the processing. The problem is we aren't sure enough people would use this to make it worth the effort to release it, especially since the implementation depends on the stylesheet processor. If you have an opinion about the utility of a release or which stylesheet processor most NXT users prefer, please tell us.

Evaluating a query and saving the result

If you evaluate a query and save the query results as XML, you will get a tree structure of matchlists and matches with nite:pointers at the leaves that point to data elements. Sometimes this is the best way to get the tree-structured cut of the data you want, since it makes many data arrangements possible that don't match the corpus design and therefore cannot be obtained by knitting.

The query engine API includes (and the search GUI exposes) an option for exporting query results not just to XML but to Excel format. We recommend caution in exercising this option, especially where further processing is required. For simple queries with one variable, the Excel data is straightforward to interpret, with one line per variable match. For simple queries with n variables, each match takes up n spreadsheet rows, and there is no way of finding the boundaries between n-tuples except by keeping track (for instance, using modular arithmetic). This isn't so much of a problem for human readability, but it does make machine parsing more difficult. For complex queries, in which the results from one query are passed through another, the leaves of the result tree and presented in left-to-right depth-first order of traversal, and even human readability can be difficult. Again, it is possible to keep track whilst parsing, but between that and the difficulty of working with Excel data in the first place, its often best to stick to XML.

Second step: process the tree

Stylesheets

This is the most standard XML transduction mechanism. There are some stylesheets in the lib directory that could be useful as is, or as models; knit.xsl itself, and attribute-extractor.xsl, that can be used in conjunction with SaveQueryResults and knit to extract a flat list of attribute values for some matched query variable (available from Sourceforge CVS from 2 July 04, will be included in NXT-1.2.10).

This option is most attractive:

for those who write stylesheets anyway (since they know XSLT)
for operations that can primarily be carried out on one coding at a time, or on knitted trees, or on query language result trees, limiting the number and complexity of the document calls required
for applications where the output requires something which is not supported in query but is supported in XSLT (like arithmetic)

Xmlperl

Xmlperl gives a way of writing pattern-matching rules on XML input but with access to general perl processing in the action part of the rule templates.

This option is most attractive:

for those who write xmlperl or at least perl anyway
for operations that can be carried out on one coding at a time, or on knitted trees, or on query language result trees
for applications where the output requires something which is not supported in query (like arithmetic)
for applications where XSLT's variables provide insufficient state information
for applications where bi-directional communication with an external process is needed (for instance, to add part-of-speech tags to the XML file), since this is easiest to set up in xmlperl

Xmlperl is quite old now. There are many XML modules for perl that could be useful but we have little experience of them.

After the LT XML2 release, see also lxviewport, which will be another mechanism for communication with external processes.

ApplyXPath/Sggrep

There are some simple utilities that apply a query to XML data and return the matches, like ApplyXPath (an Apache sample) and sggrep (part of LT XML). Where the output required is very simple, these will often suffice.

Using lxreplace

This is another transduction utility available to Edinburgh DICE users that is likely to be distributed more widely with LTXML2. It is implemented over LTXML2's stylesheet processor, but the same functionality could be implemented over some other processor.

lxreplace -q query -t template

"template" is an XSLT template body, which is instantiated to replace the nodes that match "query". The stylesheet has some pre-defined entities to make the common cases easy:

&this; expands to a copy of the matching element (including its attributes and children)
&attrs; expands to a copy of the attributes of the matching element
&children; expands to a copy of the children of the matching element

Examples:

To wrap all elements "foo" whose attribute "bar" is "unknown" in an element called "bogus":

lxreplace -q 'foo[@bar="unknown"]' -t '&this;'

(that is, replace each matching foo element with a bar element containing a copy of the original foo element).

To rename all "foo" elements to "bar" while retaining their attributes:

lxreplace -q 'foo' -t '&attrs;&children;'

(that is, replace each foo element with a bar attribute, copying the attributes and children of the original foo element).

To move the (text) content of all "foo" elements into an attribute called "value" (assuming that the foos don't have any other attributes):

lxreplace -q 'foo' -t ''

(that is, replace each foo element with a foo element whose value attribute is the text value of the original foo element).

Third step: add the changed tree back in

Again based on LT XML2, and still to be released, we have developed a command line utility that can "unknit" a knitted file back into the original component parts. On DICE,

/group/ltg/projects/lcontrib/bin/lxniteunknit -m METADATA FILE

Lxniteunknit does not include a command line option for identifying the tags used for child and pointer links because it reads this information from the metadata file. With lxniteunknit, one possible strategy for adding information to a corpus is to knit a view with the needed data, add information straight in the knitted file as new attributes or a new layer of tags, change the metadata to match the new structure, and then unknit.

Another popular option is to keep track of the data edits by id of the affected element and splice them into the original coding file using a simple perl script.

Option 3: Process using other XML-aware software

NXT files can be processed with any XML aware software, though the semantics of the standoff links between files will not be respected. Most languages have their own XML libraries: under the hood, NXT uses the Apache XML Java libraries. We sometimes use the XML::XPath module for perl, particularly on our import scripts where XSLT would be inefficient or difficult to write.

Last modified 04/13/06