Suppose that you have data in NXT format, and you need to make some other format for part or all of it - a tailored HTML display, say, or input to some external process such as a machine learning algorithm or a statistical package. There are an endless number of ways in which such tasks can be done, and it isn't always clear what the best mechanism is for any particular application (not least because it can depend on personal preference). Here we walk you through some of the ones we use.
The hardest case for data processing is where the external process isn't the end of the matter, but creates some data that must then be re-imported into NXT. (Think, for instance, of the task of part-of-speech tagging or chunking an existing corpus of transcribed speech.) In the discussion below, we include comments about this last step of re-importation, but it isn't required for most data processing applications.
Often the best option is to write a Java program that loads the data into a NOM and use the NOM API to navigate it, writing output as you go. For this, the iterators in the NOM API are useful; there are ones, for instance, that run over all elements with a given name or over individual codings. It's also possible from within an application to evaluate a query on the loaded NOM and iterate over the results within the full NOM, not just the tree that saving XML from the query language exposes. (Many of the applications in the sample directory both load and iterate over query results, so it can be useful to borrow code from them.) For re-importation, we don't have much experience of making Java communicate with programs written in other languages (such as the streaming of data back and forth that might be required to add, say, part-of-speech tags) but we know this is possible and that users have, for instance, made NXT-based applications communicate with processes running in C (but for other purposes).
This option is most attractive:
Since XML processing is oriented around trees, constructing a tree that contains the data to be processed, in XML format, opens up the data set to all of the usual XML processing possibilities.
Individual NXT codings and corpus resources are, of course, tree structures that conveniently already come in XML files. Often these files are exactly what you need for processing anyway, since they gather together like information into one file. Additionally, there are the following ways of making trees from the data.
By "knitting", we mean the process of creating a larger tree than that in an individual coding or corpus resource by traversing over child or pointer links and including what is found. Knitting an XML document from an NXT data set performs a depth-first left-to-right traversal of the nodes in a virtual document made up by including not just the XML children of a node but also the out-of-document children links (usually pointed to using nite:child and nite:pointer, respectively, although after 05 May 04 this is configurable). In the data model, tracing children is guaranteed not to introduce cycles, so the traversal recurses on them; however, following links could introduce cycles, so the traversal is truncated after the immediate node pointed to has been included in the result tree. For pointers, we also insert a node in the tree between the source and target of the link that indicates that the subtree derives from a link and shows the role. The result is one tree that starts at one of the XML documents from the data set, cutting across the other documents in the same way as the "^" operator of the query language, and including residual information about the pointer traces. At May 2004, we are considering separating the child and pointer tracing into two different steps that can be pipelined together, for better flexibility, and changing the syntax of the element between sources and targets of links.
Knit.xsl, from NXT's lib directory, is a stylesheet that can be used to knit NXT format data. It recurses down child links, incorporating trace summaries of pointer links as it encounters them. Stylesheet processor installations vary locally. Some people use Xalan, which happens to be redistributed with NXT. It can be used to run a stylesheet on an XML file as follows. With the stylesheet knit.xsl (distributed in lib) copied into the same directory as the data:
java org.apache.xalan.xslt.Process -in INFILE -xsl STYLESHEETThe default linkstyle is LTXML, the default id attribute is nite:id, the default indication of an out-of-file child is nite:child, and the default indication of an out-of-file pointer is nite:pointer. These can be overridden using the parameters linkstyle, idatt, childel, and pointerel, respectively, and so for example if the corpus is not namespaced and uses xpointer links,
java org.apache.xalan.xslt.Process -in INFILE -xsl STYLESHEET -param linkstyle xpointer -param idatt id -param childel child -param pointerel pointerThere's a known problem between some versions of Xalan and some installations of java 1.4 that means sometimes this doesn't work; the fix is documented elsewhere (although some people just back off to java 1.3, if they have it). There are lots of other stylesheet processors around, like Saxon and jd.xslt.
In NXT-1.2.6 and before (pre 05 May 04), the use of nite:id, nite:child, and nite:pointer are hardwired, ranges don't work, and there are separate stylesheets for the two link styles, knit.xsl for xpointer and knit.ltxml.xsl for LTXML.
A minor variant of this approach is to edit knit.xsl so that it constructs a a tree that is drawn from a path that could be knitted, and/or document calls to pull in off-tree items. The less the desired output matches a knitted tree and especially the more outside material it pulls in, the harder this is. Also, if a subset of the knitted tree is what's required, it's often easier to obtain it by post-processing the output of knit.
Knit.xsl is painfully slow. It follows both child links and pointer links, but conceptually, these operations could be separate. One can always knit the child links first and pipe through something that knits the pointer links, giving more flexibility. We have implemented separate "knits" for child and pointer links as command line utilities with a fast implementation based on LT XML2 (an upcoming upgrade to LT XML). These are currently (Nov 04) available on the Edinburgh DICE system in /group/ltg/projects/lcontrib/bin as lxinclude (for children) and lxnitepointer (for pointers).
lxinclude -t nite FILENAME reads from the named file (which is really a URL) or from standard input, writes to standard output, and knits child links. (The "-t nite" is required because this is a fuller XInclude implementation; this parameterizes for NXT links). If you haven't used the default nite:child links, you can pass the name of the tag you used with -l, using -xmlns to declare any required namespacing for the link name:
lxinclude -xmlns:n=http://example.org -t nite -l n:mychild
This can be useful for recursive tracing of pointer links if you happen to know that they do not loop. Technically, the -l argument is a query to allow for constructions such as -l '*[@ischild="true"]'.
Similarly,
lxnitepointer FILENAME
will trace pointer links, inserting summary traces of the linked elements.
When LT XML2 is released, we will consider what the best option is for making them available to NXT users outside Edinburgh.
/group/ltg/projects/lcontrib/bin/lxtn -s STYLESHEET XMLINPUTFILE
where a stylesheet to knit children would look like this (and do look, it's impressively simple compared to without the extension function).
We're not quite sure what to do with this. We do not currently intend to try the natural next step, an extension function that finds the (multiple) parents of a given node, because this is much harder to implement efficiently. For this reason even if we release it this approach will not have the same flexibility as using Java and navigation in the NOM. As it is, though, if one needs something a bit like knit but can't just knit, this tidies up the stylesheet considerably and sppeds up the processing. The problem is we aren't sure enough people would use this to make it worth the effort to release it, especially since the implementation depends on the stylesheet processor. If you have an opinion about the utility of a release or which stylesheet processor most NXT users prefer, please tell us.
If you evaluate a query and save the query results as XML, you will get a tree structure of matchlists and matches with nite:pointers at the leaves that point to data elements. Sometimes this is the best way to get the tree-structured cut of the data you want, since it makes many data arrangements possible that don't match the corpus design and therefore cannot be obtained by knitting.
The query engine API includes (and the search GUI exposes) an option for exporting query results not just to XML but to Excel format. We recommend caution in exercising this option, especially where further processing is required. For simple queries with one variable, the Excel data is straightforward to interpret, with one line per variable match. For simple queries with n variables, each match takes up n spreadsheet rows, and there is no way of finding the boundaries between n-tuples except by keeping track (for instance, using modular arithmetic). This isn't so much of a problem for human readability, but it does make machine parsing more difficult. For complex queries, in which the results from one query are passed through another, the leaves of the result tree and presented in left-to-right depth-first order of traversal, and even human readability can be difficult. Again, it is possible to keep track whilst parsing, but between that and the difficulty of working with Excel data in the first place, its often best to stick to XML.
This option is most attractive:
Xmlperl gives a way of writing pattern-matching rules on XML input but with access to general perl processing in the action part of the rule templates.
This option is most attractive:
Xmlperl is quite old now. There are many XML modules for perl that could be useful but we have little experience of them.
After the LT XML2 release, see also lxviewport, which will be another mechanism for communication with external processes.
lxreplace -q query -t template
"template" is an XSLT template body, which is instantiated to replace the nodes that match "query". The stylesheet has some pre-defined entities to make the common cases easy:Examples:
To wrap all elements "foo" whose attribute "bar" is "unknown" in an element called "bogus":
lxreplace -q 'foo[@bar="unknown"]' -t '
To rename all "foo" elements to "bar" while retaining their attributes:
lxreplace -q 'foo' -t '
To move the (text) content of all "foo" elements into an attribute called "value" (assuming that the foos don't have any other attributes):
lxreplace -q 'foo' -t '
(that is, replace each foo element with a foo element whose value attribute is the text value of the original foo element).
Again based on LT XML2, and still to be released, we have developed a command line utility that can "unknit" a knitted file back into the original component parts. On DICE,
/group/ltg/projects/lcontrib/bin/lxniteunknit -m METADATA FILE
Lxniteunknit does not include a command line option for identifying the tags used for child and pointer links because it reads this information from the metadata file. With lxniteunknit, one possible strategy for adding information to a corpus is to knit a view with the needed data, add information straight in the knitted file as new attributes or a new layer of tags, change the metadata to match the new structure, and then unknit.
Another popular option is to keep track of the data edits by id of the affected element and splice them into the original coding file using a simple perl script.NXT files can be processed with any XML aware software, though the semantics of the standoff links between files will not be respected. Most languages have their own XML libraries: under the hood, NXT uses the Apache XML Java libraries. We sometimes use the XML::XPath module for perl, particularly on our import scripts where XSLT would be inefficient or difficult to write.
Last modified 04/13/06