NITE XML Toolkit - Using the query language for analysis

This page describes the various utilities for searching a corpus using the NXT's dedicated query language.

Preliminaries

There are two basic methods for using the search utilities; .bat/.sh files (.bat for Windows, .sh for Unix/Linux), and the command line. Command line examples below are given for bash and are known to work under cygwin (which comes with a bash that runs under Windows). There will be DOS equivalents but it hasn't been a priority for us to figure out what they are, and in particular (we think) redirection is less flexible there.

Setting the classpath

The .bat/.sh files are set up so they just need to be run to work. The command line utilities require the classpath environment variable to be set up so that the shell can find the software. Assuming you use them from the top level directory in which the software is installed, this can be done as follows:

if [ $OSTYPE = 'cygwin' ]; then
	export CLASSPATH=".;lib;lib/nxt.jar;lib/jdom.jar;lib/JMF/lib/jmf.jar;lib/pnuts.jar;lib/resolver.jar;lib/xalan.jar;lib/xercesImpl.jar;lib/xml-apis.jar;lib/jmanual.jar;lib/jh.jar;lib/helpset.jar;lib/poi.jar"
else
	export CLASSPATH=".:lib:lib/nxt.jar:lib/jdom.jar:lib/JMF/lib/jmf.jar:lib/pnuts.jar:lib/resolver.jar:lib/xalan.jar:lib/xercesImpl.jar:lib/xml-apis.jar:lib/jmanual.jar:lib/jh.jar:lib/helpset.jar:lib/poi.jar"
fi

This syntax assumes the bash shell; other shells use a different syntax.

Then, for example,

java CountQueryResults -corpus Data/meta/swbd-metadata.xml -query '($n nt):'

Shell interactions

You'll need to be careful to use single quotes at shell level and double quotes within queries - although we've found one shell environment that requires the quotes the other way around. Getting the quoting to work correctly in a shell script is difficult even for long-time Unix users. This example shell script worked for one user's specific needs running under Cygwin.

Don't forget you can use redirection to get rid of all the pesky warning and log messages:

java CountQueryResults -corpus Data/meta/swbd-metadata.xml -query '($n nt):' 2> logfile

java CountQueryResults -corpus Data/meta/swbd-metadata.xml -query '($n nt):' 2> /dev/null

Memory usage

It is possible to increase the amount of memory available to java for processing, and depending on the machine set up, this may speed things up. This can be done by using flags to java, e.g.

java -Xincgc -Xms127m -Xmx512m -Xfuture CountQueryResults ...

but also as an edit to the java calls in any of the existing scripts. This is what they mean:

-Xincgc incremental garbage collection (get back unused memory) -Xms127m use an initial memory heap size of 127 MB -Xmx512m use a maximum memory heap size of 512 MB

It's possible to use other numbers for -Xms and -Xmx, and perhaps the values given here aren't always appropriate, or are lower than some machine defaults.

Mac users

Enough of NXT has been tested or used successfully under Mac OSX that we just believe it to work. Mac OSX didn't present any particular problems (for instance, the .sh scripts work) except in making it possible to run them by double-clicking on them. This required us to write specialist Mac scripts that call the .sh scripts using AppleScript. This is an example of one but don't try to look at it in a text editor; it opens in the Applescript editor. Keep in mind with scripts like these that you have to tailor them to your installation by specifying correct paths. If you can contribute a better paragraph explaining what Mac users need to do, we'd be grateful for it - it's hard for us to find Macs with normal setups to experiment on. If you need more help than this hint, please ask and if we can, we'll reconstruct the process.

Messages at the shell

At May 2004, NXT, and the data loading routines in particular, produce a great many progress and warning messages on System.err - enough so that it can be difficult to tell what's a real problem. We expect to consider strategies for improving this.

Graphical interfaces

Any corpus in NITE data format is immediately amenable to two different graphical interfaces that allow the corpus to be searched, even without writing tailored programs. The first is a simple search GUI, and the second is a generic data display program that works in tandem with the search GUI to highlight search results.

The Search GUI

The search GUI can be reached either by using search.bat/search.sh and specifying which corpus to load or by using the .bat/.sh for the specific corpus (if it exists) and choosing the "search" option. It has two tabbed winodws. The query tab allows the user to type in a query. Cut and paste from other applications works with this window. The query can also be saved on the bookmark menu, but at May 2004 this doesn't work well for long queries. There is a button to press to do the search, which automatically takes the user either to a pop-up window with an error message explaining where the syntax of the query is incorrect, or, for a valid query, to the result tab. This window shows the results as an XML tree structure, with more information about the element the user has selected (with the mouse) displayed below the main tree.

The GUI includes an option to save the XML result tree to a file. This can be very handy in conjunction with "knit" for performing data analysis. It also includes an option to save the results in a rudimentary Excel spreadsheet. This is less handy, especially in the case of complex queries, because the return value is hierarchically structured but the spreadsheet just contains information about each matched element dumped into a flat list by performing a depth-first, left-to-right traversal of the results. However, for relatively simple queries and people who are used to data filtering and pivot tables in Excel, it can be the easiest first step for analysis.

The search GUI works on an entire corpus at once. This can make it slow to respond if the corpus is very large or if the query is very complicated (although of course it's possible to comment out observations in the metadata to reduce the amount of information it loads). Sometimes a query is slow because it's doing something more complicated than what the user intended. A query can be interrupted mid-processing and will still return a partial result list, which can be useful for checking it.

At May 2004, when the user chooses to open a corpus from the File menu, the search GUI expects the metadata file to be called something.corpus, although many users are likely to have it called something.xml (so that it behaves properly in other applications like web browsers). Choose the "all files" option (towards the bottom of the open dialogue box) in order to see .xml files as well as .corpus ones.

The Generic Display

Unlike the search GUI, the generic display utility works on one observation at a time. It can be reached either by using the .bat/.sh for the specific corpus (if it exists) or by invoking it at the command line:

java net.sourceforge.nite.gui.util.GenericDisplay -c CORPUS -o OBS

Where CORPUS gives a path to a metadata file and OBS names an observation that is listed in that metadata file.

It simply puts up an audio/video window for each signal associated with an observation, plus one window per coding that shows the elements in an NTextArea, one element per line, with indenting corresponding to the tree structure and a rendering of the attribute values, the PCDATA the element contains, and enough information about pointers to be able to find their targets visually on the other windows. It doesn't try to do anything clever about window placement. There is a search menu on the interface that will pull up the search GUI and when the user selects an element on the search result tab, will highlight the corresponding part of the data display. This facility can be extremely useful in formulating queries!

It would be possible to improve this interface to make it much more useful (the Javadoc comments at the beginning of the code for it give our ideas about that, and it would make a nice student project). It's nice to have something that works for NITE format data out of the box, but a generic display will never be as good as writing one that is tailored for the data set.

Tailored displays

Some of the NITE data samples come with program samples that build tailored data displays or graphical user interfaces that allow the data to be coded. These programs can easily be modified to have the search highlighting capabilities of the generic display, and some of them already do. Use the implementation of the generic display utility, or of an existing sample, as a guide. At May 2004, programs that use the search menu will only show highlighting on NTextAreas because they are the only display object to implement the required interface. We expect this situation to improve.

In the tailored displays, if there isn't a direct representation of some element on the display, then there's nothing to highlight. For instance, in many data sets timestamped orthographic transcription consists of and elements but the elements are not rendered in the display, so the query "($s sil):" won't cause any highlighting to occur. This can be confusing but it is the correct behaviour. Good interface design will have a screen rendering for any elements of theoretical importance.

Command line utilities

We've written quite a few command line search utilities as our own projects have needed them; none of them are very complicated. We probably won't continue to ship with all of them in the release because there are so many and they are quite similar. The ones that aren't in the release can be found in the samples directory of the CVS; place them in the samples directory from the release and then from the top level directory, compile in the usual way, e.g. javac -d lib samples/CountQueryResults.java

SaveQueryResults

This is a command line interface for saving the query results as an XML tree (the same tree that's displayed in the search GUI). It takes the following arguments:

-corpus CORPUS -observation OBS -query QUERY -filename OUTFILENAME -directory DIRNAME -independent or -allatonce

where CORPUS is the location of the metadata file, OBS is an observation name (if not given, loads all observations listed in the metadata file), and QUERY is a query expressed in NQL. If -independent is indicated, then it saves one result file per observation with the query evaluated independently on each one; if -allatonce is indicated, then the entire corpus is loaded at once, with one output file saved. -independent is faster but -allatonce is necessary if queries draw context from outside single observations. In distributions before 05 May 2004 (1.2.6 or earlier), the default is -allatonce, but after that, it was changed to -independent to be like the other command line utilities. If no filename is indicated, the output goes to System.out. (Note that this isn't very sensible to do in conjunction with -independent because the output will just concatenate separate XML documents.) Everything else that could potentially be on System.out is redirected to System.err.

If a filename is indicated, the output ends up in the directory named by DIRNAME. It ends up in OUTFILENAME unless -independent is indicated, in which case that filename is prefixed with the name of the observation and a full stop (.). -independent is ignored if -observation OBS is indicated (i.e., the output is saved without prefixing the filename).

Under cygwin, -d takes Windows-style directory naming; e.g., -d "C:", not -d "/cygdrive/c". Using the latter will create the unexpected C:/cygdrive/c. This may be configurable at the system level. At May 2004, directory naming has not been tested on other platforms.

CountQueryResults

This is a command line interface for counting query results for an entire corpus that doesn't show a result tree but just outputs the number of matches (which is good for embedding in scripts). It takes the following arguments:

-corpus CORPUS -observation OBS -query QUERY -allatonce

which are handled as in SaveQueryResults. In the case of complex queries, the counts reflect the number of top level matches (i.e., matches to the first query that survive the filtering performed by the subsequent queries - matches to a subquery drop out if there are no matches for the next query).

When -allatonce or -observation, the result is a bare count; otherwise, it is a table, one line per observation, with observation name, whitespace, and then the count.

In our experience, queries can always be rewritten so that this is the number one wants. That's a whole other story, but here are some quick hints:

To save confusion, complex queries where the first subquery has one variable are safer than those that match n-tuples, since in the latter case the same variable binding could be represented in several n-tuples.
We are not aware of processing costs involved in splitting a query into a complex query that yields the same results (apart from tree transformation) as long as the splitting doesn't require more variables to be bound (especially using forall/exists), and the complex versions are always easier to understand.
Combine CountQueryResults with command line scripting, for instance, to fill in possible attribute values from an enumerated list.

Note that saved query results can be knit with the corpus to useful effect. In theory, saved query results can also be introduced as data itself by adding an appropriate declaration to the metadata but we tend to use the Index utility for this purpose.

In versions before NXT-1.2.6, CountQueryResults means -allatonce and a separate utility, CountOneByOne, handles the independent case.

MatchInContext

MatchInContext is a command line interface that shows the orthography for some query results, optionally with some surrounding context. We developed it specifically at the request of users who are already familiar with e.g. tgrep. It takes the following arguments:

-corpus CORPUS -observation OBS -query QUERY -context CONTEXT -textatt TEXTATT -allatonce

where the arguments are as before with the following additions:

CONTEXT is an optional query expressed in NQL that expresses the surrounding context one wishes to show the matches in but does not express a relationship to the main query - we add that the first variable in the context query dominates the first variable in the main query. If CONTEXT is not given, no context is shown. The context query must not share variable names with the main query.
TEXTATT is optional and indicates the name of the attribute where the orthography is stored (for when the orthography isn't stored as PCDATA as in the TEI convention - but try this using -textatt pos to get part of speech sequences in corpora that have them, and note that some utility like this might be useful for constructing data for statistical training).

The text is shown on STDOUT. If no context is specified, the text show is that relating to the first named variable in the query. If context is specified, then that same text is shown upcased within the text relating to the context. This can produce puzzling results (like matches with no text or with context but no match text) that in our experience prove to be correct. There is no clean way of knowing where to insert line breaks, speaker attributions, etc. in a general utility such as this one; for better displays write a tailored interface. Canonical usage would use one variable queries for the main and context queries, with the latter expressing simply a type, possibly with some constraints on attribute values. There may be less or more than one context match for a query result; in these cases we comment and show the first match we find (if any).

In versions before NXT-1.2.6, MatchInContext means -allatonce and a separate utility, MatchInContextOneByOne, handles the independent case.

NGramCalc: Calculating N-Gram Sequences

An n-gram is a sequence of n states in a row drawn from an enumerated list of types. For instance, consider Parker's floor state model (Journal of Personality and Social Psychology 1988). It marks spoken turns in a group discussion according to their participation in pairwise conversations. The floor states are newfloor (first to establish a new pairwise conversation), floor (in a pairwise conversation), broken (breaks a pairwise conversation), regain (re-establishes a pairwise conversation after a broken), and nonfloor (not in a pairwise conversation). The possible tri-grams of floor states are newfloor/floor/broken, newfloor/floor/floor, regain/broken/ nonfloor, and so on. We usually think of n-grams as including all ways of choosing a sequence of n types, but in some models, not all of them are possible; for instance, in Parker's model, the bi-gram newfloor/newfloor can't happen. N-grams are frequently used in engineering-oriented disciplines as background information for statistical modelling, but they are sometimes used in linguistics and psychology as well. Computationalists can easily calculate n-grams by extracting data from NXT into the format for another tool, but sometimes this is inconvenient or the user who requires the n-grams may not have the correct skills to do it.

NGramCalc is a utility for calculating n-grams from NXT format data. The command line options for it are a bit complicated because it's very flexible. It can take all elements of a particular type or all those resulting from the first matched variable of a query, as long as they are timed, and put them in order of start time. Then, using as the states either the value of a named attribute, or the names of elements at the end of a named role, or the value of a named attribute on those elements, it will report n-gram frequencies - but it must be possible from the metadata to get the complete list of states for the utility to work. It looks for the list as the enumerated attribute values of the appropriate tag or as the set of codes allowed in the layer a role is declared to point to. Some NXT users effectively have enumerated attribute values but declare them as strings - modifying the declaration to make it explicit will enable the utility.

To call the utility, which is in builds after NXT 1.3.1 but otherwise available from CVS, set the classpath in the usual way, and then use

java NGramCalc -corpus metadata_file_name -observation observation_name -tag tagname -query query -att attname -role rolename -n positive_integer

-corpus is required; it is the path to the metadata file.

-observation is optional; if it is used, the n-grams are calculated over one observation only, and if it is omitted, over all observations listed in the metadata. Although only one set of numbers is reported, NXT loads only one observation at a time when calculating them.

-tag is required; it names the tag to use in finding the state names.

-query is optional; if given, then the program uses matches to the first named variable as the elements from which to derive states. If it is not given, then the query is assumed to match all tags of the type named using -tag. Note that if a query is used, it is possible to have the first named variable use a disjunctive type, but only if the method for deriving states from the elements works for both types and results in the same enumerated list. In this case, either of them can be named in -tag.

-role is optional; if given, rather than looking for the states on the query matches (or named tag if no query was given), the program looks for them on the element found by tracing the named role from there. This level of indirection is useful if the data was produced using one of NXT's configurable end user tools, which tend to point to external corpus resources to get possible annotation values.

-att; if given, uses the value of the named attribute both for finding the possible state names and for finding the actual states. -att is required if -role is omitted, but optional if it is included. If -role is included and -att is omitted, then instead of using attribute values, the states are derived from the element names in the layer pointed to by the named role.

-n is optional; it gives the size of the n-grams. It defaults to 1.

For instance,

java NGramCalc -c METADATA -t turn -a fs -n 3

will calculate trigrams of fs attributes of turns and output a tab-delimited table like

500	newfloor	floor	broken
0	newfloor	newfloor	newfloor

Suppose that the way that the data is set up includes an additional attribute value that we wish to skip over when calculating the tri-grams, called "continued".

java NGramCalc -c METADATA -t turn -a fs -n 3 -q '($t turn):($t@fs != "continued")'

will do this. Entries for "continued" will still occur in the output table because it is a declared value, but will have zero in the entries.

java NGramCalc -c METADATA -t gesture-target -a name -n 3 -q '($g gest):' -r gest-target

will produce trigrams where the states are found by tracing the gest-target role from gest elements, which finds gesture-target elements (canonically, part of some corpus resource), and further looking at the values of their name attributes. Note that in this case, the type given in -t is what results from tracing the role from the query results, not the type returned in the query.

java NGramCalc -c METADATA -t gest -q '($g gest):' -r gest-target

will produce unigrams where the states are named in the elements reached by tracing the gest-target role from gest elements. Again, canonically these would be part of some corpus resource, but in this case the element names themselves are used. Note that in this case, type given in -t and in the query results are the same.

At 21 Feb 05, use of -role without -att is not yet implemented.

We can think of the following changes that could be useful:

allow order to be derived from end time rather than start time
allow order to be derived from a structural order, not just a temporal one, so that the utility could be used on untimed data

We are interested in what other requirements users can find.

FunctionQuery: Time ordered, tab-delimited output, with aggregate functions

Note: FunctionQuery is only available with NXT versions compiled after May 1st 2006 (NXT 1.3.5 upwards). It has been designed to subsume the functionality of SortedOutput.

FunctionQuery is a utility for outputting tab-delimited data. It takes all elements resulting from the result of a query, as long as they are timed, and put them in order of start time. Then it outputs one line per element the values of a the named attributes or aggregates with a tab character between each attribute. The attributes can be qualified with a variable specifier (e.g. $v@attribute). If the variable specifier is omitted, the attribute belonging to the first variable in the query (the "primary variable") is returned.

To call the utility, set the classpath in the usual way, and then use

java FunctionQuery -c metadata_file_name -ob observation_name -q query -atts [attname+]

-corpus is required; it is the path to the metadata file.

-observation is optional: it's the name of the observation. If not present the program cycles through all observations.

-text is an optional flag. If present text of the result element (or any subelements) is output after any attributes. This is kept for backwards compatability and the results are identical to adding the primary variable name (including leading $) to the end of the atts list.

-query is required; the first matched variable from every result forms the basis of the output.

-atts is required; input is expected as a space separated list of attributes or aggregates. Note that if the attribute does not exist for some matched elements, a blank tab-stop will be the output. Attributes are generally of the form: $v@attribute. If $v is omitted, the attribute is taken from the primary variable. If the attribute is omitted, the textual content of $v and its children are returned.

Aggregate functions are idenified by a leading '@' character. There are currenly 4 aggregate functions included in FunctionQuery, described below. The first argument to a function is always a subquery to be evaluated in the context of the current result.

Aggregate Functions:

For the following functions, optional arguments are denoted by an equals sign followed by the default value of that argument. The context query that is the first argument of each of these functions should be of the form:

($v var):$v
test $q && other tests

So that the results of this query are filtered with respect to the main query (represented here by the $q variable). For example $v # $q will return all results that overlap the main query result, whilst $q ^ $v will return all descendents of the current query result.

@count(conquery): returns the number of results from evaluating conquery in the context of the current result of query.
@sum(conquery, attr): returns the sum of the values of attr for all results of conquery evaluated in the context of query. attr should be a numerical attribute.
@extract(conquery, attr, n=0, last=n+1): returns the attr attribute of the nth result of conquery evaluated in the context of query. If n is less than 0, extract returns the attr attribute of the nth last result. If last is provided, the attr value of all results whose index is at least n and less than last is returned. If last is less than 0, it will count back from the final result. If last equals zero, all items between n and the end of the result list will be returned.
@overlapduration(conquery): returns the length of time that the results of conquery overlap with the results of the main query. For some conquery results, this number may exceed the duration of the main query result. For example, the duration of speech for all participants over a period of time may exceed the duration of the time segment where there a multiple simultaneous speakers. This can be avoided, for example, by restricting the conquery to a specific agent.

Example:

java FunctionQuery -c METADATA -ob OBS -q '($m move)' -atts type nite:start nite:end '@count(($w w):$w#$m)' '$m'

will output a sorted list of moves for the observation consisting of type attribute, start and end times, the count of w (words) that overlap each move, and any text included in the move, or any children.

FunctionQuery is designed to be backwards-compatable with SortedOutput.
(note: can use complex query in FunctionQuery to test)

SortedOutput: Time ordered, tab-delimited output

Note: As of May 2006, FunctionQuery is a more general tool that subsumes the functionality of SortedOutput

SortedOutput is a utility for outputting tab-delimited data. It takes all elements resulting from the first matched variable of a query, as long as they are timed, and put them in order of start time. Then it outputs one line per element the values of a the named attributes with a tab character between each attribute.

To call the utility, which is in builds after NXT 1.3.2 but otherwise available from CVS, set the classpath in the usual way, and then use

java SortedOutput -c metadata_file_name -ob observation_name -q query -atts [attname+]

-corpus is required; it is the path to the metadata file.

-observation is optional: it's the name of the observation. If not present the program cycles through all observations.

-text is an optional flag. If present text of the result element (or any subelements) is output after any attributes.

-query is required; the first matched variable from every result forms the basis of the output.

-atts is required; input is expected as a space separated list of attributes. Note that if the attribute does not exist for some matched elements, a blank tab-stop will be the output.

Example:

java SortedOutput -c METADATA -ob OBS -q '($m move)' -atts type nite:start nite:end

will output a sorted list of moves for the observation consisting of type attribute, start and end times.

SearchAndFilter (beta): More flexible time ordered, tab-delimited output

Note: As of May 2006, FunctionQuery is a more general tool that subsumes the functionality of SortedOutput

At Feb 06, there is a beta-release extension of SortedOutput called SearchAndFilter. You can find it in the Sourceforge CVS repository under "samples" but not in the builds. It allows one to specify output fields from more than the first variable match using a syntax like the following:

java SearchAndFilter -c /path/to/swbd-metadata-genitive.xml -o sw2012 -q '($g genitive-phrase)($a nt):$g>$a' -filter '$g@nite:id' '$a@nite:id' '$a'

At 22 Feb 06, there are two known bugs with the beta version - variables declared in second and subsequent queries of a complex pipe are not exposed for filtering, and there is a memory leak causing it to go slower and slower when it iterates over the observations in a corpus. After bug fixing, we will decide whether SearchAndFilter should stay as a separate utility or become the new version of SortedOutput.

Knit

Knit isn't really a search utility like the others, but it's a general utility that will be useful to anyone performing searches. It's related to the general utility that will be familiar to some users from LT XML, modified for the NXT data model and data format. Knit and its uses are described more fully on the data processing page.

Indexing

NXT includes a facility for adding annotation to a corpus based on query matches. To use it, set the classpath as usual. Then use

java Index -c CORPUS -o OBS -q QUERY -t TAG -r ROLE1 ROLE2 ... ROLEM

where CORPUS is the location of a metadata file

OBS is optional, and is the name of an observation. If it is omitted, all observations named in the metadata file are indexed in turn.

QUERY is a query. Let n be the number of unquantified variables in the first subquery (unquantified meaning, without a forall or exists).

TAG is the name of a tag (code). It is optional and defaults to "markable".

ROLE1 to ROLEM are optional, and are the names of roles to use in the indexing.

The -r flag, if present, must be given last.

For each query match, a new tag of type TAG is added to the corpus. If -r is omitted, the new tag is made a parent of the first unquantified variable of the query. If -r is included, then the new tag will contain m pointers which point to the first m unquantified variables in the first subquery, using the given role names in order. m must be less than or equal to n.

Note that this does not restrict the same element to at most one match, even though that's a property we often want our mappings to have. When creating indices of one variable, it is often best to use only one unquantified variable in the first subquery of the query, so that we don't index the same thing more than once (i.e., using further subqueries as a filter, not to retain further variable matches). Also note that Index does not remove existing tags of the same type before operation - it just adds new tags. Making the same call twice will create two indices to the same data, which is usually undesirable, but this enables an index to be built up gradually using several different queries.

The program assumes that a definition for the new tag have already been added into the metadata file. It is usual to put it in a new coding, and it would be a bad idea to put in a layer that anything points to, since no work is done to attach the indices to prospective parents or anything else besides what they index. If the indexing adds parents, then the type of the coding file (interaction or agent) must match the type of the coding file that contains the matches to the first variable. If an observation name is passed, it creates a index only for the one observation; if none is, it indexes each observation in the metadata file by loading one at a time (so this won't work for queries that need comparisons across observations).

The canonical metadata form for an index file, assuming roles are used, is an interaction coding as follows:

            <coding-file name="FOO">
              <featural-layer name="BAZ">
                  <code name="TAGNAME">
                  	<pointer number="1" role="ROLENAME" target="LAYER_CONTAINING_MATCHES"/>
                  </code>
              </featural-layer>
         </coding-file>

The name of the coding file determines the filenames where the indices get stored. The name of the featural-layer is unimportant but must be unique. The tags for the indices must not already be used in some other part of the corpus, including other indices.

An earlier version of this program was called AddMarkables. It only worked for one observation at a time and didn't take the arguments as flags but in a particular order. We've changed the name only because this isn't backward compatible.

Note that if you want one pointer for every named variable in a simple query, or you want tree-structured indices corresponding to the results for complex queries, you can use SaveQueryResults and load the results as a coding. For cases where you could use either, the main difference is that SaveQueryResults doesn't give control over the tag name and roles.

Example of Indexing

To add indices that point to active sentences in the Switchboard data, add the following tag to the metadata as an interaction-coding (i.e., as a sister to the other tags).

<coding-file name="active"> <featural-layer name="active-layer"> <code name="active"> <pointer number="1" role="at"/> </code> </featural-layer> </coding-file>

This specifies that the indices for sw2005 (for example) should go in sw2005.active.xml. Then, for example,

java Index -corpus Data/meta/swbd-metadata.xml -tag active -query '($sent nt):($sent@cat=="S")::($subj nt):($sent^1$subj)&&($subj@cat="NP")&&($subj@subcat=="SBJ"):: ($obj nt):($sent^$obj)&&($obj@cat=="NP")&&(!($obj@subcat))::(forall $obj_par nt):(($sent^$obj_par)&&($obj_par^$obj)&&($obj_par!=$sent)&&($obj_par!=$obj))->($obj_par@cat="VP")::($vp nt):($vp^1$obj) && ($vp@cat=="VP")::(forall $obj2 nt):(($vp^1$obj2) && ($obj2!=$obj) && ($obj2@cat=="NP")&&(!($obj2@subcat))) -> ($obj<>$obj2)::($mSbj markable): ($mSbj >"at" $subj)::($mObj markable): ($mObj >"at" $obj):: (forall $t1 trace):!($subj^1$t1)::(forall $t2 trace):!($obj^1$t2)'

keeping in mind that on your operating system, you may need to swap around the single and double quotes. (We've used this query in the example to show the sort of thing which external users have successfully written in the language solely from reading our documentation - it's not as difficult to write as it looks if you use the hints below.) After indexing,

($s nt)($a active):($a >"at" $s)

gets the active sentences.

Multiple invocations of Index will add more indices to the same index file, not overwrite the existing indices. This can be useful if there are two separate queries you want to run to create the index, or if you want to store multiple indices in the same file (for instance, for active and passive sentences). In the latter case, make sure to add <code> tags for both to the metadata at some time before you run the indexing for passives, e.g.:

<coding-file name="active"> <featural-layer name="active-layer"> <code name="active"> <pointer number="1" role="at"/> </code> <code name="passive"> <pointer number="1" role="at"/> </code> </featural-layer> </coding-file>

Reasons for Indexing

Indexing is useful during data analysis. It is also sometimes useful to create indices for queries that are slow to run. Of course, the faster the initial query, and the less frequently the results are required, the less sense indexing makes. However, there's no harm in creating indexes where you think you might want them, because they are kept in completely different files from the main data; it's easy to throw them away or not load them.

Known Problems

At Feb 05, there are a number of known problems with the current query language implementation.

Multiple observations and timings

There is a bug when querying over multiple observations - the implementation considers times in different observations to be comparable, so that it's possible to get the result that an element in one observation is before some element in another. This is easy to get around: query on one observation at a time, or declare the reserved attribute for observation names for your corpus and add a test for the same observation as an extra query term - e.g. ($f@obs = $g@obs), if the attribute declared is "obs".

Search GUI and forall

The search GUI (whether called stand-alone or from a search menu) can't display results if some subquery in a complex query only has query matches that are bound with "forall" - e.g.

($f foo):($f@att="val")::(forall $g bar):!($g ^ $f)

Immediate Precedence

The immediate precedence operator is missing. Immediate precedence is equivalent to

($f foo)($g foo)(forall $h foo): 
     ($f<<$g) && (($h=$f) || ($h=$g) || ($h<<$f) || ($g<<$h))

but of course this is cumbersome and can be too slow and memory-intensive for practical purposes, depending on the data set. Some common uses of the operator are covered by the NGramCalc utility. Another work-around is to create one XML tree from the NXT data that represents the information required and query it using XPath. Export to LPath and tgrep2 would also be reasonable and are not difficult to implement. If you need to match on regular expressions of XML elements in order to add markup, (so, for instance, saying "find syntactic constituents with one determiner followed by one or more adjectives followed by one noun and wrap a new tag around them"), but you can always use something like fsgmatch (from the LTG; new release, currently in beta, is called lxtransduce) and then modify the metadata to match. Remember, the data is just XML, amenable to all of the usual XML processing techniques.

Arithmetic

The arithmetic operators are missing.

At present, users who need them add new attributes to their data set and then carry on as normal. For instance, a researcher looking at how often bar elements start in the 10 seconds after foo elements end might add an "adjustedstart" attribute to bar elements that take 10 seconds off their official start times, and then use the query

($f foo)($b bar):(START($b) > END($f)) && ($b@adjustedstart < END($foo))

This stylesheet, run on a specific individual coding in the context of the MONITOR project, is an example of how this can be done. It just copies everything, adding new attributes to feedbackgaze codes. We used this general technique on the Switchboard data to get lengths for syntactic constituents, and on the Monitor data to get durations.

This method is inconvenient, particularly for the sort of exploratory study that wishes to consider several different time relationships. We don't think it is worth adding special loading routines that add temporary attributes for adjusted start and end times, but we could include some utilities for command line searching based on adjustments passed in on the command line. For instance,

java CountWithTimeOffset -q '($t turn)($f feedback):($t # $f)' -t feedback -d 50

could mean to count overlaps after feedback elements have been displaced 50 seconds forward. We are considering whether this would be useful enough to supply.

Inability to handle namespacing

At present (Apr 05) the query language parser fails to handle namespacing properly, so any elements and attributes that are namedspaced will be difficult to work with. For the timing and id attributes, where the default names are in the nite: namespace, this doesn't matter, since they are exposed to query via e.g. START($x), but namespacing other tags and attributes would make working with them difficult until this is fixed.

Speed and Memory Use

NXT's query engine is slow and uses a great deal of memory. For instance, some of our more complicated syntactic queries on the Switchboard corpus take 10 seconds per dialogue, or over an hour and a half for the entire corpus.

This is partly a consequence of what it does - the query language is solving a harder problem than languages that operate on trees and/or are limited in their use of left and right context. It is true that the current implementation is not fully optimized, but this is not something we intend to look at in the immediate future. Our first choice strategy for addressing this problem is to look at mapping NQL queries to XQuery for implementation, and addition of the missing operators, that way. Meanwhile, most of NXT's users are not actually engaged in real-time processing, and find that if they develop queries on a few observations using a GUI, they can then afford to run the queries at the command line in batch. The more they are interested in sparse phenomena, the less suitable this strategy is. For some query-based analyses, it is also useful to consider direct implementation using the NOM API, since the programmer can optimize for the analysis being performed.

Meanwhile, an hour and a half is OK for batch mode, but some of our queries are so common that we really want easy access to the results. We can get this by indexing. Using indices rather than the more complex syntactic queries makes querying roughly ten times faster. This will be even faster if one then selects not to load the syntax at all, which is possible if one doesn't need it for other parts of the subsequent query. You can choose not to load any part of the data by commenting out the tag for it in your local copy of the metadata file, or after NXT 1.3.0, by enabling lazy loading in your applications.

It's faster to use string equality than regular expression matching in the query language, and keep in mind the regular expressions have to match the entire string they are compared against, not just a substring of it.

The very desperate can write special purpose applications to evaluate their queries, which is faster especially for queries involving quantification. For instance, one user has adapted CountQueryResults to run part of the query he wants, but instead of returning the results, then checks the equivalent of his forall conditions using navigation in the NOM.

Helpful hints

We recommend refining queries using display.bat/.sh on a single dialogue (probably spot-checking on a couple more, since observations vary), and running actual counts using the command line utilities. Build up queries term by term - the syntax error messages aren't always very easy to understand. Missing dollar signs, quotation marks, and parentheses are the worst culprits. Get around the bookmark problems and the lack of parenthesis and quote matching in the search GUI by typing the query into something else that's handier (such as emacs) and pasting what you've written into the query window. You can and should include comments in queries if they are at all complicated. Queries have to be expressed on one line to run them at the command line, but you shouldn't try to author them this way - instead, postprocess a query developed in this more verbose style by taking out

Analysis of query results can be expedited by thinking carefully about the battery of tools that are available: knit, LT-XML, stylesheets, xmlperl, shell script loops, and so on. One interesting possibility is importing the query results into the data set, which would be a fancier, hierarchically structured form of indexing. At May 2004, the metadata <coding-file> declaration required to do this would be a little different for every query result, but we intend minor syntactic changes in both the query result XML and what knit produces to make this declaration static.