This page describes the various utilities for searching a corpus using the NXT's dedicated query language.
There are two basic methods for using the search utilities; .bat/.sh files (.bat for Windows, .sh for Unix/Linux), and the command line. Command line examples below are given for bash and are known to work under cygwin (which comes with a bash that runs under Windows). There will be DOS equivalents but it hasn't been a priority for us to figure out what they are, and in particular (we think) redirection is less flexible there.
The .bat/.sh files are set up so they just need to be run to work. The command line utilities require the classpath environment variable to be set up so that the shell can find the software. Assuming you use them from the top level directory in which the software is installed, this can be done as follows:
if [ $OSTYPE = 'cygwin' ]; then
export CLASSPATH=".;lib;lib/nxt.jar;lib/jdom.jar;lib/JMF/lib/jmf.jar;lib/pnuts.jar;lib/resolver.jar;lib/xalan.jar;lib/xercesImpl.jar;lib/xml-apis.jar;lib/jmanual.jar;lib/jh.jar;lib/helpset.jar;lib/poi.jar"
else
export CLASSPATH=".:lib:lib/nxt.jar:lib/jdom.jar:lib/JMF/lib/jmf.jar:lib/pnuts.jar:lib/resolver.jar:lib/xalan.jar:lib/xercesImpl.jar:lib/xml-apis.jar:lib/jmanual.jar:lib/jh.jar:lib/helpset.jar:lib/poi.jar"
fi
This syntax assumes the bash shell; other shells use a different syntax.
Then, for example,
java CountQueryResults -corpus Data/meta/swbd-metadata.xml -query '($n nt):'You'll need to be careful to use single quotes at shell level and double quotes within queries - although we've found one shell environment that requires the quotes the other way around. Getting the quoting to work correctly in a shell script is difficult even for long-time Unix users. This example shell script worked for one user's specific needs running under Cygwin.
Don't forget you can use redirection to get rid of all the pesky warning and log messages:
java CountQueryResults -corpus Data/meta/swbd-metadata.xml -query '($n nt):' 2> logfileor
java CountQueryResults -corpus Data/meta/swbd-metadata.xml -query '($n nt):' 2> /dev/nullIt is possible to increase the amount of memory available to java for processing, and depending on the machine set up, this may speed things up. This can be done by using flags to java, e.g.
java -Xincgc -Xms127m -Xmx512m -Xfuture CountQueryResults ...but also as an edit to the java calls in any of the existing scripts. This is what they mean:
-Xincgc incremental garbage collection (get back unused memory)It's possible to use other numbers for -Xms and -Xmx, and perhaps the values given here aren't always appropriate, or are lower than some machine defaults.
Enough of NXT has been tested or used successfully under Mac OSX that we just believe it to work. Mac OSX didn't present any particular problems (for instance, the .sh scripts work) except in making it possible to run them by double-clicking on them. This required us to write specialist Mac scripts that call the .sh scripts using AppleScript. This is an example of one but don't try to look at it in a text editor; it opens in the Applescript editor. Keep in mind with scripts like these that you have to tailor them to your installation by specifying correct paths. If you can contribute a better paragraph explaining what Mac users need to do, we'd be grateful for it - it's hard for us to find Macs with normal setups to experiment on. If you need more help than this hint, please ask and if we can, we'll reconstruct the process.
At May 2004, NXT, and the data loading routines in particular, produce a great many progress and warning messages on System.err - enough so that it can be difficult to tell what's a real problem. We expect to consider strategies for improving this.
Any corpus in NITE data format is immediately amenable to two different graphical interfaces that allow the corpus to be searched, even without writing tailored programs. The first is a simple search GUI, and the second is a generic data display program that works in tandem with the search GUI to highlight search results.
The search GUI can be reached either by using search.bat/search.sh and specifying which corpus to load or by using the .bat/.sh for the specific corpus (if it exists) and choosing the "search" option. It has two tabbed winodws. The query tab allows the user to type in a query. Cut and paste from other applications works with this window. The query can also be saved on the bookmark menu, but at May 2004 this doesn't work well for long queries. There is a button to press to do the search, which automatically takes the user either to a pop-up window with an error message explaining where the syntax of the query is incorrect, or, for a valid query, to the result tab. This window shows the results as an XML tree structure, with more information about the element the user has selected (with the mouse) displayed below the main tree.
The GUI includes an option to save the XML result tree to a file. This can be very handy in conjunction with "knit" for performing data analysis. It also includes an option to save the results in a rudimentary Excel spreadsheet. This is less handy, especially in the case of complex queries, because the return value is hierarchically structured but the spreadsheet just contains information about each matched element dumped into a flat list by performing a depth-first, left-to-right traversal of the results. However, for relatively simple queries and people who are used to data filtering and pivot tables in Excel, it can be the easiest first step for analysis.
The search GUI works on an entire corpus at once. This can make it slow to respond if the corpus is very large or if the query is very complicated (although of course it's possible to comment out observations in the metadata to reduce the amount of information it loads). Sometimes a query is slow because it's doing something more complicated than what the user intended. A query can be interrupted mid-processing and will still return a partial result list, which can be useful for checking it.
At May 2004, when the user chooses to open a corpus from the File menu, the search GUI expects the metadata file to be called something.corpus, although many users are likely to have it called something.xml (so that it behaves properly in other applications like web browsers). Choose the "all files" option (towards the bottom of the open dialogue box) in order to see .xml files as well as .corpus ones.
Unlike the search GUI, the generic display utility works on one observation at a time. It can be reached either by using the .bat/.sh for the specific corpus (if it exists) or by invoking it at the command line:
java net.sourceforge.nite.gui.util.GenericDisplay -c CORPUS -o OBS
Where CORPUS gives a path to a metadata file and OBS names an observation that is listed in that metadata file.
It simply puts up an audio/video window for each signal associated with an observation, plus one window per coding that shows the elements in an NTextArea, one element per line, with indenting corresponding to the tree structure and a rendering of the attribute values, the PCDATA the element contains, and enough information about pointers to be able to find their targets visually on the other windows. It doesn't try to do anything clever about window placement. There is a search menu on the interface that will pull up the search GUI and when the user selects an element on the search result tab, will highlight the corresponding part of the data display. This facility can be extremely useful in formulating queries!
It would be possible to improve this interface to make it much more useful (the Javadoc comments at the beginning of the code for it give our ideas about that, and it would make a nice student project). It's nice to have something that works for NITE format data out of the box, but a generic display will never be as good as writing one that is tailored for the data set.
Some of the NITE data samples come with program samples that build tailored data displays or graphical user interfaces that allow the data to be coded. These programs can easily be modified to have the search highlighting capabilities of the generic display, and some of them already do. Use the implementation of the generic display utility, or of an existing sample, as a guide. At May 2004, programs that use the search menu will only show highlighting on NTextAreas because they are the only display object to implement the required interface. We expect this situation to improve.
In the tailored
displays, if there isn't a direct representation of some element on
the display, then there's nothing to highlight. For instance, in many
data sets timestamped orthographic transcription consists of
We've written quite a few command line search utilities as our own projects have needed them; none of them are very complicated. We probably won't continue to ship with all of them in the release because there are so many and they are quite similar. The ones that aren't in the release can be found in the samples directory of the CVS; place them in the samples directory from the release and then from the top level directory, compile in the usual way, e.g. javac -d lib samples/CountQueryResults.java
This is a command line interface for saving the query results as an XML tree (the same tree that's displayed in the search GUI). It takes the following arguments:
-corpus CORPUS -observation OBS -query QUERY -filename OUTFILENAME -directory DIRNAME -independent or -allatoncewhere CORPUS is the location of the metadata file, OBS is an observation name (if not given, loads all observations listed in the metadata file), and QUERY is a query expressed in NQL. If -independent is indicated, then it saves one result file per observation with the query evaluated independently on each one; if -allatonce is indicated, then the entire corpus is loaded at once, with one output file saved. -independent is faster but -allatonce is necessary if queries draw context from outside single observations. In distributions before 05 May 2004 (1.2.6 or earlier), the default is -allatonce, but after that, it was changed to -independent to be like the other command line utilities. If no filename is indicated, the output goes to System.out. (Note that this isn't very sensible to do in conjunction with -independent because the output will just concatenate separate XML documents.) Everything else that could potentially be on System.out is redirected to System.err.
If a filename is indicated, the output ends up in the directory named by DIRNAME. It ends up in OUTFILENAME unless -independent is indicated, in which case that filename is prefixed with the name of the observation and a full stop (.). -independent is ignored if -observation OBS is indicated (i.e., the output is saved without prefixing the filename).
Under cygwin, -d takes Windows-style directory naming; e.g., -d "C:", not -d "/cygdrive/c". Using the latter will create the unexpected C:/cygdrive/c. This may be configurable at the system level. At May 2004, directory naming has not been tested on other platforms.
This is a command line interface for counting query results for an entire corpus that doesn't show a result tree but just outputs the number of matches (which is good for embedding in scripts). It takes the following arguments:
-corpus CORPUS -observation OBS -query QUERY -allatoncewhich are handled as in SaveQueryResults. In the case of complex queries, the counts reflect the number of top level matches (i.e., matches to the first query that survive the filtering performed by the subsequent queries - matches to a subquery drop out if there are no matches for the next query).
When -allatonce or -observation, the result is a bare count; otherwise, it is a table, one line per observation, with observation name, whitespace, and then the count.
In our experience, queries can always be rewritten so that this is the number one wants. That's a whole other story, but here are some quick hints:
Note that saved query results can be knit with the corpus to useful
effect. In theory, saved query results can also be introduced as data
itself by adding an appropriate
In versions before NXT-1.2.6, CountQueryResults means -allatonce and
a separate utility, CountOneByOne, handles the independent case.
MatchInContext
MatchInContext is a command line interface that shows the orthography for
some query results, optionally with some surrounding context. We
developed it specifically at the request of users who are already
familiar with e.g. tgrep.
It takes the following arguments:
where the arguments are as before with the following additions:
The text is shown on STDOUT. If no context is specified, the text show is that relating to the first named variable in the query. If context is specified, then that same text is shown upcased within the text relating to the context. This can produce puzzling results (like matches with no text or with context but no match text) that in our experience prove to be correct. There is no clean way of knowing where to insert line breaks, speaker attributions, etc. in a general utility such as this one; for better displays write a tailored interface. Canonical usage would use one variable queries for the main and context queries, with the latter expressing simply a type, possibly with some constraints on attribute values. There may be less or more than one context match for a query result; in these cases we comment and show the first match we find (if any).
In versions before NXT-1.2.6, MatchInContext means -allatonce and a separate utility, MatchInContextOneByOne, handles the independent case.
An n-gram is a sequence of n states in a row drawn from an enumerated list of types. For instance, consider Parker's floor state model (Journal of Personality and Social Psychology 1988). It marks spoken turns in a group discussion according to their participation in pairwise conversations. The floor states are newfloor (first to establish a new pairwise conversation), floor (in a pairwise conversation), broken (breaks a pairwise conversation), regain (re-establishes a pairwise conversation after a broken), and nonfloor (not in a pairwise conversation). The possible tri-grams of floor states are newfloor/floor/broken, newfloor/floor/floor, regain/broken/ nonfloor, and so on. We usually think of n-grams as including all ways of choosing a sequence of n types, but in some models, not all of them are possible; for instance, in Parker's model, the bi-gram newfloor/newfloor can't happen. N-grams are frequently used in engineering-oriented disciplines as background information for statistical modelling, but they are sometimes used in linguistics and psychology as well. Computationalists can easily calculate n-grams by extracting data from NXT into the format for another tool, but sometimes this is inconvenient or the user who requires the n-grams may not have the correct skills to do it.
NGramCalc is a utility for calculating n-grams from NXT format data. The command line options for it are a bit complicated because it's very flexible. It can take all elements of a particular type or all those resulting from the first matched variable of a query, as long as they are timed, and put them in order of start time. Then, using as the states either the value of a named attribute, or the names of elements at the end of a named role, or the value of a named attribute on those elements, it will report n-gram frequencies - but it must be possible from the metadata to get the complete list of states for the utility to work. It looks for the list as the enumerated attribute values of the appropriate tag or as the set of codes allowed in the layer a role is declared to point to. Some NXT users effectively have enumerated attribute values but declare them as strings - modifying the declaration to make it explicit will enable the utility.
To call the utility, which is in builds after NXT 1.3.1 but otherwise
available from CVS, set the classpath in the usual way, and then use
java NGramCalc -corpus metadata_file_name -observation observation_name -tag tagname -query query -att attname -role rolename -n positive_integer
-corpus is required; it is the path to the metadata file.
-observation is optional; if it is used, the n-grams are calculated over one observation only, and if it is omitted, over all observations listed in the metadata. Although only one set of numbers is reported, NXT loads only one observation at a time when calculating them.
-tag is required; it names the tag to use in finding the state names.
-query is optional; if given, then the program uses matches to the first named variable as the elements from which to derive states. If it is not given, then the query is assumed to match all tags of the type named using -tag. Note that if a query is used, it is possible to have the first named variable use a disjunctive type, but only if the method for deriving states from the elements works for both types and results in the same enumerated list. In this case, either of them can be named in -tag.
-role is optional; if given, rather than looking for the states on the query matches (or named tag if no query was given), the program looks for them on the element found by tracing the named role from there. This level of indirection is useful if the data was produced using one of NXT's configurable end user tools, which tend to point to external corpus resources to get possible annotation values.
-att; if given, uses the value of the named attribute both for finding the possible state names and for finding the actual states. -att is required if -role is omitted, but optional if it is included. If -role is included and -att is omitted, then instead of using attribute values, the states are derived from the element names in the layer pointed to by the named role.
-n is optional; it gives the size of the n-grams. It defaults to 1.
For instance,
java NGramCalc -c METADATA -t turn -a fs -n 3
will calculate trigrams of fs attributes of turns and output a
tab-delimited table like
500 newfloor floor broken
0 newfloor newfloor newfloor
Suppose that the way that the data is set up includes an additional
attribute value that we wish to skip over when calculating the tri-grams,
called "continued".
java NGramCalc -c METADATA -t turn -a fs -n 3 -q '($t turn):($t@fs != "continued")'
will do this. Entries for "continued" will still occur in the output
table because it is a declared value, but will have zero in the
entries.
java NGramCalc -c METADATA -t gesture-target -a name -n 3 -q '($g gest):' -r gest-target
will produce trigrams where the states are found by tracing the gest-target
role from gest elements, which finds gesture-target elements (canonically,
part of some corpus resource), and further looking at the values of their
name attributes. Note that in this case, the type given in -t is what results
from tracing the role from the query results, not the type returned in the
query.
java NGramCalc -c METADATA -t gest -q '($g gest):' -r gest-target
will produce unigrams where the states are named in the elements reached by tracing the gest-target role from gest elements. Again, canonically these would be part of some corpus resource, but in this case the element names themselves are used. Note that in this case, type given in -t and in the query results are the same.
At 21 Feb 05, use of -role without -att is not yet implemented.
We can think of the following changes that could be useful:
We are interested in what other requirements users can find.
Note: FunctionQuery is only available with NXT versions compiled after May 1st 2006 (NXT 1.3.5 upwards). It has been designed to subsume the functionality of SortedOutput.
FunctionQuery is a utility for outputting tab-delimited data. It takes all elements resulting from the result of a query, as long as they are timed, and put them in order of start time. Then it outputs one line per element the values of a the named attributes or aggregates with a tab character between each attribute. The attributes can be qualified with a variable specifier (e.g. $v@attribute). If the variable specifier is omitted, the attribute belonging to the first variable in the query (the "primary variable") is returned.
To call the utility, set the classpath in the usual way, and then use
java FunctionQuery -c metadata_file_name -ob observation_name -q query -atts [attname+]-corpus is required; it is the path to the metadata file.
-observation is optional: it's the name of the observation. If not present the program cycles through all observations.
-text is an optional flag. If present text of the result element (or any subelements) is output after any attributes. This is kept for backwards compatability and the results are identical to adding the primary variable name (including leading $) to the end of the atts list.
-query is required; the first matched variable from every result forms the basis of the output.
-atts is required; input is expected as a space separated list of attributes or aggregates. Note that if the attribute does not exist for some matched elements, a blank tab-stop will be the output. Attributes are generally of the form: $v@attribute. If $v is omitted, the attribute is taken from the primary variable. If the attribute is omitted, the textual content of $v and its children are returned.
Aggregate functions are idenified by a leading '@' character. There are currenly 4 aggregate functions included in FunctionQuery, described below. The first argument to a function is always a subquery to be evaluated in the context of the current result.
Aggregate Functions:
For the following functions, optional arguments are denoted by an equals sign followed by the default value of that argument. The context query that is the first argument of each of these functions should be of the form:($v var):$v
test $q && other tests
$q
variable).
For example $v # $q
will return all results
that overlap the main query result, whilst $q ^ $v
will return all descendents of the current query result.Example:
will output a sorted list of moves for the observation consisting of type attribute, start and end times, the count of w (words) that overlap each move, and any text included in the move, or any children.
FunctionQuery is designed to be backwards-compatable with SortedOutput.Note: As of May 2006, FunctionQuery is a more general tool that subsumes the functionality of SortedOutput
SortedOutput is a utility for outputting tab-delimited data. It takes all elements resulting from the first matched variable of a query, as long as they are timed, and put them in order of start time. Then it outputs one line per element the values of a the named attributes with a tab character between each attribute.
To call the utility, which is in builds after NXT 1.3.2 but otherwise available from CVS, set the classpath in the usual way, and then use
java SortedOutput -c metadata_file_name -ob observation_name -q query -atts [attname+]-corpus is required; it is the path to the metadata file.
-observation is optional: it's the name of the observation. If not present the program cycles through all observations.
-text is an optional flag. If present text of the result element (or any subelements) is output after any attributes.
-query is required; the first matched variable from every result forms the basis of the output.
-atts is required; input is expected as a space separated list of attributes. Note that if the attribute does not exist for some matched elements, a blank tab-stop will be the output.
Example:
will output a sorted list of moves for the observation consisting of type attribute, start and end times.
Note: As of May 2006, FunctionQuery is a more general tool that subsumes the functionality of SortedOutput
At Feb 06, there is a beta-release extension of SortedOutput called SearchAndFilter. You can find it in the Sourceforge CVS repository under "samples" but not in the builds. It allows one to specify output fields from more than the first variable match using a syntax like the following:
Knit isn't really a search utility like the others, but it's a general utility that will be useful to anyone performing searches. It's related to the general utility that will be familiar to some users from LT XML, modified for the NXT data model and data format. Knit and its uses are described more fully on the data processing page.
NXT includes a facility for adding annotation to a corpus based on
query matches. To use it, set the classpath
as usual. Then use
java Index -c CORPUS -o OBS -q QUERY -t TAG -r ROLE1 ROLE2 ... ROLEM
where CORPUS is the location of a metadata file
OBS is optional, and is the name of an observation. If it is omitted, all observations named in the metadata file are indexed in turn.
QUERY is a query. Let n be the number of unquantified variables in the first subquery (unquantified meaning, without a forall or exists).
TAG is the name of a tag (code). It is optional and defaults to "markable".
ROLE1 to ROLEM are optional, and are the names of roles to use in the indexing.
The -r flag, if present, must be given last.
For each query match, a new tag of type TAG is added to the corpus. If -r is omitted, the new tag is made a parent of the first unquantified variable of the query. If -r is included, then the new tag will contain m pointers which point to the first m unquantified variables in the first subquery, using the given role names in order. m must be less than or equal to n.
Note that this does not restrict the same element to at most one match, even though that's a property we often want our mappings to have. When creating indices of one variable, it is often best to use only one unquantified variable in the first subquery of the query, so that we don't index the same thing more than once (i.e., using further subqueries as a filter, not to retain further variable matches). Also note that Index does not remove existing tags of the same type before operation - it just adds new tags. Making the same call twice will create two indices to the same data, which is usually undesirable, but this enables an index to be built up gradually using several different queries.
The program assumes that a definition for the new tag have already been added into the metadata file. It is usual to put it in a new coding, and it would be a bad idea to put in a layer that anything points to, since no work is done to attach the indices to prospective parents or anything else besides what they index. If the indexing adds parents, then the type of the coding file (interaction or agent) must match the type of the coding file that contains the matches to the first variable. If an observation name is passed, it creates a index only for the one observation; if none is, it indexes each observation in the metadata file by loading one at a time (so this won't work for queries that need comparisons across observations).
The canonical metadata form for an index file, assuming roles are used, is
an interaction coding as follows:
<coding-file name="FOO">
<featural-layer name="BAZ">
<code name="TAGNAME">
<pointer number="1" role="ROLENAME" target="LAYER_CONTAINING_MATCHES"/>
</code>
</featural-layer>
</coding-file>
The name of the coding file determines the filenames where the indices get stored. The name of the featural-layer is unimportant but must be unique. The tags for the indices must not already be used in some other part of the corpus, including other indices.
An earlier version of this program was called AddMarkables. It only worked for one observation at a time and didn't take the arguments as flags but in a particular order. We've changed the name only because this isn't backward compatible.
Note that if you want one pointer for every named variable in a simple query, or you want tree-structured indices corresponding to the results for complex queries, you can use SaveQueryResults and load the results as a coding. For cases where you could use either, the main difference is that SaveQueryResults doesn't give control over the tag name and roles.
<coding-file name="active"> <featural-layer name="active-layer"> <code name="active"> <pointer number="1" role="at"/> </code> </featural-layer> </coding-file>
This specifies that the indices for sw2005 (for example) should go in sw2005.active.xml. Then, for example,
java Index -corpus Data/meta/swbd-metadata.xml -tag active -query '($sent nt):($sent@cat=="S")::($subj nt):($sent^1$subj)&&($subj@cat="NP")&&($subj@subcat=="SBJ"):: ($obj nt):($sent^$obj)&&($obj@cat=="NP")&&(!($obj@subcat))::(forall $obj_par nt):(($sent^$obj_par)&&($obj_par^$obj)&&($obj_par!=$sent)&&($obj_par!=$obj))->($obj_par@cat="VP")::($vp nt):($vp^1$obj) && ($vp@cat=="VP")::(forall $obj2 nt):(($vp^1$obj2) && ($obj2!=$obj) && ($obj2@cat=="NP")&&(!($obj2@subcat))) -> ($obj<>$obj2)::($mSbj markable): ($mSbj >"at" $subj)::($mObj markable): ($mObj >"at" $obj):: (forall $t1 trace):!($subj^1$t1)::(forall $t2 trace):!($obj^1$t2)'
keeping in mind that on your operating system, you may need to swap around the single and double quotes. (We've used this query in the example to show the sort of thing which external users have successfully written in the language solely from reading our documentation - it's not as difficult to write as it looks if you use the hints below.) After indexing,
($s nt)($a active):($a >"at" $s)
gets the active sentences.
Multiple invocations of Index will add more indices to the same index file, not overwrite the existing indices. This can be useful if there are two separate queries you want to run to create the index, or if you want to store multiple indices in the same file (for instance, for active and passive sentences). In the latter case, make sure to add <code> tags for both to the metadata at some time before you run the indexing for passives, e.g.:
<coding-file name="active"> <featural-layer name="active-layer"> <code name="active"> <pointer number="1" role="at"/> </code> <code name="passive"> <pointer number="1" role="at"/> </code> </featural-layer> </coding-file>
At Feb 05, there are a number of known problems with the current query language implementation.
There is a bug when querying over multiple observations - the implementation considers times in different observations to be comparable, so that it's possible to get the result that an element in one observation is before some element in another. This is easy to get around: query on one observation at a time, or declare the reserved attribute for observation names for your corpus and add a test for the same observation as an extra query term - e.g. ($f@obs = $g@obs), if the attribute declared is "obs".
The search GUI (whether called stand-alone or from a search menu)
can't display results if some subquery in a complex query only has
query matches that are bound with "forall" - e.g.
($f foo):($f@att="val")::(forall $g bar):!($g ^ $f)
The immediate precedence operator is missing. Immediate precedence is
equivalent to
($f foo)($g foo)(forall $h foo):
($f<<$g) && (($h=$f) || ($h=$g) || ($h<<$f) || ($g<<$h))
but of course this is cumbersome and can be too slow and memory-intensive for practical purposes, depending on the data set. Some common uses of the operator are covered by the NGramCalc utility. Another work-around is to create one XML tree from the NXT data that represents the information required and query it using XPath. Export to LPath and tgrep2 would also be reasonable and are not difficult to implement. If you need to match on regular expressions of XML elements in order to add markup, (so, for instance, saying "find syntactic constituents with one determiner followed by one or more adjectives followed by one noun and wrap a new tag around them"), but you can always use something like fsgmatch (from the LTG; new release, currently in beta, is called lxtransduce) and then modify the metadata to match. Remember, the data is just XML, amenable to all of the usual XML processing techniques.
The arithmetic operators are missing.
At present, users who need them add new attributes to their
data set and then carry on as normal. For instance, a researcher looking
at how often bar elements start in the 10 seconds after foo elements end
might add an "adjustedstart" attribute to bar elements that take 10 seconds
off their official start times, and then use the query
($f foo)($b bar):(START($b) > END($f)) && ($b@adjustedstart < END($foo))
This stylesheet, run on a specific individual coding in the context of the MONITOR project, is an example of how this can be done. It just copies everything, adding new attributes to feedbackgaze codes. We used this general technique on the Switchboard data to get lengths for syntactic constituents, and on the Monitor data to get durations.
This method is inconvenient, particularly for the sort of exploratory
study that wishes to consider several different time relationships.
We don't think it is worth adding special loading routines that add
temporary attributes for adjusted start and end times, but we could
include some utilities for command line searching based on adjustments
passed in on the command line. For instance,
java CountWithTimeOffset -q '($t turn)($f feedback):($t # $f)' -t feedback -d 50
could mean to count overlaps after feedback elements have been displaced 50 seconds forward. We are considering whether this would be useful enough to supply.
This is partly a consequence of what it does - the query language is solving a harder problem than languages that operate on trees and/or are limited in their use of left and right context. It is true that the current implementation is not fully optimized, but this is not something we intend to look at in the immediate future. Our first choice strategy for addressing this problem is to look at mapping NQL queries to XQuery for implementation, and addition of the missing operators, that way. Meanwhile, most of NXT's users are not actually engaged in real-time processing, and find that if they develop queries on a few observations using a GUI, they can then afford to run the queries at the command line in batch. The more they are interested in sparse phenomena, the less suitable this strategy is. For some query-based analyses, it is also useful to consider direct implementation using the NOM API, since the programmer can optimize for the analysis being performed.
Meanwhile,
an hour and a half is OK for batch mode, but some of our queries are
so common that we really want easy access to the results. We can get
this by indexing. Using indices rather than the more complex
syntactic queries makes querying roughly ten times faster. This will
be even faster if one then selects not to load the syntax at all,
which is possible if one doesn't need it for other parts of the
subsequent query. You can choose not to load any part of the data by
commenting out the
It's faster to use string equality than regular expression matching
in the query language, and keep in mind the regular expressions have
to match the entire string they are compared against, not just a substring
of it.
The very desperate can write
special purpose applications to evaluate their queries, which is faster
especially for queries involving quantification. For instance,
one user has adapted CountQueryResults to run part of the query he wants,
but instead of returning the results, then checks the equivalent of his
forall conditions using navigation in the NOM.
We recommend refining queries using display.bat/.sh on a single
dialogue (probably spot-checking on a couple more, since observations
vary), and running actual counts using the command line utilities.
Build up queries term by term - the syntax error messages aren't
always very easy to understand. Missing dollar signs, quotation marks,
and parentheses are the worst culprits. Get around the bookmark
problems and the lack of parenthesis and quote matching in the search
GUI by typing the query into something else that's handier (such as
emacs) and pasting what you've written into the query window. You can
and should include comments in queries if they are at all complicated.
Queries have to be expressed on one line to run them at the command
line, but you shouldn't try to author them this way - instead,
postprocess a query developed in this more verbose style by taking out
Analysis of query results can be expedited by thinking carefully about
the battery of tools that are available: knit, LT-XML, stylesheets,
xmlperl, shell script loops, and so on. One interesting possibility
is importing the query results into the data set, which would be a
fancier, hierarchically structured form of indexing. At May 2004, the
metadata <coding-file> declaration required to do this would be a
little different for every query result, but we intend minor syntactic
changes in both the query result XML and what knit produces to make
this declaration static.
The main documentation is the
query language reference manual.
Virtually the same information can be found on the help
menu of the search window (if you don't find it there, it's an
installation problem). An older document with more contextual
information can be found
here.
At September 2006, we plan a revised version of the manual.
The current version fails to give details about the operator for
finding out whether two elements are linked via a pointer with a role.
($a <"foo" $b) is true if $a points to $b using the "foo" role;
the role name can be omitted, but if it is specified it can only
be given as a textual string, not as a regular expression.
The current version also fails to make clear that the regular expression
examples given are only a subset of the possibilities. The exact
regular expression syntax depends on your version of Java, since it
is implemented using the java.util.regex package. Java 1.5
regular expression documentation can be found
here.
Here are some worked examples for the
Switchboard data sample
and the
Monitor data sample.
Computer scientists and people familiar
with first order predicate calculus have tended to be happy
with the reference manual plus the examples; other people
need more (so, for instance, don't know what implication is
or what "forall" is likely to mean) and we're still thinking
about what we might be able to provide for them.
At Nov 2004, there are a few things described
in the query documentation that haven't been implemented yet (and
aren't on the workplan for immediate development).
This includes arithmetic operators and temporal fuzziness.
We thought this included versions of ^ and <>
limited by distance, but users report that these (or some of these?)
work. Also, some versions of the query documentation
show ; instead of : as the separator between bindings and match
conditions. The only major bug we've run into (at Nov 2004) is
that temporal operators will perform comparisons across observations,
even though time in different observations is meant to be independent.
After NXT-1.2.6, 05 May 04, one can in the metadata
declare a reserved attribute
to use for the observation name that will be added automatically for
every element, providing a work-around.
There's
a nifty visual demo
that runs on a toy corpus and might be useful for deciding whether
this stuff is useful in the first place.
Last modified 09/18/06
Helpful hints
Related documentation