Deprecated parts of NXT Metadata

NXT's initial design include some very limited information in the metadata file for dividing a corpus into subsets and for expressing data management information relating to a coding effort. To our knowledge, even those these facilities are still available (at April 05), they have not been used, and there are better ways of approaching both problems. In addition, NXT has a superfluous distinction between simple and stand-off corpora. We describe what NXT has and then what the better approach is. We expect to remove these facilities from future versions of NXT.

Simple corpora

In NXT's design, the .type attribute of the top level <corpus> element could take one of the two values: simple or standoff. Simple corpora have one tree of data per observation, whereas standoff corpora have multi-rooted infosets with links between files. The use of simple corpora is deprecated, since it's as easy to treat a single XML file as one coding or corpus resource within a stand-off corpus. Allowing for simple corpora as a special case complicates the loading and saving parts of the NOM API. We hope eventually to remove the simple option in order to make the code easier to maintain.

Corpus subsets

Rather than treating a corpus as a bare list of observations, it can be useful to designate ways of drawing coherent subsets from the full corpus. Especially in linguistics and psychology, corpora are often collected under a number of conditions that are being compared and contrasted (e.g., face-to-face versus telephone; familiar pairs versus strangers; large groups versus small). In computational linguistics, although corpora can be homogeneous in character, individual observations are often designated into different subsets (e.g., development and test).

NXT's design included a basic facility for declaring these subsets in the metadata, in two parts: the declaration of a set of independent variables used to characterize the subsets, and then the association with values with these variables for each individual observation. In the design, the variables are defined in the <observation-variables> tag, which can occur as a child of the root element in the metadata.

Here is an example of the definition of some observation variables:

    <observation-variables>
        <observation-variable name="eye-contact" type="enumerated">
            <value>no eye</value>
            <value>eye</value>
        </observation-variable>
        <observation-variable name="temperature" type="number"/>
        <observation-variable name="weather" type="string"/>
    </observation-variables>

The observation-variable declarations follow the same structure as for variables in an ontology.

Once these observation variables have been declared, they can be used from the observation declarations:

    <observations>
        <observation name="q4nc4">
            <variables>
                <variable name="eye-contact" value="eye"/>
                <variable name="familiarity" value="non-familiar"/>
            </variables>
        </observation>
    </observations>
However, this facility isn't flexible enough for all users, it isn't backed up by loading routines that load only the corpus subsets defined using it, and it doesn't expose the variable values in any useful for way, for instance, by making them accessible in query. A better way to do the same thing, available now but not in NXT's original design, is by developing a corpus resource that conveys the information, referencing the observations by name using string equality.

Data Management

In the observations list, for each observation NXT has allowed a <user-codings> list with <user-coding> tags describing the current state of the coding. For instance,

    <observations>
        <observation name="q4nc4">
            <user-codings>
                <user-coding coder="cathy" date="sep98" name="games" status="final"/>
                <user-coding coder="gwyneth" date="oct01" name="move" status="final"/>
            </user-codings>
        </observation>
    </observations>

The coding being described is referenced by name in the name attribute. The coder and date attributes give the name and date of the last person to touch the coding, and the status attribute gives the status of the file, choosing from the enumerated values unstarted, draft, final or checked. If the status is checked it is expected that there will be a further attribute checker containing the name of the checker.

This system was designed to reflect the most important data management information, with the expectation that users would rely on CVS with symbolic tags for a complete data management history. It was intended that user interfaces for coding could, for instance, include an exit screen asking the annotator to specify this data management information. However, it is not our feeling that this is not a useful halfway point; those who need data management will have to use CVS, in which case they can rely on it entirely, and those who do not need data management do not need this facility. Instead, we now allow a corpus to specify how to access its CVS repository in the metadata file and include facilities for running CVS commands in the NXT libraries, so that users can perform these functions from their annotation GUIS directly. Another possibility, in use on one large project, is having the annotators run CVS outside the NXT-based tool, for instance, checking data in and out using a CVS client directly (for those who know how to use one), or via a web form.


Created: Tue Apr 05 2005

 

Last modified 04/13/06