The main investment involved in allowing your own data to be used by the NITE XML toolkit is the production of a metadata file and the provision of your data in a conformant fashion (especially as regards file-naming). Understanding the format of metadata files will be important if you wish to import your data, though we provide several example metadata files to help. Once you have a metadata file that describes your data, you will be able to use all the NITE tools to validate, analyse and edit your data.

What metadata files do

Metadata files describe all aspects of a corpus including:

There's a full discussion of the elements and attributes that make up a metadata file here.

What metadata files look like

Metadata files are XML and conform to a DTD. There is one metadata DTD for simple (single file) corpora and one for standoff corpora. They both share much in common, so import the same basic DTD. The set of DTDs (zipped) can be downloaded here. If you are more familiar with XML Schema and have a schema validator installed you may prefer this set of zipped schemas.

Metadata examples

Save these to disk and have a look at them in your favourite XML or text editor.

  1. Metadata for NITE's simple example (you may also want to see the data it describes - 5K zip)
  2. Metadata for the Maptask corpus (here is a single maptask observation - 165K zip)
  3. Metadata for the Smartkom corpus (simple corpus case) (here's a single Smartkom interaction file - 15K XML)

Using Metadata to validate data

Since metadata describes the format of the data and where to find it on disk, it is used by the NITE software to validate the data as it is loaded and edited. This sort of direct validation is useful, but we also provide schema validation of data using a schema derived automatically from the metadata (via a stylesheet).

Assuming you have already downloaded and installed NOM, you already have the schema-generating stylesheet (it's in the lib directory). Armed with this and a stylesheet processor (xalan is also in the NOM distribution), you can run this command on your metadata file:

java org.apache.xalan.xslt.Process -in <your-metadata> -xsl generate-schema.xsl -out extension.xsd

This creates a schema file called extension.xsd which imports two other static schema files: typelib.xsd and xlink.xsd - also in the lib directory of your NOM distribution. Put these static schema files in the same directory as the newly generated extension.xsd.

If you have a schema validator (I use xsv) you are now ready to validate some data files. Try putting these declarations:
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="extension.xsd"
in the root element of your data file and then execute:
xsv <your-file>

One of the major reasons behind this approach to schema validation is that we can validate data that is either a single file "as-serialized" by NITE, or files that have been transformed to replace their nite:child elements with the pointed-to elements recursively, and also replacing pointers with their actual elements. This is useful for validating the types of elements that can be children of a specific element and pointed to by that element. In this way an entire corpus could be schema validated. You have a stylesheet that does this transformation in the lib directory of your NOM distribution.

If this all seems rather involved, and your data already loads into the NOM, the program PrepareSchemaValidation.java will make a new directory for you which is fully ready for schema validation.

Validation limitations:

 

Last modified 04/17/06