The main investment involved in allowing your own data to be used by the NITE XML toolkit is the production of a metadata file and the provision of your data in a conformant fashion (especially as regards file-naming). Understanding the format of metadata files will be important if you wish to import your data, though we provide several example metadata files to help. Once you have a metadata file that describes your data, you will be able to use all the NITE tools to validate, analyse and edit your data.
Metadata files describe all aspects of a corpus including:
There's a full discussion of the elements and attributes that make up a metadata file here.
Metadata files are XML and conform to a DTD. There is one metadata DTD for simple (single file) corpora and one for standoff corpora. They both share much in common, so import the same basic DTD. The set of DTDs (zipped) can be downloaded here. If you are more familiar with XML Schema and have a schema validator installed you may prefer this set of zipped schemas.
Save these to disk and have a look at them in your favourite XML or text editor.
Since metadata describes the format of the data and where to find it on disk, it is used by the NITE software to validate the data as it is loaded and edited. This sort of direct validation is useful, but we also provide schema validation of data using a schema derived automatically from the metadata (via a stylesheet).
Assuming you have already downloaded and installed NOM, you already have the schema-generating stylesheet (it's in the lib directory). Armed with this and a stylesheet processor (xalan is also in the NOM distribution), you can run this command on your metadata file:
java org.apache.xalan.xslt.Process -in <your-metadata> -xsl generate-schema.xsl -out extension.xsdThis creates a schema file called extension.xsd which imports two other static schema files: typelib.xsd and xlink.xsd - also in the lib directory of your NOM distribution. Put these static schema files in the same directory as the newly generated extension.xsd.
If you have a schema validator (I use xsv) you are
now ready to validate some data files. Try putting these declarations:
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="extension.xsd"
in the root element of your data file and then execute:
xsv <your-file>
One of the major reasons behind this approach to schema validation is that we can validate data that is either a single file "as-serialized" by NITE, or files that have been transformed to replace their nite:child elements with the pointed-to elements recursively, and also replacing pointers with their actual elements. This is useful for validating the types of elements that can be children of a specific element and pointed to by that element. In this way an entire corpus could be schema validated. You have a stylesheet that does this transformation in the lib directory of your NOM distribution.
If this all seems rather involved, and your data already loads into the NOM, the program PrepareSchemaValidation.java will make a new directory for you which is fully ready for schema validation.
Last modified 04/17/06