XML stands for eXtensible Markup Language [1]. It is a simple standard way
to mark up the structure of documents, and is the responsibility of the W3C
(the World Wide Web Consortium). It combines the simplicity and ease of use of
HTML, with the power and flexibility of SGML. Here's a simple example of what
it looks like (we'll call this sample.xml
further down):
|
XML itself determines the syntax which distinguishes markup from the text marked up---the angle brackets, equal signs and quotes in the above. The document designer determines the markup vocabulary: the element types (p, s, w and c in the above), attribute types (id and pos) and their grammar, e.g. the fact that a p may contain a number of s elements.
The tremendous advantage of XML lies in its very simple low-level syntax, which makes it possible to write very fast and light-weight XML parsers (see [2] for pointers to a number of them). Since XML provides a mechanism in XML for specifying the markup grammar of a document or family of documents, an XML parser can be used for many different types of document without modification.
For language resources, this is a great step forward, as it means an end to the all-too-common necessity of writing yet another parser each time you get a new resource to work with. Already major providers of language resources such as the LDC ([8]) and ELRA ([9]) are delivering resources marked up using XML.
LT XML ([3]), developed by the Language Technology Group ([4]) of the Human Communication Research Centre ([5] at the University of Edinburgh, is an XML parser with a flexible API, together with a large collection of pre-built tools for processing XML-marked-up material. LT XML is free for non-commercial use, and is available in both source (for UN*X, WIN32 and Macintosh) and binary (for WIN32) distributions. Over 3000 licenses have been issued to a wide range of institutitions in both Europe and further afield.
LT XML's pre-built command-line tools include the following:
textonly | Extracts the text content and adds separators:
| ||||
sgcount | Tabulates element type usage:
| ||||
sggrep | Provides powerful search and filtering:
| ||||
sgrpg | Combines complex searching with reformatting (For
sophisticated use a control file, itself written in XML, is required. The
example below illustrates the restricted subset of functionality available from
the command line).
| ||||
sgsort | Sort sub-elements by content | ||||
sgmltrans | Production-rule-style reformatting (comes with sample trivial XML-to-LaTeX downtranslator) |
Each of these tools is reasonably powerful in its own right, but a crucial property of the LT XML architecture, made possible by the fact the XML documents can carry their own structure definitions with them, is that pipelines of tools can be composed for complex tasks.
The pre-built LT XML tools are based on the LT XML API (Application Programming Interface). Users can define C language programs using this interface to tackle more complex and sophisticated tasks. The API offers both a low-level (event-orientated) and high-level (tree/sub-tree orientated) view of XML documents, and is based on RXP ([6], a very fast XML parser.
The power and flexibility of XML in general, and the LT XML architecture in particular, was crucial to the LTG's ability to put together our first entry in the most recent DARPA Message Understanding Competition (?) (MUC). The LTG's entry was for the named entity recognition task, and was composed of 17 stages in a pipeline ([7]. It came top of all the entries for that task, and generated a lot of interest in the US language technology community because of ease of (re)configuration the pipelined LT XML architecture provided.
The W3C is sponsoring further development of standards associated with XML, including XSL (eXtensible Stylesheet Language), XML-Link (support for inter-document linking) and XML Schema (extending XML's facilities for defining the structure of XML documents).
We're still expanding and extending LT XML. The next release will include support for validation, a Python-language interface to the API for rapid prototyping, and support for automatically generating graphical user interfaces for common annotation tasks.
LT XML grew out of work on the MULTEXT project, sponsored by the European Union through the LRE programme. More recent development has been funded by the UK Economic and Social Research Council via HCRC's core funding, by the UK Engineering and Physical Sciences Research Council via project NSCOPE, and by Sun Microsystems and Microsoft.