Current Status of XSV: Coverage, Known Bugs, etc.

Applies to XSV 3.1-1 of 2007/12/11 16:20:05

Henry S. Thompson
Richard Tobin
11 December 2007

1. What is XSV

XSV (XML Schema Validator) is an open source (GPLed) work-in-progress attempt at a conformant schema-aware processor, as defined by XML Schema Part 1: Structures, Second Edition of 28 October 2004. It has been developed at the Language Technology Group of the Human Communication Research Centre in the School of Informatics at the University of Edinburgh, with support for one of us (Thompson) from the World Wide Web Consortium.

2. How can I use XSV

2.1. Using XSV online

The online service at the W3C has been retired. I recommend using the Xerces XML Schema 1.1 validator now.

2.2. Running XSV at your own installation

2.2.1. Win32 one-click installation

I've packaged the current version up in to a self-installing package for Win32 platforms: just fetch it, run it, and add the installation directory to your PATH, then

> xsv [flags] target [schemas . ..]
target
The document to be processed (must be a URL, relative or absolute. Note this means forward-slashes only, even on WIN32 -- e.g. file:///C:/Project/xxx.xml).
schemas
Schema documents to process it with, also URLs.
-o errfile
Output error file to errfile rather than stderr.
-s stylefile
Include an XSL style PI to stylefile in the error output.
-r [alt|ind]
Reflect the augmented document infoset as an XML file to stdout (follow with alt to force old-style (alternating normal form) reflection, or ind (the default) for new-style (individual normal form) reflection. Use -r -r to get all schema components other than those of the schema for schemas, and -r -r -r to get the complete PSVI reflection including the schema for schemas.
-w
Include warnings in error output.
-t
Show stage timings.
-k
Attempt instance validation even if schema(s) has/have errors.
-i
Input should all be schemas, assume they are meant to be complete and check them as such.
-D
Use DTD to pre-validate, not built-in schema-for-schemas.
-l
Scan the whole document for schema location hints, not just root and new-namespace-binding-introducers.
-E elt
Force document element to be named elt, an expanded name (i.e. either an unqualfied simple name in no known namespace, or a name of the form {namespaceName}localName).
-T type
Force document element to be validated against the type definition named type, an expanded name as for -E.
-N
Don't dereference namespace URIs looking for schema documents.
-e
Preserve the low-level error transcript file.
-n
Output the input document with normalized values and defaults.
-u URI
Provide a base URI for target and schemas.
-d
Show backtrace if crash occurs

2.2.2. Source distributions for the more adventurous

You can download the (Python) sources from the W3C public CVS repository, install Python 2.4 and do:

> [set PYTHONPATH to whereever you installed XSV sources]
> python .../XSV/commandLine.py ...

No, the above instructions aren't sufficiently detailed, but you probably don't want the sources unless you can figure out how to make it work :-)

Previous versions of XSV required our fast Python/C hybrid XML parser---this version should work with either PyLTXML or PyXML:

PyLTXML
Downloads: Be sure to use the most recent version, currently PyLTXML-1.3, release 9. Installers for a number of architectures now available.
PyXML
Downloads. Although no longer supported, the old version 0.8.4 contains updated code which is available nowhere else and necessary for XSV to work properly.

2.2.3. Linux RPMs and DEBs

Packages are now available for those running some versions of Linux:

These have a dependency on PyLTXML-1.3 (and Python itself -- they were all built with Python2.4), see above.

2.2.4. Source tarball

A simple tar ball is also available, suitable for installation using Python's distutils:

> [cd to whereever you unpacked the tarball]
> python setup.py install

3. What is implemented

The basic framework of schema checking and instance schema-validation is implemented. Some details of both are not yet filled in.

I've implemented a new bounded-cost approach to translating content models with numeric exponents. In a change from the earlier (2.10) release, this even handles obscure corner cases involving numeric ranges nested inside numeric ranges correctly.

Here's a brief tabulation of implemented and unimplemented aspects of the REC:

3.1. Implemented at least in part

3.2. Not implemented yet

3.3. Recent Changes

3.4. Known bugs/features