Meta-data Summit
Henry S. Thompson

22 May 1998

1. Background

Stimulated by Martin Bryant, convened by Jim Mason (WG4)

2. Martin Bryant

Martin Bryant said his concerns were focussed on wanting to know how the rush to metadata was actually going to enable what we all want it for: faster, more focussed search and retrieval on the Web. C.f. Michael Biezunski's contribution on Topic Maps lower down.

3. Bob Schloss: RDF

Bob Schloss introduced RDF (Resource Description Framework). Same motivation as Martin, plus big-site (c.f. Lexis/Nexus) topic maps (note overlap with Bryant & Biezunski's work).

Other W3C goals:

Example of what we should be able to say: ``All documents on this site with URL prefix .... are copyright FMC except for these two.''

Three way flexibility is identified as 'stretching' XML:

WEBDAV has some input wrt querying, QUERYPROP

Same issue we came up with yesterday with []: query and retrieval wrt short or long names?

Lots of examples which seem to me to imply inference rules: DC:author is Mark Twain -- what if collection with him in?

All this reinforces the need to have the namespace declarations at the front!

4. Metadata and STEP/EXPRESS

Nigel Shaw did a brief introduction of STEP and EXPRESS and identified a few places where what's there can clearly be seen as meta-data.

5. Tim Bray

Because *ML documents are underlyingly [acyclic] directed graphs [he meant the entity graph, which is indeed acyclic, but I think the more interesting case is the grove, which is cyclic], SQL and traditional query technology is inappropriate/unusable. So we're not in good shape for querying. RDF should help, is designed to fix this problem.

HT asks: but isn't RDF directed graphs too?

Response: Yes, but it's tuplisation which is crucial. Maybe *ML documents can be tuplised too, but the RDF tuplisation has useful property names (e.g. dc:author) because the graph labels are useful, whereas the grove tuplisation does not have useful labels (e.g. attrVal), because the property set properties (i.e. the graph labels in the grove) are targetted towards document syntax, not semantics.

But note that this raises the ugly matter of inference again: if we have (tuples for) ``[Henry believes that] Marlow is [not] the author of this document'', they will include under all alternatives the tuple [doc]--dc:author-->[Marlow], but in only a subset of them should a query regarding that tuple, i.e. looking for documents authored by Marlow, retrieve this document.

6. [Some ISO group rep] about SQL

They're playing catchup wrt querying into non-tuple-friendly data sources, in particular OO data, using the Java model, and multimedia, e.g. full-text (!) and spatial data.

7. Michel Biezunski: Topic Maps

External views on existing information-object repositories; intermediate semantic modeling layer between infoglut and structured views

Traditional navigational devices (indices, glossaries, thesaurii) can be seen as resolved queries in a link database. But the TM activity is not concerned with query internals or presentation.

TM uses architectures: There is already a draft standard

RDF is an instantiation or subset of TM?

8. Other bits

Steve Zilles summarised the state of play wrt XSL, there was the usual observation that there are a lot of things that look like queries playing a role in a lot of different initiatives, but it's not clear they can be unified.

Someone from INSO described the ISO effort regarding graphical information and metadata.

Someone from US DOD whose remit is interfacing the standards community and DOD's involvement therewith with the commmercial technology community. Also Joint Electronic Commerce Program Office (JECPO). Failure of X12-based effort (whazzat?).

Catherine Lupovici, ISO TC46, Jouve, bibliographic activity. Extracting metadata from object data, and adding more, i.e. id codes. Seeking to standardise presentation (!), e.g. there is a title page, so that the extraction task is possible. One of progenitors of Dublin Core, pushing it into TC46 on fast-track. C.f. also ISO 23950.

Tom Layton (RAND, Illinois, HL7): data-warehousing, data-mining, evidence-based medicine. Lack of access to and interest in the expertise of the document community is a pervasive weakness in healthcare's exploitation of technology. Ref. Dudek and other German and Swiss work where marking up existing data streams provided obvious big payoffs.

HT said something very brief about XML-Data.

9. Charles Goldfarb: Document Description Languages (DDL)

Ontology of documents, notations, markup, etc. which I didn't think was quite right.

Annex K (out for ballot? Debate ensued) allows external subset or entities included into the internal subset to be declared as being in a notation other than SGML DTD syntax for specifying e.g. content model constraints. The implication, confirmed by Charles, is that the standard does not specify how the schema interpreter pointed to by the declaration for such notations shall communicate the information contained in the schema, just that that information should provide PRLGABS1-type information.

HT expressed worry about the processing implications.

SGML will grandfather in the lexical and semantic strong typing stuff from HyTime.

More articulation of architecture stuff: details not clear.

More articulation of property sets, groves.

Grove-based approach to defining querying.

10. Eliot Kimber

Phyllis: general grove-builder and HyTime engine. Uses property sets for everything, i.e. GIFs, JPEGs, MIDI as well as *ML documents.

Query: will it scale to terabit datasets?

Answer: Elliot: Abstract first, optimise later. Others: Step is being used for data of this scale.

Define query languages in terms of an abstraction, e.g. groves, then many syntaxes can be cached out with the same or similar underlying semantics.

11. Thoughts

Charles says in putting pointers to runnable code in documnets for a critical step in the SGML parsing process, he's only making manifest the crunch issue for XML data. Jean says not to worry, if schema interpreters are constrained to say what they've learned in terms of PRLGABS1, then a posteriori that can be expressed for transfer to the parser using angle-bang syntax. My view is that the well-formedness/validity distinction for XML allows a simpler architecture for XML, as per my SGML/XML '97 paper: schema-valid means it's well-formed (vanilla XML parser can answer, without reference to schema interpreter), and respects the constraints expressed in the schema (schema interpreter (an XML application) can answer, confident in the knowledge its input (both schema and instance) is wellformed).

Three components necessary [sufficient?] for a viable solution [to Martin's goal right at the top](by tuplisable I mean expressable in terms of a set of fixed-arity predications with a fixed inventory of predicates):

  1. Metadata must be tuplisable (so the well-understood big-ticket query engines can go to work, e.g. SQL engines);
  2. The tuple-predicate vocabulary/ies has to have [universally] documented semantics, e.g. Dublin Core; There was debate here about the universality requirement. This doesn't mean the vocabularies are part of the standard, just that a condition on utility is broad agreement: the market will tell.
  3. One level of meta-meta data has to be standardised as well, namely authority: who says so, with what conviction.

RDF is designed to meet these criteria. Transforming the 'syntactic' grove you get using the published SGML property set into a 'semantic' grove and then tuplising wrt the property set is asserted to be another route to meet the criteria. The predicates in the syntactic grove are things like attribute, child and otsrelpn(:-). The predicates in the semantic grove are (typically) the terminals from the syntactic grove, and (accordingly) are significant in the object domain, i.e. have domain semantics, e.g. author, price and presentingSymptom. I also think it would be possible to specify an informal architecture for the use of XML-Link which would work as well.