Towards a Base Architecture for Spoken Language Transcript{s,tion}Henry S. ThompsonLanguage Technology GroupHCRC, University of Edinburgh26 September 19971. Acknowledgements
This work was carried out at the Human Communication
Research Centre, whose baseline funding comes from the UK Economic
and Social Research Council,
and at the Institute for Research on Cognitive Science, University
of Pennsylvania, funded by NSF and (D)ARPA.
2. Background
- HCRC Map Task
- Original Corpus
- Canadian Map Task Corpus
- Japanese Map Task Corpus
- New projects
- Project meetings
- Three-person Map Task
3. Background, cont'd
- Original Public Transcripts
- TEI-compliant SGML
- Semi-automatic silence-based
linking to speech
- Internal Transcripts
- Line per 'turn'
- Idiosyncratic typed coding lines
- Game/Move
- Reference
- Gaze
- Syntax
4. Evolutionary Pressure
- Practical needs
- Multiple annotators
- Cross-annotation querying
- Theoretical interests
- Details of talker contribution
timing
- Cross-linguistic study of 'back-channel'
utterances
5. Opportunity
- Funding from ESRC to time-stamp word boundaries
in the original Map Task Corpus
- Internal funding to re-engineer
the internal annotation standards
- Development of general stand-off
annotation architecture (MULTEXT into LT NSL)
6. Diversion: XML
- SGML-Lite or HTML-Heavy
- W3C charter to design 'SGML for
the Web'
- WWW demand for 'our own tags
for our Web pages'
- Three components
- Language: An easy-to-process
and -author subset of SGML
- Link: TEI/HyTime derived, superset
of HTML <A HREF=>
- Style: Superset of union of
CSS, DSSSL using XML and ECMAScript syntax
7. XML: Designed for Use
- Ease of implementation (cf. SGML)
- Separation of form and content
(cf. HTML)
- Four free implementations (in
Java and C) already available
8. LT XML: Free tools for XML applications
- Based on LT NSL, our SGML application toolkit
- One of many tributaries to the
XML stream
- Fast, non-validating parser
- Pipelined architecture
- Existing tools for search, transformation
- Event/Tree API for new applications
- Implemented in C
- Aimed at processing structured
data
9. LT XML: Basic Architecture
- Pipelines of 'fat' streams
- API provides primitives for XML-appropriate
input and output
- Two alternative views:
- micro-sequence: start-tag, comment,
char-data, end-tag, proc. inst
- tree-structure: sequence of
sub-trees, level ad lib.
- Built-in support for stand-off
annotation
10. What's "Stand-off" Annotation?
- Separating annotation from the material being
annotated
- Three obvious reasons
- Base material may be read-only
and large;
- Markup may involve multiple overlapping
hierarchies
- The base document may not be
freely distributable or of necessity somewhere else
11. Hyperlink Semantics for stand-off
- Links don't just mean "follow me"
- Browsing isn't the only application
- Links can point to more than
one 'unit' in the target document
- We need three types of link semantics
in the first instance
- Inclusion
- Replacement
- Inverse Replacement
12. An Aside about Architecture
- We assume (and supply :-) a pipelined or layered
architecture
- We allow an application to see
a document stream which is synthesised on the fly
- In other words, we combine the
markup and the base document on the fly
- The application sees the result
13. A Simple Example
- The BNC contains 2GB of British English text
- I want to add sentence markup
to it
| |
---|
|
Base:
<w id=w12>Now</w>
<w id=w13>is</w>
<w id=w14>the</w>
. . .
<w id=w28>party</w>
<c id=c4>.</c>
Standoff:
<s xml-type='link' show='include'
href='&f;#id(w12)..id(c4)'>
</s>
|
14. Simple example, cont'd
- What the application sees is
| |
---|
|
<s>
<w id=w12>Now</w>
<w id=w13>is</w>
<w id=w14>the</w>
. . .
<w id=w28>party</w>
<c id=c4>.</c>
</s>
<s>
. . .
</s>
|
15. Inclusion
- Two things happened:
- The linking attributes disappeared
- The linked-to material (the resource)
filled in the content of the linking element
- The first of these is common
to all our links
- The second is the particular
semantics of inclusion
16. Simple Tools are Simple to Build
- Less than one page of C code to produce simple
application
- Fast (and will be faster)
- Pipelines mean you can compose
simple tools for complex applications
- Shared API with LT NSL
17. Pre-constructed Tools
- Extract text content:
textonly
- Select fragments based on tags,
attributes and text content:
sggrep
- Count tags:
sgcount
- Production-system style transformation:
sgmltrans
- Simple pattern-based information
extraction:
sgrpg
- Indexing for fast access:
mkindex
- Composition of stand-off annotation:
knit
18. Availability
- Free to all for research use
- Executables and libraries for
Unix (Solaris, SunOs, Linux, FreeBSD) and Win32
- Sources for Unix
- Packaged executable for Mac
- here
19. End Diversion: The Bottom Line
- Channel per talker recordings
- Document per talker transcripts
- Minimum commitment element-per-word
XML markup
- All
annotation in separate documents, linked onto base level
- Tools for GUI-based annotation,
cross-annotation intersection and tabulation, overlap display
20. The Base Level
- A flat, exhaustive cover
- Three tags
tok
:
minimal (sub-word) unit (incl. clitic, sub-word)noi
:
non-lexical oral noise (incl. laugh, breath, xxx, )sil
:
silence
- Why tokens and not timeables
(single-morph syllable sequences)?
- multiple timeables: Saint
Joan
- less than a timeable: Joan's
- cross-cutting: Saint Joan's
21. Annotation
- Using stand-off annotation for everything
| |
---|
|
Base:
<tok id='t3' start='1.3'>Saint</tok>
<tok id='t4' start='1.7'>Joan</tok>
<tok id='t5' type='clitic'>+'s</tok>
Standoff:
<w id='w1' href='&f;#id(t3)..id(t4)' tag='pnn'/>
<w id='w2' href='&f;#id(t5)' tag='bez'>
|
22. Conclusion
- Inline representation of overlap is
- a swamp
- unscaleable
- a snare and a delusion
- a holdover from the days of subjective
exploration
- a presentation issue, not a content
issue
- a mistake!
- Do it right the first time!