NITE XML Toolkit - Collecting Data for NXT use

This web page contains specific information for potential users who need NXT's features about how to record, transcribe, and otherwise mark up their data before up-translation to NXT. NXT's earliest users have mostly been from computational linguistics projects. This is partly because of where it comes from - it arose out of a collaboration among two computational linguistics groups and an interdisciplinary research centre - and partly because for most uses, its design assumes that the projects that use it will have access to a programmer to set up tailored tools for data coding and to get out some kinds of analysis, or at the very least someone on the project will be willing to look at XML. However, NXT is useful for linguistics and psychology projects based on corpus methods as well. This web page is primarily aimed at them, to tell them problems to look out for, help them assess what degree of technical help they will need in order to carry out the work successfully, and give a sense of what sorts of things are possible with the software.

Recording

Signal Formats

For information on media formats and JMF, see How to play media in NXT.

It is a good idea to produce a sample signal and test it in NXT (and any other tools you intend to use) before starting recording proper, since changing the format of a signal can be confusing and time-consuming. There are two tests that are useful. The first is whether you can view the signal at all under any application on your machine, and the second is whether you can view the signal from NXT. The simplest way of testing the latter is to name the signal as required for one of the sample data sets in the NXT download and try the generic display or some other tool that uses the signal. For video, if the former works and not the latter, then you may have the video codec you need, but NXT can't find it - it may be possible to fix the problem by adding the video codec to the JMF Registry. If neither works, the first thing to look at is whether or not you have the video codec you need installed on your machine. Another common problem is that the video is actually OK, but the header written by the video processing tool (if you performed a conversion) isn't what JMF expects. This suggests trying to convert in a different way, although some brave souls have been known to modify the header in a text editor.

We have received a request to implement an alternative media player for NXT that uses QT Java (the QuickTime API for Java) rather than JMF. This would have advantages for Mac users and might help some PC users. We're currently considering whether we can support this request.

Capturing Multiple Signals

Quite often data sets will have multiple signals capturing the same observation (videos capturing different angles, one audio signal per participant, and so on). NXT expresses the timing of an annotation by offsets from the beginning of the audio or video signal. This means that all signals should start at the same time. This is easiest to guarantee if they are automatically synchronized with each other, which is usually done by taking the timestamp from one piece of recording equipment and using it to overwrite the locally produced timestamps on all the others. (When we find time to ask someone who is technically competent exactly how this is done, we'll insert the information here.) A distant second best to automatic synchronization is to provide some audibly and visibly distinctive event (hitting a colourful children's xylophone, for instance) that can be used to manually edit the signals so that they all start at the same time.

Using Multiple Signals

Most coding tools will allow only one signal to be played at a time. It's not clear that more than this is ever really required, because it's possible to render multiple signals onto one. For instance, individual audio signals can be mixed into one recording covering everyone in the room, for tools that require everyone to be heard on the interface. Soundless video or video with low quality audio can have higher quality audio spliced onto it. For the purposes of a particular interface, it should be possible to construct a single signal to suit, although these might be different views of the data for different interfaces (hence the requirement for synchronization - it is counter-productive to have different annotations on the same observation that use different time bases). The one sticking point is where combining multiple videos into one split-screen view results in an unacceptable loss of resolution, especially in data sets that do not have a "room view" video in addition to, say, individual videos of the participants.

It is technically possible in the current NXT (1.2.9) to show more than one signal by having the application put up more than one media player. The players have separate play buttons. When "synchronize" is chosen and a play button is pressed, the signal for that play button is played, with other signals (and the text-based display windows) taking their timings from that signal. That means that other signals aren't playing, but they are updating the best they can to keep along with the playing signal. Some people have used this to show one audio with one (soundless) video. (It's important here to play the audio and use that to drive the video, since the video can try to display by selecting frames, whereas the audio can't.) Whether this even comes close to working depends on machine performance and how much processing the video format requires. If you intend to rely on it, you should test your formats and signal configuration on your chosen platform carefully.

We know that NXT can be fixed to allow properly for multiple signals because we've tested it with one audio and multiple videos, and we intend to make that change very soon, but it may lag the NXT 1.3.0 release.

Transcription

One of the real benefits of using NXT is the fact that it puts together timing information and linguistic structure. This means that most projects transcribing data with an eye to using NXT want a transcription tool that allows timings to be recorded. For rough timings, a tool with a signal(audio or video) player will do, especially if it's possible to slow the signal down and go back and forth a bit to home in on the right location (although this greatly increases expense over the sort of "on-line" coding performed simply by hitting keys for the codes as the signal plays). For accurate timing of transcription elements - which is what most projects need - the tool must show the speech waveform and allow the start and end times of utterances (or even words) to be marked using it.

NXT does not provide any interface for transcription. It's possible to write an NXT-based transcription interface that takes times from the signal player, but no one has. Providing one that allows accurate timestamping is a major effort because NXT doesn't (yet?) contain a waveform generator. For this reason, you'll want to do transcription in some other tool and import the result into NXT.

Using special-purpose transcription tools

There are a number of special-purpose transcription tools available. For signals that effectively have one speaker at a time, most people seem to use Transcriber or perhaps TransAna. For group discussion, channelTrans which is a multi-channel version of Transcriber, seems to be the current tool of choice. iTranscribe is a ground-up rewrite of it that is currently in pre-release.

Although we have used some of these tools, we've never evaluated them from the point of view of non-computational users (especially whether or not installation is difficult or whether in practice they've required programmatic modification), so we wouldn't want to endorse any particular one, and of course, there may well be others that work better for you.

Transcriber's transcriptions are stored in an XML format that can be up-translated to NXT format fairly simply. TransAna's are stored in an SQL database, so the up-translation is a little more complicated; we've never tried it but there are NXT users who have exported data from SQL-based products into whatever XML format they support and then converted that into NXT.

Using programs not primarily intended for transcription

Some linguistics and psychology-based projects use programs they already have on their computers (like Microsoft Word and Excel) for transcription, without any modification. This is because (a) they know they want to use spreadsheets for data analysis (or to prepare data for importation into SPSS) and they know how to get there from here, (b) because they can't afford software licenses but they know they've already paid for these ones; and (c) they aren't very confident about installing other software on their machines.

Using unmodified standard programs can be successful, but it takes very careful thought about the process, and we would caution potential users not to launch blindly into it. We would also argue that since there are now programs specifically for transcription that are free and work well on Windows machines, there is much less reason for doing this than there used to be. However, whatever you do for transcription, you want to avoid the following.

hand-typing times (for instance, from a display on the front of a VCR), because the typist will get them wrong
hand-typing codes (for instance, {laugh}, because the typist will get them wrong

In short, avoid hand-typing anything but the orthography, and especially anything involving numbers or left and right bracketing. These are practices we still see regularly, mostly when people ask for advice about how to clean up the aftermath. Which is extremely boring to do, because it takes developing rules for each problem ({laughs}, {laugh, laugh), laugh, {laff}, {luagh}... including each possible way of crossing nested brackets accidentally), inspecting the set as you goes to see what the next rule should be. Few programmers will take on this sort of job voluntarily (or at least not twice), which can make it expensive. It is far better (...easier, less stressful, better for staff relations, less expensive...) to sort out your transcription practices to avoid these problems.

More as a curiosity than anything else, we will mention that it is possible to tailor Microsoft Word and Excel to contain buttons on the toolbars for inserting codes, and to disable keys for curly brackets and so on, so that the typist can't easily get them wrong. We know of a support programmer who was using these techniques in the mid-90s to support corpus projects, and managed to train a few computationally-unskilled but brave individuals to create their own transcription and coding interfaces this way. If you really must use these programs, you really should consider these techniques. (Note to the more technical reader or anyone trying to find someone who knows how it works these days: the programs use Visual Basic and manipulate Word and Excel via their APIs; they can be created by writing the program in the VB editor or from the end user interface using the "Record Macro" function, or by some combination of the two.) In the 90's, the Microsoft platform changed every few years in ways that required the tools to be continually reimplemented. We don't know whether this has improved or not.

Up-translating transcriptions prepared in these programs to NXT can be painful, depending upon exactly how the transcription was done. It's best if all of the transcription information is still available when you save as "text only". This means, for instance, avoiding the use of underlining and bold to mean things like overlap and emphasis. Otherwise, the easiest treatment is to save the document as HTML and then write scripts to convert that to NXT format, which is fiddly and can be unpalatable.

Using Forced Alignment with Speech Recognizer Output to get Word Timings

Timings at the level of the individual word can be useful for analysis, but they are extremely expensive and tedious to produce by hand, so most projects can only dream about them. It is actually becoming technically feasible to get usable timings automatically, using a speech recognizer. By "becoming", we mean that computational linguistics projects, who have access to speech specialists, know how to do it well enough that they think of it as taking a bit of effort but not requiring particular thought. This is a very quick explanation of how, partly in case you want to build this into your project and partly because we're considering whether we can facilitate this process for projects in general (for instance, by working closely with one project to do it for them and producing the tools and scripts that others would need to do forced alignment, as a side effect). Please note that the author is not a speech researcher or a linguist; she's just had lunch with a few, and not even done a proper literature review. That means that we don't guarantee everything in here is accurate, but that we are taking steps to understand this process and what we might be able to do about it. For better information, one possible source is Lei Chen, Yang Liu, Mary Harper, Eduardo Maia, and Susan McRoy, Evaluating Factors Impacting the Accuracy of Forced Alignments in a Multimodal Corpus, LREC 2004, Lisbon Portugal.

Commercial speech recognizers take an audio signal and give you their one best guess (or maybe n best guesses) of what the words are. Research speech recognizers can do this, but for each segment of speech, they can also provide a lattice of recognition hypotheses. A lattice is a special kind of graph where nodes are times and arcs (lines connecting two different times) are word hypotheses, meaning the word might have been said between the two times, with a given probability. The different complete things that might have been said can be found by tracing all the paths from the start time to the end time of the segment, putting the word hypotheses together. (The best hypothesis is then the one that has the highest overall probability, but that's not always the correct one.) If you have transcription for the speech that was produced by hand and can therefore be assumed to be correct, you can exploit the lattice to get word timings by finding the path through the lattice for which the words match what was transcribed by hand and transferring the start and end times for each word over to the transcribed data. This is what is meant by "forced alignment". HTK, one popular toolkit that researchers use to build their speech recognizers, comes with forced alignment as a standard feature, which means that if your recognizer uses it, you don't have to write a special purpose program to get the timings out of the lattice and onto your transcription. Of course, it's possible that other speech recognizers do this to and we just don't know about it.

The timings that are derived from forced alignment are not as accurate as those that can be obtained by timestamping from a waveform representation, but they are much, much cheaper. Chen et al. 2004 has some formal results about accuracy. Speech recognizers model what is said by recognizing phonemes and putting them together into words, so the inaccuracy comes from the kinds of things that happen to articulation at word boundaries. This means that, to hazard a guess, the accuracy isn't good enough for phoneticians, but it is good enough for researchers who are just trying to find out the timing relationship between words and events in other modalities (posture shifts, gestures, gaze, and so on). The timings for the onset and end of a speech segment are likely to be more accurate than the word boundaries in between.

The biggest problem in producing a forced alignment is obtaining a research speech recognizer that exposes the lattice of word hypotheses. The typical speech recognition researcher concentrates on accuracy in terms of word error rates (what percentage of words the system gets wrong in its best guess), since in the field as a whole, one can publish if and only if the word error rate is lower than in the last paper to be published. (This is why most people developing speech recognizers don't seem to have immediate answers to the question of how accurate the timings are.) Developing increasingly accurate recognizers takes effort, and once a group has put the effort in, they don't usually want to give their recognizer away. So if you want to used forced alignment, you have the following options:

Persuade a speech group to help you. Lending the speech recognizer for your purposes doesn't harm commercial prospects or their research prospects in any way, but they might never have thought about that. This does require contact with a group that is either charitable or knows the benefits of negotation. Since speech groups are always wanting more data and since hand-transcription is expensive, one reasonable deal is that if they provide you with the timings for your research, they can use your data to improve their recognizer. This only works if your recordings are of high enough quality for their purposes and speech groups may have specific technical constraints. For instance, speech recognizers work better on data that is recorded using the same kind of microphones as the data the recognizer was trained on. This means that the best time to broker a deal is before you start recording. The easiest arrangement is usually for them to bung your data through the recognizer at their site and pass you the results rather than for you to install the recognizer.
Build your own recognizer. One of the interesting things about forced alignment is that you don't actually need a good recognizer - you just need one that can get the correct words somewhere in the lattice of word hypotheses. Knowing the correct words also makes it possible to make it much more likely that the correct hypothesis will be in the lattice somewhere, since you can make sure that none of the words are outside of the speech recognizer's vocabulary. A quick poll of speech researchers results in the estimate that constructing a speech recognizer that works OK but won't win any awards using HTK takes 1-3 person-months. More time lowers the word error rate but isn't likely to affect the timing accuracy. The researchers involved found it difficult to think about how bad the recognizer could be and still work for these purposes, so they weren't sure whether spending less time was a possibility. It does take someone with a computational background to build a recognizer, although they didn't feel it took any particular skill or speech background to build a bad one.
Find a speech recognizer floating around somewhere that's free and will work. There must be a project student somewhere who has put together a recognizer using HTK that is good enough for these purposes.

Finally, here are the steps in producing a forced alignment:

Produce high quality speech recordings. You must have one microphone per participant, and they must be close-talking microphones (i.e., tabletop PZMs will not do - you need lapel or head-mounted microphones). If you are recording conversational speech (i.e., dialogue or small groups), it's essential that the signal on each participant's microphone be stronger when they're speaking than when other people are. Each participant must be recorded onto a separate channel.
Optionally, find the areas of speech on each audio signal automatically. The energy on the signal will be higher when the person is speaking; you need to figure out some threshhold above which the person is speaking and write a script to mark those. This is often done in MATLAB.
Hand-transcribe, either scanning each entire signal looking for speech or limiting yourself to the areas found to the automatic process. Turns (or utterances, depending on your terminology) don't have to be timestamped accurately, but can include extra silent space before or after them that will be corrected by the forced alignment. However, it's important that the padding not include cross-talk from another person that could confuse the recognizer.
Optionally, add to the speech recognizer's dictionary all of the words in the hand-transcription that aren't in it already. (This is so that it can make a guess at what speech matches them even though it has never encountered the words before, rather than treating them as out-of-vocabulary, which means applying some kind of more general "garbage" model.)
Run the speech recognizer in forced alignment mode and then a script to add the timings to the hand transcription.

Time-stamped coding

Although waveforms are necessary for timestamping speech events accurately, many other kinds of coding (gestures, posture, etc.) don't really require anything that isn't available in the current version of NXT, except possibly the ability to advance a video frame by frame. People are starting to use NXT to do this kind of coding, and we expect to release some sample tools of this style plus a configurable video labelling tool fairly soon. However, there are many other ways of getting time-stamped coding; some of the video tools we encounter most often are The Observer, EventEditor, Anvil, and TASX. EMU is audio-only but contains extra features (such as format and pitch tracking) that are useful for speech research.

Time-stamped codings are so simple in format (even if they allow hierarchical decomposition of codes in "layers") that it doesn't really matter how they are stored for our purposes - all of them are easy to up-translate into NXT. In our experience it takes a programmer .5-1 day to set up scripts for the translation, assuming she understands the input and output formats.

Last modified 04/13/06

NITE XML Toolkit - Collecting Data for NXT Use