Being the results of an e-mail and net survey carried out in May of 1995.
Note that I have NOT included here the bulk of the (American English for the most part) LDC backlist, q.v.
===========================================================
First hand information (i.e. from the producer of the data)
===========================================================
Map Task (HCRC/LDC)
http://www.cogsci.ed.ac.uk/elsnet/resources.html
ftp://ftp.cis.upenn.edu/pub/ldc_www/hpage.html
The HCRC Map Task Corpus is a set of 8 CD-ROMs containing linked
sampled audio and transcriptions of a total of about 18 hours of
spontaneous speech that was recorded from 128 two-person
conversations according to a detailed experimental design.
Corpus of Spoken American English (Dept of Linguistics, UC Santa Barbara):
John DuBois <dubois@humanitas.ucsb.edu>
We hope to have the first CD-ROM (with 22.05 kHz 16-bit stereo .WAV
audio and transcription in Windows for PC's) sometime this summer.
It will contain just 10 transcripts averaging 25 minutes each, or
about 5% of the eventual one million words of material.
The Groningen Speech Corpus (SPEX)
http://www.cogsci.ed.ac.uk/elsnet/resources.html
The Groningen Speech Corpus was collected by A.M. Sulter, MD and
Prof. H.K. Schutte as part of a research project funded by NWO
(Netherlands Organization for Scientific Research). The 4 CD-ROMs
contain over 20 hours of speech. It is a corpus of read speech
material in Dutch, recorded on PCM tape under fairly good
conditions. 238 speakers READING Texts, sentences, words, numbers
and 3 vowels. 750 ECU (academic use), industrial use 3000 ECU.
Dutch Polyphone (SPEX):
spex@spex.nl
5000 speakers reading 50 items (digits, sentences (phonetically
rich), transliterated.
Speechstyles (SPEX)
spex@spex.nl
129 speakers, spontaneous speech (monologues), semi- spontaneous
speech (picture descriptions), reading. All transliterated, and
provided with NIST Sphere Headers. Price: about 750 ECU
(academic), 3000 industrial.
Dutch Read Text corpus (SPEX):
spex@spex.nl
one speakers reading 45 texts (some of them also at fast speech
rate). 6 text are segmented and labelled at the phoneme level.
Price 200 ECU (academic) 800 industrial.
DIRECT (Sao Paolo and Liverpool):
HELOISA COLLINS <hcollins@bra000.canal-vip.onsp.br>
Development of Research in English for Commerce and Technology, a
binational project going on in the Catholic University of Sao
Paulo, Brazil, and the University of Liverpool in England (check
ftp.liv.ac.uk for the working papers produced so far), has some
spoken data that might be of interest. As a member od the research
team, I've done some work on public presentations (non-academic)
and am now doing analysis of meetings. A PhD student working under
my supervision is working on job interviews and another one is
currently analysing conducted tours. This material is not publicly
available yet, but we could consider making part of it available on
an exchange basis. Languages are English (as native, second and
foreign language) and Brazilian Portuguese. I haven't got details
about number of words right now (we work on the basis of the texts
of complete communicative events), but this might give you a rough
idea: 4 presentations in English (transcribed) 4 or 5 in Portuguese
(not trasncribed) 2 in English (not transcribed) 3 meetings in
English and 2 in Portuguese (transcribed) 10 conducted tours
(transcribed) 10 job interviews (trancriptions almost done) We have
more stuff, transcribed, that has been collected by other members
of the project.
In principle, as I said before, there would be no cost involved,
since we are more interested in enlarging the corpus and would,
therefore, prefer to exchange data. We would like texts of complete
events in the area of general business. In fact, we may be
interested in anything which is not strictly academic.
MARSEC (Univ. Leeds & Reading):
http://midwich.reading.ac.uk/research/speechlab/marsec/marsec.html
The MAchine Readable Spoken English Corpus.
A small section of the corpus is available for anonymous ftp from
The Speech Laboratory at Leeds University. lethe.leeds.ac.uk:/pub/marsec
EUROM0 (University College London):
M.Huckvale@ucl.ac.uk
EUROM0 is a CD-ROM containing spoken recordings of digits,
sentences and passages by 4 speakers in each of 5 European
languages.
Japanese (ATR):
sho@ctr.atr.co.jp
[Name] ATR Speech Databases for Research
[Language] Japanese
[Description]
Magnetic Tapes and/or CD-ROM.
20kHz (partially 12kHz) sampling, 16bit digitized.
[Contents]
Set A: 8,500 Words Speech Database
20 speakers (10 males and 10 females)
Set B: Phoneme-Balanced 503 Sentences Speech Database
10 speakers (6 males and 4 females)
Set C: Large Size of Speakers Speech Database
Set D: Text Speech Database
2 speakers (1 male and 1 female)
12 stories (about 400 sentences)
Set E: English Speech Database
4 speakers (2 males and 2 females), about 5,000 words
Set F: Sentence Speech Database
6 speakers (3 males and 3 females), about 1,100 sentences
[Costs] Please contact to the distribution coordinator.
[Distribution Coordinator]
Mr. Shohei TAHARA
Research Engineering Department
ATR (Advanced Telecommunications Research Institute) International
2-2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-02, Japan
Telephone: +81 774 95 1192
Facsimile: +81 774 95 1179
E-Mail: sho@ctr.atr.co.jp
German (Vienna University of Technology):
Gernot Kubin <kubin@sampo.nt.tuwien.ac.at>
We have recorded a data base of sustained, continuant speech sounds.
Language: German, 25 continuants. Number of speakers: 60+,
male/female, mixed age. Recording conditions: DAT (48 kHz, 16
bit), anechoic chamber. File format: raw 16 bit integer, mono.
Each file corresponds to an individual continuant sustained over
approx. 1 second.
Spanish (Univ. Politecnica de Madrid):
luis@gaps.ssr.upm.es (Luis Hernandez Gomez)
Corpus:
- Spanish Headlines from AT&T Bell Lab.
- 650 sentences
- Orthographic anf phonetic transcription
Speakers:
- 25 male speakers. 200 sentences/speaker
- 25 female speakers. 200 senetences/speaker
Recording conditions:
- Recording studio
- 16 bits, 16 KHz
Northern Ireland Transcribed Corpus of Speech (Queen's University Belfast):
J.M.Kirk@qub.ac.uk
Language: English Transcription: Orthographic 105 interviews with
people from 38 localities in Northern Ireland. 3 age groups:
children, middle-aged and elderly. Word token: c. 250,000
Recordings made late 70s early 80s Contents: for elderly and
middle-aged: interviews about changes in the pattern of life, many
recollections andreminiscences and anecdotes. Lots of questions by
thefieldworker/interviewer. Good gender and ethnic balance, too.
Fits on three HD 3.5" floppy disks
==========================
No audio, transcripts only
==========================
ECI MCI (HCRC/LDC)
http://www.cogsci.ed.ac.uk/elsnet/resources.html
ftp://ftp.cis.upenn.edu/pub/ldc_www/hpage.html
ECI has produced Multilingual Corpus I (ECI/MCI) of over 98 million
words, covering most of the major European languages, as well as
Turkish, Japanese, Russian, Chinese, Malay and more. The primary
focus in this effort is on textual material of all kinds, including
transcriptions of spoken material. The ECI/MCI is now available at
a price of GBP 23.50 (including GBP 3.50 VAT) for countries within
the European Union, and GBP 20 for countries outside the European
Union.
Japanese (ATR):
sho@ctr.atr.co.jp
[Name] ATR Dialogue Text Databases for Research
[Language] Japanese and English
[Description]
ATR corpus contains conversations between Japanese speakers through
telephone and/or keyboard communications. All conversations are
transcribed. Morphological and syntactical tags are given.
Corresponding English is given. About half million words are
available.
[Contents]
Set 1: Telephone Conversation, Conference Registration Task
Set 2: Keyboard Conversation, Conference Registration Task
Set 3: Telephone Conversation, Travel Arrangement Task
Set 4: Keyboard Conversation, Travel Arrangement Task
[Costs] Please contact to the distribution coordinator.
[Distribution Coordinator]
Mr. Shohei TAHARA, as above
=======================
Here on are forthcoming
=======================
EUROM-1 (University College London):
M.Huckvale@ucl.ac.uk
"CDs are currently in production"
Dutch Eurom1
spex@spex.nl
HIFI recordings of 64 speakers, reading passages, sentences and numbers.
Price: not yet known.
BREF (LIMSI/CNRS, Paris):
lamel@limsi.fr
"soon. we were about to make an announcement and then
had another administrative problem.
i hope that this will be resolved very soon - we thought
that all was in place..." -- Lori Lamel, 4/12/95
TRAINS (University of Rochester):
Peter Heeman <heeman@cs.rochester.edu>
The TRAINS spoken dialogue corpus should be available soon from the
LDC. It is a corpus of spoken english in a task oriented setting.
There is about 6 and a half hours of dialogue, comprising about
55,000 words spoken. I am not sure about the cost.
ShATR (Univ. Sheffield & ATR):
B.Karlsen@dcs.shef.ac.uk
http://www.dcs.shef.ac.uk/research/groups/spandh/ShATR.html
At the moment we are finishing the last transcriptions of very special
corpus here at Sheffield. The corpus is called ShATR
(Sheffield-ATR) and it contains a set of high quality recordings of
multiple speakers speaking simultaneously. There are 4 british
english speakers and 1 american english speaker. The corpus
contains 8 channels: one for each speaker (head mounted mic), an
omnidirectional mic, and the left and right channel of a acoustical
mannikin with artificial ears. The data are in NeXT/Sun sound file
format, and there are almost 37 min. speech at 48kHz sampling rate
(16 bit linear) for each channel. Only part of the corpus will be
made available via ftp, the entire corpus will be possible to
purchase from LDC (Linguistic Data Corporation, US) on CD-ROMs when
the corpus is finished. Price is still unknown.
BABEL (Reading Univ. and others):
http://midwich.rdg.ac.uk/
new European (Copernicus) project based in Reading, making
SAM-style database of Bulgarian, Estonian, Hungarian, Polish and
Romanian. Some Bulgarian data already available.
========================================================================
Second hand (i.e. someone says "I believe that [someone else] has [...]"
========================================================================
University of Victoria Phonetic Database:
Sampled data files from 45 languages (including some Amerindian
ones I had never heard of), together with phonetic and orthographic
transcriptions and software for playing from CD-ROM using PC with
Soundblaster card. I have played with this, but don't yet own a
copy. Available for about $470 from Speech Technology Research
Ltd. in Victoria, fax. 604/477-2540
The Oxford Acoustic Database:
Produced by Brian Pickering and Burt Rosner, published by Oxford
University Press; cost somewhere around 100 pounds. I've lost the
details, but I think there are about 8 well-known languages on
it.
Fin-DSDB (Helsinki):
aiivonen@helsinki.fi
Finnish Digital Speech Database () including an editing and analysing
program QuickSig designed by Matti Karjalainen and Toomas Altosaar
(Helsinki Technical University); database designed in collaboration
with Department of Phonetics, University of Helsinki.
PHONDAT (Kiel University):
Now on sale from Klaus Kohler's Dept., text in 2 vols. of the Kiel
working papers (AIPUK 27/8). All German.
SCRIBE (various UK partners):
now on sale at DRA Malvern. CD-ROMS and time-aligned
transcriptions. All English.
Spanish spoken material:
There is at least one spoken corpus of spoken Spanish available at the
Universidad Autonoma de Madrid.
ftp://lola.lllf.uam.es/pub/corpus/
There are also some South American corpora there but they are
probably written texts.
I have tried on several occasions to download the description of
the oral corpus but have had nothing but problems, even though the
corpus itself is fine, so I cannot say much about the sources used.
The corpus takes up about 7Mb.
CHILDES:
brian+@andrew.cmu.edu (Brian MacWhinney)
The Child Language Data Exchange System reportedly has oral
child-adult conversational material.
ASJ (JIPDEC):
http://www.itl.atr.co.jp/cocosda/corpora/japanese
1. Corpus name: ASJ Continuous Speech Corpus for Research
2. Producer: Japan Information Processing Development Corporation
3. Contents: Vol. 1-3 : ATR 503 PB sentences (read speech)
64 speakers (30 males & 34 females)
9.600 sentences
Vol. 4-6 : Various guide task sentences (read speech)
36 speakers (18 males & 18 females)
12,474 sentences
Vol. 7 : Simulated dialogues with transcribed texts
37 speakers (29 males & 8 females)
37 dialogues
4. A/D condition: 16 kHz sampling rate, 16 bit quantization
5. Media: CD-ROM (ISO 9660)
6. Distribution condition: for non-commercial purposes
7. Price: Yen 3.090/vol + mailing cost
8. Note: Submission of license agreement form is required
9. Person in charge:
K. KATAOKA
AI and Fuzzy Promotion Center,
Japan Information Processing Development Center (JIPDEC)
3-5-8 Shibakoen, Minatoku, Tokyo 105, JAPAN
TEL. +81 3 3432 9390
FAX. +81 3 3431 4324
Note:
As for volumes one to three of the ASJ corpus, only several copies are
available and a hundred or more copies are available for volumes 4 to 7.
Some volumes of CD-ROMs may be reproduced if they receive many requests.
JEIDA Noise Database:
http://www.itl.atr.co.jp/cocosda/corpora/japanese
2. Producer: Japan Electronic Industry Development Association
3. Reference: Mr. T. Kitamura, Sunrise Music Inc.
4. Content: Various environmental noise
5. Speakers, Repetition: 17 sorts of noise in 17 DAT cassettes
6. AD conversion condition: 48 kHz, 16 bits
7. Distribution media/way: 18 DAT cassettes, one of which is a
digest tape of 17 sorts of noise
8. Distribution condition: for non-commercial purposes
9. Others: Submission of license agreement form is required
Contact address of Mr. Kitamura:
4-7-6 Akasaka, Minato, Tokyo 107, Japan
Sunrise Music Co. Ltd.
Tel: +81 3 3585 6541
Fax: +81 3 3585 6748
Cost of dubbing for one set: Yen 72,000.- including tapes.
Contents are as follows. 1. Automobile cabin (Medium-size car)
2. Automobile cabin (Compact car) 3. Exhibition hall A (In a booth)
4. Exhibition hall B (In a passage) 5. Railway station (Near ticket
vending machines / In a passage) 6. Telephone booth (Down town)
7. Factory (Machinery / Press) 8. Parcel classification works
9. Trunk road / Road crossing 10. Crowded street 11. New trunkline
train 12. Ordinary train 13. Computer room A (Minicomputers)
14. Computer room B (Workstations) 15. Large air conditioner
16. Air conditioning fan coil / Ventilation duct 17. Elevator
passage (Hospital / Department store) 18. Digest tape of Nos. 1 to
17
Non-native French (Univ. of London):
j.dewaele@french.bbk.ac.uk
Debates, formal and informal interviews with non-native speakers
(Dutch). Audio tape and transcriptions on diskettes.