|
Alex
Lascarides (PI)
Claire Grover
Mirella Lapata
The DISP project aims to combine statistical and rule-based techniques to acquire a robust model of semantic ambiguity resolution from corpora. We are especially interested in semantic relationships that are linguistically implicit, such as the relationship between nouns in a complex nominal (e.g., that "liver infection" means infection that's located in the liver) and the interpretation of logical metonymy (e.g., that "enjoy the book" means enjoy reading the book).
We aim to provide a ranked set of plausible interpretations for a given construction, and also for the differences in interpretations that result from different heads and modifiers in the construction. For example, our model should predict the different meanings of "hospital arrival" (where the relationship is "at") vs. "patient arrival" (where the relation is "subject"); see Grover, Lapata and Lascarides (2002). It should also predict that "fast car" is much more likely to mean car that drives fast, than car that is manufactured fast; see Lapata and Lascarides (2001) A Probabilisitic Account of Logical Metonymy, submitted to Computational Linguistics.
Our main strategy is to identify lexical semantic information automatically by exploiting the consistent correspondences between surface syntactic cues and lexical meaning. For example, we predict the most likely meaning of "hospital arrival" by estimating---via data from a corpus---the relative likelihoods of "hospital" appearing in the various argument positions to the verb "arrive". And similarly, we estimate the meanings of "begin the book" via the relative frequencies of a verb being a complement to "begin" and also being a verb that takes "book" as its object.
We have explored a variety of ways of acquiring probabilistic information about these meaning correlates from a corpus. We have explored the effects of using a shallow chunk parser (Cass) vs. a deeper parser, based on a probabilistic tag sequence grammar. We have explored the relative merits of using external linguistic resources such as WordNet for smoothing over sparse data.
We have demonstrated that these techniques can be used to acquire probabilistic models of semantic interpretation from both a balanced corpus (the British National Corpus) and a more specific corpus (namely, the Ohsumed corpus of medical journal abstracts). For the latter, we explored the utility of using domain-specific, external linguistic resources; in particular, the UMLS metathesaurus.
The Ohsumed corpus posed special challenges in preparing the corpus for parsing, since many of the medical terms are unknown, and there were a number of idiomatic expressions, such as chemical terms and complex number expressions, which must be packaged into units to improve performance. The software we used for data preparation are the LTG's publicly available LT TTT tools; a description of the techniques used can be found in Grover and Lascarides (2001): XML-Based Data Preparation for Robust Deep Parsing. See also Grover, Matheson, Mikheev and Moens (2000): LT TTT - A Flexible Tokenisation Tool.
The DISP project is funded by ESRC grant no. R000237772.
|
|