Informatics @ Edinburgh Language Technology Group

SUM - Flexible Summaries


Page Menu
· Home
· HOLJ Corpus
· Publications
 
Links
· House of Lords
· HOL Judgments
· Incorporated Council of Law Reporting
· ICLR HOL Summaries
 
Project Description

Introduction

SUM is an EPSRC research project of the Language Technology Group, based in the Institute for Communicating and Collaborative Systems of Edinburgh's School of Informatics. The area of investigation of the SUM project is automatic summarisation. The project aims to use summarisation to help address the information overload problem in the legal domain. Legal cases are an important part of legal discourse and automatic summarisation offers a route to providing important information in a format that is more accessible and understandable to a range of users, from students and other legal novices to solicitors and judges. Currently, selected judgments are manually summarised by legal experts. While an ultimate goal of legal summarisation would be to provide clear, non-technical summaries of legal judgments, an automatic system using current technology would already enable immediate access to preliminary summaries, and serve as an assisting technology in manual summarisation. Automatic summaries might also be incorporated to provide dynamic, customised content in information retrieval systems.

The project was designed to build on prior research within the Language Technology Group by Simone Teufel and Marc Moens (T&M). It has been argued that most practically oriented work on automated summarisation can be described as based on either text extraction or fact extraction. In these terms, the T&M approach can be characterised as augmented text extraction: the system creates summaries by combining extracted sentences, but the sentences in the source texts are first categorised to reflect their role in the rhetorical or argumentative structure of the document. This rhetorical role information is used to guide the creation of the summaries and to permit several summaries to be created per document where each is tailored to meet the needs of a different class of user---hence the use of the term `flexible' in the project's title.

Results

The work of T&M was carried out on the domain of scientific articles and a corpus of such articles was collected and annotated for the purpose. The SUM project aimed to bring the technology to a new, more challenging domain, that of legal texts. This has involved the identification of a suitable source of texts and their collection and annotation. The resulting corpus, the HOLJ corpus is available here for use by other researchers.

The work of Teufel & Moens provided an initial technological framework for the SUM project, and the overall structure of their system architecture has served the project well. In detail the systems differ and the key innovations have been the use of automatic linguistic processing to replace hand-selection of cue phrases for the classifiers and the exploration of machine learning techniques culminating in an adaptation of a maximum entropy classifier for the core rhetorical role and sentence selection classifiers. Evaluation results of the T&M work represent a point of comparison for the SUM project, against which its achievements can be evaluated. For further detail please refer to the publications page.

Current and Future Work

We are currently designing an extrinsic user evaluation which we will carry out in collaboration with Burkhard Schafer of the Joseph Bell Centre for Forensic Statistics & Legal Reasoning within the University's School of Law. A hypothetical case will be presented to subjects, with a number of possible precedent-setting cases. The possible precedents will be presented in various formats, including: our system summaries (tailored to different types of reader and visualised in various ways); the original full text; and the gold standard summaries. Levels of agreement between subjects, and between subjects' and experts' classifications of cases, will allow us to quantify the utility of our system for a group of real users.

We are planning to continue to exploit the corpus and it will be of particular use to Ben Hachey in his PhD research. We intend to use the corpus in collaborations with Simone Teufel to compare systems across two corpora and with Mirella Lapata to test her probabilistic approach to discourse structuring. We also plan to experiment with methods for automatic text alignment and to improve the named entity component through machine learning and active learning.

Acknowledgments

The project ran between 1 January 2001 and 31 September 2004 and was funded by EPSRC GR/N35311. Staff include Jon Oberlander, Claire Grover, Ben Hachey, Chris Korycinski, Ian Hughson, Beatrice Alex, and Shipra Dingare.

The corpus annotation was carried out by Vasilis Karaiskos and Hui-Mei Liao with assistence using the NITE XML Toolkit (NXT) from Jonathan Kilgour. We are also grateful to Steve Clark for discussions on sequence modelling.

 
Valid HTML 4.01! Valid CSS!
Benjamin Hachey
Last modified: Mon Aug 21 12:09:37 BST 2006