Introduction
SUM is an EPSRC research project of the
Language
Technology Group, based in the Institute
for Communicating and Collaborative
Systems of Edinburgh's School
of Informatics. The area of
investigation of the SUM project is
automatic summarisation. The project
aims to use summarisation to help
address the information overload problem
in the legal domain. Legal cases are an
important part of legal discourse and
automatic summarisation offers a route
to providing important information in a
format that is more accessible and
understandable to a range of users, from
students and other legal novices to
solicitors and judges. Currently,
selected judgments are manually
summarised by legal experts. While an
ultimate goal of legal summarisation
would be to provide clear, non-technical
summaries of legal judgments, an
automatic system using current
technology would already enable
immediate access to preliminary
summaries, and serve as an assisting
technology in manual
summarisation. Automatic summaries might
also be incorporated to provide dynamic,
customised content in information
retrieval systems.
The project was designed to build on
prior research within the Language
Technology Group by Simone
Teufel and Marc
Moens (T&M). It has been argued
that most practically oriented work on
automated summarisation can be described
as based on either text
extraction or fact
extraction. In these terms, the
T&M approach can be characterised as
augmented text extraction: the
system creates summaries by combining
extracted sentences, but the sentences
in the source texts are first
categorised to reflect their role in the
rhetorical or argumentative structure of
the document. This rhetorical role
information is used to guide the
creation of the summaries and to permit
several summaries to be created per
document where each is tailored to meet
the needs of a different class of
user---hence the use of the term
`flexible' in the project's title.
Results
The work of T&M was carried out on
the domain of scientific articles and a
corpus of such articles was collected
and annotated for the purpose. The SUM
project aimed to bring the technology to
a new, more challenging domain, that of
legal texts. This has involved the
identification of a suitable source of
texts and their collection and
annotation. The resulting corpus, the
HOLJ corpus is available here
for use by other researchers.
The work of Teufel & Moens provided
an initial technological framework for
the SUM project, and the overall
structure of their system architecture
has served the project well. In detail
the systems differ and the key
innovations have been the use of
automatic linguistic processing to
replace hand-selection of cue phrases
for the classifiers and the exploration
of machine learning techniques
culminating in an adaptation of a
maximum entropy classifier for the core
rhetorical role and sentence selection
classifiers. Evaluation results of the
T&M work represent a point of
comparison for the SUM project, against
which its achievements can be
evaluated. For further detail please
refer to the publications
page.
Current and Future Work
We are currently designing an extrinsic
user evaluation which we will carry out
in collaboration with Burkhard
Schafer of the Joseph
Bell Centre for Forensic Statistics
& Legal Reasoning within the
University's School
of Law. A hypothetical case will be
presented to subjects, with a number of
possible precedent-setting cases. The
possible precedents will be presented in
various formats, including: our system
summaries (tailored to different types
of reader and visualised in various
ways); the original full text; and the
gold standard summaries. Levels of
agreement between subjects, and between
subjects' and experts' classifications
of cases, will allow us to quantify the
utility of our system for a group of
real users.
We are planning to continue to exploit
the corpus and it will be of particular
use to Ben Hachey in his PhD
research. We intend to use the corpus in
collaborations with Simone
Teufel to compare systems across two
corpora and with Mirella
Lapata to test her probabilistic
approach to discourse structuring. We
also plan to experiment with methods for
automatic text alignment and to improve
the named entity component through
machine learning and active
learning.
Acknowledgments
The project ran between 1 January 2001
and 31 September 2004 and was funded by
EPSRC GR/N35311. Staff include Jon
Oberlander, Claire
Grover, Ben
Hachey, Chris Korycinski, Ian
Hughson, Beatrice
Alex, and Shipra Dingare.
The corpus annotation was carried out by
Vasilis
Karaiskos and Hui-Mei Liao with
assistence using the NITE
XML Toolkit (NXT) from Jonathan
Kilgour. We are also grateful to Steve
Clark for discussions on sequence
modelling.
|