Plague.TXT

Plague.TXT: Extracting Epidemiological Data from Historic Outbreak Reports

The project designs and develops pathways of extracting and digitally mapping epidemiological data from historical reports. The interdisciplinary pilot study brings together historians, computational linguistics and computer scientist at the University of Edinburgh to analyse over 100 outbreak reports of the Third Plague Pandemic (1894 – 1950). The goal is to develop structured accounts of the most significant concepts that were used to understand the epidemiology of plague around the globe. Utilising text mining and natural language processing we seek to build a model-approach to historical texts considering epidemic outbreaks to contribute both to historical scholarship as well as to epidemiological research. 

Bubonic Plague report about the Hong Kong outbreak published in 1895 with open access on Internet Archive here.

We are working on adapting the Edinburgh Geoparser to process such historical text to extract information relevant for further analysis. This includes, among other things, location names linked to Geonames, normalised dates, person names, geographic features and a lexicon of plague related vocabulary containing variants containing OCR errors and their corrected forms (e.g. bubonic discharge for bnbonie diseharge ). The geoparser output is manually corrected and further information about the structure and content of each report is added as an extra layer of annotation. This will enable document-zone-specific analysis.

The pilot also focusses on improving the optically character recognition that is currently made available on Internet Archive. A variety of computer vision and machine learning techniques are being applied to enhance the output quality of the OCR, including automated cropping of text areas, page de-warping and a LSTM-based OCR engine. This builds on work carried out at the University of Edinburgh Library of digitising and text mining the Scottish Court of Session papers.

Collaborators

Project leads: Dr. Lukas Engelmann and Dr. Beatrice Alex

Contributors: Arlene Casey, Richard Tobin and Mike Bennett

Funding

This works is funded by Challenge Investment Fund 2018/19 at the College of Arts, Humanities and Social Sciences at the University of Edinburgh.