Plague.TXT: Extracting Epidemiological Data from Historic Outbreak Reports
The project designs and develops pathways of extracting and digitally mapping epidemiological data from historical reports. The interdisciplinary pilot study brings together historians, computational linguistics and computer scientist at the University of Edinburgh to analyse over 100 outbreak reports of the Third Plague Pandemic (1894 – 1950). The goal is to develop structured accounts of the most significant concepts that were used to understand the epidemiology of plague around the globe. Utilising text mining and natural language processing we seek to build a model-approach to historical texts considering epidemic outbreaks to contribute both to historical scholarship as well as to epidemiological research.
We are working on adapting the Edinburgh Geoparser to process such historical text to extract information relevant for further analysis. This includes, among other things, location names linked to Geonames, normalised dates, person names, geographic features and a lexicon of plague related vocabulary containing variants containing OCR errors and their corrected forms (e.g. bubonic discharge for bnbonie diseharge ). The geoparser output is manually corrected and further information about the structure and content of each report is added as an extra layer of annotation. This will enable document-zone-specific analysis.
The pilot also focusses on improving the optically character recognition that is currently made available on Internet Archive. A variety of computer vision and machine learning techniques are being applied to enhance the output quality of the OCR, including automated cropping of text areas, page de-warping and a LSTM-based OCR engine. This builds on work carried out at the University of Edinburgh Library of digitising and text mining the Scottish Court of Session papers.
Contributors: Arlene Casey, Mike Bennett, Iona Walker, Richard Tobin and Claire Grover
This works is funded by Challenge Investment Fund 2018/19 at the College of Arts, Humanities and Social Sciences at the University of Edinburgh, by the School of Literatures, Languages and Cultures and by the School of Social and Political Science at the University of Edinburgh.
Arlene Casey, Mike Bennett, Richard Tobin, Claire Grover, Iona Walker, Lukas Engelmann and Beatrice Alex (2020). Plague Dot Text: Text Mining and Annotation of Outbreak Reports of the Third Plague Pandemic (1894-1952), accepted for publication in the Journal of Data Mining and Digital Humanities, 2020. [arxiv]
Arlene Casey, Mike Bennett, Richard Tobin, Claire Grover, Lukas Engelmann and Beatrice Alex (2019). Plague Dot Text: Text mining and annotation of outbreak reports of the Third Plague Pandemic (1894-1952), In Proceedings of HistoInformatics 2019 at the 23rd International Conference on Theory and Practice of Digital Libraries (TPDL 2019), CEUR Vol-2461, Oslo, Norway, 2019. [pdf]
Beatrice Alex and Mike Bennett. 2020. Does Digitised Historical Text
have to be mediOCRe? Optical character recognition and text mining of historical documents. Seminar at the National Library of Scotland, 29th of January 2020. [slides]