Geoparsing the Gazetteers of Scotland

The Ordnance Gazetteer of Scotland: A survey of Scottish Topography, Statistical, Biographical and Historical (1803-1901) is collection of twenty volumes of the most popular descriptive historical gazetteers of Scotland in the 19th century.  They are considered to be geographical dictionaries and include an alphabetic list of principal places in Scotland, including towns, counties, castles, glens, antiquities and parishes. Each entry also includes detailed historical information and a geographical description about each place.  Descriptive gazetteers such as these are written complements to maps, atlases and cartographic works. 

This dataset was recently made available by the National Library of Scotland (NLS) in the form of over 13,000 page images, corresponding optically character recognised (OCRed) text in XML format, and metadata for each item in the collection.  In total the OCRed text (which is non post-corrected) comprises of almost 14.5 million words and collectively these gazetteers provide a comprehensive geographical encyclopaedia of Scotland in the 19th century.  While this is a valuable resource, it is too time-consuming to geoparse this data manually.

This project will focus on devising automatic methods to geoparse the Gazetteers of Scotland.  Previous related work on geoparsing The Survey of English Placenames has involved the painstaking development of rule-based geo-parsing methods [1,2] using and adapting the Edinburgh Geoparser.  We will also use the Edinburgh Geoparser as a baseline for this project as a starting point.  The Edinburgh Geoparser’s resolution component was recently integrated with the DEFOE text analysis tool [3,4], a spark-based library which allows running text mining queries across large datasets such as historical newspapers and datasets made available by NLS, including the Gazetteer of Scotland.  We will compare different NLP pipelines and formally evaluate their geoparsing performance by means of a manually annotated gold standard.

The main goal of the project is to create better historical gazetteer resources for Scotland in order to enable better quality geoparsing performance for mapping historical Scottish text. We are also looking to partner with humanities and digital humanities scholars who are interested in applying such methods in the light of their particular use cases (contact: Bea Alex).

Collaborators

Rosa Filgueira Vicente, EPCC, University of Edinburgh

Claire Grover, School of Informatics

Melissa Terras, Centre for Data, Culture and Society, Edinburgh Futures Institute, University of Edinburgh

Sarah Ames, National Library of Scotland

Chris Fleet, National Library of Scotland

Beatrice Alex, Edinburgh Futures Institute. University of Edinburgh

Publications

Rosa Filgueira, Claire Grover, Melissa Terras and Beatrice Alex (2020). Geoparsing the historical Gazetteers of Scotland: accurately computing location in mass digitised texts, To appear in Proceedings of the 8th Workshop on the Challenges in the Management of Large Corpora (CMLC-8 2020) at LREC 2020, 16th of May 2020. [workshop]

References

[1] Claire Grover and Richard Tobin (2014). A Gazetteer and Georeferencing for Historical English Documents. In Proceedings of LaTeCH 2014 at EACL 2014. Gothenburg, Sweden. 

[2] Beatrice Alex, Kate Byrne, Claire Grover and Richard Tobin (2015). Adapting the Edinburgh Geoparser for Historical Georeferencing. International Journal for Humanities and Arts Computing, 9(1), pp. 15-35. 

[3] https://github.com/alan-turing-institute/defoe

[4] Filgueira Vicente, R, Jackson, M, Roubickova, A, Krause, A, Terras, M, Hauswedell, T, Nyhan, J, Beavan, D, Hobson, T, Coll Ardanuy, M, Colavizza, G, Hetherington, J & Ahnert, R 2019, defoe: A Spark-based Toolbox for Analysing Digital Historical Textual Data. in 2019 IEEE 15th International Conference on e- Science (e-Science). 2019 IEEE 15th International Conference on e-Science (e-Science), San Diego, United States, 24/09/19.