Digitising Scotland: Automatic coding of occupations and causes of death
Digitising Scotland’s digitised birth, marriage, and death records include textual descriptions of people’s occupations and causes of death. To use these effectively for large-scale research they must be coded in a form suitable for statistical analysis. The aim of our work in this project is to to map the descriptions to standard HISCO codes for occupations and to standard ICD-10 codes for causes of death. It is impractical to have experts code all the records manually, so we treat the problem as a text classification task and apply machine learning techniques.
A proportion of the records will be manually coded and used to train the system. We have achieved percentage coding accuracy in the low 90s for causes of death. The task is somewhat easier for occupations as there is less variety in the descriptions, and we are already achieving percentage coding accuracy in the high 90s, but the manually coded data we have for training and testing only covers a small part of the historical period. We are aiming to improve performance by working in the following areas:
- We have only a small quantity of manually coded historical data, and
coding is an expensive, skilled task. Our main need is for more coded data from a spread of dates across the period. We are looking at using Active Learning techniques to take best advantage of the resources available.
- Many records describe multiple causes of death in a single sentence; we are developing Natural Language Processing techniques to split them.
- The text includes many indecipherable and mis-spelled words; we are investigating techniques for correcting these.
- We are investigating to what extent modern coded data can be used for training,
- since medical terminology has changed greatly over the historical period.
- We are also looking at synonym detection to improve accuracy.
Digitising Scotland project website.
This project is funded by the ESRC grant ES/K00574X/1
Richard Tobin, Elaine Farrow, Claire Grover, Beatrice Alex (2019). Automatic coding of occupation and cause-of-death records. presented at ADR 2019, Cardiff, UK, December 2019. [html]