Published originally on Medium on the 12th of May 2023
Our collaboration of Gaelic language experts and computational linguists at the University of Edinburgh resulted in an Artificial Intelligence (AI) model uncovering human cheating that happened decades earlier.
We, Prof Will Lamb, in Celtic and Scottish Studies, and Dr Beatrice Alex, Senior Lecturer in text mining at Literatures, Languages and Cultures, are leading an interdisciplinary project creating Scottish Gaelic handwriting recognition (HWR) to convert manuscripts to electronic text. Our team has trained the first Scottish Gaelic Transkribus model for recognising handwriting in Gaelic automatically. Transkribus provides a platform for AI-powered text recognition, transcription and searching of historical documents for different languages and time periods. It allows researchers to train models effectively, even for low resource languages that have fairly little accessible data to begin with.
To train the Scottish Gaelic HWR model, our team used manuscripts from the School of Scottish Studies Archives (SSSA). The SSSA mostly comprises sound recordings relating to cultural life, folklore and traditional arts in Scotland — including songs, tales and verse collected over the last 70 years. Some of these recordings are accompanied by handwritten transcriptions and it’s this data that was used to train the model.
We ran a series of experiments, which were published in the 4th Celtic Language Technology Workshop proceedings , and the model is made available for research on the Transkribus website. For the primary hand, the best model achieves a character level accuracy of 98.3% and a word level accuracy of 95.1%.
During the error analysis, we spotted some unusual examples that caused the model to stumble. It struggled to recognise the writing of one hand, in particular. Looking further into the data, we realised that the hand was that of one particular individual who was already known to the Gaelic language experts involved in the project. The handwriting of this person was unusually spaced out, much more so than in typical adult handwriting.
One of the reasons for the poor performance on this hand is that the model doesn’t have sufficient training data for this handwriting style. However, the large gaps between the words (also called inter-word spacing) present a particular challenge to handwriting recognition models and lead to incorrect line splitting, as can be seen in the image above. By treating each word individually, the model is unable to take context into account when recognising words and characters, leading to poor HRW performance.
It turns out that the fieldworker concerned was paid by the page. Presumably, this particular individual decided to make the job more lucrative by spacing out their handwriting and generating more pages of transcription . Little did they know that their work practices would be thrown up by an AI model many years later. Prof Lamb has recently discovered that the fieldworker concerned was called out in the 1950s, which led to a change in payment to a flat rate.
So AI technology cannot only be used by students to generate essays to cheat in assignments as is now feared by many university and college staff with the release of ChatGPT; in this case, it helped to identify an outlier of unusual behaviour. Our team found it amusing that the AI model wasn’t able to recognise this particular handwriting accurately, resurfacing a long forgotten incident of someone trying to make a bit of extra cash in a slightly underhand way.
Beatrice Alex, Will Lamb and Michael Bauer
 Sinclair, Mark, William Lamb, and Beatrice Alex (2022). Handwriting Recognition for Scottish Gaelic. In Proceedings of the 4th Celtic Language Technology Workshop at LREC 2022 (CLTW 4), Marseille, June 2022, pp.60–70.
 Lamb, William (2012) ‘The storyteller, the scribe and a missing man: Hidden influences from printed sources in the Gaelic tales of Duncan and Neil MacDonald’, Oral Tradition, 27/1: 109–160.