Uncategorized – Language Technology Group

June 10, 2025June 10, 2025

From millions of radiology reports to better brain health

Advancing Brain Imaging Research in Scotland with Natural Language Processing

Coming to HealTAC 2025: A lightning talk and poster on breakthrough NLP applications in neuroimaging

Every day, radiologists generate thousands of brain imaging reports, detailed narratives describing what they see in CT and MRI scans. These reports contain a wealth of clinical information, but until recently much of it remained locked away as unstructured text, difficult to analyse at scale. What if we could automatically extract meaningful patterns from these reports to predict strokes, identify dementia risk and recruit patients for life-saving clinical trials?

That’s exactly what our wider research team at the Universities of Edinburgh and Dundee has accomplished, and we’re excited to share our findings at the upcoming HealTAC 2025 conference through both a lightning talk and poster presentation.

The Challenge

Medical imaging generates enormous amounts of textual data. In Scotland alone, the Scottish Medical Imaging (SMI) dataset contains reports from over 57 million radiology studies [1], including approximately 1.7 million reports related to the brain, a substantial resource for neurological research. Each brain imaging report tells a story, describing brain lesions, small vessel disease, atrophy, tumours and other abnormalities or confirming normal findings, but analysing this information manually across entire populations is simply impossible.

Traditional approaches to medical research often rely on structured data or small, carefully curated datasets. But what if we could harness the full power of all available brain imaging reports to understand disease patterns, predict outcomes and improve patient care?

Our Solution

We developed and refined EdIE-R [2], a robust, natural language processing (NLP) pipeline specifically designed for brain imaging reports. This system identifies 24 distinct clinical phenotypes, including:

Different types of strokes (ischemic vs. haemorrhagic, with temporal and location details)
Brain tumours (meningiomas, gliomas, metastases)
Small vessel disease and microbleeds, and
Other neurological abnormalities

The system processes reports through multiple stages: identifying medical entities, detecting negation (crucial for understanding what’s not present), extracting relationships between findings and assigning document-level labels (phenotypes). Performance varies depending on the data type, scan type, frequency of the phenotype and age of the patient [2-4]. For conditions like small vessel disease, EdIE-R achieves near-perfect accuracy with F1-scores between 0.97 and 1.0.

Three Large-scale Studies

Our NLP pipeline has enabled three major population-based studies that were made possible using automated text processing:

1. Preventing Cerebrovascular Disease (WARBLER Project)

“Covert” cerebrovascular disease (CCD), brain vessel damage that doesn’t cause obvious symptoms in itself, is a common incidental brain imaging finding. Using our NLP system on the entire Scottish population’s brain imaging data (2010-2018) made available for research by Public Health Scotland, we identified crucial associations between these hidden findings and future stroke risk [5]. The results show that NLP can be used for early detection of CCD from radiology reports, which could enable early intervention to prevent strokes and vascular dementia.

2. Finding Patients for Clinical Trials (DISCOVER Project)

One of the biggest challenges in medical research is finding the right patients for clinical trials. Using our NLP system in DataLoch, we identified 27,000 NHS Lothian patients with signs of covert cerebrovascular disease, including 9,000 not currently on preventive medication. These patients will be contacted to participate in trials for new stroke and dementia treatments, demonstrating how AI can accelerate the path from research to real-world impact.

3. Predicting Dementia (SCANDAN Project)

Can we predict who will develop dementia based on their brain scans? Our NLP pipeline helped to create a massive, labelled and linked dataset for dementia prediction by filtering out irrelevant scans and extracting key imaging features to enable down-stream image analysis and predictive modelling efforts [7]. This work is laying the foundation for AI systems that can identify high-risk patients years before symptoms appear.

Why This Matters

These studies showcase how NLP can transform population health research:

Scale: We’re analysing data from hundreds of thousands of patients across an entire population, not just small research cohorts.

Speed: What would take teams of clinical coders years to manually code, our system processes in a very short amount of time.

Discovery: By analysing vast amounts of previously inaccessible data, we’re uncovering new patterns and associations that could lead to better treatments.

Clinical Impact: Our work directly enables patient recruitment for clinical trials and early identification of high-risk individuals.

Challenges and Future Directions

Of course, this work isn’t without challenges. Ensuring that NLP systems work reliably across different NHS boards, scan types and patient populations requires robust validation and refinement [2-4]. Privacy and confidentiality considerations also mean that we must work within secure research environments, adding complexity to our analyses.

Looking ahead, we’re excited to see how the current work will enable future clinical trials and accelerate medical research through the development of predictive models that combine brain imaging and linked clinical data.

Join Us at HealTAC

We’re excited to present this work at HealTAC 2025. Do join us at the conference for our:

Lightning Talk: 17/06/2025, 14:00-15:00, and
Poster Session: 17/06/2025, 15:0-17:00, Look for our poster on “Advancing Neuroimaging Research with NLP: Three Large-Scale Population-Based Studies in Scotland”

References

[1] Baxter, R., Nind, T., Sutherland, J., McAllister, G., Hardy, D., Hume, A., MacLeod, R., Caldwell, J., Krueger, S., Tramma, L. and Teviotdale, R., 2023. The Scottish Medical Imaging Archive: 57.3 million radiology studies linked to their medical records. Radiology: Artificial Intelligence, 6(1), p.e220266.

[2] Alex, B., Grover, C., Tobin, R., Sudlow, C., Mair, G. and Whiteley, W., 2019. Text mining brain imaging reports. Journal of biomedical semantics, 10, pp.1-11.

[3] Wheater, E., Mair, G., Sudlow, C., Alex, B., Grover, C., & Whiteley, W., 2019. A validated natural language processing algorithm for brain imaging phenotypes from radiology reports in UK electronic health records. BMC medical informatics and decision making, 19(1), 184.

[4] Casey, A., Davidson, E., Grover, C., Tobin, R., Grivas, A., Zhang, H., Schrempf, P., O’Neil, A.Q., Lee, L., Walsh, M., Pellie, F., Ferguson, K., Cvoro, V., Wu, H., Whalley, H., Mair, G., Whiteley, W. and Alex, B., 2023. Understanding the performance and reliability of NLP tools: a comparison of four NLP tools predicting stroke phenotypes in radiology reports. Frontiers in digital health, 5, p.1184919.

[5] Iveson, M.H., Mukherjee, M., Davidson, E., Zhang, H., Sherlock, L., Ball, E.L., Mair, G., Hosking, A., Whalley, H., Poon, M.T.C., Tobin, R., Grover, C., Alex, B. & Whiteley, W.N., In preparation. Clinically-reported covert cerebrovascular disease and risk of stroke, dementia and other neurological disease: a whole-population cohort of 395,273 people using natural language processing.

[6] Camilleri, M., Gouzou, D., Al-Wasity, S., Valdes Hernandez, M., Alex, B., Tsaftaris, S., Brooks, A., MacLeod, R., Wu, H., Bauer, B., Grover, C., Krueger, S., Tobin, R., Steele, D., Mair, G., Wardlaw, J., Doney, A., Trucco, E. & Whiteley, W., In preparation. A large dataset of brain imaging linked to health systems data: a whole system national cohort.

Funding

This work has been funded/supported by:

The Alan Turing Institute Project and Fellowships (CG & BA, EPSRC grant EP/N510129/1)
MRC Pathfinder (MRC – MCPC17209)
The Medical Research Council (WW, MRC Clinician Scientist Award G0902303)
Chief Scientist Office (WW, Scottish Senior Clinical Fellowship, CAF/17/01).
The Alzheimer’s Society
HDR-UK
Stroke Association Edith Murphy Foundation (GM, Senior Clinical Lectureship, SA L-SMP 18n1000)
Innovate UK on behalf of UKRI (iCAIRD, project number: 104690
Generation Scotland (Chief Scientist Office of the Scottish Government Health Directorates (CZD/16/6), Scottish Funding Council (HR03006) and the Wellcome Trust (216767/Z/19/Z))
The Advanced Care Research Centre (L&G)
AIM-CISC project (NIHR202639).
NEURii (Eisai, Gates Ventures, Health Data Research UK and LifeArc)

October 31, 2024November 4, 2024

Perceptions of Edinburgh

Capturing neighbourhood characteristics by clustering geoparsed local news

Our paper on “Perceptions of Edinburgh: Capturing neighbourhood characteristics by clustering geoparsed local news” is published here. This work is part of the AIM-CISC project on AI and multimorbidity, where our aim was to capture information about neighbourhoods and communities in local news.

We collected Edinburgh Evening News accessed through University of Edinburgh Library for a timespan of 5 years and clustered them by neighbourhood into relevant topics.

We also visualised topics across time.

We used the Edinburgh Geoparser to georeference the news articles and determine which the most relevant neighbourhood for each article is. We also used the neighbourhoods of the Scottish Index of Multiple Deprivation to link the news to.

As part of this work, we ran a public involvement and engagement event with members of the public who read local newspapers and have knowledge of the City of Edinburgh.

This work was made possible by combining different datasets, resources and cross-disciplinary expertise and is used for follow-on research identifying which characteristics of neighbourhoods increase vulnerability or promote resilience to ill health (Abubakar, E., et al., under review.).

Earlier this year, our team won the Centre for Data, Culture and Society award for the Best Novel Use of a Digital Method.

“This interdisciplinary project itself makes a strong contribution to understanding community health in Scotland, and the success of the methodology suggests a wider application of this framework to assessing community health via local news in other areas. I especially appreciate the researchers’ testing of their methods to confirm that the correlation they had uncovered was, indeed, reliable!”

The project was funded by NIHR’s AIM funding and is supported by the Advanced Care Research Centre.

Reference: Andreas Grivas, Claire Grover, Richard Tobin, Clare Llewellyn, Eleojo Oluwaseun Abubakar, Chunyu Zheng, Chris Dibben, Alan Marshall, Jamie Pearce, Beatrice Alex,Perceptions of Edinburgh: Capturing neighbourhood characteristics by clustering geoparsed local news, Information Processing & Management, Volume 62, Issue 1, 2025, https://doi.org/10.1016/j.ipm.2024.103910.

May 26, 2023

When the AI unveils human cheating instead of assisting it: a case from automatic handwriting recognition

Published originally on Medium on the 12th of May 2023

Our collaboration of Gaelic language experts and computational linguists at the University of Edinburgh resulted in an Artificial Intelligence (AI) model uncovering human cheating that happened decades earlier.

We, Prof Will Lamb, in Celtic and Scottish Studies, and Dr Beatrice Alex, Senior Lecturer in text mining at Literatures, Languages and Cultures, are leading an interdisciplinary project creating Scottish Gaelic handwriting recognition (HWR) to convert manuscripts to electronic text. Our team has trained the first Scottish Gaelic Transkribus model for recognising handwriting in Gaelic automatically. Transkribus provides a platform for AI-powered text recognition, transcription and searching of historical documents for different languages and time periods. It allows researchers to train models effectively, even for low resource languages that have fairly little accessible data to begin with.

To train the Scottish Gaelic HWR model, our team used manuscripts from the School of Scottish Studies Archives (SSSA). The SSSA mostly comprises sound recordings relating to cultural life, folklore and traditional arts in Scotland — including songs, tales and verse collected over the last 70 years. Some of these recordings are accompanied by handwritten transcriptions and it’s this data that was used to train the model.

We ran a series of experiments, which were published in the 4th Celtic Language Technology Workshop proceedings [1], and the model is made available for research on the Transkribus website. For the primary hand, the best model achieves a character level accuracy of 98.3% and a word level accuracy of 95.1%.

During the error analysis, we spotted some unusual examples that caused the model to stumble. It struggled to recognise the writing of one hand, in particular. Looking further into the data, we realised that the hand was that of one particular individual who was already known to the Gaelic language experts involved in the project. The handwriting of this person was unusually spaced out, much more so than in typical adult handwriting.

Bad quality HWR output for a hand with large inter-word spacing.

One of the reasons for the poor performance on this hand is that the model doesn’t have sufficient training data for this handwriting style. However, the large gaps between the words (also called inter-word spacing) present a particular challenge to handwriting recognition models and lead to incorrect line splitting, as can be seen in the image above. By treating each word individually, the model is unable to take context into account when recognising words and characters, leading to poor HRW performance.

It turns out that the fieldworker concerned was paid by the page. Presumably, this particular individual decided to make the job more lucrative by spacing out their handwriting and generating more pages of transcription [2]. Little did they know that their work practices would be thrown up by an AI model many years later. Prof Lamb has recently discovered that the fieldworker concerned was called out in the 1950s, which led to a change in payment to a flat rate.

So AI technology cannot only be used by students to generate essays to cheat in assignments as is now feared by many university and college staff with the release of ChatGPT; in this case, it helped to identify an outlier of unusual behaviour. Our team found it amusing that the AI model wasn’t able to recognise this particular handwriting accurately, resurfacing a long forgotten incident of someone trying to make a bit of extra cash in a slightly underhand way.

Beatrice Alex, Will Lamb and Michael Bauer

References

[1] Sinclair, Mark, William Lamb, and Beatrice Alex (2022). Handwriting Recognition for Scottish Gaelic. In Proceedings of the 4th Celtic Language Technology Workshop at LREC 2022 (CLTW 4), Marseille, June 2022, pp.60–70.

[2] Lamb, William (2012) ‘The storyteller, the scribe and a missing man: Hidden influences from printed sources in the Gaelic tales of Duncan and Neil MacDonald’, Oral Tradition, 27/1: 109–160.

January 11, 2023January 11, 2023

Happy Geoparsing: The Edinburgh Geoparser v1.3 is out

New Release

We have released version 1.3 of the Edinburgh Geoparser and updated the accompanying lesson on the Programming Historian. The Geoparser now runs with a free OpenStreetmap visualisation by default. Anita Hawes, Publishing Assistant at Programming Historian, recently made us aware that users of the Geoparser who followed our lesson were asked to enter credit card details when creating the key for using Mapbox for the map visualisation. We want our language technology to be open and free, so we reacted quickly to fix that.

We have now changed the Geoparser’s visualisation component to use OpenStreetMap tiles by default. OpenStreetMap tiles can be used for light use free of charge (and without signing up to anything) in accordance with their Tile Usage Policy.

If you have a Mapbox account you can continue to use it with the Geoparser by setting the GEOPARSER_MAP_KEY environment variable as before, but make sure you are aware of the possibility that they may charge you if you have given them a credit card number and exceed their limits on free use.

This is the only change we made in v1.3 compared to v1.2. If you don’t use the visualisation component there is no need to update.

Figure 1: Examples of some geo-parsed exonyms (Vienna for Wien, Munich for München, Copenhagen for København, Venice for Venezia, Milan for Milano and Florence for Firenze).

Watch out for Exonyms

An exonym is a place name for which foreigners have a different name, like Munich for München. The main disadvantage of using OpenStreetMap tiles – from the point of view of an English-language geoparser – is that it generally displays maps in the language of the area or country, rather than English. This is a problem for exonyms as a place name on the map might not coincide with the name in the text. Despite this mismatch, it’s actually compelling to see how place names vary in different languages. For example, check out the place name for Hungary:

To help track your locations, the Geoparser visualisation centres the map on the corresponding pin when clicking on a place name that was recognised and is highlighted in the text and it also displays the recognised place name when hovering on the pin (see Figure 1).

Happy New Geoparsing!

March 3, 2022

Volunteer to Help Save Ukrainian Cultural Heritage Online (SUCHO)

Here is an urgent message from Prof Melissa Terras on how to help preserve Ukrainian Cultural Heritage … please spread the word.

Dear Colleagues,

Trusted friends of mine have set up SUCHO, Save Ukrainian Cultural Heritage Online (SUCHO) https://www.sucho.org

They are asking for volunteers to help identify and archive sites and content, while they are still online. You do not have to read Ukrainian or Russian to help.

You can submit items to be saved: https://docs.google.com/forms/d/e/1FAIpQLSffa64-l6qXqEumAcf38OEOrTFeYZEmF531PNv9ZgzNFbcgxQ/viewform

And Volunteer to help put things in the internet archive, or use more advanced archiving software: https://docs.google.com/forms/d/e/1FAIpQLSc6KbhtEOI8zKsQmKT_waE1XlYEF1E6t-HzJ7Gc1EBfMvMg_A/viewform

Please do share with colleagues, and your students, and your networks. It’s one concrete thing we can do to help Ukraine, from afar.

You may also be interested in following the work of the Ukrainian Library Association, who are coordinating a National Digital Library in Ukraine: https://www.facebook.com/ula.org.ua/

Best wishes, and I hope you are doing ok at this difficult time.
Melissa
————
Professor Melissa Terras
University of Edinburgh, College of Arts, Humanities and Social Sciences
@melissaterras

https://twitter.com/bea_alex/status/1498793011345567750?s=20&t=AWapCHOHTFzPZJLo9swStw