Advancing Brain Imaging Research in Scotland with Natural Language Processing
Coming to HealTAC 2025: A lightning talk and poster on breakthrough NLP applications in neuroimaging
Every day, radiologists generate thousands of brain imaging reports, detailed narratives describing what they see in CT and MRI scans. These reports contain a wealth of clinical information, but until recently much of it remained locked away as unstructured text, difficult to analyse at scale. What if we could automatically extract meaningful patterns from these reports to predict strokes, identify dementia risk and recruit patients for life-saving clinical trials?
That’s exactly what our wider research team at the Universities of Edinburgh and Dundee has accomplished, and we’re excited to share our findings at the upcoming HealTAC 2025 conference through both a lightning talk and poster presentation.

The Challenge
Medical imaging generates enormous amounts of textual data. In Scotland alone, the Scottish Medical Imaging (SMI) dataset contains reports from over 57 million radiology studies [1], including approximately 1.7 million reports related to the brain, a substantial resource for neurological research. Each brain imaging report tells a story, describing brain lesions, small vessel disease, atrophy, tumours and other abnormalities or confirming normal findings, but analysing this information manually across entire populations is simply impossible.
Traditional approaches to medical research often rely on structured data or small, carefully curated datasets. But what if we could harness the full power of all available brain imaging reports to understand disease patterns, predict outcomes and improve patient care?
Our Solution
We developed and refined EdIE-R [2], a robust, natural language processing (NLP) pipeline specifically designed for brain imaging reports. This system identifies 24 distinct clinical phenotypes, including:
- Different types of strokes (ischemic vs. haemorrhagic, with temporal and location details)
- Brain tumours (meningiomas, gliomas, metastases)
- Small vessel disease and microbleeds, and
- Other neurological abnormalities
The system processes reports through multiple stages: identifying medical entities, detecting negation (crucial for understanding what’s not present), extracting relationships between findings and assigning document-level labels (phenotypes). Performance varies depending on the data type, scan type, frequency of the phenotype and age of the patient [2-4]. For conditions like small vessel disease, EdIE-R achieves near-perfect accuracy with F1-scores between 0.97 and 1.0.
Three Large-scale Studies
Our NLP pipeline has enabled three major population-based studies that were made possible using automated text processing:
1. Preventing Cerebrovascular Disease (WARBLER Project)
“Covert” cerebrovascular disease (CCD), brain vessel damage that doesn’t cause obvious symptoms in itself, is a common incidental brain imaging finding. Using our NLP system on the entire Scottish population’s brain imaging data (2010-2018) made available for research by Public Health Scotland, we identified crucial associations between these hidden findings and future stroke risk [5]. The results show that NLP can be used for early detection of CCD from radiology reports, which could enable early intervention to prevent strokes and vascular dementia.
2. Finding Patients for Clinical Trials (DISCOVER Project)
One of the biggest challenges in medical research is finding the right patients for clinical trials. Using our NLP system in DataLoch, we identified 27,000 NHS Lothian patients with signs of covert cerebrovascular disease, including 9,000 not currently on preventive medication. These patients will be contacted to participate in trials for new stroke and dementia treatments, demonstrating how AI can accelerate the path from research to real-world impact.
3. Predicting Dementia (SCANDAN Project)
Can we predict who will develop dementia based on their brain scans? Our NLP pipeline helped to create a massive, labelled and linked dataset for dementia prediction by filtering out irrelevant scans and extracting key imaging features to enable down-stream image analysis and predictive modelling efforts [7]. This work is laying the foundation for AI systems that can identify high-risk patients years before symptoms appear.
Why This Matters
These studies showcase how NLP can transform population health research:
Scale: We’re analysing data from hundreds of thousands of patients across an entire population, not just small research cohorts.
Speed: What would take teams of clinical coders years to manually code, our system processes in a very short amount of time.
Discovery: By analysing vast amounts of previously inaccessible data, we’re uncovering new patterns and associations that could lead to better treatments.
Clinical Impact: Our work directly enables patient recruitment for clinical trials and early identification of high-risk individuals.
Challenges and Future Directions
Of course, this work isn’t without challenges. Ensuring that NLP systems work reliably across different NHS boards, scan types and patient populations requires robust validation and refinement [2-4]. Privacy and confidentiality considerations also mean that we must work within secure research environments, adding complexity to our analyses.
Looking ahead, we’re excited to see how the current work will enable future clinical trials and accelerate medical research through the development of predictive models that combine brain imaging and linked clinical data.
Join Us at HealTAC
We’re excited to present this work at HealTAC 2025. Do join us at the conference for our:
- Lightning Talk: 17/06/2025, 14:00-15:00, and
- Poster Session: 17/06/2025, 15:0-17:00, Look for our poster on “Advancing Neuroimaging Research with NLP: Three Large-Scale Population-Based Studies in Scotland”
References
[1] Baxter, R., Nind, T., Sutherland, J., McAllister, G., Hardy, D., Hume, A., MacLeod, R., Caldwell, J., Krueger, S., Tramma, L. and Teviotdale, R., 2023. The Scottish Medical Imaging Archive: 57.3 million radiology studies linked to their medical records. Radiology: Artificial Intelligence, 6(1), p.e220266. [2] Alex, B., Grover, C., Tobin, R., Sudlow, C., Mair, G. and Whiteley, W., 2019. Text mining brain imaging reports. Journal of biomedical semantics, 10, pp.1-11. [3] Wheater, E., Mair, G., Sudlow, C., Alex, B., Grover, C., & Whiteley, W., 2019. A validated natural language processing algorithm for brain imaging phenotypes from radiology reports in UK electronic health records. BMC medical informatics and decision making, 19(1), 184. [4] Casey, A., Davidson, E., Grover, C., Tobin, R., Grivas, A., Zhang, H., Schrempf, P., O’Neil, A.Q., Lee, L., Walsh, M., Pellie, F., Ferguson, K., Cvoro, V., Wu, H., Whalley, H., Mair, G., Whiteley, W. and Alex, B., 2023. Understanding the performance and reliability of NLP tools: a comparison of four NLP tools predicting stroke phenotypes in radiology reports. Frontiers in digital health, 5, p.1184919. [5] Iveson, M.H., Mukherjee, M., Davidson, E., Zhang, H., Sherlock, L., Ball, E.L., Mair, G., Hosking, A., Whalley, H., Poon, M.T.C., Tobin, R., Grover, C., Alex, B. & Whiteley, W.N., In preparation. Clinically-reported covert cerebrovascular disease and risk of stroke, dementia and other neurological disease: a whole-population cohort of 395,273 people using natural language processing. [6] Camilleri, M., Gouzou, D., Al-Wasity, S., Valdes Hernandez, M., Alex, B., Tsaftaris, S., Brooks, A., MacLeod, R., Wu, H., Bauer, B., Grover, C., Krueger, S., Tobin, R., Steele, D., Mair, G., Wardlaw, J., Doney, A., Trucco, E. & Whiteley, W., In preparation. A large dataset of brain imaging linked to health systems data: a whole system national cohort.Funding
This work has been funded/supported by:
- The Alan Turing Institute Project and Fellowships (CG & BA, EPSRC grant EP/N510129/1)
- MRC Pathfinder (MRC – MCPC17209)
- The Medical Research Council (WW, MRC Clinician Scientist Award G0902303)
- Chief Scientist Office (WW, Scottish Senior Clinical Fellowship, CAF/17/01).
- The Alzheimer’s Society
- HDR-UK
- Stroke Association Edith Murphy Foundation (GM, Senior Clinical Lectureship, SA L-SMP 18n1000)
- Innovate UK on behalf of UKRI (iCAIRD, project number: 104690
- Generation Scotland (Chief Scientist Office of the Scottish Government Health Directorates (CZD/16/6), Scottish Funding Council (HR03006) and the Wellcome Trust (216767/Z/19/Z))
- The Advanced Care Research Centre (L&G)
- AIM-CISC project (NIHR202639).
- NEURii (Eisai, Gates Ventures, Health Data Research UK and LifeArc)