Blood Is Thicker Than Water

A Hierarchical Evaluation Metric for Document Classification

This blog post serves as an introduction to the methods described in the paper CoPHE: A Count-Preserving Hierarchical Evaluation Metric in Large-Scale Multi-Label Text Classification [1] presented at EMNLP 2021. For a more detailed description, along with comparison against previous evaluation metrics in this setting, please refer to the full publication.

Motivation

Evaluation in Large-Scale Multi-Label Text Classification, such as automated ICD-9 coding of discharge summaries in MIMIC-III [2], is treated in prior art as exact-match evaluation on the code level. Labelling is carried out on the document level (weak labelling), with each code appearing at most once per document. Hence, the prediction and gold standard for a document can be viewed as sets. The label space of MIMIC-III consists of leaf nodes within the ICD-9 tree (example substructure in Figure 1), treating both the prediction and the gold standard as flat (disregarding the ontological structure).

Figure 1: Example subgraph of the code 364 within the ICD-9 ontology. Leaf (prediction-level) nodes are represented with circular nodes, ancestor nodes are rectangular.

Within a structured label space, the concept of distance between labels naturally arises. If, for instance, we consider each edge within the ontology to be of equal weight, the code 410.01 Acute myocardial infarction of anterolateral wall, initial episode of care is closer to its sibling code 410.02 Acute myocardial infarction of anterolateral wall, subsequent episode of care than to a cousin code 410.11 Acute myocardial infarction of other anterior wall, initial episode of care, or a more distantly related code, e.g., 401.9 Unspecified essential hypertension. The standard flat evaluation does not capture this, and if the code 410.01 was mispredicted for a document to be any other code, through the standard flat exact-match approach the errors would be considered equivalent, disregarding the closeness of predictions to the gold standard.

Previous work has incorporated the structural nature of the label space of ICD ontologies, such as [3]. This study, however, concerns a different task – information extraction with strong labels. The ICD codes are assigned to specific spans of texts within the document. This strong labelling allows for associating a prediction to a gold standard label, and exact comparison on a case-by-case basis. This is, unfortunately, not possible in the weakly-labelled scenario of document-level ICD coding, where if a label is mispredicted we cannot state its corresponding gold standard label with certainty.

One of the approaches to creating a metric for the structured label space in [3] is tracing the distance between the closest common parent on the graph of the ontology (tree) and either of the prediction and the gold standard. We are unable to reuse this method exactly, lacking the knowledge of which gold standard codes relate to mispredictions. Instead, we are able to make use of the common ancestor.

Hierarchical Evaluation

One way to approach hierarchical evaluation in a weakly-labelled scenario is to not only evaluate on the leaf-level prediction, but also the codes’ ancestors. We can convert leaf-level predictions into ancestor predictions (e.g., by means of adjacency matrices,) and compare those against their respective converted gold standard (Figure 2). The core idea here is that codes appearing closer together within the ontology will share more ancestors, thereby mitigating the error that arises from misclassification.

Figure 2: A conversion from leaf-level to parent for both the prediction vector and the gold standard label vector. A similar conversion can be done for at least one more (grandparent) level.

Once we have the ancestor-level values we can either report separate metrics for each level of the ontology, or a single metric on the combined information from different levels.

Figure 3: A comparison between predictions and gold standard. Ancestor vectors are concatenated with leaf vectors.

As prediction-level codes in MIMIC-III appear at different depths (as seen in Figure 1), it is reasonable to report performance based on different depths of the ontology. Depending on the implementation of the transition procedure, duplicates may appear.

The example presented in Figure 3 is neat in that at most one prediction is made for each of the shown code families on the leaf level. What about multiple predictions within the same family? One option would be to stick to binary as on the prediction level. If at least 1 prediction-level node within the family is positive, the family is considered to be positive (1), and negative (0) otherwise. As such, the value of ancestor nodes is the result of the logical OR operation on their descendants. Standard P/R/F1 can be applied for evaluation without further need for processing. Such an approach to hierarchical evaluation in multi-label classification was presented by Kosmopoulos et al. [4]

Count-Preserving Hierarchical Evaluation (CoPHE)

Alternatively, we can extend to full counts, in which case each family value is a sum of the values of the family’s prediction-level codes. This results in ancestor values in the domain of natural numbers. Standard binary P/R/F1 do not work in this case (as TP, FP, and FN are defined for binary input), but we retain more information that can tell us of over- or under-prediction for the ancestor codes. Why is this important? The ontology may contain inexplicit rules, such some code families allowing only a single code assigned per document – e.g., 401 (Essential Hypertension) has three descendant codes corresponding to malignant, benign, and unspecified hypertension respectively. From a logical standpoint, hypertension can be either malignant or benign, but not both at the same time, and would be considered unspecified only if it was stated to be present, but not specified to be malignant or benign.

Back to TP, FP, FN. We are dealing with vectors consisting of natural numbers now, rather than binary vectors. Hence we need to redefine these metrics.

Let x be the number of predicted codes within a certain code family f for a document d. Let y be the number of true gold standard codes within the same code family f for a document d.

TP(d, f) = min(x, y)
FP(d, f) = max(x – y, 0)
FN(d, f) = max(y – x, 0)

Where min and max are functions returning the minimum and maximum between two input numbers respectively.

TP represents the numeric overlap, FP and FN represent over-prediction and under-prediction respectively.
Remark: Note that the outputs of the redefined TP, FP, and FN are equivalent to those of their standard definitions assuming binary x and y.

We call this method a Count-Preserving Hierarchical Evaluation (CoPHE).

Figure 4: A comparison between predictions and gold standard showing vector interpretation in CoPHE. Two phenomena of the non-binary ancestor evaluation are on display: (1) While there is a mismatch on the leaf level in the 401 family (401.1 predicted versus 401.9 expected), translated into the direct ancestor level (401), both the prediction and the true label are 401 respectively – allowing for a match on this level. (2) For parent 402 there are two leaves predicted, while 1 expected. This puts us in the non-binary scenario, with TP = 1, FP = 1, FN = 0. As displayed in (1), on this ancestor level the match between the leaves does not matter, but rather how many times the ancestor is involved. In this case the ancestor (402.0) is over-predicted by 1.

CoPHE is not meant to be used as a replacement of the existing metrics, but rather in tandem with them. In general, hierarchical metrics (set-based and CoPHE) are expected to produce scores mitigating mismatches on the code-level. It is also important to compare set-based hierarchical results to those of CoPHE. Assuming no over-/under-prediction (not captured by the set-based metric) takes place, FN and FP values for CoPHE will stay the same as for set-based, with TP being greater or equal than that of set-based. This would lead to CoPHE Precision, Recall (and consequently F1 score) higher or equal to those of set-based hierarchical evaluation. Should CoPHE results be lower than those of set-based hierarchical evaluation, this is an indication of over-/under-prediction taking place.

We have developed CoPHE for ICD-9 coding and made the code publicly available on Github. The approach can be adjusted to any label-space with an acyclic graph structure. For further details, including results of prior art model on MIMIC-III, please consult the publication.

References

[1] Falis, Matúš, et al. “CoPHE: A Count-Preserving Hierarchical Evaluation Metric in Large-Scale Multi-Label Text Classification.” 2021 Conference on Empirical Methods in Natural Language Processing. 2021.
[2]Johnson, Alistair EW, et al. “MIMIC-III, a freely accessible critical care database.” Scientific data 3.1 (2016): 1-9.
[3] Maynard, Diana, Wim Peters, and Yaoyong Li. “Evaluating Evaluation Metrics for Ontology-Based Applications: Infinite Reflection.” LREC. 2008.
[4] Kosmopoulos, Aris, et al. “Evaluation measures for hierarchical classification: a unified view and novel approaches.” Data Mining and Knowledge Discovery 29.3 (2015): 820-865.

Edinburgh Geoparser Back On The Map

Photo credit: Timo Wielink on Unsplash

We are delighted to announce the release of the new version (v1.2) of the Edinburgh Geoparser, a tool to geoparser contemporary English text.  Most importantly, it now comes with a new map display using OpenStreetMap.

We also made a few fixes to make it run on the latest versions of MacOS and have added instructions of how to visualise the timeline display on different browsers.

The new geoparser also incorporates a gazetteer lookup that’s now supported by the University of Edinburgh Digital Library team.  We continue to support queries to all gazetteers that were distributed with the previous release of the geoparser (see the list here).

We are a small research team so updating this technology regularly can be challenging but we hope that with this new release the Edinburgh Geoparser will continue to be useful for place-based research and teaching.  More information on how do download the new version and on its updated documentation can be found here.

Reflections on my first PhD Publication at the Second Workshop on Gender Bias in Natural Language Processing

This post originally appeared on Lucy Havens’ blog on 10 February 2021.

This winter (December 2020), I published a new research methodology for Natural Language Processing (NLP) researchers to consider, which I refer to as a bias-aware methodology. Earlier in the year, a couple months into my PhD research on using NLP to detect biases in language, I’d been relieved to see Blodgett et al.’s ‘Critical Survey’ confirm what I’d begun to suspect: NLP bias research was missing the human element.  As a researcher new to the NLP domain, I’d been shifting between frustration with the vagueness of existing NLP bias research and doubt in my own understanding.  Soon after reading the Survey, I came across Kate Crawford’s 2017 keynote, The Trouble with Bias.  Both the Survey and keynote discuss the harmful consequences of siloed technology research, and they both call for interdisciplinary and stakeholder collaboration throughout the development of technology systems.  The Survey was published three years after the keynote.  Why was there still a need to make the same calls?

I realized that, although there was a wealth of evidence supporting the need for interdisciplinary and stakeholder collaboration, there wasn’t guidance on how to go about engaging in such collaboration.  Drawing on my background working at the intersection of multiple disciplines, I went to work creating a new methodology that would outline how to collaborate across disciplines and with system stakeholders.  Though my work and studies have fallen under many different names (to name a few: Information Systems, Human-Computer Interaction, Customer Experience, Design Informatics), I consistently situate myself in the same sort of place: at the intersection of groups of people who do not typically work together.  I enjoy adapting the tools of one discipline to another to enable new types of research questions to be asked and new insights to be discovered.  To adapt one discipline’s tools for another, I listen closely to how people communicate, adopting distinct vocabularies and presentation styles depending on my audience.  I employ human-centered design methods, observing and interviewing, even if only informally, to gather information about the goals and concerns of my collaborators.  As those involved in anything participatory, user-centered, or customer experience-related have likely experienced, once you’re exposed to the methods, it’s difficult to stop yourself from seeing everything through a human-centered design lens.  So, my PhD was inevitably including some form of human-centered design.

In the new methodology I propose with my co-authors in Situated Data, Situated Systems: A Methodology to Engage with Power Relations in Natural Language Processing Research, I’ve embedded interdisciplinary concepts and practices into three activities for researchers to execute in parallel: (1) Examining Power Relations, (2) Explaining the Bias of Focus, and (3) Applying NLP Methods.  The practice of participatory action research, which plays a part in all three activities, embeds stakeholder collaboration into the methodology as well.  I’m in the process of executing these three activities during my PhD research, so I will certainly refine the methodology over time (I’d also love feedback on how it suits your work and how you’d adjust it!).  That being said, the methodology does provide a starting point for all types of NLP research and development, facilitating critical reflection on power relations and their resulting biases that impact all NLP datasets and systems.  If your dataset or system has a huge community of potential stakeholders, the methodology asks you to make decisions based on the people at the margins of that stakeholder community, assembling as diverse a group of people as possible with whom you can collaborate.  If your project timeline does not allow adequate time for stakeholder collaboration, the methodology asks you to be detailed in the documentation of your work, stating the time, place and people that make up your project context, and the power relations between people in your project context.

NLP uses human language as a data source, meaning NLP datasets are inherently biased, and NLP systems built on those datasets are inherently biased.  Everyone has a unique combination of experiences that give them a particular perspective, or bias, and this isn’t necessarily a bad thing.  The problems arise when a particular perspective is presented as universal or neutral.  If we identify which perspectives are present in our research and, to the best of our ability, which perspectives are absent, we can help people who visit our work realize how they should adapt it to suit their context.  Adopting the bias-aware methodology requires a mindset shift, where the human element has just as much weight as the technological element.   We must set project timelines and funding models that allow for collaboration with adequately diverse groups of people. 

For more on why and how to use a bias-aware NLP research methodology, check out the published paper in the ACL Anthology or read the preprint on ArXiv! 

Citation:

Havens, Lucy, Melissa Terras, Benjamin Bach, and Beatrice Alex. 2020 “Situated Data, Situated Systems: A Methodology to Engage with Power Relations in Natural Language Processing Research.” Proceedings of the Second Workshop on Gender Bias in Natural Language Processing. Barcelona, Spain (Online), December 13, 2020, pp. 107-124. Association for Computational Linguistics. Available: https://www.aclweb.org/anthology/2020.gebnlp-1.10

By Lucy Havens

EdIE, a suite of Information Extraction tools for stroke related phenotypes

We are delighted to announce that our tools for Information Extraction from radiology reports are now available on GitHub and as a web demo online! Our language processing tools work on clinical text from brain radiology reports to extract information on stroke-related diseases. See a brief introduction to our online demo here.

EdIE-Viz demo

We release these tools in the hope that they will help speed up and improve clinical NLP research in this area by bringing the tools to the data, as in our experience assuming any form of patient data mobility is unrealistic.

We are very happy to collaborate with other researchers on clinical NLP projects, please get in touch for more information. Moreover, if you have any feedback or comments on our tools we would also be very happy to hear from you.

Our LOUHI 2020 workshop paper

The aforementioned tools accompany our workshop paper Not a cute stroke: Analysis of Rule- and Neural Network-based Information Extraction Systems for Brain Radiology Reports (Grivas et al., 2020) that we presented at LOUHI 2020. The recorded presentations will be available soon.

In the paper we discuss our insights from applying information extraction to brain radiology reports with the following three conclusions:

  1. The choice of metric matters and depends on your downstream task as caveated previously in this blog post by Chris Manning. Directly optimising models for F1 score (the harmonic mean of precision and recall) when doing named entity recognition (NER) has unintended consequences, such as preferring models that predict less entities. This can be seen when breaking down the F1 score into fine-grained error types.
  2. When relying on semi-automatic annotations, neural language models trained on large corpora can be used to check for any annotation inconsistencies and to get more insights on your annotated data.
  3. Applying negation detection learned by neural models to new external out-of-sample datasets leads to degrading performance and is still challenging, even in the large pre-trained language model era.

Questions from the audience

What is the reason for the rule-based system performing so well?

There is a lot of generalisable linguistic knowledge as well as medical expert knowledge incorporated in the rules that were developed by Claire Grover after discussions with medical experts. In addition, some mentions are generally straightforward to detect due to the specific vocabulary used. Moreover, the semi-automatic nature of the annotations also further contributes to the increase in scores.

Would you have any thoughts on metrics used for NER?

In the past, ACE and MUC have been used as more fuzzy metrics than the strict metric of the CONLL scorer, but the answer really depends on what our goal is. For hierarchical and fine-grained entity recognition, other metrics that take the hierarchy into account are likely more suitable, for example. Overall, we highlighted that it is beneficial and insightful to break down errors into semantically meaningful groups, especially when comparing models. David S. Batista has a great introduction and discussion of metrics used for NER.

Other papers and talks @LOUHI2020

LOUHI had a lot of interesting related talks, a selection of which we briefly summarise below:

  • Hercules Dalianis presented his group’s work on de-identification strategies and their effect on downstream namee published in The Impact of De-identification on Downstream Named Entity Recognition in Clinical (Berg et al., 2020). This is important as decisions taken at the anonymisation stage of clinical text (which is not usually done by researchers) may lead to unforeseen effects at the NLP stage.
  • Maciej Wiatrak presented a comparison of multi-task neural network architectures for entity linking published in Simple Hierarchical Multi-Task Neural End-To-End Entity Linking for Biomedical Text (Wiatrak and Iso-Sipila, 2020)
  • Aditya Khandelwal‘s presentation was on negation and speculation cue and scope detection as multi-task learning published in Multitask Learning of Negation and Speculation using Transformers (Khandelwal and Britto, 2020).
  • Florian Borchert presented work on their corpus of German clinical guidelines which includes structural metadata published in GGPONC: A Corpus of German Medical Text with Rich Metadata Based on Clinical Practice Guidelines (Borcher et al., 2020). This work is of interest to us since we work with clinical guidelines in English for a Clinical Guidelines Browser application.
  • Minghao Zhu presented experiments classifying social media posts reflecting the personal experience of the poster published in Identifying Personal Experience Tweets of Medication Effects Using Pre-trained RoBERTa Language Model and Its Updating (Zhu et al., 2020)
  • Guergana Savova, the keynote speaker, presented interesting summary of work in this field titled Clinical NLP, some tasks and applications in medicine and many points raised by her rang true to us.

Group thanks and funding

We thank our wider collaborators in the Edinburgh Clinical NLP Group for their help and feedback on this work. We are also very grateful for our funding that made this work possible: the MRC Mental Health Data Pathfinder Award (MRC – MCPC17209) the Alan Turing Institute fellowships and project (EPSRC grant EP/N510129/1), the MRC Clinician Scientist Award (G0902303) and the Scottish Senior Clinical Fellowship (CAF/17/01).

Andreas Grivas, Edinburgh, 18/12/2020