Happy Geoparsing: The Edinburgh Geoparser v1.3 is out

Photo credit: Markus Winkler (Unsplash)

New Release

We have released version 1.3 of the Edinburgh Geoparser and updated the accompanying lesson on the Programming Historian. The Geoparser now runs with a free OpenStreetmap visualisation by default. Anita Hawes, Publishing Assistant at Programming Historian, recently made us aware that users of the Geoparser who followed our lesson were asked to enter credit card details when creating the key for using Mapbox for the map visualisation.  We want our language technology to be open and free, so we reacted quickly to fix that.

We have now changed the Geoparser’s visualisation component to use OpenStreetMap tiles by default. OpenStreetMap tiles can be used for light use free of charge (and without signing up to anything) in accordance with their Tile Usage Policy.

If you have a Mapbox account you can continue to use it with the Geoparser by setting the GEOPARSER_MAP_KEY environment variable as before, but make sure you are aware of the possibility that they may charge you if you have given them a credit card number and exceed their limits on free use.

This is the only change we made in v1.3 compared to v1.2.  If you don’t use the visualisation component there is no need to update.

Figure 1: Examples of some geo-parsed exonyms (Vienna for Wien, Munich for München, Copenhagen for København, Venice for Venezia, Milan for Milano and Florence for Firenze).

Watch out for Exonyms

An exonym is a place name for which foreigners have a different name, like Munich for München. The main disadvantage of using OpenStreetMap tiles – from the point of view of an English-language geoparser – is that it generally displays maps in the language of the area or country, rather than English. This is a problem for exonyms as a place name on the map might not coincide with the name in the text. Despite this mismatch, it’s actually compelling to see how place names vary in different languages. For example, check out the place name for Hungary:

To help track your locations, the Geoparser visualisation centres the map on the corresponding pin when clicking on a place name that was recognised and is highlighted in the text and it also displays the recognised place name when hovering on the pin (see Figure 1).

Happy New Geoparsing!

Volunteer to Help Save Ukrainian Cultural Heritage Online (SUCHO)

Here is an urgent message from Prof Melissa Terras on how to help preserve Ukrainian Cultural Heritage … please spread the word.

Dear Colleagues,

Trusted friends of mine have set up SUCHO, Save Ukrainian Cultural Heritage Online (SUCHO) https://www.sucho.org

They are asking for volunteers to help identify and archive sites and content, while they are still online. You do not have to read Ukrainian or Russian to help. 

You can submit items to be saved: https://docs.google.com/forms/d/e/1FAIpQLSffa64-l6qXqEumAcf38OEOrTFeYZEmF531PNv9ZgzNFbcgxQ/viewform

And Volunteer to help put things in the internet archive, or use more advanced archiving software: https://docs.google.com/forms/d/e/1FAIpQLSc6KbhtEOI8zKsQmKT_waE1XlYEF1E6t-HzJ7Gc1EBfMvMg_A/viewform

Please do share with colleagues, and your students, and your networks. It’s one concrete thing we can do to help Ukraine, from afar.

You may also be interested in following the work of the Ukrainian Library Association, who are coordinating a National Digital Library in Ukraine: https://www.facebook.com/ula.org.ua/

Best wishes, and I hope you are doing ok at this difficult time.
Professor Melissa Terras
University of Edinburgh, College of Arts, Humanities and Social Sciences

Blood Is Thicker Than Water

A Hierarchical Evaluation Metric for Document Classification

This blog post serves as an introduction to the methods described in the paper CoPHE: A Count-Preserving Hierarchical Evaluation Metric in Large-Scale Multi-Label Text Classification [1] presented at EMNLP 2021. For a more detailed description, along with comparison against previous evaluation metrics in this setting, please refer to the full publication.


Evaluation in Large-Scale Multi-Label Text Classification, such as automated ICD-9 coding of discharge summaries in MIMIC-III [2], is treated in prior art as exact-match evaluation on the code level. Labelling is carried out on the document level (weak labelling), with each code appearing at most once per document. Hence, the prediction and gold standard for a document can be viewed as sets. The label space of MIMIC-III consists of leaf nodes within the ICD-9 tree (example substructure in Figure 1), treating both the prediction and the gold standard as flat (disregarding the ontological structure).

Figure 1: Example subgraph of the code 364 within the ICD-9 ontology. Leaf (prediction-level) nodes are represented with circular nodes, ancestor nodes are rectangular.

Within a structured label space, the concept of distance between labels naturally arises. If, for instance, we consider each edge within the ontology to be of equal weight, the code 410.01 Acute myocardial infarction of anterolateral wall, initial episode of care is closer to its sibling code 410.02 Acute myocardial infarction of anterolateral wall, subsequent episode of care than to a cousin code 410.11 Acute myocardial infarction of other anterior wall, initial episode of care, or a more distantly related code, e.g., 401.9 Unspecified essential hypertension. The standard flat evaluation does not capture this, and if the code 410.01 was mispredicted for a document to be any other code, through the standard flat exact-match approach the errors would be considered equivalent, disregarding the closeness of predictions to the gold standard.

Previous work has incorporated the structural nature of the label space of ICD ontologies, such as [3]. This study, however, concerns a different task – information extraction with strong labels. The ICD codes are assigned to specific spans of texts within the document. This strong labelling allows for associating a prediction to a gold standard label, and exact comparison on a case-by-case basis. This is, unfortunately, not possible in the weakly-labelled scenario of document-level ICD coding, where if a label is mispredicted we cannot state its corresponding gold standard label with certainty.

One of the approaches to creating a metric for the structured label space in [3] is tracing the distance between the closest common parent on the graph of the ontology (tree) and either of the prediction and the gold standard. We are unable to reuse this method exactly, lacking the knowledge of which gold standard codes relate to mispredictions. Instead, we are able to make use of the common ancestor.

Hierarchical Evaluation

One way to approach hierarchical evaluation in a weakly-labelled scenario is to not only evaluate on the leaf-level prediction, but also the codes’ ancestors. We can convert leaf-level predictions into ancestor predictions (e.g., by means of adjacency matrices,) and compare those against their respective converted gold standard (Figure 2). The core idea here is that codes appearing closer together within the ontology will share more ancestors, thereby mitigating the error that arises from misclassification.

Figure 2: A conversion from leaf-level to parent for both the prediction vector and the gold standard label vector. A similar conversion can be done for at least one more (grandparent) level.

Once we have the ancestor-level values we can either report separate metrics for each level of the ontology, or a single metric on the combined information from different levels.

Figure 3: A comparison between predictions and gold standard. Ancestor vectors are concatenated with leaf vectors.

As prediction-level codes in MIMIC-III appear at different depths (as seen in Figure 1), it is reasonable to report performance based on different depths of the ontology. Depending on the implementation of the transition procedure, duplicates may appear.

The example presented in Figure 3 is neat in that at most one prediction is made for each of the shown code families on the leaf level. What about multiple predictions within the same family? One option would be to stick to binary as on the prediction level. If at least 1 prediction-level node within the family is positive, the family is considered to be positive (1), and negative (0) otherwise. As such, the value of ancestor nodes is the result of the logical OR operation on their descendants. Standard P/R/F1 can be applied for evaluation without further need for processing. Such an approach to hierarchical evaluation in multi-label classification was presented by Kosmopoulos et al. [4]

Count-Preserving Hierarchical Evaluation (CoPHE)

Alternatively, we can extend to full counts, in which case each family value is a sum of the values of the family’s prediction-level codes. This results in ancestor values in the domain of natural numbers. Standard binary P/R/F1 do not work in this case (as TP, FP, and FN are defined for binary input), but we retain more information that can tell us of over- or under-prediction for the ancestor codes. Why is this important? The ontology may contain inexplicit rules, such some code families allowing only a single code assigned per document – e.g., 401 (Essential Hypertension) has three descendant codes corresponding to malignant, benign, and unspecified hypertension respectively. From a logical standpoint, hypertension can be either malignant or benign, but not both at the same time, and would be considered unspecified only if it was stated to be present, but not specified to be malignant or benign.

Back to TP, FP, FN. We are dealing with vectors consisting of natural numbers now, rather than binary vectors. Hence we need to redefine these metrics.

Let x be the number of predicted codes within a certain code family f for a document d. Let y be the number of true gold standard codes within the same code family f for a document d.

TP(d, f) = min(x, y)
FP(d, f) = max(x – y, 0)
FN(d, f) = max(y – x, 0)

Where min and max are functions returning the minimum and maximum between two input numbers respectively.

TP represents the numeric overlap, FP and FN represent over-prediction and under-prediction respectively.
Remark: Note that the outputs of the redefined TP, FP, and FN are equivalent to those of their standard definitions assuming binary x and y.

We call this method a Count-Preserving Hierarchical Evaluation (CoPHE).

Figure 4: A comparison between predictions and gold standard showing vector interpretation in CoPHE. Two phenomena of the non-binary ancestor evaluation are on display: (1) While there is a mismatch on the leaf level in the 401 family (401.1 predicted versus 401.9 expected), translated into the direct ancestor level (401), both the prediction and the true label are 401 respectively – allowing for a match on this level. (2) For parent 402 there are two leaves predicted, while 1 expected. This puts us in the non-binary scenario, with TP = 1, FP = 1, FN = 0. As displayed in (1), on this ancestor level the match between the leaves does not matter, but rather how many times the ancestor is involved. In this case the ancestor (402.0) is over-predicted by 1.

CoPHE is not meant to be used as a replacement of the existing metrics, but rather in tandem with them. In general, hierarchical metrics (set-based and CoPHE) are expected to produce scores mitigating mismatches on the code-level. It is also important to compare set-based hierarchical results to those of CoPHE. Assuming no over-/under-prediction (not captured by the set-based metric) takes place, FN and FP values for CoPHE will stay the same as for set-based, with TP being greater or equal than that of set-based. This would lead to CoPHE Precision, Recall (and consequently F1 score) higher or equal to those of set-based hierarchical evaluation. Should CoPHE results be lower than those of set-based hierarchical evaluation, this is an indication of over-/under-prediction taking place.

We have developed CoPHE for ICD-9 coding and made the code publicly available on Github. The approach can be adjusted to any label-space with an acyclic graph structure. For further details, including results of prior art model on MIMIC-III, please consult the publication.


[1] Falis, Matúš, et al. “CoPHE: A Count-Preserving Hierarchical Evaluation Metric in Large-Scale Multi-Label Text Classification.” 2021 Conference on Empirical Methods in Natural Language Processing. 2021.
[2]Johnson, Alistair EW, et al. “MIMIC-III, a freely accessible critical care database.” Scientific data 3.1 (2016): 1-9.
[3] Maynard, Diana, Wim Peters, and Yaoyong Li. “Evaluating Evaluation Metrics for Ontology-Based Applications: Infinite Reflection.” LREC. 2008.
[4] Kosmopoulos, Aris, et al. “Evaluation measures for hierarchical classification: a unified view and novel approaches.” Data Mining and Knowledge Discovery 29.3 (2015): 820-865.

Edinburgh Geoparser Back On The Map

Photo credit: Timo Wielink on Unsplash

We are delighted to announce the release of the new version (v1.2) of the Edinburgh Geoparser, a tool to geoparser contemporary English text.  Most importantly, it now comes with a new map display using OpenStreetMap.

We also made a few fixes to make it run on the latest versions of MacOS and have added instructions of how to visualise the timeline display on different browsers.

The new geoparser also incorporates a gazetteer lookup that’s now supported by the University of Edinburgh Digital Library team.  We continue to support queries to all gazetteers that were distributed with the previous release of the geoparser (see the list here).

We are a small research team so updating this technology regularly can be challenging but we hope that with this new release the Edinburgh Geoparser will continue to be useful for place-based research and teaching.  More information on how do download the new version and on its updated documentation can be found here.