We seek to quantify the mortality risk associated with mentions of medical concepts in textual electronic health records (EHRs). Recognising mentions of named entities of relevant types, e.g. conditions, symptoms, laboratory tests or behaviours, in text is a well-researched task. However, determining the level of risk associated with them is partly dependent on the textual context in which they appear, which may describe severity, temporal aspects, quantity, etc.
To take into account that a given word appearing in the context of different risk factors (medical concepts) can make different contributions towards risk level, we propose a multi-task approach, called context-aware linear modelling (CALM), which can be applied using appropriately regularised linear regression. In order to improve the performance for risk factors unseen in training data, e.g. rare diseases, we take into account their distributional similarity to other concepts.
The evaluation is based on a corpus of 531 reports from EHRs with 99,376 risk factors rated manually by experts. While CALM significantly outperforms single-task models, taking into account concept similarity further improves performance, reaching the level of human annotators’ agreements.
Our results show that automatic quantification of risk factors in EHRs can achieve performance comparable to human assessment, and taking into account multi-task structure of the problem and the ability to handle rare concepts are crucial for its accuracy.