Consistency of variety of machine learning and statistical models in predicting clinical risks of individual patients: longitudinal cohort study using cardiovascular disease as exemplarCitation formats

Standard

Harvard

APA

Vancouver

Author

Bibtex

@article{fd342e25d57e49f18ca1993d1be55325,
title = "Consistency of variety of machine learning and statistical models in predicting clinical risks of individual patients: longitudinal cohort study using cardiovascular disease as exemplar",
abstract = "Objective To assess the consistency of machine learning and statistical techniques in predicting individual level and population level risks of cardiovascular disease and the effects of censoring on risk predictions.Design Longitudinal cohort study from 1 January 1998 to 31 December 2018.Setting and participants 3.6 million patients from the Clinical Practice Research Datalink registered at 391 general practices in England with linked hospital admission and mortality records.Main outcome measures Model performance including discrimination, calibration, and consistency of individual risk prediction for the same patients among models with comparable model performance. 19 different prediction techniques were applied, including 12 families of machine learning models (grid searched for best models), three Cox proportional hazards models (local fitted, QRISK3, and Framingham), three parametric survival models, and one logistic model.Results The various models had similar population level performance (C statistics of about 0.87 and similar calibration). However, the predictions for individual risks of cardiovascular disease varied widely between and within different types of machine learning and statistical models, especially in patients with higher risks. A patient with a risk of 9.5-10.5% predicted by QRISK3 had a risk of 2.9-9.2% in a random forest and 2.4-7.2% in a neural network. The differences in predicted risks between QRISK3 and a neural network ranged between –23.2% and 0.1% (95% range). Models that ignored censoring (that is, assumed censored patients to be event free) substantially underestimated risk of cardiovascular disease. Of the 223 815 patients with a cardiovascular disease risk above 7.5% with QRISK3, 57.8% would be reclassified below 7.5% when using another model.Conclusions A variety of models predicted risks for the same patients very differently despite similar model performances. The logistic models and commonly used machine learning models should not be directly applied to the prediction of long term risks without considering censoring. Survival models that consider censoring and that are explainable, such as QRISK3, are preferable. The level of consistency within and between models should be routinely assessed before they are used for clinical decision making.",
keywords = "Adult, Calibration, Cardiovascular Diseases/epidemiology, Decision Making, England/epidemiology, Female, Humans, Longitudinal Studies, Machine Learning, Male, Medical Record Linkage, Middle Aged, Models, Statistical, Predictive Value of Tests, Risk Assessment/methods, Survival Analysis",
author = "Yan Li and Matthew Sperrin and Ashcroft, {Darren M} and {Van Staa}, {Tjeerd Pieter}",
year = "2020",
month = nov,
day = "4",
doi = "10.1136/bmj.m3919",
language = "English",
volume = "371",
journal = "British Medical Journal",
issn = "0959-535X",
publisher = "BMJ ",

}

RIS

TY - JOUR

T1 - Consistency of variety of machine learning and statistical models in predicting clinical risks of individual patients: longitudinal cohort study using cardiovascular disease as exemplar

AU - Li, Yan

AU - Sperrin, Matthew

AU - Ashcroft, Darren M

AU - Van Staa, Tjeerd Pieter

PY - 2020/11/4

Y1 - 2020/11/4

N2 - Objective To assess the consistency of machine learning and statistical techniques in predicting individual level and population level risks of cardiovascular disease and the effects of censoring on risk predictions.Design Longitudinal cohort study from 1 January 1998 to 31 December 2018.Setting and participants 3.6 million patients from the Clinical Practice Research Datalink registered at 391 general practices in England with linked hospital admission and mortality records.Main outcome measures Model performance including discrimination, calibration, and consistency of individual risk prediction for the same patients among models with comparable model performance. 19 different prediction techniques were applied, including 12 families of machine learning models (grid searched for best models), three Cox proportional hazards models (local fitted, QRISK3, and Framingham), three parametric survival models, and one logistic model.Results The various models had similar population level performance (C statistics of about 0.87 and similar calibration). However, the predictions for individual risks of cardiovascular disease varied widely between and within different types of machine learning and statistical models, especially in patients with higher risks. A patient with a risk of 9.5-10.5% predicted by QRISK3 had a risk of 2.9-9.2% in a random forest and 2.4-7.2% in a neural network. The differences in predicted risks between QRISK3 and a neural network ranged between –23.2% and 0.1% (95% range). Models that ignored censoring (that is, assumed censored patients to be event free) substantially underestimated risk of cardiovascular disease. Of the 223 815 patients with a cardiovascular disease risk above 7.5% with QRISK3, 57.8% would be reclassified below 7.5% when using another model.Conclusions A variety of models predicted risks for the same patients very differently despite similar model performances. The logistic models and commonly used machine learning models should not be directly applied to the prediction of long term risks without considering censoring. Survival models that consider censoring and that are explainable, such as QRISK3, are preferable. The level of consistency within and between models should be routinely assessed before they are used for clinical decision making.

AB - Objective To assess the consistency of machine learning and statistical techniques in predicting individual level and population level risks of cardiovascular disease and the effects of censoring on risk predictions.Design Longitudinal cohort study from 1 January 1998 to 31 December 2018.Setting and participants 3.6 million patients from the Clinical Practice Research Datalink registered at 391 general practices in England with linked hospital admission and mortality records.Main outcome measures Model performance including discrimination, calibration, and consistency of individual risk prediction for the same patients among models with comparable model performance. 19 different prediction techniques were applied, including 12 families of machine learning models (grid searched for best models), three Cox proportional hazards models (local fitted, QRISK3, and Framingham), three parametric survival models, and one logistic model.Results The various models had similar population level performance (C statistics of about 0.87 and similar calibration). However, the predictions for individual risks of cardiovascular disease varied widely between and within different types of machine learning and statistical models, especially in patients with higher risks. A patient with a risk of 9.5-10.5% predicted by QRISK3 had a risk of 2.9-9.2% in a random forest and 2.4-7.2% in a neural network. The differences in predicted risks between QRISK3 and a neural network ranged between –23.2% and 0.1% (95% range). Models that ignored censoring (that is, assumed censored patients to be event free) substantially underestimated risk of cardiovascular disease. Of the 223 815 patients with a cardiovascular disease risk above 7.5% with QRISK3, 57.8% would be reclassified below 7.5% when using another model.Conclusions A variety of models predicted risks for the same patients very differently despite similar model performances. The logistic models and commonly used machine learning models should not be directly applied to the prediction of long term risks without considering censoring. Survival models that consider censoring and that are explainable, such as QRISK3, are preferable. The level of consistency within and between models should be routinely assessed before they are used for clinical decision making.

KW - Adult

KW - Calibration

KW - Cardiovascular Diseases/epidemiology

KW - Decision Making

KW - England/epidemiology

KW - Female

KW - Humans

KW - Longitudinal Studies

KW - Machine Learning

KW - Male

KW - Medical Record Linkage

KW - Middle Aged

KW - Models, Statistical

KW - Predictive Value of Tests

KW - Risk Assessment/methods

KW - Survival Analysis

U2 - 10.1136/bmj.m3919

DO - 10.1136/bmj.m3919

M3 - Article

C2 - 33148619

VL - 371

JO - British Medical Journal

JF - British Medical Journal

SN - 0959-535X

M1 - m3919

ER -