COMPUTATIONAL STRATEGIES FOR IDENTIFYING RISKS OF FALLS FROM LARGE-SCALE DATASETS

UoM administered thesis: Phd

  • Authors:
  • Muhannad Almohaimeed

Abstract

Patient records offer a very useful resource that can be used for many purposes, including advancement of clinical research and development of future health initiatives. The number of resources that provide access to patient data generated within primary care settings has recently seen a significant increase. Medical databases, such as the Clinical Practice Research Datalink (CPRD), were established to make access to and usage of patient records in research more efficient. As patient data become more available to researchers, this has the potential to drive further advancement in the medical field. In interrogating such data, researchers normally use specific a priori defined hypotheses. Conversely, other unexplored, yet equally important, areas may exist beyond these structured hypotheses, and therefore, this practice may reduce the impact of the data due to inadvertently missing relevant hypotheses. Therefore, advanced methods to generate hypotheses are vital to open such unexplored areas for research. This should assist in increasing the effectiveness of studies in elucidating underlying disease processes and potential outcomes of healthcare interventions. There is a lot of interest in applying machine learning data mining techniques for discovering patterns and correlations in large-scale datasets. Analysis of patient record data using such techniques could improve our understanding of the information included in these records. However, the representation of the data, bags of diagnostic codes using medical terminology, is not amenable to directly apply existing machine learning or data mining tools. The size of the datasets can also become an obstacle to applying such tools. In addressing these issues, this thesis undertakes the challenge of developing strategies for data exploration and hypothesis formulation from such resources. This thesis describes novel strategies that map electronic patient records from CPRD datasets into appropriate and informative low-dimensional vector spaces. Considering the high dimensionality and complexity of the data, we characterise the problem through developing a method that allows us to determine a minimal subset of patients that can be used to adequately represent all the dataset. Then, the methodology involves using this subset and employing analysis of semantic similarity between patients, principal components analysis (PCA) and clustering in order to provide effective representation of patient records. This assists in identifying unusual and interesting patterns at the subgroup level in order to investigate the different risks associated with different groups. In the final stage, a two-hit based strategy, which incorporates a temporal dimension, is used to explore combinations of risk factors most associated with increased risk of disease. The usefulness of this methodology was demonstrated mainly in assessing risk factors related falls in older adults as an exemplar case study. Patients experiencing falls provided a particularly rich dataset for this study. It is a diagnosis that is relatively common in a population that has significant contact with general practitioners (older patients), making it a suitable case study to test the methodology. A fundamental outcome of this study was the wealth of data that could be produced from such analysis. The machine learning strategy for agnostic hypothesis formulation successfully identified new hypotheses for factors associated with falls. These hypotheses were then examined based on existing literature and expert review to determine whether such data driven methodologies can provide new insights into risk factors associated with falls. In conclusion, we suggest that the strategies developed in this work would make available patient record data more amenable to analysis through traditional data mining strategies as well as allow a much more straightforward environment for agnostic hypothesis generation in this area of research. Furthermore, there is no aspect of these strategies that restricts it to falls. The same methodology could equally be applied to many other disease states extracted from any medical database using terms from taxonomies or ontologies.

Details

Original languageEnglish
Awarding Institution
Supervisors/Advisors
Award date1 Aug 2018