AMPLIFYING DATA CURATION EFFORTS TO IMPROVE THE QUALITY OF LIFE SCIENCE DATA

UoM administered thesis: Phd

  • Authors:
  • Mariam Alqasab

Abstract

The massive amount of data received from the biomedical literature raises the issue of maintaining data quality. This leads biomedical database providers to curate their data, whether by using tools or hiring domain experts (humans who are known as curators). It should be noted that the curation process is not affordable for all databases, as it is an expensive and time-consuming task, especially when human experts perform curation. Carrying out curation is crucial in all domains and is not limited to biocuration. In the biomedical field, keeping data curated can prevent harmful problems. For example, if a protein name is miswritten in a data records, a scientist may then use the incorrect name in all their experiments, causing confusion. In short, relying on data that has not received curation can cause the production of incorrect results. The importance of performing data curation leads many researchers to focus their efforts on providing approaches to help speed up the curation process, make it more reliable and make it more efficient. In this thesis, we first propose a maturity model that describes the maturity levels of biomedical data curation. The proposed maturity model aims to help data providers to identify limitations in their current curation methods and enhance their curation process. The maturity model was built based on information gathered from five different biomedical databases and surveying the biocuration literature, and did not require extra input from curators. Second, we explore one possible approach to maximising the value obtained from human curators (IQBot) by automatically extracting information about data defects and corrections arising from the work that the curators carry out. This information is packaged in a source-independent form, allowing it to be used by the owners of other databases. To extract this information, we compared data from two consecutive versions of the data records. We ran IQBot to monitor a real-world database (UniProtKB) to extract defects and defect corrections. When we compared the extracted defects and defect corrections with data from other databases, we found that the databases still had out-of-date data in their records.

Details

Original languageEnglish
Awarding Institution
Supervisors/Advisors
Award date1 Aug 2019