A General Framework for Building Accurate and Understandable Genomic Models: A Study in Rice (Oryza Sativa)

UoM administered thesis: Phd

  • Authors:
  • Oghenejokpeme Orhobor


Rapid technological advances in genotyping and sequencing technologies are driving the generation of vast amounts of genomic data. These advancements present a unique opportunity to improve our understanding of the environmental and genetic mechanisms that give rise to phenotypes. This data is technically hard to analyse because there are many attributes (often in the order of a million), and vast quantities of background knowledge is relevant. Genotype data are most commonly used in genomic models to identify genetic regions which control phenotypes and to predict the likelihood that members of a population will produce progeny with particular phenotypes. However, most of the data may be irrelevant for certain phenotypes, leading to suboptimal, difficult to understand models. To meet this challenge, we propose a three-stage general framework that incorporates background knowledge in its model building processes by applying feature stability, inductive logic programming (ILP), and meta-learning. In the first stage of the framework, we identify associated markers using marker stability rather than traditional mixed models. In the second stage we formalise the identified frequent patterns and additional background knowledge as predicates in first order logic, and using an ILP engine we identify frequent patterns which correspond to genetic configurations that are associated with a trait. Finally, the identified frequent patterns in the previous stage are used as additional data for phenotype prediction. We demonstrate that this framework (1) significantly outperforms the state-of-the-art in identifying associated genomic regions, (2) identifies relevant genetic configurations, and (3) improves overall phenotype prediction, using a diverse Rice (Oryza sativa) population.


Original languageEnglish
Awarding Institution
Award date1 Aug 2019