Deep Learning Uncovers Genomic Features of Cell-type and State

UoM administered thesis: Phd

  • Authors:
  • Mike Phuycharoen

Abstract

Genomic and epigenomic data are being obtained experimentally at an ever-increasing rate. As datasets become easier and cheaper to collect, computational methods allowing their interpretation and integration gain in importance. This thesis addresses the problem of using omic data to identify functional elements in DNA sequences with machine learning. In particular, convolutional neural networks are used to identify binding sites of transcription factor proteins (TFs), as well as features of chromatin accessibility in a set of mouse and human cell types. Deep learning attribution methods can provide explanations for model predictions, and performance of different approaches is evaluated. Two main systems are analysed. The problem of differential and cooperative TF binding is illustrated in mouse branchial arch tissues, where MEIS TFs are known to co-bind with HOX to regulate tissue-specific developmental programmes. It is shown that deep neural networks outperform other commonly used computational methods in predicting binding of HOXA2 from differential MEIS data. Novel applications for indirect regularisation with data are introduced, allowing classification of small datasets. Secondly, a short time series of chromatin accessibility is modelled after immune stimulation in human CD4+T cells. Sequence features characteristic of different dynamic trajectories are identified. An unsupervised approach is introduced for obtaining differential features without a priori class specification along with a semi-supervised method for removal of replicate bias from the differential metric. The methods are used in two more systems in mouse: MEF2D binding across three tissues, and OCT4 binding in embryonic stem cells. Deep learning models presented in this work show substantial improvements over k-mer counting and SVMs, and provide important motivation for further development of machine learning methods for genomic analysis.

Details

Original languageEnglish
Awarding Institution
Supervisors/Advisors
Award date1 Aug 2020