Predicting Drug Target Proteins and Their Properties

UoM administered thesis: Doctoral Thesis

  • Authors:
  • Simon Bull


The discovery of drug targets is a vital component in the development of therapeutic treatments, as it is only through the modulation of a target's activity that a drug can alleviate symptoms or cure. Accurate identification of drug targets is therefore an important part of any development program, and has an outsized impact on the program's success due to its position as the first step in the pipeline. This makes the stringent selection of potential targets all the more vital when attempting to control the increasing cost and time needed to successfully complete a development program, and in order to increase the throughput of the entire drug discovery pipeline.In this work, a computational approach was taken to the investigation of protein drug targets. First, a new heuristic, Leaf, for the approximation of a maximum independent set was developed, and evaluated in terms of its ability to remove redundancy from protein datasets, the goal being to generate the largest possible non-redundant dataset. The ability of Leaf to remove redundancy was compared to that of pre-existing heuristics and an optimal algorithm, Cliquer. Not only did Leaf find unbiased non-redundant sets that were around 10% larger than the commonly used PISCES algorithm, it found ones that were no more than one protein smaller than the maximum possible found by Cliquer.Following this, the human proteome was mined to discover properties of proteins that may be important in determining their suitability for pharmaceutical modulation. Data was gathered concerning each protein's sequence, post-translational modifications, secondary structure, germline variants, expression profile and target status. The data was then analysed to determine features for which the target and non-target proteins had significantly different values. This analysis was repeated for subsets of the proteome consisting of all GPCRs, ion channels, kinases and proteases, as well as for a subset consisting of all proteins that are implicated in cancer. Next, machine learning was used to quantify the proteins in each dataset in terms of their potential to serve as a drug target. For each dataset, this was accomplished by first inducing a random forest that could distinguish between its targets and non-targets, and then using the random forest to quantify the drug target likeness of the non-targets.The properties that can best differentiate targets from non-targets were primarily found to be those that are directly related to a protein's sequence (e.g. secondary structure). Germline variants, expression levels and interactions between proteins had minimal discriminative power. Overall, the best indicators of drug target likeness were found to be the proteins' hydrophobicities, in vivo half-lives, propensity for being membrane bound and the fraction of non-polar amino acids in their sequences. In terms of predicting potential targets, datasets of proteases, ion channels and cancer proteins were able to induce random forests that were highly capable of distinguishing between targets and non-targets. The non-target proteins predicted to be targets by these random forests comprise the set of the most suitable potential future drug targets, and are therefore likely to produce the best results if used as the basis for building a drug development programme.


Original languageEnglish
Awarding Institution
Award date1 Aug 2015