Bioinformatic Approaches to Detect Transposable Element Insertions in High Throughput Sequence Data from Saccharomyces and Drosophila

UoM administered thesis: Unknown


Transposable elements (TEs) are mutagenic mobile DNA sequences whose excision and insertion are powerful drivers of evolution. Some TE families are known to target specific genome features, and studying their insertion preferences can provide information about both TE biology and the state of the genome at these locations. To investigate this, collecting large numbers of insertion sites for TEs in natural populations is required. Genome resequencing data can potentially provide a rich source of such insertion sites. The field of detecting these "non-reference" TE insertions is an active area of research, with many methods being released and no comprehensive review performed. To drive forward knowledge of TE biology and the field of non-reference TE detection, we created McClintock, an integrated pipeline of six TE detection methods. McClintock lowers the barriers against use of these methods by automating the creation of the diverse range of input files required whilst also setting up all methods to run simultaneously and standardising the output. To test McClintock and its component methods, it was run on both simulated and real Saccharomyces cerevisiae data. Tests on simulated data reveal the general properties of component methods' predictions as well as the limitations of simulated data for testing software systems. Overlap between results from the McClintock component methods show many insertions detected by only one method, highlighting the need to run multiple TE detection methods to fully understand a resequenced sample. Utilising the well characterised properties of S. cerevisiae TE insertion preferences, real yeast population resequencing data can act as a biological validation for the predictions of McClintock. All component methods recreated previously known biological properties of S. cerevisiae TE insertions in natural population data. To demonstrate the versatility of McClintock, we applied the system to Drosophila melanogaster resequencing data. 27 Schneider's cell lines were sequenced and analysed with McClintock. In addition to demonstrating the scalability of McClintock to larger genomes with more TE families, this exposed ongoing transposition in S2 cell lines. Likewise, the use of non-reference TE insertions as variable sites allowed us to recreate the relationships between S2 sub-lines, confirming that S1, S2, and S3 were most likely established separately. The results also suggest that there are several S2 sub-lines in use and that these sub-lines can differ from each other in TE content by hundreds of non-reference TE copies. Overall this thesis demonstrates that the McClintock pipeline can highlight problems in TE detection from genome data as well as revealing that much can still be learned from this data source.


Original languageEnglish
Awarding Institution
Award date1 Aug 2016