Extraction of database and software usage patterns from the bioinformatics literature

UoM administered thesis: Doctoral Thesis

  • Authors:
  • Geraint Duck


Method forms the basis of scientific research, enabling criticism, selection and extension of current knowledge. However, methods are usually confined to the literature, where they are often difficult to find, understand, compare, or repeat. Bioinformatics and computational biology provide a rich opportunity for resource creation and discovery, with a rapidly expanding "resourceome". Many of these resources are difficult to find due to the large choice available, and there are only a limited number of sufficiently populated lists that can help inform resource selection. Text mining has enabled large scale data analysis and extraction from within the scientific literature, and as such can provide a way to help explore the vast wealth of resources available, which form the basis of bioinformatics methods. As such, this thesis aims to survey the computational biology literature, using text mining to extract database and software resource name mentions. By evaluating the common pairs and patterns of usage of these resources within such articles, an abstract approximation of the in silico methods employed within the target domain is developed.Specifically, this thesis provides an analysis of the difficulties of resource name extraction from the literature, then using this knowledge to develop bioNerDS - a rule-based system that can detect database and software name mentions within full-text documents (with a final F-score of 67%). bioNerDS is then applied to the full-text document corpus from PubMed Central, the results of which are then explored to identify the differences in resource usage between different domains (bioinformatics, biology and medicine) through time, different journals and different document sections. In particular, the well established resources (e.g., BLAST, GO and GenBank) remain pervasive throughout the domains, although they are seeing a slight decline in usage. Statistical programs see high levels of usage, with R in bioinformatics and SPSS in medicine being frequently mentioned throughout the literature.An overview of the common resource pairs has been generated by pairing database and software names which directly co-occur after one another in text. Combining and aggregating these resource pairs together across the literature enables the generation of a network of common resource patterns within computational biology, which provides an abstract representation of the common in silico methods used. For example, sequence alignment tools remain an important part of several computational biology analysis pipelines, and GO is a strong network sink (primarily used for data annotation). The networks also show the emergence of proteomics and next generation sequencing resources, and provide a specialised overview of a typical phylogenetics method.This work performs an analysis of common resource usage patterns, and thus provides an important first step towards in silico method extraction using text-mining. This should have future implications in community best practice, both for resource and method selection.


Original languageEnglish
Awarding Institution
Award date1 Aug 2015