The quantity of literature in the biomedical domain is growing exponentially. It is becoming impossible for researchers to cope with this ever-increasing amount of information. Text mining provides methods that can improve access to information of interest through information retrieval, information extraction and question answering. However, most of these systems focus on information presented in main body of text while ignoring other parts of the document such as tables and figures.
Tables present a potentially important component of research presentation, as authors often include more detailed information in tables than in textual sections of a document. Tables allow presentation of large amounts of information in relatively limited space, due to their structural flexibility and ability to present multi-dimensional information.
Table processing encapsulates specific challenges that table mining systems need to take into account. Challenges include a variety of visual and semantic structures in tables, variety of information presentation formats, and dense content in table cells. The work presented in this thesis examines a multi-layered approach to information extraction from tables in biomedical documents.
In this thesis we propose a representation model of tables and a method for table structure disentangling and information extraction. The model describes table structures and how they are read. We propose a method for information extraction that consists of: (1) table detection, (2) functional analysis, (3) structural analysis, (4) semantic tagging, (5) pragmatic analysis, (6) cell selection and (7) syntactic processing and extraction. In order to validate our approach, show its potential and identify remaining challenges, we applied our methodology to two case studies. The aim of the first case study was to extract baseline characteristics of clinical trials (number of patients, age, gender distribution, etc.) from tables. The second case study explored how the methodology can be applied to relationship extraction, examining extraction of drug-drug interactions.
Our method performed functional analysis with a precision score of 0.9425, recall score of 0.9428 and F1-score of 0.9426. Relationships between cells were recognized with a precision of 0.9238, recall of 0.9744 and F1-score of 0.9484. The information extraction methodology performance is the state-of-the-art in table information extraction recording an F1-score range of 0.82-0.93 for demographic data, adverse event and drug-drug interaction extraction, depending on the complexity of the task and available semantic resources.
Presented methodology demonstrated that information can be efficiently extracted from tables in biomedical literature. Information extraction from tables can be important for enhancing data curation, information retrieval, question answering and decision support systems with additional information from tables that cannot be found in the other parts of the document.