AUTOMATIC COMPILATION OF BILINGUAL TERMINOLOGIES FROM COMPARABLE CORPORA

UoM administered thesis: Phd

  • Authors:
  • Georgios Kontonatsios

Abstract

Bilingual terminological resources play a pivotal role in human and machine translation of technical text. Owing to the immense volume of newly produced terminology in the biomedical domain, existing resources suffer from low coverage and they are only available for a limited number of languages. The need for term alignment methods that accurately identify translations of terms, emerges. In this work, we focus on bilingual terminology induction from freely available comparable corpora, i.e. thematically related documents in two or more languages. We investigate different sources of information that determine translation equivalence, including: (a) the internal structure of terms (compositional clue), (b) the surrounding lexical context (contextual clue) and (c) the topic distribution of terms (topical clue). We present four novel compositional alignment methods and we introduce several extensions over existing compositional, context-based and topic-based approaches. Furthermore, we combine the three translation clues in a single term alignment model and we show substantial improvements over the individual translation signals when considered in isolation. We examine the performance of the proposed term alignment methods on closely related (English-French, English-Spanish) language pairs, on a more distant, low-resource language pair (English-Greek) and on an unrelated (English-Japanese) language pair. As an application, we integrate automatically compiled bilingual terminologies with Statistical Machine Translation systems to more accurately translate unknown terms. Results show that an up-to-date bilingual dictionary of terms improves the translation performance of SMT.

Details

Original languageEnglish
Awarding Institution
Supervisors/Advisors
Award date1 Aug 2015