An Investigation into Approaches to Text-to-Speech Synthesis for Modern Standard Arabic

UoM administered thesis: Phd

  • Authors:
  • Dena Alabbad

Abstract

Text-to-speech synthesis (TTS) converts natural language text into speech. While there has been considerable work on this task, there is still a need to improve the quality of the generated speech. The quality of synthetic speech can be addressed with respect to two features: intelligibility, which indicates how reliably the generated speech can be understood; and naturalness, which measures how similar the synthetic speech is to human’s speech. The challenge in speech synthesis lies in improving these aspects. These problems are particularly acute for Arabic, which is less studied in the field of speech synthesis than other widely spoken languages, on account of its challenging nature. The aim of the research presented in this thesis was to investigate the best method of generating good quality synthetic speech for Modern Standard Arabic (MSA) with a small amount of training data. This research investigated the phonological characteristics of MSA that contribute to speech quality and employed these in the synthesis. Furthermore, the two dominant synthesis techniques, namely concatenative unit selection synthesis and HMM-based synthesis, were investigated. Two synthesis systems based on these techniques were implemented for MSA: a concatenative diphone-based unit selection system, called “SARA”, and an HMM-based system using the HTS toolkit. A number of experiments were conducted to test the effectiveness of employing the phonological rules of assimilation, pharyngealisation, lexical stress, and intonation, in synthesis. A subjective evaluation of intelligibility and naturalness for both implemented systems' output speech was conducted. The results revealed that using context-sensitive phonetic transcription in synthesis improves both intelligibility and naturalness. The experiments conducted with the unit selection system, to test the impact of employing different combinations of phonetic rules in synthesis, showed that combinations of rules gave better results than the application of individual rules in isolation. Objective measures for synthetic speech quality were also investigated. The correctness measure obtained by using ASR techniques appeared to provide a useful tool for the automatic assessment of the intelligibility of synthesised speech for MSA, at least for systems that are minor variations of one another. An automatic naturalness evaluation system was also developed to rank sets of synthetic speech utterances by learning from human subjective ranking. The implemented algorithm was tested and the results revealed that the method is robust and sound in terms of ranking different versions of a specific system with minor variations, and hence that it provides a useful tool for filtering minor variants of a system during the development phase. Finally, comparative subjective and objective evaluations were conducted to compare the two implemented synthesis systems' outputs, which revealed that, with 45 minutes of MSA training data, the concatenative diphone-based unit selection synthesis outperforms the HMM-based synthesis system in both intelligibility and naturalness.

Details

Original languageEnglish
Awarding Institution
Supervisors/Advisors
  • Allan Ramsay (Supervisor)
Award date31 Dec 2019