USING DATA-DRIVEN RESOURCES FOR OPTIMISING RULE-BASED SYNTACTIC ANALYSIS FOR MODERN STANDARD ARABIC

UoM administered thesis: Phd

  • Authors:
  • Mohamed Elbey

Abstract

This thesis is about optimising a rule based parser for Modern Standard Arabic (MSA).If ambiguity is a major problem in NLP systems. It is even worse in a language MSAdue to the fact that written MSA omits short vowels and for other reasons that will bediscussed in Chapter 1.By analysing the original rule based parser, it turned out that many parses were unnecessarydue to many edges being produced and not used in the final analysis. The first part of this thesis is to investigate whether integrating a Part Of Speech (POS) tagger will help speeding up the parsing, or not. This is a well-known technique for Romance and Germanic languages, but its effectiveness has not been widely explored for MSA.The second part of the thesis is to use statistics and machine learning techniques andinvestigate its effects on the parser. This thesis is not about the accuracy of the parser. Itis about finding ways to improve the speed. A new approach will be discussed, whichwas not explored in statistical parsing before. This approach is collecting statisticswhile parsing, and using these to learn strategies to be used during the parsing process.The learning process involves all the moves of the parsing (moves that lead to the finalanalysis, i.e good moves and moves that lead away from it, i.e bad moves). The ideahere is, not only we are learning from positive data, but also from negative data. Thequestions to be asked:• Why is this move good so that we can encourage it.• Why is this move bad so that we discourage it.In the final part of the thesis, both techniques were merged together: integrating a POStagger and using the learning approach, and finding out the effect of this on the parser.

Details

Original languageEnglish
Awarding Institution
Supervisors/Advisors
    Award date31 Dec 2014