Reducing Out-of-Vocabulary in Morphology to Improve the Accuracy in Arabic Dialects Speech Recognition by Khalid Abdulrahman Almeman A thesis submitted to The University of Birmingham for the degree of Doctor of Philosophy School of Computer Science The University of Birmingham March 2015 University of Birmingham Research Archive e-theses repository This unpublished thesis/dissertation is copyright of the author and/or third parties. The intellectual property rights of the author or third parties in respect of this work are as defined by The Copyright Designs and Patents Act 1988 or as modified by any successor legislation. Any use made of information contained in this thesis/dissertation must be in accordance with that legislation and must be properly acknowledged. Further distribution or reproduction in any format is prohibited without the permission of the copyright holder. Abstract This thesis has two aims: developing resources for Arabic dialects and improving the speech recognition of Arabic dialects. Two important components are considered: Pro- nunciation Dictionary (PD) and Language Model (LM). Six parts are involved, which relate to finding and evaluating dialects resources and improving the performance of sys- tems for the speech recognition of dialects. Three resources are built and evaluated: one tool and two corpora. The methodology thatwasusedforbuildingthemulti-dialectmorphologyanalyserinvolvestheproposaland evaluation of linguistic and statistic bases. We obtained an overall accuracy of 94%. The dialect text corpora have four sub-dialects, with more than 50 million tokens. The multi- dialect speech corpora have 32 speech hours, which were collected from 52 participants. The resultant speech corpora have more than 67,000 speech files. The main objective is improvement in the PDs and LMs of Arabic dialects. The use of incremental methodology made it possible to check orthography and phonology rules incrementally. We were able to distinguish the rules that positively affected the PDs. The Word Error Rate (WER) improved by an accuracy of 5.3% in MSA and 5% in Levantine. Three levels of morphemes were used to improve the LMs of dialects: stem, pre- fix+stem and stem+suffix. We checked the three forms using two different types of LMs. Eighteen experiments are carried out on MSA, Gulf dialect and Egyptian dialect, all of which yielded positive results, showing that WERs were reduced by 0.5% to 6.8%. Acknowledgements Many thanks to my supervisor, Dr. Mark Lee for his support, guidance and advice during the PhD period. Special thanks to my parents, Madhawi Alfanikh and Abdulrahman Almeman for their unfailing love and their encouragement. I am grateful to my wife, Amane Alsaheel for her encouragement, support and quiet patience during my MSc and PhD. I wish to thank thesis group members, Prof. John Barnden and Dr. Iain Styles for their guidance and advice. Contents 1 Introduction 1 1.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 A brief description of the work detailed in the thesis . . . . . . . . . . . . . 6 1.4 Summary of the key thesis contributions . . . . . . . . . . . . . . . . . . . 7 1.4.1 Presenting a methodology for building a multi dialect morphology analyser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4.2 Demonstrating a methodology for collecting multi dialect Arabic text corpora automatically . . . . . . . . . . . . . . . . . . . . . . . 8 1.4.3 Developing a methodology for building a multi dialect speech corpus 8 1.4.4 Building a multi dialect speech recognition system and comparing it to separated dialects tasks . . . . . . . . . . . . . . . . . . . . . . 9 1.4.5 Improving of Arabic dialects PDs . . . . . . . . . . . . . . . . . . . 9 1.4.6 Improving of Arabic dialects LMs . . . . . . . . . . . . . . . . . . . 9 1.5 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.6 Resulting publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.7 Resulting dialects resources . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2 Modern Standard Arabic and Arabic Dialects 15 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 MSA vs. dialects in usage . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 The multiplicity of dialects . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Dialectic variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.5 Morphology of the dialects . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.6 Phonology of the dialects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.7 Some challenges for Arabic dialects . . . . . . . . . . . . . . . . . . . . . . 26 2.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3 Related Work 29 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 Automatic multi-dialect analysis of Arabic . . . . . . . . . . . . . . . . . . 29 3.3 Automatic building of Arabic multi dialect text corpora by bootstrapping dialect words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.4 Multi dialect speech parallel corpora . . . . . . . . . . . . . . . . . . . . . 34 3.5 A comparison of Arabic speech recognition for multi-dialect vs. specific dialects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.6 An incremental methodology for improving pronunciation dictionaries for Arabic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.7 Morpheme-based language models for improving the speech recognition of Arabic dialects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4 Automatic Multi-Dialect Analysis of Arabic 44 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.2 The motivations for multi-dialect morphology analysis . . . . . . . . . . . . 45 4.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.4.1 The building of Arabic multi dialect morphology analyser webpage 52 4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5 Automatic Building of Arabic Multi Dialect Text Corpora by Boot- strapping Dialect Words 60 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.2 The motivations for building a multi dialect written text corpora . . . . . . 61 5.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.5.1 Comparing results . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.5.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.5.3 Error evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6 Multi Dialect Arabic Speech Parallel Corpora 78 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.2 The need for Arabic multi dialect speech corpora . . . . . . . . . . . . . . 79 6.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.4.1 Write MSA text and diacritise it . . . . . . . . . . . . . . . . . . . 83 6.4.2 Translate into dialects and diacritise them . . . . . . . . . . . . . . 83 6.4.3 Recording . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.4.4 Audio segmenting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.5 File organisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.6.1 Parallel texts results . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.6.2 Parallel speech results . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.7.1 Text evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.7.2 Speech evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 7 A Comparison of Arabic Speech Recognition for Multi-Dialect vs. Spe- cific Dialects 95 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 7.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 7.3 Recognition system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 7.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 7.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 7.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 8 An Incremental Methodology for Improving Pronunciation Dictionaries for Arabic Speech Recognition 105 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 8.2 Why do we need to improve Arabic pronunciation dictionaries? . . . . . . . 106 8.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 8.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 8.4.1 The pronunciation dictionary rules . . . . . . . . . . . . . . . . . . 109 8.4.2 The incremental methodology for improving Arabic pronunciation dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 8.5 Recognition system and baseline result . . . . . . . . . . . . . . . . . . . . 111 8.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 8.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 8.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 9 Morpheme-Based Language Models for Improving the Speech Recogni- tion of Arabic Dialects 118 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 9.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 9.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 9.4 Recognition system and baseline result . . . . . . . . . . . . . . . . . . . . 123 9.5 Automatic Speech Recognition (ASR) experiments results . . . . . . . . . 124 9.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 9.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 10 Conclusions and Future Work 129 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 10.2 The methodologies for building Arabic dialects resources . . . . . . . . . . 130 10.3 Improving speech recognition for Arabic dialects . . . . . . . . . . . . . . . 133 10.4 How this research can be extended to multi-dialect approaches to other languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 10.5 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 10.5.1 Extending the multi dialect analyser . . . . . . . . . . . . . . . . . 137 10.5.2 Classifying text corpora . . . . . . . . . . . . . . . . . . . . . . . . 138 10.5.3 Producing different version of speech corpora . . . . . . . . . . . . . 138 10.5.4 Extending incremental methodology by applying it to other dialects 138 10.5.5 Returning to the original full word from prefix+stem or stem+stem after improving LMs . . . . . . . . . . . . . . . . . . . . . . . . . . 138 10.6 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 List of References 140 List of Tables 2.1 Arabic letters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 (cid:72) (cid:46) 2.2 The 13 combinations of short vowels for /b/ ‘Baa letter’ . . . . . . . . . 18 2.3 Percentage of shared unigrams, bigrams and trigrams in the Egyptian cor- pus (ECA) and the MSA corpus, and for the conversational British English corpus (BE) and American English corpus (AmE) (Kirchhoff and Vergyri, 2005) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4 Some changes in phones between Arabic dialects compared with MSA . . . 22 3.1 Some existing MSA corpora . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2 Some existing MSA speech corpora . . . . . . . . . . . . . . . . . . . . . . 36 3.3 Gulf speech corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.4 Levantine speech corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.5 Egyptian speech corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.6 Microphone source speech corpora for MSA, Gulf, Egyptian and Levantine 39 4.1 Example of the output after the first layer . . . . . . . . . . . . . . . . . . 51 4.2 Example of segmented words . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3 Example of analysed words after the last layer . . . . . . . . . . . . . . . . 54 4.4 Results before starting the experiments . . . . . . . . . . . . . . . . . . . . 55 4.5 Results after MSA analyser has adopted . . . . . . . . . . . . . . . . . . . 55 4.6 The final results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.1 Examples of categorised words and phrases . . . . . . . . . . . . . . . . . . 64 5.2 Total number of words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.3 The estimation of how many pages we need per dialect . . . . . . . . . . . 65 5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.5 Total results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 vii 5.6 Sentences counts and average of length . . . . . . . . . . . . . . . . . . . . 68 5.7 Comparing unknown words . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.8 Size comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.9 Frequency of frequencies of token types . . . . . . . . . . . . . . . . . . . . 70 5.10 10 greatest unigrams tokens frequencies (including function words) . . . . . 71 5.11 10 greatest unigrams tokens frequencies (non function words) . . . . . . . 72 5.12 Bigrams, trigrams and five-grams counts . . . . . . . . . . . . . . . . . . . 74 5.13 Commonest bigram for all dialects corpora . . . . . . . . . . . . . . . . . 75 5.14 Commonest trigram for all four corpora . . . . . . . . . . . . . . . . . . . . 75 6.1 New phones representation in dialects . . . . . . . . . . . . . . . . . . . . . 81 6.2 Corpora distribution for sections . . . . . . . . . . . . . . . . . . . . . . . . 83 6.3 Example of some sentences . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.4 Recording attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.5 Tokens count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.6 Parallel texts sentences count . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.7 Speaker count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.8 Speaker age . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.9 Files count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.10 Utterances count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.11 Phones count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.12 Sharing sentences between four parallel corpora . . . . . . . . . . . . . . . 89 6.13 The word overlap for MSA with Gulf, Egyptian and Levantine . . . . . . . 89 6.14 Lexicon count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.15 Speech contrast evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 92 7.1 One-to-one auto mapping for creating a baseline PD . . . . . . . . . . . . . 98 7.2 The WER variation with the tied states and the number of densities in multi-dialect system using all four corpora . . . . . . . . . . . . . . . . . . 99 7.3 The best WERs for the four dialects when evaluated using multi-dialect data100 7.4 The best WERs for the four dialects when evaluated against each dialect’s own data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 7.5 ThetableshowsLevantinedialectresultswhenevaluatedusingMSAacous- tic model with three different LMs . . . . . . . . . . . . . . . . . . . . . . 101 7.6 A Student’s t-test result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 8.1 An one-to-one automapping for creating baseline PD . . . . . . . . . . . . 108 8.2 PD phonology and morphology rules . . . . . . . . . . . . . . . . . . . . . 111 8.3 MSA Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 8.4 Levantine Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 9.1 Description of the corpora used for creating closed and open LMs . . . . . 121 9.2 Reduction percentage in closed LM size- unique tokens . . . . . . . . . . . 121 9.3 Reduction percentage in open LM size- unique tokens . . . . . . . . . . . . 121
Description: