ebook img

statistical parametric methods for articulatory-based foreign accent conversion PDF

194 Pages·2015·2.82 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview statistical parametric methods for articulatory-based foreign accent conversion

STATISTICAL PARAMETRIC METHODS FOR ARTICULATORY-BASED FOREIGN ACCENT CONVERSION A Dissertation by SANDESH ARYAL Submitted to the Office of Graduate and Professional Studies of Texas A&M University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Chair of Committee, Ricardo Gutierrez-Osuna Committee Members, Yoonsuck Choe Dylan Shell Byung-Jun Yoon Head of Department, Dilma Da Silva December 2015 Major Subject: Computer Engineering Copyright 2015 Sandesh Aryal ABSTRACT Foreign accent conversion seeks to transform utterances from a non-native speaker (L2) to appear as if they had been produced by the same speaker but with a native (L1) accent. Such accent-modified utterances have been suggested to be effective in pronunciation training for adult second language learners. Accent modification involves separating the linguistic gestures and voice-quality cues from the L1 and L2 utterances, then transposing them across the two speakers. However, because of the complex interaction between these two sources of information, their separation in the acoustic domain is not straightforward. As a result, vocoding approaches to accent conversion results in a voice that is different from both the L1 and L2 speakers. In contrast, separation in the articulatory domain is straightforward since linguistic gestures are readily available via articulatory data. However, because of the difficulty in collecting articulatory data, conventional synthesis techniques based on unit selection are ill-suited for accent conversion given the small size of articulatory corpora and the inability to interpolate missing native sounds in L2 corpus. To address these issues, this dissertation presents two statistical parametric methods to accent conversion that operate in the acoustic and articulatory domains, respectively. The acoustic method uses a cross-speaker statistical mapping to generate L2 acoustic features from the trajectories of L1 acoustic features in a reference utterance. Our results show significant reductions in the perceived non-native accents compared to the corresponding L2 utterance. The results also show a strong voice-similarity between ii accent conversions and the original L2 utterance. Our second (articulatory-based) approach consists of building a statistical parametric articulatory synthesizer for a non- native speaker, then driving the synthesizer with the articulators from the reference L1 speaker. This statistical approach not only has low data requirements but also has the flexibility to interpolate missing sounds in the L2 corpus. In a series of listening tests, articulatory accent conversions were rated more intelligible and less accented than their L2 counterparts. In the final study, we compare the two approaches: acoustic and articulatory. Our results show that the articulatory approach, despite the direct access to the native linguistic gestures, is less effective in reducing perceived non-native accents than the acoustic approach. iii ACKNOWLEDGEMENTS First and foremost, I would like to thank my advisor Dr. Ricardo Gutierrez- Osuna for his continuous guidance, encouragement and support throughout my graduate studies that helped me become a better researcher. I would also like to thank my committee members Dr. Yoonsuck Choe, Dr. Dylan Shell, and Dr. Byung-Jun Yoon for the valuable suggestions and feedbacks. I am also grateful to all my teachers who contributed in making me ready for undertaking this path. Thanks also go to my lab mates Jin, Avinash, Daniel, Jong, Rakesh, Folami, Rhushabh, Virendra, Chris, Anshul, Zelun, Difan, Adam and Guanlong for keeping my spirits up, being always ready to help whenever needed, and making my time in the lab enjoyable. Finally, I am very grateful to my parents and my family for their inspiration, encouragement, patience, unwavering support and love. A special gratitude goes to my wife Sushila for being there when I needed the most. iv TABLE OF CONTENTS Page ABSTRACT .......................................................................................................................ii ACKNOWLEDGEMENTS .............................................................................................. iv TABLE OF CONTENTS ................................................................................................... v LIST OF FIGURES ........................................................................................................... ix LIST OF TABLES ......................................................................................................... xiii 1. INTRODUCTION ...................................................................................................... 1 1.1 Thesis outline .................................................................................................. 6 2. BACKGROUND ........................................................................................................ 7 2.1 Non-native accents .......................................................................................... 8 2.1.1 Non-native accents in communication .............................................. 10 2.1.2 Pronunciation training for second language learner ......................... 11 2.2 Speech production physiology ...................................................................... 13 2.2.1 Acoustic theory of speech ................................................................. 14 2.2.2 Speaker’s identity, voice and accent ................................................. 16 2.2.3 Modulation theory............................................................................. 17 2.3 Speech analysis and synthesis....................................................................... 18 2.3.1 Linear predictive analysis ................................................................. 18 2.3.2 Cepstral analysis ............................................................................... 20 2.3.3 STRAIGHT analysis ......................................................................... 21 2.3.4 Sinusoidal analysis and harmonics plus noise model ....................... 22 2.4 Articulatory speech processing ..................................................................... 23 2.4.1 Articulatory representation ............................................................... 24 2.4.2 Articulatory measurements ............................................................... 25 2.4.3 Articulatory normalization ................................................................ 27 2.4.4 Articulatory synthesis ....................................................................... 30 2.4.5 Articulatory inversion ....................................................................... 33 2.5 Summary ....................................................................................................... 35 v Page 3. LITERATURE REVIEW ......................................................................................... 36 3.1 Foreign accent conversion ............................................................................ 37 3.2 Acoustic-based approach .............................................................................. 37 3.3 Articulatory-based approaches ..................................................................... 42 3.4 Evaluation of foreign accent conversion ...................................................... 45 3.4.1 Acoustic quality assessment ............................................................. 46 3.4.2 Intelligibility assessment................................................................... 47 3.4.3 Assessment of non-native accents .................................................... 47 3.4.4 Assessment of voice-identity ............................................................ 49 4. ACOUSTIC-BASED FOREIGN ACCENT CONVERSION USING VOICE CONVERSION ......................................................................................................... 50 4.1 Foreign accent conversion based on spectral mapping ................................. 51 4.1.1 STRAIGHT feature extraction and synthesis ................................... 53 4.1.2 Pairing acoustic vectors .................................................................... 53 4.1.3 Cross-speaker spectral mapping ....................................................... 55 4.1.4 Prosody modification ........................................................................ 59 4.2 Experimental ................................................................................................. 60 4.2.1 Conversion from non-native to native accent ................................... 60 4.2.2 Conversion from native to non-native accent ................................... 61 4.2.3 Experimental corpus ......................................................................... 62 4.3 Results ......................................................................................................... 63 4.3.1 L1→L2 accent/voice conversion ...................................................... 63 4.3.2 L2→L1 accent/voice conversion ...................................................... 64 4.3.3 Correlation with differences in the L1 and L2 phonetic inventories ......................................................................................... 65 4.4 Discussion ..................................................................................................... 67 5. STATISTICAL PARAMETRIC ARTICULATORY FOREIGN ACCENT CONVERSION ......................................................................................................... 69 5.1 Method description ....................................................................................... 71 5.1.1 Cross-speaker articulatory mapping ................................................. 71 5.1.2 Forward mapping .............................................................................. 73 5.1.3 System diagram................................................................................. 75 5.2 Experimental corpus ..................................................................................... 76 5.2.1 Experimental conditions ................................................................... 77 5.2.2 Participant recruitment ...................................................................... 80 5.3 Results ......................................................................................................... 81 5.3.1 Accuracy of articulatory normalization ............................................ 81 vi Page 5.3.2 Assessment of intelligibility ............................................................. 84 5.3.3 Assessment of non-native accentedness ........................................... 90 5.3.4 Assessment of voice individuality .................................................... 92 5.4 Discussion ..................................................................................................... 94 6. ARTICULATORY-BASED CONVERSION OF FOREIGN ACCENTS WITH DEEP NEURAL NETWORKS .................................................................... 97 6.1 Deep neural network in articulatory-acoustic mappings .............................. 98 6.2 DNN-based foreign accent conversion method ............................................ 99 6.2.1 DNN-based forward mapping ......................................................... 100 6.2.2 Global variance adjustment ............................................................ 101 6.3 Performance of DNN-based forward mapping ........................................... 102 6.3.1 GMM-based baseline methods ....................................................... 102 6.3.2 Experimental ................................................................................... 104 6.3.3 Experiment 1: Comparison of DNN vs. GMM............................... 108 6.3.4 Experiment 2: Context length ......................................................... 109 6.3.5 Experiment 3: Network depth ......................................................... 110 6.3.6 Experiment 4: Synthesis time ......................................................... 111 6.3.7 Experiment 5: Subjective assessment ............................................. 112 6.3.8 Discussions on the performance of DNN-based forward mapping .......................................................................................... 113 6.4 Evaluation of foreign accent conversion with DNN ................................... 115 6.5 Results ....................................................................................................... 116 6.5.1 Intelligibility assessment................................................................. 116 6.5.2 Assessment of non-native accentedness ......................................... 117 6.5.3 Voice identity assessment ............................................................... 119 6.6 Conclusion .................................................................................................. 121 7. ACOUSTIC VS. ARTICULATORY-BASED STRATEGIES .............................. 122 7.1 Comparison between the articulatory and acoustic-based strategies .......... 123 7.2 Equivalent articulatory synthesizer for the acoustic-based strategy ........... 124 7.2.1 Training the cross-speaker forward mapping ................................. 125 7.3 Experimental validation .............................................................................. 126 7.3.1 Experimental conditions ................................................................. 127 7.4 Results ....................................................................................................... 127 7.4.1 Non-native accent evaluation.......................................................... 128 7.4.2 Intelligibility assessment................................................................. 129 7.4.3 Voice identity assessment ............................................................... 130 7.5 Discussions ................................................................................................. 132 vii Page 8. CONCLUSIONS .................................................................................................... 135 8.1 Summary ..................................................................................................... 135 8.2 Main contributions ...................................................................................... 137 8.3 Future Work ................................................................................................ 138 8.3.1 Large scale validation ..................................................................... 138 8.3.2 Performance improvement .............................................................. 138 8.3.3 Application of foreign accent conversion methods in computer aided pronunciation training ........................................... 140 8.3.4 Extension to other articulatory speech modification problems ...... 141 REFERENCES ............................................................................................................... 142 APPENDIX A: FORWARD MAPPING WITH DEEP NETWORKS ......................... 154 APPENDIX B: MECHANICAL TURK TESTS SAMPLES ........................................ 161 APPENDIX C: LIST OF PUBLICATIONS .................................................................. 167 APPENDIX D: PSI STATFAC TOOLBOX ................................................................. 169 viii LIST OF FIGURES Page Figure 1: Human speech production physiology. ............................................................. 14 Figure 2: Acoustic theory of speech: speech signal (right) is the convolution of the glottal source excitation signal (left) and the vocal tract filter response (middle). ............................................................................................ 15 Figure 3: A simplified computational model of speech production physiology. ............. 16 Figure 4: A typical FFT and LPC spectrum of a nasal speech segment. ......................... 19 Figure 5: STRAIGHT analysis and synthesis. ................................................................. 22 Figure 6: Position of the 6 EMA pellets used in our study; UL: upper lip; LL: lower lip; LI: lower incisor; TT: tongue tip; TB: tongue blade; TD: tongue dorsum. An additional pellet (red cross-hair) was placed on the upper incisor and served as a reference. .................................................... 27 Figure 7: Cepstral decomposition of speech into spectral slope and spectral detail (DCT: Discrete cosine transform). ......................................................... 41 Figure 8: Articulatory foreign accent conversion based on unit selection (from Felps (2011). .................................................................................................... 44 Figure 9: Foreign accent conversion method using cross-speaker statistical mappings. ......................................................................................................... 52 Figure 10: (a) Conventional approach to voice conversion; source and target utterances are paired based on their ordering in a forced-aligned parallel corpus, (b) Our approach to accent conversion: source and target utterances are paired based on their acoustic similarity following vocal-tract-length normalization (VTLN), MCD: Mel Cepstral Distortion. .......................................................................................... 54 Figure 11: Shifting and scaling L1 pitch trajectory to match the vocal range of the L2 speaker. ................................................................................................. 60 ix Page Figure 12: The number of missing phonemes in L2 inventory and the proportion of listeners who found the AC12 synthesis less foreign accented than the VC12 synthesis for each test sentence are highly correlated . ..................................................................................... 66 Figure 13: Articulatory accent conversion is a two-step process consisting of L1-L2 articulatory normalization and L2 forward mapping. ........................... 70 Figure 14: Overview of the cross-speaker articulatory normalization procedure. A separate set of parameters is obtained for each EMA pellet. ....................... 72 Figure 15: Block diagram of accent conversion method (PM: pitch modification). ................................................................................................... 76 Figure 16: (a) Distribution of six EMA pellet positions from the L1 speaker (solid markers) and L2 speaker (hollow markers) from a parallel corpus. Large differences can be seen in the span of the measured positions of articulators (UL: upper lip; LL: lower lip; LI: lower incisor; TT: tongue tip; TB: tongue blade; and TD: tongue dorsum). The upper incisor (UI) was used as a reference point. (b) Distribution of EMA pellet positions for the L1 speaker (solid markers) and L2 speaker (hollow markers) following articulatory normalization. .................... 82 Figure 17: Trajectory of the tongue-tip pellet in L1 and L2 utterances of the word ‘that’. The L1 trajectory normalized to the L2 articulatory space is also shown. Arrows indicate the direction of trajectories. ........................... 83 Figure 18: (a) Distribution of tongue tip position in frontal vowels for the L1 speaker (dark ellipses) and L2 speaker (light) speaker; ellipses represent the half-sigma contour of the distribution for each vowel. (b) Distribution of tongue tip position in frontal vowels for the L1 speaker after articulatory mapping (dark) and the L2 speaker (light). ............ 84 Figure 19: Box plot of (a) word accuracy and (b) subjective intelligibility ratings for , and utterances. .................................................. 85 Figure 20: Word accuracy for and for the 46 test sentences. The diagonal dashed line represents The sentences for which are above the dashed line and the vice versa. .......................... 86 Figure 21: Subjective evaluation of non-native accentedness. Participants were asked to determine which utterance in a pair was more native-like. ............... 91 x

Description:
approach consists of building a statistical parametric articulatory synthesizer for a non- native speaker, then and English manifest into the common characteristics of Spanish accented English. Several . investigated the effect of learner-teacher voice similarity in pronunciation training by pairi
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.