ebook img

DTIC ADA457730: Vocabulary and Environment Adaptation in Vocabulary-Independent Speech Recognition PDF

7 Pages·0.61 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview DTIC ADA457730: Vocabulary and Environment Adaptation in Vocabulary-Independent Speech Recognition

yralubacoV dna Environment Adaptation ni tnednepednI-yralubacoV Speech Recognition Hsiao-Wuen Hon Kai-Fu Lee School of Computer Science Speech & Language Group Carnegie Mellon University Apple Computer, Inc. Pittsburgh, Pennsylvania 15213 Cupertino, CA 95014 1 Abstract the target vocabulary based on only the relevant allophones. The adapted trees would only focus on the relevant contexts In this paper, we are looking into the adaptation issues of to separate the relevant allophones, thus give the resulting vocabulary-independent (VI) systems. Just as with speaker- allophonic clusters more discriminative power for the target adaptation in speaker-independent system, two vocabulary vocabulary. In an experiment of adapting allophone cluster- adaptation algorithms 5 are implemented in order to tailor ing tree for the Resource Management task, this algorithm the VI subword models to the target vocabulary. The first achieved an 9% error reduction. algorithm is to generate vocabulary-adapted clustering de- cision trees by focusing on relevant allophones during tree Our second vocabulary adaptation algorithm is to focus on the relevant allophones during training of generalized allo- generation and reduces the VI error rate by 9%. The second phonic models, instead of focusing on them during generation algorithm, vocabulary-bias training, is to give the relevant of allophonic clustering decision trees. To achieve that, we allophones more prominence by assign more weight to them give the relevant allophones more prominence by assigning during Baum-Welch training of the generalized allophonic more weight to the relevant allophones during Baum-Welch models and reduces the VI error rate by 15%. Finally, in order training of generalized allophonic models. With vocabulary- to overcome the degradation caused by the different acoustic bias training we are able to reduce the VI error rate by 15% environments used for VI training and testing, CDCN and for the Resource Management task. ISDCN originally designed for microphone adaptation are in- corporated into our VI system and both reduce the degradation We have found that different recording environments be- of VI cross-environment recognition by 50%. tween training and testing (CMU vs. TI) will degrade the per- formance significantly 6, even when the same microphone 2 Introduction is used in either case. Based on the framework of semi- continuous HMMs, we proposed to update codebook proto- types in discrete HMMs in order to fit speech vectors from In 89' and 91' DARPA Speech and Natural Language Work- new environments 5. Moreover, codebook-dependent cep- shops 8, 7, we have shown that accurate vocabulary- stral normalization (CDCN) and interpolated SNR-dependent independent (VI) speech recognition is possible. However, cepstral normalization (ISDCN) proposed by Acero et al. 2 there are many anatomical differences between tasks (vocab- for microphone adaptation are incorporated into the our VI ularies), such as the size of the vocabulary and the frequency system to achieve environmental robustness. CDCN uses of confusable words., which might affect the acoustic model- the speech knowledge represented in a codebook to estimate ing techniques to achieve optimal performance in vocabulary- the noise and spectral equalization correction vectors for en- dependent (VD) systems. For example, whole-word models vironmental normalization. In ISDCN, the SNR-dependent are often used in small-vocabulary tasks, while subword mod- correction vectors are obtained via EM algorithm to minimize els must be used in large-vocabulary tasks. Moreover, within the VQ distortion. Both algorithms reduced the degradation a limited vocabulary, it is possible to design some special fea- of VI cross-environment recognition by 50%. tures to separate the confusable models. Therefore, discrimi- native training techniques, such as neural networks 10, and In this paper, we first describe our two vocabulary adap- maximum mutual information estimator (MMIE) 4, have so tation algorithms , vocabulary-adapted decision trees and much success in small-vocabulary tasks. vocabulary-bias training. Then we describe the codebook adaptation algorithm and two cepstral normalization tech- Just as with speaker adaptation in speaker-independent niques, CDCN and ISDCN for environmental robustness. We systems, it is desirable to implement vocabulary adapta- will also present results with these vocabulary and environ- tion to make the VI system tailored to the target vocabulary ment adaptation algorithms. Finally, we will close with some (task). Our first vocabulary adaptation algorithm is to build concluding remark about this work and future work. vocabulary-adapted allophonic clustering decision trees for 168 Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED 1992 2. REPORT TYPE 00-00-1992 to 00-00-1992 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER Vocabulary and Environment Adaptation in Vocabulary-Independent 5b. GRANT NUMBER Speech Recognition 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION Carnegie Mellon University,School of Computer REPORT NUMBER Science,Pittsburgh,PA,15213 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES 14. ABSTRACT 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE 6 unclassified unclassified unclassified Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 3 Vocabulary Adaptation out to be the one of relevant questions which separates the relevant allophones appropriately and therefore possesses the Unlike most speaker adaptation techniques, our vocabulary greatest discriminative ability among the relevant allophones. adaptation algorithms only take advantage of analyzing the Figure 3 just shows such optimal split for relevant allophones. target vocabulary and thus do not require any additional The generation of the clustering decision trees are recursive. vocabulary-specific data. Two terminologies which play an The existence of enormous irrelevant allophones would pre- essential role in our algorithms are defined as follows. vent the generation of the decision trees from concentrating on those relevant allophones and relevant questions, and results relevant allophones Those allophones which occur in the in sub-optimal trees for those relevant allophones. target vocabulary (task). Left = Vowel? irrelevant allophones Those allophone which occur in the VI training set, but not in the target vocabulary .)ksat( In 91' DARPA Speech and Natural Language Workshop 7, we have shown the decision-tree based generalized allo- phone is a adequate VI subword model. Figure 1 is an example !NI irrelevant allophones of our VI subword unit, generalized allophone, which is ac- relevant allophnones tually an allophonic cluster. The allophones in the white area are relevant allophones and the rest are irrelevant ones. Figure 2: An split(question) in the original decision tree for phone / k / Right = Liquid? Figure :1 A generalized allophone (allophonic cluster) NN irrelevant allophones ~ relevant allophnones 3.1 Vocabulary.Adapted Decision Tree Figure 3: the correspondent optimal split(question) for rele- Our first vocabulary adaptation algorithm is to change the vant allophones of phone / k / allophone clustering (the decision trees) so that the brand new set of subword models would have a more discriminative Based on the analysis, our first adaptation algorithm is to power for the target vocabulary. Since the clustering decision build vocabulary-adapted (VA) decision trees by using only tree was built on the entire VI training set, the existence of the relevant allophones during the generation of decision trees. enormous irrelevant aUophones might result in sub-optimally clustering of allophones for the target vocabulary. The adapted trees would not only be automatically generated, but also focus on the relevant questions to separate the relevant To reveal such facts, let's look at the following scenario. allophones, therefore give the resulting allophonic clusters Figure 2 is a split in the original decision tree for phone more discriminative power for the target vocabulary. / k / generated from vocabulary-independent training set and Three potential problems are brought up when one exam- the associated question for this split is "Is the left context a ining the algorithm closely. First of all, some relevant allo- vowel". Suppose all the left contexts for phone/k/ in the phones might not occur in the VI training set since we can't target vocabulary are vowels. Thus, the question for this split expect 100% allophone coverage for every task, especially is totally unsuitable for the target vocabulary because the split for large-vocabulary task. Nevertheless, it is essential to have assigns all the allophones for /k/ in the target vocabulary all the models for relevant allophones ready before generating to one branch and discrimination among those allophones becomes impossible. the VA decision trees because we need the entropy informa- tion of models for each split. It is trivial for those relevant On the other hand, if only the relevant ailophones are con- allophones which also occur in VI training set. The correspon- sidered for this split, the associated split question would turns dent allophonic models trained from the training data can be 169 2.3 saiB-yralubacoV gniniarT used directly. Because of the nature of decision trees, every allophone could find its closest generalized allophonic cluster by traw~rsing the decision trees. Therefore, the correspondent While the above adaptation algorithm tailors the subword generalized allophonic models could be used as the models units to the target vocabulary by focusing on the relevant al- for those relevant allophones not occurring in the VI training lophones during the generation of clustering decision trees, set during the generation of the VA clustering trees. it treated relevant and other irrelevant allophones equally in the final training of generalized allophonic models. Our next Secondly, if only the part of VI training set which con- adaptation algorithm is to give the relevant allophones more rains the relevant allophones is used to train new generalized prominence during the training of generalized allophonic allophonic models, the new adapted generalized allophonic models. models would be under-trained and less robust. Fortunately, we can retain the entire training set because of the the nature Since the VI training database is supposed to be very large, of decision trees. All the allophones could find their gener- it is reasonable to assume that the irrelevant allophones are alized allophonic clusters by traversing the new VA decision the majority of almost every cluster. Thus, the resulting allo- trees, so the entire VI training set could actually contribute phonic cluster will more likely represent the acoustic behavior to the training of new adapted generalized allophonic models of the set of irrelevant allophones, instead of the set of relevant and make them well-trained and robust. allophones. The entropy criterion for splitting during the generation of In order to make relevant allophones become the majority of decision trees is weighted by the counts (frequencies) of allo- the allophonic cluster without incorporating new vocabulary- phones 6. By preferring to split nodes with large counts (al- specific data, we must impose a bias toward the relevant al- lophones appearing frequently), the counts of the allophonic lophones during training. Since our VI system is based on cluster will become more balanced and the final generalized HMM approach, it is trivial to give the relevant allophones allophonic models will be equally trainable. Since the VA de- more prominence by assigning more weight to them during cision tress are generated from the set of relevant allophones Baum-Welch training. The simplest way is to multiply a which is not the same as the set of allophones to train the prominent weight to the parametric re-estimation equations generalized allophonic models. The balance feature of those for relevant allophones. models will be no longer valid. Some generalized allophonic The prominent weight can be a pre-defined constant, like models might only have few (or even none) examples in the VI 2.0 or 3.0, or a function of some variables. However, it is training set and thus cannot be well-trained. Fortunately, we better for the prominent weight to reflect the reliability of can enhance the trainability of VA subword models through the relevant allophones toward which we imposed a bias. gross validation with the entire VI training set. The gross If a relevant allophone occur rarely in the training set, we validation for VA decision trees is somehow different than the shouldn't assign a large weight to it because the statistics of conventional cross validation which uses one part of the data it is not reliable. On the other hand, we could assign larger to grow the trees and the other part of independent data to weights to those relevant allophones with enough examples prune the trees in order to predict new contexts. Since rele- in the training data. In our experiments, we use a simple vant allophones is already only a small portion of the entire VI function based on the frequencies of relevant allophones. All training set, further dividing it will prevent the learning algo- the irrelevant allophones have the weight 1.0 and the weight rithm from generating reliable AV decision trees. Instead, we for relevant allophones is given by the following function: grow the VA decision trees very deeply; replace the entropy reduction information of each split by traversing through the 1 + loya(Z) where x is the frequency of relevant allophones trees with all the allophones (including irrelevant ones); and finally prune the trees based on the new entropy informa- a is chosen to be the minimum number of training examples tion. This will prune out those splits of nodes without enough to train a reasonable model in our configuration. training support (too few examples) even though they might be relevant to the target vocabulary. Therefore the resulting Imposing a bias toward the relevant allophones is similar to generalized allophonic models will become more trainable. duplicating the training data of relevant allophones. For ex- ample, using aprominent weight of 2.0 for an training example The vocabulary-adapted decision tree learning algorithm, in the Baum-Welch re-estimation is like observing the same emphasizing the relevant allophones during growing of the training example twice. Therefore, our vocabulary-bias train- decision trees and using the gross validation with the entire VI ing algorithm is identical to duplicating the training exam- training set provides an ideal mean for finding the equilibrium ples of relevant allophones according to the weight function. between adaptability for the target vocabulary and trainability Based on the same principle, this adaptation algorithm can be with the VI training database. applied to other non-HMM systems by duplicating the train- ing data of relevant allophones to make relevant allophones 170 become the majority of the training data during training. The 2. continuously transforming the testing speech spectral resulting models will then be tailored to those relevant aUo- vectors x, into normalized vectors Yi, so that the dis- phones. tribution of the ~y is close to that of the training data described by the codebook prototypes. 4 Environment Adaptation Our first environment adaptation algorithm belongs to the first strategy, while two cepstral normalization algorithms which It is well known that when a system is trained and tested under will be described in Section 4.2 belongs to the second strategy. different environments, the performance of recognition drops moderately 8 However, it is very likely for training and test- Semi-continuous HMMs (SCHMMs) or tied mixture con- ing taking place under different environments in VI systems tinuous HMMs 9, 3 has been proposed to extend the dis- because the VI models can be used for any task which could crete HMMs by replacing discrete output distributions with a happen anywhere. Even if the recording hardware remains un- combination of the original discrete output probability distri- changed, e.g., microphones, A/D converters, pre-amplifiers, butions and continuous pdf's of codebooks. SCHMMs can etc, the other environmental factors, e.g. the room size, back- jointly re-estimate both the codebooks and HMM parameters ground noise, positions of microphones, reverberation from to achieve an optimal codebook/model combination according surface reflections, etc, are all out of the control realm. For ex- to a maximum likelihood criterion during training. They have ample, when comparing the recording environment of Texas been applied to several recognition systems with improved Instruments (TI) and Carnegie Mellon University (CMU), a performance over discrete HMMs 9, 3. few differences were observed although both used the same close-talk microphone (Sennheiser HMD-414). The cooebooks of our vocabulary-independent system can be modified to optimize the probability of generating data • Recording equipment - TI and CMU used different A/D from new environment by the vocabulary-independent HMMs according to the SCHMM framework. Let #i denote the mean devices, filters and pre-amplifiers which might change vector of cooebook index i in the original codebook, then the the overall transfer function and thus generate different new vector ~ can be obtained from the following equation spectral tilts on speech signals. • Room - The TI recording took place in a sound-proof - E (cT= (1) room, while the CMU recording took place in a big labo- ratory with much background noise (mostly paper rustle, where 7~ (t) denotes the posterior probability observed the keyboard noise, and other conversations). Therefore; codeword i at time t using HMM m for speech vector xt. CMU's data tends to contain more additive noise than TI's. Note that we did not use continuous Gassian pdf's to rep- resent the cooebooks in the Equation .1 Each mean vec- • Input level - The CMU recording process always ad- tor of the new codebook is computed from acoustic vector justed the amplifier's gain control for different speak- xt associated with corresponding posterior probability in the ers to compensate the varied sound volume of speakers. discrete forward-backward algorithm without involving con- Since the sound volume of TI's female speakers tends to tinuous pdf computation. The new data from different envi- be much lower, TI probably didn't adjust the gain control ronment, xt, can be automatically aligned with corresponding like CMU did. Therefore, the dynamic range of CMU's codeword in the forward-backward training procedure. If the data tends to be larger. alignment is not closely associated with the corresponding codeword in the HMM training procedure, reestimation of 4.1 Codebook Adaptation the corresponding codeword will then be de-weighted by the posterior probability ~7 n (t) accordingly in order to adjust the The speech signal processing of our VI system is based on a new cooebook to fit the new data. characterization of speech in a codebook of prototypical moO- els 7. Typically the performance of systems based on a code- book degrade over time as the speech signal drifts through en- 4.2 Cepstral Normalization vironmental changes due to the increased distortion between the speech and the codebook. The types of environmental factors which differ in TI's and CMU's recording environments can roughly be classified into Therefore, two possible adaptation strategies include: two complementary categories : .1 continuously updating the cooebook prototypes to fit the .1 additive noise - noise from different sources, like paper testing speech spectral vectors xt. rustle, keyboard noise, other conversations, etc. 171 2. spectral equalization - distortions from the convolution of the speech signal with an unknown channel, like posi- Condition Error Rate Error Reduction tions of microphones, reverberation from surface reflec- Baseline 5.4% N/A% tions, etc. +VA decision trees 4.9% 9.3% +VB training 4.6% 14.8% +VA trees & VB training 4.6% 14.8% Acero at al. 1,2 proposed a series of environment normal- ization algorithms based on joint compensation for additive Table :1 The results for Resource Management using noise and equaliTation. They has been implemented success- vocabulary-adapted decision trees and vocabulary-bias train- fully on SPHINX to achieve robustness to different micro- ing algorithms phones. Among those algorithms, codeword-dependent cep- stral normalization (CDCN), is the most accurate one, while interpolated SNR-dependent cepstral normalization (ISDCN) is the most efficient one .1 In this study, we incorporate these to further tailor the vocabulary-independent models to the two algorithms to make our vocabulary-independent system Resource Management task, no compound improvement was more robust to environmental variations. produced. It might be because either both algorithms are learning the similar characteristics of the target task, or the combination of these two algorithms already reaches the limi- x = z- w(q,n) (2) tation of adaptation capability within our modeling technique without the help of vocabulary-specific data. Equation 2 is the environmental compensation model, where x, z, w, q and n represent respectively the normalized vector, observed vector, correction vector, spectral equaliza- tion vector and noise vector. The CDCN algorithm attempts Adaptation Sentence CMU-TEST TI-TEST to determine q and n that provide an ensemble of compen- Baseline 5.4% 7.4% sated vectors x being collectively closest to the set of locations 100 N/A 7.1% of legitimate VQ codewords. The correction vector w will 300 N/A 7.0% be obtained using MMSE estimator based on q, n and the 1000 N/A 7.0% codebook. In ISDCN, q and n were determined by an EM 2000 N/A 6.9% algorithm aiming at minimizing VQ distortion. The final cor- rection vector w also depends on the instantaneous SNR of Table 2: The vocabulary-independent results on TI-TEST by the current input frame using a sigmoid function. adapting the codebooks for TI's data 5 Experiments and Results In codebook adaptation experiments, the 4 codebooks used in our HMM-based system are updated according Equation All the experiments are evaluated on the speaker-independent 1. We randomly select 100, 300, 1000, 2000 sentences from DARPA resource management task. This task is a 991-word TIRM database to form different adaptation sets. Two iter- continuous task and a standard word-pair grammar with per- ation were carried out for each adaptation sets to estimated plexity 60 was used throughout. The test set, TI.TEST, con- the new codebooks for TI's data, while the HMM parameters sists of 320 sentences from 32 speakers (a random selection are fixed. Table 2 shows the adaptation recognition result on from June 1988, February 1989 and October 1990 DARPA TI testing set. It is indicated that only marginal improvement evaluation sets). by adapting codebook for new environment even with lots of In order to isolate the influence of cross-environment recog- adaptation data. The result suggested that the adaptation of nition, another identical same test set, CMU-TEST, from codebook alone fail to produce adequate adaptation because 32 speakers (different from TI speakers) was collected at the HMM statistics used by recognizer have not been updated. CMU. Our baseline is using 4-codebook discrete SPHINX Table 3 shows the recognition error rate on two test sets for and decision-tree based generalized allophones as the VI sub- VI systems incorporated with CDCN and ISDCN. Be aware word units7. Table 1 shows that about 9% error reduction that our VI training set was recorded at CMU. The degradation is achieved by adapting the decision trees for Resource Man- of cross-environment recognition with TI-TEST is roughly agement task, while about 15% error reduction is achieved by reduced by 50%. Like most environment normalization al- using vocabulary-bias training for the same task. Neverthe- gorithms, there is also a minor performance degradation for less, when we try to combine these two adaptation algorithms same-environment recognition when gaining robustness to ehT1 redaer si derrefer to 1 for deliated NCDC dna NCDSI smhtirogla other environments. 172 Acknowledgements Test Set CMU-TEST TI-TEST Baseline 5.4% 7.4% This research was sponsored by the Defense Advanced Research CDCN 5.6% 6A% Projects Agency (DOD), Arpa Order No. 5167, under contract ISDCN 5.7% 6.5% number N00039-85-C-0163. The authors would like to express their gratitude to Professor Raj Reddy and CMU speech research group Table 3: The results for environment normalization using for their support. CDCN & ISDCN References 1 Acero, A. Acoustical and Environmental Robustness in Auto- matic Speech Recognition. Department of Electrical Engineer- 6 Conclusions ing, Carnegie-Mellon University, September 1990. 2 Acero, A. and Stem, R. Environmental Robustness in Auto- matic Speech Recognition. in: IEEE International Confer- ence on Acoustics, Speech, and Signal Processing. 1990, pp. 849-852. In this paper, we have presented two vocabulary adaptation algonthms, including vocabulary-adapted decision trees and 3 Bellegarda, I. and Nahamoo, D. Tied Mixture Continuous Pa- vocabulary-bias training, that improve the performance of rameter Models for Large Vocabulary Isolated Speech Recog- nition, in: IEEE International Conference on Acoustics, the vocabulary-independent system on the target task by tai- Speech, and Signal Processing. 1989, pp. 13-16. loring the VI subword models to he target vocabulary. In 91' DARPA Speech and Natural Language Workshop 7, we 4 Brown, .P The Acoustic-Modeling Problem in Automatic have shown that our VI system is already slightly better than Speech Recognition. Computer Science Department, Carnegie Mellon University, May 1987. our VD system. With these two adaptation algorithms which led to 9% and 15% error reduction respectively on Resource 5 Hon, H. Vocabulary-lndependentSpeech Recognition: : The Management task, the resulting VI system is far more ac- VOCIND System. School of Computer Science, Carnegie Mel- curate than our VD system. In 8, we have demonstrated lon University, February 1992. improved vocabulary-independent results with vocabulary- 6 Hon, H. and Lee, K. CMU Robust Vocabulary-Independent specific adaptation data. In the future, we plan to extend our Speech Recognition System. :ni IEEE International Confer- adaptation algorithms with the help of vocabulary-specific ence on Acoustics, Speech, and Signal Processing. Toronto, data to achieve further adaptation with the target vocabulary Ontario, CANADA, 1991, pp. 889-892. (task). 7 Hon, H. and Lee, K. Recent Progress in Robust -yralubacoV Independent Speech Recognition. :ni DARPA Speech and CDCN and ISDCN have been successfully incorporated Language Workshop. Morgan Kaufmann Publishers, Asilo- to the vocabulary-independent system and reduce the degra- mar, ,AC 1991. dation of VI cross-environment recognition by 50%. In the 8 Hon, H. and Lee, K. Towards Speech Recognition Without future, we will keep investigating new environment normal- Vocabulary-Speci~c Training. in: DARPA Speech and Lan- ization techniques to further reduce the degradation and ulti- guage Workshop. Morgan Kaufmann Publishers, Cape Cod, mately achieve the full environmental robustness across dif- ,AM 1989. ferent acoustic environments. Moreover, environment adap- 9 Huang, X., Lee, K., and Hon, H. On Semi-Continuous Hidden tation with environment-specific data will also be explored Markov Modeling. :ni IEEE International Conference on for adapting the VI system to the new environment once we Acoustics, Speech, and Signal Processing. Albuquerque, have more knowledge about it. NM, 1990, pp. 689-692. 10 Walbel, A., Hanazawa, T., Hinton, G., Shikano, ,.K and Lang, To make the speech recognition system more robust for K. Phoneme Recognition using Time-Delay Neural Networks. new vocabularies and new environments is essential to make IEEE Transactions on Acoustics, Speech, and Signal Pro- the speech recognition application feasible. Our results have cessing, vol. ASSP-28 (1989), pp. 357-366. shown that plentiful training data, careful subword model- ing (decision-tree based generalized allophones) and suit- able environment normalization have compensated for the lack of vocabulary and environment specific training. With the additional help of vocabulary adaptation, the vocabulary- independent system can be further tailored to any task quickly and cheaply, and therefore facilitates speech applications tremendously. 173

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.