ebook img

SYLLABLE-BASED LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION Aravind ... PDF

28 Pages·1999·0.17 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview SYLLABLE-BASED LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION Aravind ...

SYLLABLE-BASED LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION Aravind Ganapathiraju, Jonathan Hamaker, Mark Ordowski, George R. Doddington1 Joseph Picone Department of Defense Institute for Signal and Information Processing 9800 Savage Road Mississippi State University Ft. George G. Meade, MD 20755 Mississippi State, MS 39762 ABSTRACT Most large vocabulary continuous speech recognition (LVCSR) systems in the past decade have used a context-dependentphoneasthefundamentalacousticunit.Inthispaper,wepresentoneofthefirstrobust LVCSR systems that uses a syllable-level acoustic unit for LVCSR on telephone-bandwidth speech. This effort is motivated by the inherent limitations in phone-based approaches —namely the lack of an easy and efficient way for modeling long-term temporal dependencies. A syllable unit spans a longer time frame, typically three phones, thereby offering a more parsimonious framework for modeling pronunciation variation in spontaneous speech. We present encouraging results which show that a syllable-based system exceeds the performance of a comparabletriphonesystembothintermsofworderrorrate(WER)andcomplexity.TheWERofthebest syllable system reported here is 49.1% on a standard SWITCHBOARD evaluation, a small improvement over the triphone system. We also report results on a much smaller recognition task, OGIAlphadigits, which was used to validate some of the benefits syllables offer over triphones. The syllable-based system exceeds the performance of the triphone system by nearly 20%, an impressive accomplishment since the alphadigits application consists mostly of phone-level minimal pair distinctions. EDICS: SA 1.6.2 CORRESPONDENCE: Aravind Ganapathiraju Institute for Signal and Information Processing Department of Electrical and Computer Engineering, PO Box 9571 Mississippi State University, Mississippi State, MS 39762 Phone: (601) 325-8335 Fax: (601) 325-3149 Email: [email protected] 1. G.R.Doddington is now with the Information Technology Laboratory of the National Institute of Standards and Technology. SYLLABLE BASED LVCSR PAGE 1 OF 27 1. INTRODUCTION For at least a decade the triphone has been the dominant method of modeling acoustics in speech recognition.However,triphonesarearelativelyinefficientdecompositionalunitduetothelargenumberof triphonepatternswithanonzeroprobabilityofoccurrence,leadingtosystemsthatrequirevastamountsof memory for model storage, and numerous models with poorly trained parameters. Moreover, since a triphone unit spans an extremely short time interval, integration of spectral and temporal dependencies is not easy. For applications such as the SWITCHBOARD(SWB) Corpus[1], where performance of phone-based approaches has stagnated over the past few years, we have shifted our focus to a larger acoustic context[2]. The syllable is one such acoustic unit whose appeal lies in its close connection to human speech perception and articulation, its integration of some coarticulation phenomena, and the potential for a relatively compact representation of conversational speech. For example, consider the phrase “Did you get much back?” shown in Figure1. This example1 has been excised from a conversation in SWB. The first two words, “Did you,” are merged into the word “get” resulting in a pronunciation “jh y uw g eh”. Previous approaches to model such behavior involved the use of context-dependent(CD) phones[3]. However, since phones are deleted (or often extremely reduced), what is needed is higher-level information predicting the deletion of the phone. Modeling words as sequences ofphones, though logical, is not justified when we try to derive a one-to-one mapping between the acoustics and the phone. Recent attempts at pronunciation modeling[4,5] have demonstrated limited successatmodelingsuchphenomena.Forexample,inSWB,asmanyas80differentpronunciationsofthe word “and” have been labelled [6]. The example in Figure1, though an extreme case, demonstrates the challenges for explicit phone-based pronunciation modeling in conversational speech. A syllable, on the other hand, seems to be an intuitive unit for representation of speech sounds. Without much difficulty listeners identify the number of syllables in a word[6] and, with a high degree of agreement, even the syllable boundaries. It is our conjecture in this paper that such behavior makes the syllableamorestableacousticunitforspeechrecognition.Thestabilityandrobustnessofsyllablesinthe 1. This example is available athttp://www.isip.msstate.edu/projects/switchboard/faq/data/example_023. SYLLABLE BASED LVCSR PAGE 2 OF 27 Englishlanguageisfurthersupportedbycomparingthephoneandsyllabledeletionratesinconversational speech.IntheanalysisofdatafromatranscriptionprojectonSWB[6,7],itwasestimatedthatthedeletion rateforphoneswas12%,comparedtoadeletionrateforsyllablesoflessthan1%.TheexampleinFigure1 supportsthisobservation.Thissuggeststhatexplicitpronunciationmodelingbecomesanecessaryfeature of phone-based systems to accommodate the high phone deletion rate in conversational speech, and can perhaps be circumvented by a syllable-based system. The use of an acoustic unit with a longer duration facilitates exploitation of temporal and spectral variations[8]simultaneously.Parametertrajectoriesandmulti-pathHMMs[9]areexamplesoftechniques that can exploit the longer acoustic context, and yet have had marginal impact on phone-based systems. Recent research on stochastic segment modeling of phones[10] demonstrates that recognition performancecanbeimprovedbyexploitingcorrelationsinspectralandtemporalstructure.Webelievethat applying these ideas to a syllable-sized unit, which has a longer acoustic context, will result in significant improvements in speech recognition performance while maintaining a manageable level of complexity. In this paper, we describe a syllable-based system and compare its performance with word-internal and cross-word triphone systems on two publicly available databases: SWITCHBOARD(SWB)[11] and Alphadigits(AD)[12]. Note that these evaluations span the spectrum of telephone-based continuous speech recognition applications from spoken letters and digits to conversational speech. We focus on aspects of the syllable system which are significantly different from triphone systems, such as the model inventory and lexicon. We demonstrate some improvements that were achieved using monosyllabic word models and finite duration models. 2. BASELINE SYSTEMS There are many ways to incorporate syllable information into a speech recognition system. Below, we describeseveralapproachesthatintegratesyllableacousticmodelswithexistingphone-basedsystems.We also describe the phone-based systems used as baselines since performance on these demanding applications depends significantly on the complexity of the technology used. SYLLABLE BASED LVCSR PAGE 3 OF 27 2.1. DESIGN OF A SYLLABLE-BASED LEXICON By definition, a syllable spans a longer period of time compared to a phonemic unit [13]. In English, we cancategorizedifferenttypesofsyllablesusingtheirconsonant(C)andvowel(V)content.Forexamplea syllable labeled CVC consists of a sequence of a consonant, vowel and consonant. An example of a CVC syllable is “_t_eh_n”, pronounced “ten.” Though other forms of syllables exist(for example, VVC and CCVVC), the CVC and CV syllables cover close to three-quarters of SWB[6]. The syllable is defined basedonhumanperceptionandspeechproductionphenomenontypicallyassistedbystresspatternswithin a word. For example, the word “atlas” intuitively appears to consist of two distinct segments. It is these segments that are called syllables. For ease of representation, syllables are typically represented as a concatenation of the baseform phones comprising the segment (_ae_t l_ax_s). It does not however mean that the acoustic segments always contain the phone sequence in its entirety. Research conducted during the 1996 LVCSR Summer Workshop at the Center for Language and Speech Processing at Johns Hopkins University demonstrated that the use of syllable-level information and stress markingscanreduceworderrorrateintriphone-basedsystems[14].Aby-productofthisworkwasahigh qualitydictionaryannotatedwithstressandsyllableinformation.Theannotateddictionarywasdeveloped from a baseline dictionary of about 90,000 words. Stress markings were obtained from the Pronlex dictionary[15].Pronunciationsweremarkedforprimarystressandsecondarystress.Syllabificationswere introduced automatically using software provided by NIST[16]. This software implements syllabification principles[13]basedonpermittedsyllable-initialconsonantclusters,syllable-finalconsonantclustersand prohibited onsets. In order to keep the number of pronunciations and syllable units manageable, only one syllabification per word was used. One complication in using syllables is the existence of ambisyllabic consonants: consonants at syllable boundaries that belong, in part, to both the preceding and the following syllable. Examples of this phenomenon, as they appear in the syllable lexicon described above, are: ABLE (cid:212) _ey_b# _#b_ax_l GRAMMAR (cid:212) _g_r_ae_m# _#m_er SYLLABLE BASED LVCSR PAGE 4 OF 27 The “#” denotes ambisyllabicity and is used as a tag for the phone which is the most plausible boundary. Thoughnoclearsyllableboundaryexists,itmakessensetoassumethattheaboveexamplesconsistoftwo syllables each. The most commonly occurring variant for each word was chosen and the ambisyllabic markingswereretained.Hence,somesyllablesappearinoursyllabarymultipletimes,withtheadditional entries containing the ambisyllabic maker “#” (which can appear at the beginning or end of the syllable). Forthesystemsdiscussedinthispaper,stresswasignoredfortworeasons.First,ourgoalwastokeepthe baseline system as simple as possible and to prevent an abundance of undertrained acoustic parameters. Second,thevalueoflexicalstressinformationseemedquestionable.Eventhoughstressplaysanimportant prosodic role in English, the use of stress marks would increase the number of syllables by an order of magnitude and would induce a combinatorial explosion of parameters. Our syllabified lexicon for SWB consistedofabout70,000wordentriesandrequired9,023syllablesforcompletecoverageofthe60+hour trainingdata[17].AsexplainedinFigure2,3%ofthetotalnumberofsyllablesappearinginSWB,which is approximately 275 syllables, cover over 80% of the syllable tokens in the training data. Of the 70,000 words represented in the lexicon, approximately 40% have at least one ambisyllabic representation. 2.2. BASELINE TRIPHONE SYSTEM AllsystemsdescribedinthispaperarebasedonastandardLVCSRsystemdevelopedfromacommercially available package, HTK[18]. This system, though extremely powerful and flexible, did not support cross-worddecodingforanapplicationaslargeasSWB.Consideringtheexploratorynatureofthiswork, we decided not to use context dependency modeling in our syllable systems. Context dependent syllable models would introduce a few million models into the system, and this, in turn, would necessitate the use ofclusteringand/orstate-tying.Manyotherfeaturesstandardinstate-of-the-artLVCSRsystems[19],such as speaker adaptation and vocal tract length normalization, were similarly excluded from our study. Such things provide well-calibrated improvements in recognition performance, yet add to the overall system complexity, complicate the system optimization process, and require a deeper understanding of a baseline system’s performance before they can be successfully introduced. Therefore, our baseline syllable system is a word-internal context-independent system, while our baseline phone systems are word-internal context-dependent systems. The phone-based system follows a fairly generic strategy for triphone training. The training procedure is SYLLABLE BASED LVCSR PAGE 5 OF 27 essentially a four-stage process consisting of: 1. Flat-start monophone training: Generation of monophone seed models with nominal values, and reestimation of these models using reference transcriptions. 2. Baum-Welchtrainingofmonophones:Adjustmentofthesilencemodel,reestimationofsingle-Gaussian monophones using the standard Viterbi alignment process. 3. Triphonecreation:Creation of triphone transcriptions from monophone transcriptions, initial triphone training, triphone clustering, state tying, training of state-tied triphones. 4. Mixture generation: Split single Gaussian distributions into mixture distributions using an iterative divide-by-two clustering algorithm; reestimation of triphone models with mixture distributions. The first two stages of training produce a context-independent monophone system. This system uses 42 monophones, a silence model and a word-level silence model (short pause). All phone models are 3-state left-to-right HMMs without skip states, and use a single Gaussian observation distribution. Ten hours of data was used for the flat-start process. This data was chosen to span variability in the corpus. A new silence model was created at this stage which had additional transitions intended to create a more robust model capable of absorbing impulsive noises (common in the training data). In addition, a Viterbi alignmentofthemonophonetranscriptionswasperformedbasedonthefully-trainedmonophonemodels. The monophone models were reestimated using these Viterbi alignments. A context-dependent(CD) phone systemwas then bootstrapped from the context-independent(CI) system.Thesingle-GaussianmonophonemodelsgeneratedfromtheCIsystemwereclusteredandusedto seed the triphone models. Four passes of Baum-Welch reestimation were used to reestimate the triphone model parameters. The number of Gaussians was, however, reduced by tying states [20]. Finally, these models were increased to eight Gaussians per state using a standard divide-by-two clustering algorithm. The resulting system had 81,314 virtual triphones (i.e. all triphones possible using an inventory of 44phones and the lexicon), 11,344 real triphones, 6,125 states and 8Gaussian mixture components per state. SYLLABLE BASED LVCSR PAGE 6 OF 27 2.3. PRELIMINARY SYLLABLE SYSTEM The preliminary syllable system consisted of 9,023 syllable models. A standard left-to-right model topologywithnoskipstateswasused.Thenumberofstatesineachmodelwassettoone-halfthemedian duration of the syllables measured using a 10msec frame duration. The duration information for each syllable was obtained from a Viterbi alignment based on a state-of-the-art triphone system. Syllable models were trained in a manner analogous to the baseline triphone system, excluding the triphone clustering stage (unnecessary for a context-independent system). The resulting models, similar to the baseline phone system, had 8Gaussian mixture components per state. Themodelsinthissystem,however,werepoorlytrainedduetothenatureofSWB.Ofthe9,000syllables appearinginthetrainingdatabase,over8000ofthesehavefewerthan100trainingtokens.FromFigure2 it is clear that a small portion of the syllabary is sufficient to cover nearly all of the database— 275syllables cover 80% of the database. Hence, we chose to evaluate our approach using a system consisting of 800syllables and replacing the remaining poorly trained syllables in the lexicon by their phonemic representation. An example is the word “access,” which is represented in this hybrid system as “_ae_ksehs.” The symbol “_ae_k” represents a syllable model, while “s” and “eh” represent its phone constituents. Approximately 10% of the entries in the lexicon have syllable-only representations. Note that the phones used for recognition were not trained in the context of syllables, but were trained separately as CI phones using 32 Gaussian mixture components per state. All of these phone models, which we refer to as “glue” phones, consisted of three state left-to-right models. This methodology of combining syllables and phones is not entirely appropriate and is the subject of on-going research since it is a combination of disparate model sets estimated in isolation. However, this approach represents a pragmatic solution to the problem of poorly trained acoustic models. 3. ENHANCED SYSTEMS Though the use of syllables in conjunction with phones in lexical representations circumvented the problem of undertrained syllable models, model mismatch at phone-syllable junctions was still a significant problem. Below, we describe several modifications that were made to address this problem. SYLLABLE BASED LVCSR PAGE 7 OF 27 3.1. HYBRID SYLLABLE SYSTEM Wefirstapproachedthemismatchissuebybuildingasystemconsistingofthe800mostfrequentsyllables and CI phones from scratch (rather than bootstrapping the models). Several important issues such as ambisyllabicity and resyllabification were ignored in this process. For example, if a syllable with an ambisyllabicmarkerwastobereplacedbyitsphonerepresentation,weignoredthemarkeralltogether(for example, “shading” became “sheydd_ih_ng.” The number of states for each syllable model was proportional to its median duration, while the phone models used standard three state left-to-right topologies with no skip states. The final models had 8Gaussian mixture components per state. An evaluation of the system on the 2427 utterance test set resulted in a 57.8% word error rate(WER). A analysis of the errors occurring in the above experiment revealedthataveryhighpercentageofwordswithanall-phonelexicalrepresentationoramixedsyllable/ phone lexical representation were in error. Table1 provides an analysis of the errors by word category. This analysis motivated the development of a hybrid system using syllables and word-internal triphones. Followingtheapproachabove,CIphoneswerereplacedwiththecorrespondingCDphones(forexample, “_ah_nk” became “_ah_nn-k”, and “p_t_ih_ng” became “p+t_t_ih_ng”). Syllable models from the baseline syllable system and triphone models from the baseline word-internal triphone system were combined and reestimated using 4 passes of Baum-Welch over the entire training database. Table 2 gives the performance of the hybrid system with CD phones. 3.2. MONOSYLLABIC WORD MODELING One reason SWB is a difficult corpus for LVCSR is the variability in word pronunciations. Since the syllable is a longer acoustic unit compared to the phone, the need to explicitly provide pronunciations for all variants can be alleviated. It is possible for the syllable model to automatically consume the acoustical variationofthepronunciationofaword/syllableinthemodelparameters.Acloserlookatthetrainingdata intermsofitswordcontentrevealedsomeinterestingfacts.Table3showsthedistributionofwordsinthe 60+ hour training set. There were a total of 529 monosyllabic words in the training data. However, these 529 monosyllabic words covered 75% of the total number of word tokens in the training set. The top 200 SYLLABLE BASED LVCSR PAGE 8 OF 27 monosyllabicwordscovered71%ofthetotalnumberofwordtokensinthetrainingset.Additionally,82% ofthe recognition errors when usingthe hybrid syllablesystem were monosyllabic words. This suggested the need to explicitly model monosyllabic words. In the monosyllabic word system, monosyllabic words with multiple pronunciations were represented by one model that represented all pronunciations. Table 4 provides some examples of this modification. Another example not mentioned in the table is the word “and.” In the lexicon its only pronunciationis “_ae_n_d.”However,inconversationalspeechthepossiblecommonalternativepronunciationscouldhave deletions of “_ae,” “_d,” or both. Using the larger acoustic unit, it can be seen that the word model is not dependentonthelexicalrealization,andvariationinpronunciationcanbemodeledbytheHMMstructure directly. However, we decided to use separate models for words with different baseforms. Table 5 provides some examples of this modification. In spontaneous conversational speech, the monosyllabic word “no” has a significantdurationalvariationdependingonitspositioninasentence.Itisunlikelythatthemonosyllabic word“know”hasthissamecharacteristic.Thedifferenceinthedurationofthesemodelsis,onanaverage, 80ms. This difference, therefore, necessitates two separate models for these homonyms with a different number of states in each model. The word model “know” was constructed with 9 states and the word model“no”had13states.Anotherexampleofthistypeofmonosyllabicwordis“to,”whichismorelikely tobepronouncedas“_t_ax”ratherthan“_t_uw.”Inthiscasethenumberofstatesinthewordmodel“to” is 4 compared to 10 states for the word “two,” and 11 states for the word “too.” Yet another ramification of introducing monosyllabic word models is the relationship between word models that were previously represented as syllables. The 800 syllables in the baseline system were replacedby200wordmodels+632syllables.Someofthesyllableswereonlytrainedfrommonosyllabic word tokens while others have training tokens from both monosyllabic and polysyllabic words. However, when using word models, some of the original syllables would have insufficient training material to reliably train both a word model and a syllable model. In other words, most of the occurrences of a given syllablewereasamonosyllabicword.Separatingthetwooccurrencesresultedinapoorlytrainedsyllable SYLLABLE BASED LVCSR PAGE 9 OF 27 version of the model. The 200 word models were seeded with the most frequent syllable representation for that word. The number of states in the syllable and word models were reestimated by relabeling the forced alignments withthe632syllablesand200wordmodels.Asbefore,thenumberofstatesforeachmodelwasonehalf themedianduration.Theseedmodelsofthismonosyllabicwordsystem(200wordmodels+632syllables + word internal triphones) that were obtained from the hybrid syllable system discussed in the previous section were reestimated using 4 iterations of Baum-Welch reestimation on the 60+ hour training set. 3.3. FINITE DURATION MODELING As previously explained, a syllable is expected to be durationally more stable than a phone. However, when we examined the forced alignments using our hybrid syllable system, we noticed very long tails in thedurationhistogramsformanysyllables.Thedurationhistogramforthesyllable“_ae_n_d”isshownin Figure3.Thepeakatthe8thframeinFigure3isanartifactoftherequirementthatallinstancesoftheunit during forced alignment need to be at least 8 frames long (the number of states in the model) since the models do not have skip states. We also observed a very high word deletion rate. The deletions are somewhatattributabletothelongtailsinthedurationhistogramsofsyllablemodels.Thesefactssuggested a need for some additional durational constraints on our models. Toexploretheimportanceofdurationalmodels,wedecidedtoevaluateafiniteduration[21]topology.The finite duration topology we chose is shown in Figure4. A model was created by using the corresponding infinitedurationmodelasaseed,andreplicatingeachstateinthefinitedurationmodel P times,where P is obtained from P = E[S]+ 2 (cid:215) stddev(S) , (1) and S isthenumberofframesthathavebeenmappedtothatstateforagivensyllabletoken.Thenumber of times the state is replicated is roughly proportional to the self-loop probability for the given state. AssumingaGaussiandistributionforthedurationofasyllable,theaboveequationguaranteesthatatleast 90%ofthetrainingtokensforthesyllablecanbeexplainedbytheestimatedprobability.Theobservations

Description:
CORRESPONDENCE: Aravind Ganapathiraju of phone-based systems to accommodate the high phone deletion rate in .. enhanced if a user's spelled or spoken response could reliably take the place of the keypads which are .. Analysis of the frequency of words appearing in the training data. 4.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.