Toward Spontaneous Speech Recognition and Understanding Sadaoki Furui TokyoInstituteof Technology Departmentof ComputerScience 2-12-1,O-okayama,Meguro-ku,Tokyo,152-8552Japan Tel/Fax:+81-3-5734-3480 [email protected] http://www.furui.cs.titech.ac.jp/ 0205-03 Outline • Fundamentals of automatic speech recognition (cid:127) Acoustic modeling (cid:127) Language modeling (cid:127) Database (corpus) and task evaluation (cid:127) Transcription and dialogue systems (cid:127) Spontaneous speech recognition (cid:127) Speech understanding (cid:127) Speech summarization 1 0205-03 Outline (cid:127) Fundamentals of automatic speech recognition (cid:127) Acoustic modeling (cid:127) Language modeling (cid:127) Database (corpus) and task evaluation (cid:127) Transcription and dialogue systems (cid:127) Spontaneous speech recognition (cid:127) Speech understanding (cid:127) Speech summarization 0010-11 Speech recognition technology Spontaneous natural speech 2-way conversation dialogue Fluent word network transcription e speech spotting systemdriven agent& yl dialogue intelligent st digit messaging king spReeeacdh strings dniaalmineg 2000(cid:1) ea formfill office Sp byvoice dictation Connected speech 1980(cid:1) directory assistance Isolated voice words commands 1990(cid:1) 2 20 200 2000 20000 Unrestricted Vocabularysize(numberofwords) 2 0201-01 Categorization of speech recognition tasks Dialogue Monologue (CategoryI) (CategoryII) Humantohuman Switchboard, Broadcastsnews(Hub4), CallHome(Hub5), lecture,presentation, meetingtask voicemail (CategoryIII) (CategoryIV) Humantomachine ATIS,Communicator, Dictation informationretrieval, reservation Major speech recognition applications (cid:127) Conversational systems for accessing information services – Robust conversationusingwireless handheld/hands-free devices inthereal mobilecomputingenvironment – Multimodal speechrecognitiontechnology (cid:127) Systems for transcribing, understanding and summarizing ubiquitous speech documents such as broadcast news, meetings, lectures, presentations and voicemails 3 0010-12 Mechanism of state-of-the-art speech recognizers Speechinput Acoustic analysis x ...x 1 T Phonemeinventory Globalsearch: P(x1...xT|w1...wk) Maximize Pronunciationlexicon P(x1..P.x(wT1|..w.1w..k.w|xk1)...Px(Tw)1...wk) P(w ...w ) over w1...wk 1 k Languagemodel Recognized wordsequence 0010-13 State-of-the-art algorithms in speech recognition Speechinput LPCor Context-dependent,tied melcepstrum, mixturesub-wordHMMs, timederivatives, Acoustic learningfromspeechdata auditorymodels analysis SBR,MLLR Cepstrum Phonemeinventory subtraction Pronunciationlexicon Globalsearch Frame Languagemodel synchronous, beamsearch, stacksearch, fastmatch, Recognized Bigram,trigram, A*search wordsequence FSN,CFG 4 0205-03 Outline (cid:127) Fundamentals of automatic speech recognition (cid:127) Acoustic modeling (cid:127) Language modeling (cid:127) Database (corpus) and task evaluation (cid:127) Transcription and dialogue systems (cid:127) Spontaneous speech recognition (cid:127) Speech understanding (cid:127) Speech summarization 0104-05 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 80dB Power 0dB 6kHz Spectrum 0kHz Waveform 0ms Pitch 15ms Time Digital sound spectrogram 5 0109-22 Framelength Timewindow Frameperiod Frame … Featurevector Feature vector (short-time spectrum) extraction from speech 0112-11 Spectralfinestructure g o l Short-termspeechspectrum F0(Fundamentalfrequency) f Spectralenvelope F0(Fundamentalfrequency) f g o l Resonances(Formants) Whatwearehearing f Spectral structure of speech 6 0112-12 Logarithmicspectrum Cepstrum Spectralfinestructure g 0 o l Fastperiodicalfunctionof f f IDFT t Concentrationat differentpositions Spectralenvelope g 0 o l Slowperiodicalfunctionof f f t Relationship between logarithmic spectrum and cepstrum 0103-09 10 SpectralenvelopebyLPC SpectralenvelopebyLPCcepstrum 8 de 6 u t pli m a 4 g o L Short-timespectrum 2 SpectralenvelopebyFFTcepstrum 0 0 1 2 3 Frequency[kHz] Comparison of spectral envelopes byLPC,LPCcepstrum, and FFT cepstrummethods 7 0103-10 Parameter(vector)trajectory Instantaneous vector Transitional (velocity)vector (Cepstrum) (Delta-cepstrum) Cepstrum and delta-cepstrum coefficients 0103-11 Speech FFT FFTbased spectrum Melscale triangularfilters Log DCT ∆ Acoustic vector ∆2 MFCC-based front-end processor 8 0104-08 b (x) b (x) b (x) 1 2 3 Output probabilities x x x 0.2 0.4 0.7 0.5 0.6 0.3 1 2 3 Phoneme models 0.3 Feature vectors time Phonemek-1 Phonemek Phonemek+1(cid:1) Structure of phoneme HMMs 0104-07 Words grey whales Phonemes Allophones Allophonemodels Hz) k ( Spectrogram ncy ue q Fre de Speechsignal plitu m A Times(seconds) Units of speech (after J.Makhoul & R. Schwartz) 9 0205-03 Outline (cid:127) Fundamentals of automatic speech recognition (cid:127) Acoustic modeling (cid:127) Language modeling (cid:127) Database (corpus) and task evaluation (cid:127) Transcription and dialogue systems (cid:127) Spontaneous speech recognition (cid:127) Speech understanding (cid:127) Speech summarization 0103-14 Language model is crucial ! (cid:127) Rudolph the red nose reindeer. (cid:127) Rudolph the Red knows rain, dear. (cid:127) Rudolph the Red Nose reigned here. (cid:127) This new display can recognize speech. (cid:127) This nudist play can wreck a nice beach. 10