ebook img

BSTJ 60: 8. October 1981: Improving the Quality of a Noisy Speech Signal. (Sondhi, M.M.; Schmidt, C.E.; Rabiner, L.R.) PDF

5.7 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview BSTJ 60: 8. October 1981: Improving the Quality of a Noisy Speech Signal. (Sondhi, M.M.; Schmidt, C.E.; Rabiner, L.R.)

orth a ica a er Coma Improving the Quality of a Noisy Speech Signal By M.M, SONDML, ©. E, SCHMIDT, and LR. FABINER (Manuscrit recived December 18, 1860) In this paper we discuss the problem of reducing the noise Tvel of «a noigy speech signal. Several variants of the weilnown class of "“qpotrad subtraction” techniques are described. The basi implemen: tation consist ofa channel voroder in which both the noize spectral {eel and the overall (signal + noise) spectra level are estimated in ‘each channel, and the gain ofeach channel is adjusted on the basis Of the relative noine level in that channel, Two improvements over Dreviousty known techniques have been studied. One is a noise level ‘estimator based on a slowly varying, adaptive noise level histogran. ‘The ether iso nonlinear smoother based on inter channel continuity constraints for riminating the s-catled "musica! tones" (ce, narrow land moive bursts of varying pith), bnformad Hstening indicates (hat ar modest signal-to-noise ratios (greater than about 8 dB) substan tual noise reduction is achieved with itte degradation of the speech ‘quality. INTRODUCTION. ‘The idea that a vocoder may be used to improve the quality of a rnoiny speech signal, as been around for about twenty years. To the ‘est of our knowledge the fis sch proposal was made in 1960 by M. IR. Schroeder! The basic idea ofthis proposal can be explained wich tho help of Fig 1, a follows igure lashows atypical short-term magnitude spectrum of «voiced portion f'@ noisy speech signal. Let S() denote che envelope of cis Sectrum, (Hall chat tho “channel gene” ofa vocoder are estimazes ft this envelope at the cencer frequencies of the channels. The fine ‘Mrueture of the spectrom is attributed to the harmonics ofthe fur ‘Uuaental voice fequeney.) ‘igure thshows a "formant equalize” version, 8(a), ofthe envelope ‘The peaks in San $ occur at the same frequencies but the peaks of Fig adele of mu wag by incase the dranie rane ten ses pa tt pe et ps! Soe tie gee SEE Sel seule epctm mete. ef The pean etn Sle) SEES Thue Wwe" trmane pe an val ge tus the eg "The proposals, eatentinly to generate a signal with afin structire ss lose aa possible to tha of the original spooch signal, but with aa tnvelope given by S'S, where nis some incetger, sey, Lor 2 xcept for {eeale factor, the speccralexvelope of che resulting signal i the same 1s that ofthe orginal signal a ch formant peaks, but is considrahly Fedaced in he valley, A sown in Fg. Le this processing effectively fedlucen the overall noive level. Of course, the formant peuks also Inecome sharper, i the formant bandwidthe get reduce Reference | deseribes two implementations ofthis dex frequency domain method in which the envelope is modified by modifying the ‘channel gain of a aclfercited channel vocoder, and a time domain tmethod in which the aame effect is achieved by repeated convolution ‘in many practical cases of interest, the noise is additive and uncor- related with Une speech signal. In rach 8 situation ft were possible {o estimate the spectral level of the noise ata function of frequency, then the nolse reduction could be achieved in a somewhat diferent ‘manner, Suppose the noey speech it applied co the input of « dunnel ‘ocader (ee8 Section IL for detailed description). Let the output. of ‘he Ath channel ba ye = s+ mm where yin the speech signal and me tthe noise cgpal in that channel. Let NE be the average power of the ies and Sf that of Une speech signa. Then, aasuming thal. he nine find speech are uncorrelated, the average powor ofthe noity speech i fden by vic sient o [Now Yi can be estimated dieecly from the output signal yy If an 149. THE BELL SYSTEM TECHNICAL JOURNAL, OCTOBER 1881 eimate of Ni ia available, as postulated, Uen (YE — NEI!" provides tan estate ofthe magnitude ofthe signal alone i the Ath channel, "Tha, ithe Level ofthe chine sgual is mueiplid bythe ratio ofthis cexcimatod signal rowrr ta overall power, then a noise reduction i heed Th 1d, at the suggestion of Mi, F. Schroeder, thin “apectal sub- eaclion” ion was implemented as a BLODI language computer pro frm hy one of wn (MMS) in cllaoro ion sith Sally Severs Besides pecteal sulitaction, one other feature was incorporated ito thin implementation, It had been recently demonstrated that autecomela tion and eepecrum pitch wxracion ave quite acaurata and reliable for risky speech sana with sinal--noice ratio (o/n} as low as 6 dB. Such extrusion provide a clean excitation signal even from a highly ‘aiky speech signal. Therefore, the wll-eeitaion described in Ref. fm replaced by a voiced-unveiood hue his) signal derived Fea an ‘aulocorelaton pitch extractor. ‘Although this implementation demonstra the (easily of the ‘ule een, Che compucer facilis avaduble al cha Lime did not allow ‘throng invertigation ofthe effects of changing varfous parameters find configurations, Alo, sinc digial hardware wos not yet readily tvailable tt did nor appear Tkely Usa ich nolae-seipping techniquer ‘would find application inthe immediate future, For these reesons ‘hese cacniguas wurv nol satvely porsued at cha time ‘Since the mid-zeventien, prewmahly doe to the vastly improved digital wchnology and renewed militry interes, weer rpg ae taguin ataeted considerable attention. Thu recees iterest in his problem appears te bave startod in 1804, when Weiss eal. independ tently discovered the apevial ablmetion method Bxoept forthe fact {hanehe fer bank ofthe channel vader wax replaced by short-term. Fourier analysis the implementation of Wels eal was quit salar (Uw mir described above, During the pas five ort yours several Mudies have explored this and other methos Tur nolve temoval TNotallesuuaug dasa ie the work of Tl, Rerout etal, and Meawlay find Malpans "A veview of cheve und other studies isgiven ina recent per by Lim an Oppenteins® [In view of the curren inleret in nove removal, we have reconty been experimenting ith the specral sabtrsction method by computer Siilaion, Subsequent sections of this paper deseribe the relia of ‘our experiments. ‘From Une Drie deseripcion given above, ic is clear that spectral snibiraction is vapcted to be Ueefl oni in eases when the ace ia ‘itive. With thia constraint, there aru busily Lo Types of situa tions in which chi method might find upelietio: {i} "The speech may he prouced in @ nosy cnvironment, SPEECH SIGNAL 1649, tho cockpit of an airplane, In auch sicuation the is unknown a prion. Thic infomation must be ectimaved from the noisy speech signal itaclf, og. during intervals of allence beeweon speech bursta, The algorithm far eslicuting Une noise spectrum is, therefore, ono of che moet important parts ofthe wimulations described Inter. (Gi) ‘Toe speech itself my be panerted in quit environment bat right be transformed to w noisy signal because of the action of a ‘coding device. amploa where auch noe may bo modelled ne additive ‘ae pab-onle modulation (Pem} oder, and lta acd whoa slap lure choven wich thal granular noe predominate over the slope-overlond noise. In such cases, both Uhe level f the noise und its spoctmal samposiion might be lawn a prov. Uae of this « prue (ormation simplifies the system sid bnproves is performance "There i a third way in which noise may enter the communication channel additively. The speech signal may be generated in a quiet fqvirorment hut the liscener may be in a noisy environment. A imeasage cent over the public addres ayctam ata busy railway station i such an example, In this eas, the problem is to proprocess the ‘speoch signal in such a way thot its intellgtbiity is last impaired by the noine, Some work om thin problem haz hoon reported in the ecatare)" however, we will not deal with cis problema. Before tnrning wo a description of our simulations, ib is worth anphatiring that we Aeltberataly used the word “quali” rather then inleligihliy” in Uhe cle of his paper. Telly, of ou, one woul ke the ineligibity ako w be incromed, However, thin & not abs lately essential. Te is quite annoying and fatiguing ta have to listen to 1 nogy speech signal for any leigh of time. Therefore, a device that ‘oucee or aliminates the noise can he quite useful even ifthe cleaner signal is no more intligibe than the noisy one. Wu THE Basic STRUCTURES “Two basic channel vocoder configurations for implementing special subtraction mere mulated. For reasons that will become apparent fom the following descriptions, we call theee configirations 2x oe an pence, empectvaly. 12.7 The sethexcted contguration [A block diagram of the selletcsed method of noise removal is shox in Fig. 2. The noisy pect, sample I,000 fines per seca i fiat passed through a bank of N equienaeed hindpase iene that span, the telephone channel bandldh (appronimately 200 0 320 Ha) "The processing of the output of the bandpass (ir is dential for each 1850. ‘THE BELL SYSTEM TECHNICAL JOURNAL, OCTORER 1981 ch cht channel. In the Ath channel, the following operations are performed on the output. i) The level (magnitude) of the noiay speech signal, Y, is ex mated, (Gd) Ta. parallel path the level ofthe noite, Nx i exrimated. Gi) The eatimatew Ne wel Ye at oval ts caine an enim Sy of the level of the uncorrupted speech signa in Une Ah channel ‘Go) ‘The ajusted channel signa is computed by the relation @ "Ye CCloarly 8 has the desired estimated mugnitude $, The sum é= Ys, 4. then provides the final proceed ont 2.2 Theplemexcitad configuration A block diagram of the ptsh-excited method is shown in Fig. 3.'The ‘estimates §, = 1,2, .-- N, are obtained exactly as in the case ofthe ‘salfexcited configuration. However, the adjusted channel signals are obtained differently (i) The noigy speech signal is frst procease by @ pitch extractor ‘hich alo provides the voiced/unvoiced clarification. The particular pitch exeractor used ia described in Ret. LL (id) The output of the pitch extractor is used to provide a clean ‘excitation signal which consists of « Gaussian noise during unvoiced ‘SPEECH SIGNAL 1861 i $a rm ftp oni chal hr oie tipper in which ‘soll anna etton Gae i plow ofthe bandpae hanel nil portions and a train of impos atthe pitch rate during voiced egrnenc Tis) ‘Thia clean excitation signal ia paased through « bank of band past fiters, wdaticnl o the anes shown in Fig. 2, to give channel sala, 3, which aze approximately equal i magnitude (Gv) The adjusted channel signal i computed as 8, a ‘As before, fas the comect magnitude nn, a8 before, the sum of ‘heoe adjusted channel signals give the final process output. ‘As dgcussed jn the next seston, the estimaces of S are computed every 01 8 (ie, 100 times a second). Th our initial experiments the ‘channel gana were held constant between estimates. In this ease, the fain jumps in valu every 0.01 s, producing annoying audible clicks, ‘These ticks were eliminaled by replacing each jump by a linear interpolation ofthe channel gins over 6 speech samples (0, over 06 mm). 1, ALTERNATIVE CONFIGURATIONS SIMULATED ‘Several modified versions of the base configurations of Figs 2 and 8 have been simulated, and several entences processed with these simulations. The alternatives that me have smdied in some detail are two choice for the nurnber of channels; two methods of estimating 41882 THE BELL SYSTEM TECHNICAL JOURNAL, OCTOBER 108: Ys two methods of estimating No: and ew methods of estimating S. ‘These will now be deve, “Two designa were simulated, cach with equispuced Mters, In one sgn 16 channels (200-Hz wide) vero usu and in the other 82 channels (100-Hz wide). The filter responses end the sum of the esponses for ench disign are shown in Fig 4, (Each fltar was «linear ‘hase nite Impulse response (ix) fer of duration 88 samples inthe 1-channl Alter bank and 176 samples inthe 32-channel filter bank) ‘The two methods of minting the maguide, Ys, of Che noiay channel signal ar shown in Fig. 3 Ether [yo] or ¥4 low pase fered 0.90 Ha Inthe seul cate, the square root of dhe ouput of Une Toe pass fiter is eompuced. The impulse and frequency respunses of che Tove pas filer Sl order infinite impulse response (ut) Bessel ter] are shown in Fi. 6. "The choice of bandwidth of the low-pass filter is governed by « compromise between the following evo roquiremenrs: For nocurate estimation of ¥, the averaging Lire shout be ws are a8 possible, Le “TERRE y al CUANAANAMAA 0) mony mis tin a of he tn rb iter hate ove te Cate te ae ar el cae PEL, LL o ‘ony nina lene Ae lowed by wagers 00 on the filter bandwidth should le sanall aa possible, Om the other hand the spectrum of speech varies oth time a the bandwith should be as large as porsible to track these variacons, The ususl compromise cub off frequency in chanel vocoder is about 20 Tz Note chat dhe outputs of the low-pass ers need be sampled only 0 eimen/. Ta alow forhe roll-off of the iter, the sampling nate was hooen us 100/s Susnewhat surpriangly, 8 mich higher sampling ato tras find eo. degrade performance "We will explain this peradox in Seetion IV (Phe Musical Tones) 2.3 Enumating N, Daring intervals of sil in the spoech, tho input signal consists of soize alone, Thersfove, one possible estimate for Nye the smallest value stained by Ve. Homever, because of statistical Oucations, Ys ‘nite rapidly takes on au snrealiscieally low value. Therefore, this Catimeto is quite unsacistactory, Ln order Lo avoid such problems with tutlers the methad sehematized in Figure 7 hax been simulated. ‘Ase fir nlp, the wagnizude of y i estimated by a procedure ‘onteal to that of Wg. 5, exept tha the low-pass filter has a cut-off frequency of 10 Tz inst of 81 Hx (The impulse response of she 10 Hofer ia quite similar o thar ofthe 30-He filter with the time sealed by a factor of), ‘A before, the coll Irequeney of the low-pass filer shoul be chosen no larger than that necessaey to follow the Uime-variations of fhe noize spectrum, Oar choice of £0 His wn extremely conservative ‘aloe. For most splicationn a eutof frequency of 1 Fs or lee should sufice. “Analogously Wo the extimation of Yj, we have (we ways of estimating 1854. THE BELL SYSIEM TECHNICAL JOURNAL, OCTOBER 1981 [Nay which dior only in the type of nonlinearity used. Vigure 7 shows the frontend ofthe wlvemate noise estimator that we have simulated ‘Let Zain) be the estimates of the magaitude of y. abeained by one of these methods, stmpled roery 001 8 Then che algorithm for Gnding the noe level aa fellows (a) Stare Zn), a = 1, +, Qin buler of sine i) Find the stnlles! value such that the ner higher value is ‘within 6B of it Call this smallest value MIN. (id) Make w bistogram wich 1-d! bins ofall the values that i in the range MIN to MAX ~ MIN + 15d (oi) Declare K cimes the magnimade corresponding Wo the pen of the histogrua, asthe nose level. (a) Get nowt sample (oi) I this sample is greeter than MAX, discard it and gu to step w (iii) 1€ the sample ie less than MAX replace the oils sample the buffer by the new ample and go to stp Ui a a sellin repre el gm, gy fhe aD A SPEECH SIGNAL 1855, © ee scumnen eh vo) SEs Attar some oxpirimenlation, Q~ 100 and K ~ 3 or 25 oro found to be most siselary Torche range of s/s considered. All experinents to be deactbed later were performed with these values of Q and K. Carel considerations of the above algorithm show onvines the reader that this procedure ignores xcasional low values of a: guards fgainst sudden ineroasod in Z, becuse of the onset of speech; and Finally, i allows adaptation to slowly varying nose level 5.6 Esumating & "As mentioned in the intwduction, under the assumption that wand ri ure uncorelated, 8, should be estimated as & — (1 — NIV". Horrever, there is atistral Mituation because ofthe finite averaging time even ithe assumption i seit valid, Therefore, sometime th ‘estimated vale of Ys i Tess thon tat of, In such eases, 8s vel 0 ero, This ur Gist procedure tor eetimasing Sh is 3.-TEM nem, ta) “a Wem (0) [A second eatimata that we have tried is Cee eee ea (0) y (oH 1, THE MUBIGAL TONES ‘We have processed several speech signal theough a varery of noise stripping slgorituns «blaine! ly weleting rom che alternative listed above, The results will be discussed in detail in the nest section, 1086. THE ALI SYSTFW TECIINICA! JOURNAL, OCTOBER 186)

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.