ebook img

DTIC ADA458669: Augmented Role Filling Capabilities for Semantic Interpretation of Spoken Language PDF

10 Pages·0.87 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview DTIC ADA458669: Augmented Role Filling Capabilities for Semantic Interpretation of Spoken Language

Augmented Role Filling Capabilities for Semantic Interpretation of Spoken Language Lewis Norton, Marcia Linebarger, Deborah Dahl, and Nghi Nguyen Unisys Center for Advanced Information Technology Paoli, Pennsylvania 19301 ABSTRACT In the area of integration of speech and natural language, we report on an experiment with three spoken language systems, This paper describes recent work on the Unisys ATIS Spo- coupling the same Unisys natural language system to three ken Language System, and reports benchmark results on nat- different speech recognisers as shown in Figure 1. ural language, spoken language, and speech recognition. We describe enhancements to the system's semantic processing for handling non.transparent argument structure and enhance- ments to the system's pragmatic processing of material in art. swers displayed to the user. We found that the system's liscore ~:~,,-..,.-i.,.,~,~ ,:,:...,.~,:.:.:.:,-,,.,,~~.:,:.:~:~:.:i:~:.~,:':-:,;:.:':-:.~':.:~~l.::::~" :::~,.: on the natural language benchmark test decreased from o~8~ to 36~ without these enhancements. We also report results for :- :+: Ptiq0tY x-: .." three spoken language systems, Unisys natural language cou- pled with MIT-Summit speech recognition, Unisys natural lan- guage coupled wish MIT-Lincoln Labs speech recognition and Unisys natural language coupled with BBN speech recognition. Speech recognition results are reported on the results of the Unisys natural language selecting a candidate from the MIT- Summit N-best (N=16). Figure 1: Unlsys natural language + multiple INTRODUCTION speech recognlsers Improving the performance of spoken language systems re- quires addressing issues along several fxonts, including ba- We believe this is a very promising technique for evaluat- sic improvements in natural language processing and speech ing the components of spoken language systems. Using this recognition as well as issues of integration of these compo- technique we can make a very straightforward compe~ison of nents in spoken language systems. In this paper we report the the performance of the recognizers in a spoken language con- results of our recent work in each of these areas. * text. Furthermore, this technique also allows us to make a One major area of work has been in the the semantic and fine-gralned comparision of the interaction between speech and pragmatic components of the Unisys natural language process- natural language in the three systems by looking at such ques- ins system. The work in semantics enhances the robustness tions as the relative proportion of speech recognizer outputs of semantic processing by allowing parses which do not di- that fail to parse, fall to receive a semantic analysis and so on. rectly express the argument structure expected by semantics Finally, we report on speech recognition results obtained by to nevertheless be processed in a rule-governed way. In the filtering the N-best (N=16) from MIT-Summlt through the area of pragmatics we have extended our techniques for bring- Unisys natural language system. We note that there was a ing material displayed to the user into the dialog context to higher error rate for context-dependent speech as compared to handle several additional classes of references to material in context-independent speech (54.6% compared to 45.8~) and the display. suggest two hypotheses which may account for this difference. • This work was supported by DARPA contract N000014-89- SEMANTICS C0171, administered by the Office of Naval Research. We are grate- fuI to Victor Zue of MIT, Doug Paul of MIT Lincoln Laboratories When evaluating our system after the Hidden Valley work- and John Mak.houl of BBN for making output from their speech shop, we observed two phenomena about PUNDIT (the Unlsys recognition systems available to us. We also wish to thank Tim natttral language understanding system 3) that warranted Finln, Rich Fritzson, Don McKay, Jim Meldlnger, and Jan Pastor of Unisys and Lynette Hirschnmn of MIT for their contributions to improvement. The first was that PUNDIT's semantic inter- this work. preter was sometimes fA;llng to correctly recognize predicate 125 Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED 1991 2. REPORT TYPE 00-00-1991 to 00-00-1991 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER Augmented Role Filling Capabilities for Semantic Interpretation of 5b. GRANT NUMBER Spoken Language 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION Unisys Corporation,Center for Advanced Information Technology,PO REPORT NUMBER Box 517,Paoli,PA,19301 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES 14. ABSTRACT 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE 9 unclassified unclassified unclassified Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 argument relationships for syntactic constituents that were Our changes to the semantic interpreter allow it to fill roles not immediately associated with their intended head. The sec- correctly in cases such as the above, utilising its existing ond was that PUNDIT was producing different representations knowledge of syntax-semantics correspondences, but relaxing for queries with different syntactic/lexical content but identi- certain expectations about the syntactic attachment of role- cal (or nearly identical) semantic content. We see both of these filling constituents. Thus the CATEGORIAL constraints remain shortcomings as due to what we will term "non-transparent in force, but the ATTACHMENT constraints have been loosened argument structure": syntactic representations in which syn- somewhat. The system now identifies prepositional phrases tactic constituents are not associated with their intended head, and adverbs which have not frilled a role in the predicate with or semantic representations in which predicate-argument re- which they are syntactically associated, and offers them as lationships are underspecified. Our approach to dealing with role fillers to fillers of this predicate. This strategy applies re- these shortcomings has been to maintain a rule-governed ap- cursively to fillers of fillers of roles; for example, in What types proach to role-filling despite non-transparent syntactic and of ground transportation services are available from the air- semantic structures. We believe that the extensions we are port in Atlanta to downtown A tlanta f , the two final pp's ulti- about to describe are especially relevant to Spoken Language mately fill roles in the decomposition associated with "ground Understanding, because non-transparent argument structure transportation" since neither "types" nor "services" has map- appears to be particularly characteristic of spontaneous spo- ping rules to consume them. The same mechanism already in ken utterances, for reasons we will sketch below. place for role-filling is employed in these cases, the only differ- ence being that unused syntactic constituents are propagated The semantic interpreter and non-transparent parses downward. Note that we continue to take syntax into account; Semantic interpretation in PUNDIT is the process of instan- we do not wish to ignore the syntax of leftover constituents tiating the arguments of case frame structures called decom- and fill roles indiscriminately on the basis of semantic proper- positions, which are associated with verbs and selected nouns ties alone. and adjectives (7). The arguments of decompositions are We conducted an experiment to assess the effects of these assigned thematic role labels such as agent, patient, source, changes upon the system's performance, using a set of 138 and so forth. Semantic interpretation is a search for suit- queries (both class A and non-class A) on which the system able grammatical role/thematic role correspondences, using was previously trained. The measure of performance used was syntax-semantics mapping rules, which specify what syntac- the standard ATIS metric of the number of correct answers tic constituents may fill a particular role; and semantic class minus the number incorrect. Disabling the semantic changes constraints, which specify the semantic properties required of described above lowered the system's score from 82 to 63, a potential fillers. The syntactic constraints on potential role decrease of 23~. fillers are of two types: CATEGORIAL constraints, which re- The application module and non-transparent semantic quire that the potential filler be of a certain grammatical type representations such as subject, object, or prepositional phrase; and ATTACH- MENT constraints, which require that the potential filler occur Our second improvement was directed at cases where PUN- within the phrase headed by the predicate of which it is to be DIT's semantic interpreter may have correctly represented the an argument. The categorial constraints are stated explicitly meaning of a sentence but in an irregular way. For exam- in the syntax-semantics mapping rules; the latter are implicit pie, the instantiated decomposition produced for "flights from in the functioning of the semantic interpreter. For example, Boston" is: the source role of flight_C, the domain model predicate as- fiight_¢ (~iightl, source (boston) .... ) sociated with the noun "flight", can, in accordance with the syntax-semantics mapping rules, be filled by the entity associ- while "flights leaving Boston" resulted in: ated with the object of a "from'-pp occurring within the same ~light C (flight 1, source (_), ...) noun phrase as "flight" (The flight from Boston takes three loavoP (lsavel, hours). Unfortunately, the parse does not always express the flight (flight 1), argument structure of the sentence as transparently as it does source (boston), ...) in this example; constituents that should provide role fillers for a predicate are not always syntactically associated with Clearly it would be preferable for the flight_C decomposition the predicate. There are several causes for such a mismatch to be the same in both cases, but in the second case the source between the parse and the intended interpretation. They in- role of the decomposition associated with flightl was unfilled, clude (l) a variety of syntactic deformations which we will although it could be inferred from the leaveP decomposition refer to as extraposition ( What flights do you have to Boston, that the flight's source was Boston. In other words, PUNDIT where the "to'-pp belongs in the subject np; I need ticket in- had not captured a basic synonymy relation between these formation from Boston to Dallas, where the pp's modify the np's. prenominal noun "ticket", not the head noun "information"; Our response to this was to augment the semantic inter- or I toant a cheaper flight than Delta 66, where the "than'-pp preter with a routine which can perform inferences involving modifies "cheaper", not "flight"), (2) metonymy (I toant the more than one decomposition. The actual inferences are ex- $50.00 flight, where the speaker means that s/he wants the pressed in the form of rules which are domain-dependent; the flight whose FARE is $50.00), and (3) suboptimal parses (e.g., inference-performlng mechanism is domain-independent. For parses with incorrect pp-attachment). the above example, we have written a rule which, paraphrased 126 in English, says that if a verb is one of a class of motion verbs present the user with an appropriate cooperative response to used to express flying (e.g., "leave"), and if the source role of the follow-up query. There were only a handful of follow-up this verb is filled, propagate that filler to the source role of queries of this form in the training data, hut the extension to the flight involved. Thus the flight_C decomposition becomes handle them was easy to add given the code in place to handle the same for both inputs. Thirty-four such rules have been the "Delta 123" example. written for the ATIS domain, and we estimate that they are Similarly, the training data contained numerous instances applicable to 10% to 15% of the training data. of queries such as What are the claues. ~ In the absence of The payoff from this extension comes in the use of PUNDIT's context, the best answer to this seems to be a llst of the more output by application modules. For the ATIS domain, the ap- than 100 different fare classes. However, queries such as these plication module is the program that takes PUNDIT's output invariably follow the display of some fare classes in either flight and uses it to formulate a query to the ATIS DB. It is obviously tables or fare tables. The cooperative response, then, is to dis- advantageous for the creation and maintenance of an applica- play a table of fare classes whose rows have been limited to tion module that its input be regularized to the greatest extent those classes previously mentioned in the most recent flight possible, thus making such a module simpler, and avoiding du- or fare table. Our system also uses a generalization of this plication of code to compensate for non-regularized input in algorithm to filter requests for other kinds of codes, such as different application modules. restrictions, ground transportation codes, aircraft codes, and When we ran the same set of 138 queries used in the exper- meal codes. In all, from the TI training data (2) we have iment described in the previous subsection without the rules noticed 19 follow-up queries (out of 912) which now get the just discussed (but with the semantics improvements of the correct answer in context because of this extension to our sys- previous subsection), the system's score dropped from 82 to tem; there may be more queries which requite this extension 62, or 24%. There appears to be little interaction between the that we have not yet processed correctly for other reasons. semantics improvements and the rules of this subsection-they We make it possible to refer to previous answer tables in apply to different phenomena in input data. our system by means of the following mechanism. Whenever an answer table is returned, a discourse entity representing it is added to the discourse context, and a semantics for this en- PRAGMATICS tity is provided. Roughly speaking, if the query leading to the In our June 1990 workshop paper (6), we described a fea- answer table is a request for X, the semantics can be thought ture of our system which we included to handle correctly a of as being "the table of X" (6). For example, if the query particular kind of discourse phenomenon. In particular, in the was a request for flights from City1 to City2, the semantics as- ATIS domain there are frequent references to flights by flight signed to the discourse entity representing the answer is "the nnmher (e.g., "Delta flight 123") which the user means to be table of flights from City1 to City2". Note that we do NOT unambiguous, but which in general have to be disambiguated create discourse entities for each row (particular flights from in context. The reason is that the user learned about "Delta City1 to City2 in the example) or for each column entry in a 123" from some previous answer, where it was returned as one row (e.g., the departure time of a particular flight from Cityl of the flights between two cities City1 and City2. The problem to City2). Doing so would make the discourse context munan- is that "Delta 123" may have additional legs; for instance it ageably large. But the table (complete with column headings) may go on from City2 to City3. The user, when asking for is available and accessible to our system, and can be searched the fares for "Delta 123", is presumably interested only in the for particular values when it is desirable to do so, as in the City1 to City2 fare, not the City2 to City3 one and not the capabilities being described in this section. City1 to City3 one. So our system looked back at previous an- The techniques just described depend on the availability swers to find a mention of "Delta 123", thereby determining of previous ANSWERS. Some of the follow-up queries which the flight leg of interest. they enable to be answered correctly could perhaps be han- This kind of disamhiguation can take other forms, and we dled by reference to previous QUERIES only, particularly in have added some of them to our system since June. One of the special case where there is known to be only one previous these capabilities is illustrated by the two queries What does query. We believe that our techniques are superior for at least LH meanf and What does EQP meanf Without context, two reasons. First, in the presence of more than one previous the first of these cannot be correctly answered, because "LH" query, the answers to those previous queries are for our system is a code for both an airline and a fare class. The second a more compact and modular representation of the content of of these queries would yield a table with two rows, one row those queries than the discourse entities created while analys- for each table for which "EQP" is one of the table's column ing the queries themselves; in short, it is simply easier to find headings. In both of these queries, however, the user is asking what we want in the answers rather than in our representa- for clarification of something which has been presented as part tions of the queries. Second, there are follow-up queries which of a previous answer display. So what our system needs to cannot he answered unless reference is made to previous an- do, and does do, is refer back to previous answers much in swers, so such techniques are necessary in a complete system. the spirit of the "Delta 123" example above. For the first Therefore, why not use them whenever they can be used, even query, we will find the most recent answer which has "LH" as when alternative techniques might be available? a column entry in some row: for the second we will find the The February 1991 D1 pairs test, which limited context de- most recent answer which has "EQP" as a column heading. pendency to dependency which could be resolved by examina- Our system can then make the proper disambiguation and tion of a single previous query (and not its answer), provides 127 additional data on the applicability of these methods. In par- ticular, 27 of the 38 pairs involved the disambiguation of a Class Number T F Score flight number to the flight leg of interest. It appears that four Class A 145 queries 84 14 48.3~ additional queries can be successfully answered by the tech- Class D1 38 pairs 24 0 63.2~ nique we discussed above for handling the query What are Class AO 11 queries 1 0 9.1% the classes? The remaining 7 queries appear to be such that Class D10 2 pairs 0 1 -50% reference to previous answers is not helpful. Table 1: Unisys System Scores SPOKEN LANGUAGE SYSTEMS We describe here the five spoken language tests in which BBN was the output from SOLBYB rescored using cross-word we participated. Our methodology in these tests has been to models and a 4-gram model and then reordered before input couple the speech recognition output from different recogniz- to the natural language system. ers to the same natural language processing system. Because the natural language component and the application module Optional Class A Tests are held constant in these systems, this methodology provides We also report on spoken language results on the optional us with a means of comparing the performance of speech rec- class A test, using both the Unlsys-MIT system and the ognlzers in a spoken language context. Unisys-BBN systems described above. Class .A: UnSays PUNDIT system coupled with MIT Summit SPEECH RECOGNITION TESTS The spoken language system used in this test consists of the Unisys PUNDIT natural language system coupled via an N-best The speecli recognition tests were done using the natural interface to the MIT SUMMIT speech recognition system. We language constraints provided by the Unisys PUNDIT natu- will refer to this system as Unisys-MIT. These results were run ral language system to select one candidate from the N-best with N=16, except for 4 utterances which could not be run output of the MIT Laboratory of Computer Science TIMMUS at N=16 because of insufficient memory in the speech recog- speech recognition system. Using an N of 16, PUNDIT selected nition system. N=I was used for these utterances. SUMMIT the first candidate of the N-best which passed its natural lan- produced the N-best and PUNDIT selected from the N-best the guage constraints based on syntactic, semantic and pragmatic most acoustically probable utterance which also passed PUN- knowledge. If all candidates were rejected by the natural lan- DIT's syntactic, semantic, and pragmatic constraints. PUNDIT guage system, the first candidate in the N-best was considered then processed the selected candidate to produce the spoken to be the recognized string. language system output. The value of N of 16 was selected on the basis of experiments reported in 1, which demonstrated BENCHMARK RESULTS that using larger N's than 10-15 leads to a situation where the chance of getting an F begins to outweigh the possible benefit Natural Language Common Task Evaluatlon of additional T's. The SUMMIT system is a speaker-independent continuous Unisys attempted all four of the nature, language tests; speech recognition system developed at the MIT Laboratory both the required and the optional class A and class D1 tests. of Computer Science. It is described in 10. Our scores as released by NIST are as shown in table 1. The overall level of success is unimpressive. For the class A test, Unlsys PUNDIT coupled with Lincoln Labs Speech Rec- which corresponds most closely to the test last June, our per- ognizer formance is not much better, in spite of eight more months of The spoken language system used in this test consists of the work on our system. (If the scoring algorithm in effect now Unisys PUNDIT natural language system loosely coupled to the had been in effect in June, our score then would have been MIT Lincoln Labs speech recognition system. The Lincoln 42.2Ye) As this paper is being written, we have not had the Labs system selected the top-1 output, which PUNDIT then time to examine our performance on a sentence by sentence processed to produce the spoken language system output. basis. It appears likely, however, that the amount of train- The LincoLn Labs system is a speaker independent continu- ing data has not yet adequately covered the full range of the ous speech recognition system which was trained on a corpus various ways that people can formulate queries to the ATIS of 5020 training sentences from 37 speakers. It used a bigram database. backoff language model of perplexity 17.8. The system is de- We are fairly pleased that our "false alarm" rate has not scribed in more detail in 8. gone up since June. It was 11% then; if we take the 196 sen- tences involved in the latest 4 tests as a single group, we find Class A: Unlsys PUNDIT system coupled with BBN BY- our rate of F's to be less than 8%. When we discuss our spo- SOLB ken language results in a subsequent section, we will see that In this test N-best output from the BBN SOLBYB system as although the rate of correct answers drops noticeably when a described in 4 was input to PUNDIT. As in the system which speech recognizer is added to the system, the rate of incorrect used the MIT N-best, we used an N of 16. The N-best from answers does not appear to increase. The importance of a low 128 "false alarm" rate is well appreciated by spoken language un- rent class A test, for the extensions discussed in the section derstanding researchers; from a user's point of view nothing on semantics were not the only improvements we have made could be worse than an answer which is wrong although the to our system. This is another indication of variability among user may have no way of telling it is wrong. It will be im- speakers; for our system the 5 speakers of last June's test portant to lower the rate of such errors to a level well below were easier to process. It appears to us that larger test sets S%. are necessary to make a broad evaluation of natural language Our best performance came on the D1 pairs test. One would understanding capabilities. (We do not extend this suggestion have expected a lower score on any test that requires two con- to tests involving speech input because of the level of effort secutive sentences to be understood than on a test of self- that would consume.) We have already noted the absence of contained sentences. While we wish we could claim that our relative clauses and participial modifiers in the recent class work discussed in the earlier section on pragmatics was instru- A test. We also noticed that 23 of 145, or 16%, of the sen- mental in achieving our score, it appears that much of what tences used the word "available", usually in constructions like we added to our system did not come into play in this test. "what X is available", while this word only appeared in 4% of A more likely explanation of the unexpectedly high score is the pilot training data. In the class D1 test, there were few that when a user queries the system in a mode which utilizes discourse phenomena represented, and we noted in an ear- follow-up queries, he or she tends to use simpler individual lier section that over 70% (27 of 38) of the D1 pairs involved queries. Perhaps a user who does not use follow-up queries is just the phenomenon of flight leg disambiguation. Tests of trying to put more into each individual query. Some evidence such size, then, are not broadly representative of the range of for this is that our score for just the 20 distinct class A an- query formulations in the ATIS domain. tecedent sentences for the D1 pairs test was 75%, well above Related to the last point is the suspicion that the few thou- our 48.3% score for all the class A sentences. Even more strik- sand sentences of training data are themselves too few to rep- ing is the fact that of the 9 speakers represented in this round resent the range of user queries for this domain. We have of tests, only two contributed more than 3 pairs to the class noticed that fewer new words are appearing in the more re- D1 test-speakers CK and CO contributed 13 pairs each. Our cent sets of training data, so vocabulary closure is probably scores restricted to just those two speakers were 93~ for the occurring. Even so, in the 145 class A queries of the recent class A test and 65% for the class D1 test (100% for speaker test, our system found 12 with unknown words, or 8% of the CO in the class D1 test!). queries. This was actually higher than the 5.5% we experi- The optional tests clearly were too small to have much sig- enced with the test last June, but that is more a comment on nificance. It is not surprising that our system proved to be the variability due to small test size. It is an open question incapable at this point of dealing with extraneous words in the whether more and more training data is the answer to making input queries, for we have made no efforts as yet to compen- our systems more complete, however. After all, larger volumes sate for such inputs. These tests will be useful as a benchxnark of data are both expensive to collect and expensive to train for comparison after we have addressed such issues. from. The lack of closure for the syntactic and semantic vari- ation in user queries presents a challenge for further research Semantics Extensions and the Common Task Tests in spoken language understanding. It may we be that we will In the section on semantics we reported the results of two have to begin studying reasonable ways in which the variation experiments that we ran to assess the effects of extensions to in the range of user expression can be limited, without unduly our system. We performed the same tests using the data of contrainlng the user in the natural performance of the task. the latest class A test of 145 queries. When the extensions Spoken Language Evaluations to our semantic interpreter were removed, our performance dropped to 72 T, 19 F, or 36.6%, a decrease of 24% from our Unisys-MIT The spoken language results for this system score of 48.3%. This reinforces our belief that these extensions were 29 T, 15 F, and 101 NA, for a weighted score of 9.7%. The are very important and useful. When we ran the test without system examined an average of 6.5 candidates in the N-best the rules relating multiple decompositions, our performance before finding an acceptable one. Of all candidates considered was 83 T, 14 F, or 47.6%, a decrease of less than 2%. This by the system, we found that 85~ were rejected by the syntax latter finding was most surprising-basically it implies that in component and 3% by the semantics/praKmatics component, the 1991 test data there were virtually no constructions of the and 11% were accepted by both components. It should be kind which those rules enable us to process, because the ab- pointed out that the syntax component uses a form of com- sence of the rules relating the decompositions corresponding piled semantics constraints during its search for parses (5), to those constructions resulted in almost no reduction in our thus the resnlts for purely syntactic rejection are not as high as score. In particular, there must have been no nouns modi- appears in this comparison, because some semantic constraints fied by relative clauses ("flights that arrive before noon") or are applied during parsing. After a candidate is accepted by participial modifiers ("flights serving dinner"). This has some both syntax and semantics, the search in the N-best is termi- implication regarding the distribution of various forms of syn- nated. However, the application component, which contains tactic expression across speakers, for phenomena which were a great deal of information about domain-specific pragmatics, dearly significant in our training data apparently were absent can also reject syntactically and semantically acceptable in- from 9 speakers' worth of test data. puts for which it cannot construct a sensible database query. The above experiments imply that our system as of last In fact, a syntactically and semanticuly acceptable candidate June would have gotten a score of less than 35% on the cur- was found in 75% of the N-best candidate lists, but a call was 129 systems and consequently prevents some of the inputs which !'001 might have led to an F in the natural language system from being recognized well enough for the natural language system to generate an F. ® In this system, based on one input per utterance, we found ~ 07 that 59% of the inputs failed to receive a syntactic analysis (including compiled semantics, as discussed above) and 2% £ failed to receive a semantic analysis. No database call could ° :Z be generated for 13% of the inputs and a call was made for the remaining 25~ of the inputs. 51 Evaluation of the Natural Language System In 1 we reported on a technique for evaluation of the natural language E --I L component of our spoken language system, based on the ques- z 01 - J tion of how often did the natural language system do the right thing. If the reference answer for an utterance is found in the 2 3 4 5 6 7 8 9 01 11 21 31 41 51 =61 ~- N-best, the right thing for the natural language system is to Location of Selected candidate I (all )detcejer i find the reference answer (or a semantic equivalent) in the N- best and give the right answer. The operational definition of 1~ TiM tSeb-N ~ NBB tseb-N doing the right thing, then, is for the system to receive a "T" on such inputs. On the other hand if the reference answer is Figure 2: Comparison of Location of Accepted not in the N-best the right thing for the system to do is to either find a semantic equivalent to the reference answer or to Candidate in N-best (N=16) for Unlsys-MIT and reject all inputs. Thus, doing the right thing in the case of no Unisys-BBN Systems reference answer can be operationally defined as "T" + "NA'. made for only 30% of inputs. The application component was not able to rnal~e a sensible call for the remaining inputs. Reference Reference The false alarm (or F) rate we observed in this test was in N-best not in N-best Overall around 10~, which is consistent with our previous spoken lan- Pundit right 54% 90% 84% guage results (1) and with our natural language results, as Pundit wrong 45% 10% 16% discussed above. A Unisys-BBN This system received a score of 77 T, 20 F Table 2: PUNDIT's performance on Class (145 and 48 NA for a weighted score of 39.3%. In this system queries), depending on whether or not reference 74% of all inputs were rejected by syntax, 11% of inputs were query occurred in N-best (N=16) from MIT-LCS accepted by syntax but rejected by semantics and 15~ were SPREC. accepted by both syntax and semantics. The false alarm rate is 13.8~, which is slightly higher but in the same range as previous false alarm rates. Reference Reference As can be seen in Figure 2, in general the system found an in N-best not in N-best Overall acceptable candidate earlier in the N-best with the BBN N- Pundit right 69~ 81~ 73~ best than with the MIT N-best. The average location of the Pundit wrong 31% 19% 27~ selected candidate in the N-best with the BBN data was 3.8 compared to 6.5 with the MIT N-best. Table 3: PUNDIT's performance on Class A (145 Unisys-LL Using the top-one candidate from the Lincoln queries), depending on whether or not refer- Labs speech recognizer the spoken language results for this ence query occurred in N-best (N=16) from BBN system were 32 T, 5 F and 108 NA for a weighted score of SPREC. 18.6~. The false alarm rate for this system was only 3.4~, which is lower than that for the other spoken language and natural language systems on which we report in this paper. Several interesting comparisons can be made based on ta- There is no obvious explanation for this. The simple hypoth- bles 2 and 3. To begin with, it seems clear that the BBN esis of better speech recognition in the Unisys-LL system will N-best is better than the MIT N-best based on three quite not suffice, because the BBN system has better speech recog- distinct measures - first of all the speech recognition score is nition but the false alarm rate is higher than the Unisys-LL better (16.1% word error rate for BBN vs. 43.6~ word er- rate. In addition, the Unisys system's performance on the NL ror rate for MIT), secondly, the spoken language score (with test tells us how the system would do given perfect speech the natural language system held constant) for Unisys-BBN recognition, and the false alarm rate there is around 8~. One is better than Unlsys-MIT (39.3% for Unisys-BBN vs. 9.7% possible hypothesis is that the bigram language model used for Unlsys-MIT) and thirdly, the reference answer occurred in the Lincoln Labs system is in some sense more conserva- in MIT's top 16 candidates only 15~ of the time vs. 65% of tive than the language models used in the BBN and MIT the time for the BBN N-best. Thus this experiment allows us 130 to ask the question of what effect does better speech recogni- tents receive an "F'. The largest difference among the three tion have on the interaction between speech recognition and systems was in the number of cases where Unisys-BBN re- natural language? ceived a "T" but the other two systems received an "NA'. In the case where the reference answer is in the N-best, This occurred for 13 queries. PUNDIT does much better with the BBN N-best. Since less Another interesting comparison is to look at the cases where search in the N-best is required with BBN data the reference Unisys-MIT and Unisys-BBN issued a call based on the ts~iff answer or equivalent is likely to be found sooner, and conse- candidate in the N-best, since this corresponds to the one-best quently there will be fewer chances for PUNDIT to find a syn- interface used in Unisys-LL. In Unisys-MIT twenty-seven calls tactically and semantically acceptable sentence in the N-best were issued based on the first candidate, out of a total of 45 which differs crucially from what was uttered. On the other cans. Of the calls issued on the first candidate, 51 received a hand, PUNDIT actually does better with the poorer speech rec- score of T and 12 received a score of F, for a weighted score ogniser output from MIT when the reference answer is not in of 2%. In Unisys-BBN the first candidate was selected from the N-best. We suspect that the poorer speech recognizer out- the N-best 70% of the time. 26 of these candidates resulted put is in some sense easier to reject; that is, it is more likely in scores of "F" and 42 resulted in a "T" for a weighted score to seriously violate the syntactic and semantic constraints of of 11%. English. If this is so then it is possible that a relatively ac- Overall, the number of calls made was quite similar for the cepting natural language system might work wen with worse Unisys-LL and Unisys-MIT systems (25% of utterances for speech recognition outputs (because even a relatively accept- Unisys-LL and 30% for Unisys-MIT), but it was much higher ing natural language system can reject very had inputs), but for Unisys-BBN (67%). In all three systems most of the in- with better speech recognizer output one might get good per- puts were rejected by the syntax component (59% of all inputs formance with a stricter natural language system. We plan to for Unisys-LL, 74% of all inputs for Unisys-BBN and 85% of test this hypothesis in future research. all inputs for Unisys-MIT). We can compare this to a base- It is natural to ask why we should care about what to do line syntactic falluxe of 14% of inputs on the Unisys natural with poorer speech recognizer output; one would tlllnlc that language test. (Note that since multiple inputs per utterance we should use the best recognizer output possible. The answer are possible with the N-best systems, the N-best vs. one-best is that many potential applications have requirements such as systems are not strictly comparable.) large vocabulary size which are somewhat at odds with high Speech Recognition Evaluatlons accuracy, consequently the best recognizer output available may nevertheless be relatively inaccurate. Thus it is important Using speech recognition data from MIT, we submitted re- to have speech/natu.~al language integration strategies which suits for the Class A, Class D1, Class AO and Class D10 allow us to fine tune the interaction to compensate for less speech recognition tests, shown in tables 4, 5, 6, and 7. accurate speech recognition. As expected, we observed a higher error rate for the op- Optional Class A We used both the Unisys-MIT system tional tests, which contained verbal deletions, and we also and the UUIsys-BBN system for this test. For both speech observed a wide range of performance across speakers. The recognizers in this test of eleven utterances with verbal dele- comparison of D1 pales and Class A speech recognition showed tions we received two T's and sero F's for a weighted score of poorer word recognition in the D1 pairs than in the Class A 18.2%. There is too little test data in this condition to draw test. An average 45.8% word error rate was observed for the reliable conclusions from the results. Class A utterances compared to a 54.6% error rate for the D1 utterances. As tables 4 and 6 show, this was fairly consis- Comparison of Spoken Language Systems tent across speakers, except for speaker CJ. There are at least We believe coupling of a single natural language system with two hypotheses which may explain this higher error in context multiple speech recognition systems has the potential for being dependent spontaneous utterances. One hypothesis suggests a very useful technique for comparing speech recognizers in a that the higher error rate may be due in part to the presence spoken language context. Of course speech recognizers can he of prosodic phenomena common in dialog such as destress- compared on the basis of word and sentence accuracy, but we ing of "old" information. Because the specific dialog context do not know how direct the mapping is between these mea- affects the pronunciation of words corresponding to old and sures of performance and spoken language performance. The new information, the training data used so far may not pro- most direct comparision for spoken language evaluation, then, vide a complete sample of how words are pronounced in a wide is to define an experimental condition in which the systems to range of dialog contexts, consequently leading to poorer word be compared differ only in the speech recognition component. recognition. Another hypothesis is based on the fact that the Not only is this strategy useful for comparing system level context-dependent sentences contain many references to flight measurements of performance of speech recognizers, hut it is numbers. Flight numbers may be difficult to recognize he- also useful for more fine grained analyses of the interaction cause there is very little opportunity for syntactic or semantic between the speech recognition component and the natural information to constrain which number was uttered. language system. Figure 3 shows the distribution of T's, F's and NA's for CONCLUSIONS specific queries across the three systems. Note that for 52 queries, or 36~ of the total, the systems In this paper we presented benchmark test results on nat- received the same score, although in no case did all three sys- ural language understanding, spoken language understanding 131 and speech recognition. Our weighted score for the Class A natural language test was 48.3%, for the D1 pairs, 63.2%, for the Class AO test, 9.1% and for the DO test, -50%. We presented five benchmark tests of spoken language sys- tems, Unisys-MIT on Class A, which received a weighted score of 9.7%, Unisys-MIT on Class AO, which received a Speaker Co,, Sub Del I Ins I E~ S. Err I weighted score of 18.2~, Unlsys-LL on Class A, which received CE 56.0 33.3 10.6 5.1 49.1 95.0 a weighted score of 18.6%, Unisys-BBN on Class A, which re- ceived a weighted score of 39.6%, and Unlsys-BBN on Class CH 47.4 44.7 7.9 23.7 76.3 100.0 AO, which received a weighted score of 18.2%. Finedy, we CI 45.6 46,8 7,6 24,0 78.4 100.0 presented speech recognition results using the Unisys natural CJ 75.8 18.3 5.9 3.1 27.3 84.6 language system as a filter on the N-best output of the MIT CK 56.9 29.4 13.7 2.0 45.1 91.7 SUMMIT system. CL 75.0 22.5 2.5 9,7 34.7 84.6 The semantics enhancements to the natural language sys- CM 31.1 62.1 6.8 15.9 84.8 100.0 tem are motivating us to revisit the tightly integrated archl- CO 74.1 19.1 6.8 8.6 34.6 100.0 tecture of semantics/pragmatics processing in our system, be- CP ! 71.7 26.5 1.8 7.5 35.8 88.9 cause with these enhancements, semantic information regard- I egarevA I 63.5 30.3 I 6.2 I 9.3 45.8 91.2 I ing a discourse entity can become available to the processing at a much later point than previously. Thus, pragmatic pro- Table 4: System Summary Percentages by Speaker cessing must be invoked at a later point to ensure that de for Class A relevant semantic information has been exploited. The spoken language results are especially interesting, be- Speaker I Corr I Sub Del I Ins Err S. Err cause we are now beginning to be able to look at the inter- CE 59.5 40.5 0.0 16.2 56.8 100.0 actions of the natural language system with different speech cI 21.9 71.9 6.2 21.9 lOO.O lOO.O recognisers,, and see how to tune the natural language system CJ 72.6 17.7 9.7 1.6 29.0 100.0 to make the best use of the information available from the CM 42.7 ! 50.7 6.7 21.3 78.7 100.0 various speech recognlsers. We believe that it is important Average I 51.5 42:2 6.3 I 14.6 63.1 I 100.0 to make these kinds of comparisons and we are planning to work with at least one other speech recognition system using Table 5: System Summary Percentages by Speaker the N-best interface. We also plan to begin exploring more for Class .4,0 tightly coupled systems using the stack decoder architecture (9). Speaker Corr Sub I De I Ins Err S. Err CH 40.0 53.3 6.7 13.3 73.3 100.0 REFERENCES CI 24.2 46.3 29.5 15.8 91.6 100.0 1 Deborah A. Dald, Lynette Hirschman, Lewis M. Norton, CJ 83.9 10.7 5.4 3.6 19.6 50.0 Marcia C. Linebarger, David Magerman, and Cather- CK 48.0 39.9 12.2 2.7 54.7 100.0 ine N. Ball. Training and evaluation of spoken language CL 66.7 31.7 1.7 8.3 41.7 83.3 understanding system. In Peoceedings of the DARPA CO 54.7 27.3 18.0 3.1 48.4 94.4 Speech and Language Workshop, Hidden Valley, PA, June CP 80.0 20.0 0.0 8.0 28.0 100.0 1990. Average 52.0 34.1 I 13.9 6.6 54.6 91.4 2 Charles T. Hemphill, John J. Godfrey, and George R. Doddington. The ATIS spoken language systems pilot cor- Table 6: System Summary Percentages by Speaker pus. In Proceedings of the DARPA Speech and Language for Class D1 Workshop, Hidden Valley, PA, June 1990. Speaker Cot, I Sub Del Ins Err S. Err 3 L. Hirschman, M. PaLmer, J. Dowding, D. Dahl, M. Linebarger, R. Passonneau, F.-M. Lang, C. Ball, and CM { 34.8 { 65.2 0.0 26.1 91.3 100.0 C. Weir. The PUNDIT natural-language processing sys- tem. In AI Systems in Government Conf. Computer So- Average 59.6 40.4 0.0 15.8 56.1 100.0 ciety of the IEEE, March 1989. Table 7: System Summary Percentages by Speaker 4 F. Kubala, S. Austin, C. Barry, J. Makhoul, P. Placeway, and R. Schwaxts. BYBLOS speech recognition benclmmrk for Class D10 results. In Proceedings of the Darpa Speech and Natural Language Workshop, Asilomar, CA, February 1991. 5 F.-M. Lang and L. Hirschman. Improved portability and parsing through interactive acquisition of semantic infor- mation. In Proc. of the Second Conf. on Applied Natural Language Processing, Austin, TX, February 1988. 132 6 Lewis M. Norton, Deborah A. Dahl, Donald P. McKay, Lynette Hirsclunan, Marcla C. Linebarger, David MaKer- man, and. Catherine N. Ball. Management and evaluation of interactive dialog in the air travel domain. In Pro- sgnideec of the DARPA Speech and Language ,pohskroW Hidden Valley, PA, 3une 1990. 7 Ma~tha Palmer. Semantic Processing "rof Finite Domains. Cambridge University Press, Cambridge, England, 1990. 8 D. B. Paul. New results with the Lincoln tied-mixture HMM CSI~ system. In sgnideecorP of eht DARPA hceepS and Natural Language ,pohs~IroW February 1991. 9 Douglas B. Paul. A CS11-NL interface specification. In agnideecorP of eht DARPA Speech and Natural egaugnaL ,pohs~troW 1989. 10 V. Zue, 3. Glass, D. Goddeau, Dave Goodine, Lynette Hixsclunan, H. Leung, M. Phillips, 3. Polifxonl, and BBN-NA Uncoln sbaL S. Seneff. Development and prehninv~y evaluation of the T F NA MIT ATIS system. In Proceedings of eht DARPA hceepS 0 0 4 and Natural Language Workshop, February 3991. Mff F 1 0 2 • ~ -i.- ~x-:.- nlocniL sbaL NA o 1 T F NA BBN-F 1 1 0 i 0 Mrr I o ~ii s i nlocniL sbaL AN 1 2 10 BBN-T T F NA .. ....~ .... TIM F 2 1 4 NA 15 1 13 Query by query ©omparlton of results for three speech reoognlzers with Unlsys NL Shaded ce~ represent agreement arnoag all three systems Figure 3: Comparison of results from three spoken language systems 133

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.