File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/92/h92-1060_evalu.xml
Size: 17,448 bytes
Last Modified: 2025-10-06 14:00:07
<?xml version="1.0" standalone="yes"?> <Paper uid="H92-1060"> <Title>A Relaxation Method for Understanding Speech Utterances 1</Title> <Section position="4" start_page="300" end_page="300" type="evalu"> <SectionTitle> EVALUATION PROCEDURE </SectionTitle> <Paragraph position="0"> The DARPA community has been developing an evaluation scheme over the past year and a half, based on a comparison between an answer produced by the system and a set of two &quot;rain/max&quot; answers provided by trained annotators, specifying the minimum and maximum requirement for expected entries from the database, where the maximum table addresses the overgeneration issue.</Paragraph> <Paragraph position="1"> The sentences for a given dialogue are presented in order to the system being tested, and it must deal with the sentence in context to come up with an appropriate answer 3.</Paragraph> <Paragraph position="2"> No partial credit is given for a &quot;nearly correct&quot; answer, and systems are penalized for wrong answers, so that the score is defined as the difference between percent correct and percent incorrect.</Paragraph> <Paragraph position="3"> The DARPA community have also been involved in a joint effort to collect a large body of common training and testing data within the ATIS domain. The data have been collected at five sites (AT~T, BBN, MIT, CMU, and SRI) under a variety of different conditions, sometimes including a speech recognizer (SRI), sometimes allowing only verbal feedback (AT&T), sometimes typed verbatim to the site-dependent system (MIT), and sometimes including a &quot;wizard&quot; who could rephrase the question before submitting it to the data collection system (BBN, CMU). The sentences have been labelled according to four distinct categories: Class A (no context required), Class D1 (context-dependent on only one preceding sentence), Class D (context dependent on multiple sentences) ~, and Class X (unanswerable).</Paragraph> <Paragraph position="4"> INPUT: &quot;(WHAT ARE THE MEALS) AND (AIRCRAFT FOR FLIGHT TWO EIGHTY ONE) AND ALSO FOR (FLIGHT TWO OH ONE)&quot; FRAME: \[Existential clause Topic: \[(aircraft meal) for;\[(flight) number:(281, 201)\]\]\] RESPONSE: Here are meals for flights 281 and 201 from boston to denver. AIRLINE NUMBER FROM TO MEALS SERVED UNITED 201 BOS DEN BREAKFAST UNITED 281 BOS DEN DINNER Here are the aircraft for flights 281 and 201 from boston to denver. AIRLINE NUMBER FROM TO AIRCRAFT COMPANY UNITED 201 BOS DEN DC8 MCDONNELL DOUGLAS UNITED 281 BOS DEN DC8 MCDONNELL DOUGLAS Figure I: Examplesentencetoillustr~erobust-parsing mechanism. Parenthesesin thesentenceindicate parsed phr~es. INPUT: &quot;(WHAT ARE THE CHEAPEST FLIGHTS FROM BOSTON TO ATLANTA) WITH (CONNECTING FLIGHTS IN PHILADELPHIA)&quot; RESPONSE: Here are the cheapest connecting flights from Boston to Atlanta connecting in Philadelphia. <shows table> INPUT: &quot;(I WOULD LIKE TO FLY FROM BOSTON TO ATLANTA AND STOP IN PHILADELPHIA) (ARE THERE ANY PLANES) (THAT) DO (THAT)&quot; RESPONSE: Here are the aircraft for the flights with stops from Boston to Atlanta connecting in Philadelphia. <shows table> INPUT: &quot;(HOW LONG) DOES (FLIGHT TWO EIGHTY ONE) REMAIN (IN DENVER) RESPONSE: Here are the connection times for flight 281 from Boston to San Francisco connecting in Denver. <shows table> total of 362 sentences in this set, 290 were &quot;answerable,&quot; (Class A, D, or D1). The second set is the February '92 test set, released just prior to this meeting. The data include nearly 1000 sentences, distributed equally over contributions from all five collecting sites. A subset of 687 sentences were considered evaluable. This test set is associated with a set of &quot;official&quot; results for all of the participating sites, mediated through NIST.</Paragraph> </Section> <Section position="5" start_page="300" end_page="303" type="evalu"> <SectionTitle> RESULTS </SectionTitle> <Paragraph position="0"> We report here on the results for the two DARPA test releases, and on three different systems: (1) The MIT NL (text input) system, (2) The MIT Spoken Language System (recognizer included), and (3) The MIT-SRI system (MIT NL component operating on outputs from a recognizer developed at SRI \[3\]). For the October '91 NLonly experiment, we give a breakdown of performance for those sentences that required robust parsing against those that received a full parse, in order to assess how much robust parsing helped. For the February '92 test set, we provide a detailed discussion of the errors for the text-input condition. We use the MIT-SRI results in an experiment to address the question of whether it is valid to penalize systems one-to-one for incorrect answers.</Paragraph> <Paragraph position="1"> October '91 Test Results A breakdown of the results for our system on text input on the October '91 test set, with robust parsing included, is given in Figure 3. All of the columns under &quot;robust&quot; mode would have given a NO ANSWER response without the robust parser. Over half of the answers must be correct in order to yield a net gain in score. For the Class A and Class D1 sentences, this requirement was met with a comfortable margin. Although the Class D, robustly parsed sentences yielded a greater number of incorrect answers than correct ones, this result is misleading, because the majority of the errors were not due to failures in the robust parsing algorithm. For instance, five sentences concerned a fare &quot;less than one thousand dollars.&quot; A minor bug in the number interpretation routine led to an incorrect answer to all of these questions. An additional four sentences failed due to a minor problem in the external history mechanism. Overall, we were quite encouraged by the result of this evaluation, which indicates that the robust parsing mechanism provides a powerful enhancement of the system's capabilities.</Paragraph> <Paragraph position="2"> February '92 Test Results Table 1 gives performance results for the Feb '92 test set. For the text-input condition, 80% of the queries were answered correctly, and this number dropped to 61% for speech-input mode. The number of incorrect answers remained almost constant at 13%, with a corresponding large increase in unanswered questions from 7% to 25%.</Paragraph> <Paragraph position="3"> This is a direct result of our change in rejection strategy in going from text-input to speech-input mode. We examined in detail all of the sentences for which our text-input system produced an incorrect answer, categorizing the errors in the hopes of assessing how far away we are from the ideal goal of an error-free system.</Paragraph> <Paragraph position="4"> A breakdown of the categorizations is given in Table 2. Seventeen answers fell in the category, &quot;correct,&quot; which is to say the system produced the answer we expected it to produce, and we feel that, although the answer does not match the comparator's requirement, it is nonetheless also a reasonable answer. For instance, the answer we gave to the question, &quot;what is the stopover,&quot; was the location of the stop, whereas the comparator expects, inexplicably, the number of stops instead. There were five essentially identical sentences asking for the number of Delta flights in differing fare classes. In all cases our count was off by one, because we included a connecting flight one of whose legs was a Delta flight.</Paragraph> <Paragraph position="5"> This was a consequence of a misunderstanding on our part of the rules, so we feel that the system did the right thing in this case. Two sentences were a result of the comparator refusing to accept &quot;NIL&quot; and a null string as the same thing. Other &quot;correct&quot; sentences involved an interpretation of the context, often for cases where the subject is speaking &quot;computerese,&quot; where we think our interpretation is a valid one. Given that 20% of the errors are in this category, we believe that the comparator evaluation is probably overly rigid. It might make sense to allow some flexibility in overruling the comparator's result on a case-by-case basis.</Paragraph> <Paragraph position="6"> There were 32 sentences in the category &quot;easily fixed.&quot; It took two day's time to correct the mistakes for these sentences, although they were distributed over all aspects of the system (parse failure, meaning representation, discourse mechanism, and query generation). Some of them were clearly bugs, whereas others were simply due to incomplete understanding (such as generalizing &quot;this afternoon&quot; to mean &quot;today&quot; as well as &quot;in the afternoon.&quot;) Six of the sentences failed due to a deficiency in our discourse mechanism specific to questions about airlines. These involved an anaphoric reference to a set of phantom flights, implied because of a preceding question about an airline.</Paragraph> <Paragraph position="7"> The system understood &quot;those flights&quot; only in the context of an existing set of flights that had been generated through a call to the database. Thus, in the sequence, &quot;Does Delta fly between Boston and Denver,&quot; followed by &quot;Show me ~hose fligMs,&quot; the system was unable to understand which flights were intended. This was an interesting discourse situation, and we were happy to uncover this inadequacy in our system. Overall, while it is encouraging that it was easy to correct so many errors, it is also problematic that we continue to uncover such &quot;minor&quot; problems in unseen data. It is unclear how many more sets of 1000 sentences will be necessary before new bugs and inadequacies are no longer encountered.</Paragraph> <Paragraph position="8"> Twenty seven sentences were judged as more difficult to correct, and their problems are about equally divided between the categories &quot;complex meaning&quot; and &quot;difficult context.&quot; A particularly troublesome set for context are the sentences spoken by subjects who tend to chronically speak a staccato computerese which is difficult to distinguish from normal fragments. We are not as concerned about these sentences, because these subjects would get feedback from our system were they using it interactively, which would serve to communicate to them quite clearly how the system is interpreting their staccato sentences, thus keeping the dialogue coherent. Another set of sentences that are very difficult yet probably not fruitful to correct, are &quot;stage setting&quot; sentences that tend to ask for too much information, such as the test-set sentence, &quot;Please give me flight information from Denver to Pittsburgh to Atlanta and return to Denver.&quot; Our system provides a large subset of the flights requested, which is surely information overload anyway, leading the subject, in an interactive mode, to follow up with a sentence asking for information about only one leg of the trip.</Paragraph> <Paragraph position="9"> The eleven remaining errors were distributed among three categories. Three were due to an incorrect analysis of a context-setting query. False starts that were deadly for the robust parser accounted for four errors.</Paragraph> <Paragraph position="10"> For instance, a stutter on the word &quot;a&quot; (&quot;A a flight&quot;) produced the interpretation &quot;AA&quot; (American Airlines). Such problems are very difficult to repair, and we see no near-term solutions. An additional four errors were labelled &quot;uninteresting,&quot; either because our system will never see such a sentence in actual operation (a request for a definition of a code like &quot;DDEN,&quot; which is never displayed to the user by our system) or because the sentence is hopelessly obscure, such that a similar sentence would never reoccur.</Paragraph> <Section position="1" start_page="302" end_page="303" type="sub_section"> <SectionTitle> The MIT-SRI System </SectionTitle> <Paragraph position="0"> The SRI researchers have provided us with their recognizer's outputs for three sets of data: a training data subset, the October '91 test set, and the February '92 test set. We used the training data to develop an appropriate rejection mechanism, and then we applied the results to both test sets. We decided to use the same rejection criterion for this test as for the NL-input test, without screening context dependent sentences requiring a robust parse, as we had done for the MIT recognizer inputs.</Paragraph> <Paragraph position="1"> Interestingly, the error for the &quot;MIT-SRI&quot; system on the October '91 test set was only ten percentage points higher than that for text input, whereas the performance drop was much greater for the February '92 test set (18.2 points). We don't fully understand this difference, but apparently the recognition errors were more disruptive for the February '92 test set than for the October '91 test set. Although the SRI recognizer has a significantly better SPREC performance than the MIT recognizer (11.0% Error vs. 18.0%), our SLS system was apparently not able to take advantage of this performance improvement.</Paragraph> <Paragraph position="2"> The error for the MIT-only system was only 2% higher than that of the MIT-SRI system. We can think of at least two factors that may account for this surprising result. The first is that our recognizer results were obtained through filtering by TINA on 10 N-best outputs from our recognizer. IfTINA could find a parsable hypothesis, then that one would be selected as the recognizer output. This meant that small errors in prepositions and the like were more likely to be corrected. The second factor is the more rigid rejection criterion used for the MIT recognizer. A larger percentage of the MIT-SRI sentences were incorrect (19% vs. 14%), and we suspect that had we used the same rejection criterion for the SRI recognizer as for the MIT recognizer the performance would have improved.</Paragraph> <Paragraph position="3"> We strongly suspect that the algorithm of penalizing sentences one-to-one for incorrect answers is too steep a penalty. Because current system capabilities generally include a good discourse model as well as an ability to handle sentence fragments, it is often the case that a partially understood query provides valid information that the system can make use of in a follow-up query. For instance, if the user said: &quot;Show me all flights from Boston to Dallas leaving Tuesday morning before ten&quot; and the system misunderstood &quot;Tuesday&quot; as &quot;Thursday,&quot; the user could simply say in a follow-up query, &quot;On Tuesday,&quot; and the system would be able to deliver a completely correct answer. On the other hand, if the system instead refused to answer the first question (so as to maximize score), the user would have to repeat the entire sentence in order to retain the other conditions.</Paragraph> <Paragraph position="4"> The only way to clearly assess whether or not systems should err in the direction of answering too much is to compare user satisfaction tests on A/B conditions.</Paragraph> <Paragraph position="5"> Short of this, however, it is still possible to devise an experiment to assess the degree of correctness for those answers that the recognizer misunderstood. To do this, we selected a subset of 62 utterances from the February '92 test material, representing all queries which had been correctly answered (according to the comparator) by our NL system, but incorrectly answered by the joint MIT-SRI SLS system. We have available to us a frame-based evaluation procedure that we make use of internally for comparing semantic frames generated by the recognizer against those generated from the true orthography. The scoring involves comparing a set of key/value pairs representing the set of attributes mentioned in the sentence, things like &quot;source,&quot; &quot;departure-time,&quot; &quot;farecode,&quot; &quot;flight-number,&quot; etc. The score is computed as (correct - insertion) / (correct -t- substitution -b deletion), where &quot;correct&quot; means that both the key and the value are identical between the hypothesis (NL answer) test sentences whose orthography was correctly understood by the NL component but whose SRI recognizer outputs were incorrect. See text for further details.</Paragraph> <Paragraph position="6"> and the reference (recognizer answer).</Paragraph> <Paragraph position="7"> The result is shown in Table 3. There were on average about 3.5 attributes per sentence to be identified. The system identified correctly more than 3 out of every 4 attributes, with an insertion rate (recognizing additional false attributes) of 10%. This suggests to us that users would be better served if the system answered most of these questions than if the system simply said a canned phrase such as, &quot;I'm sorry, I didn't understand you,&quot; requiring the user to reinstantiate even those attributes that had been correctly recognized.</Paragraph> </Section> </Section> class="xml-element"></Paper>