File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/91/h91-1016_evalu.xml
Size: 7,922 bytes
Last Modified: 2025-10-06 14:00:01
<?xml version="1.0" standalone="yes"?> <Paper uid="H91-1016"> <Title>EVALUATION OF THE CMU ATIS SYSTEM</Title> <Section position="5" start_page="102" end_page="103" type="evalu"> <SectionTitle> RESULTS </SectionTitle> <Paragraph position="0"> Our current system has a lexicon of 710 words and uses a bigram language model of perplexity 49. Six noise models are included in the lexicon. We used the version of Sphinx produced by Hun \[9\], which includes between-word triphone models. The vocabulary-independent phone models generated by Hon \[9\] were used to compile the word models for the system. No task specific acoustic training was done. We have not yet added the out-of-vocabulary models to the system.</Paragraph> <Paragraph position="1"> The DARPA ATIS0 training set consists of approximately 700 utterances gathered by Texas Instruments and distributed by NIST. This data was gathered and distributed before the June 1990 evaluations. The data was gathered using a &quot;wizard&quot; paradigm. Subjects were asked to perform an ATIS scenario.</Paragraph> <Paragraph position="2"> They were given a task to perform and told that they were to use a speech understanding computer to get information. A hidden experimenter listened to the subjects and provided the appropriate information from the database. The transcripts from this set were used to train our language model. This includes the bigram model for the recognizer and the grammar for the parser. Since this amount of data is not nearly enough to train a language model, we chose to &quot;pad&quot; our bigrams. Bigrams were generated based on tag pairs rather than word pairs. Words in our lexicon were put into categories represented by tags. The June90 training corpus was tagged according to this mapping. We then generated a word-pair file from the Phoenix finite-state ATIS grammar, This file was used to initiafize the tag bigram counts. The tagged corpus was then used to add to the counts and the bigram file was generated. It is a &quot;padded&quot; bigram in the sense that the grammar is used to insure a count of at least 1 for all &quot;legal&quot; tag pairs. This procedure yielded a bigrarn language model which has perplexity The class A set contains 145 utterances that are processed individually without context. All utterances in the test set were &quot;Class-A&quot;, that is, answerable, context independent and with no disfluencies. The class D1 set contains 38 utterance pairs. These are intended to test dialog capability. The first utterance of a pair is a Class-A utterance that sets the context for the second. Only scores for the second utterance are reported for this set.</Paragraph> <Paragraph position="3"> We processed both transcript and speech input for each set.</Paragraph> <Paragraph position="4"> Tables 1-4 show the results of this evaluation.</Paragraph> <Paragraph position="5"> Utterances were scored correct if the answer output by the system matched the reference answer for the utterance. The reference answer is database output, not a word string. Systems are allowed to output a NO_ANSWER response, indicating that the utterance was misunderstood. Any output that was not correct or NOANSWER was scored incorrect. The Weighted Score is computed as ( 1- ( 2*percent false + percentNO_ANSWER) ).</Paragraph> <Paragraph position="6"> Table 1 shows the results for class A utterances. For these, the system produced the correct answer for 80.7 percent of the transcript input and 61.4 percent of the speech input. The performance for transcript input reflects the grammatical and semantic coverage of the parser and application program. The performance for the speech input reflects additional errors made in the recognition stage. Recognition performance for these utterances is shown in Table 2. Word substitutions, deletions and insertions are summed to give the word error measure of 28.7 percent. A string error rate of 79 percent means that only twenty one percent of the utterances contained no errors. However, 61 percent of the utterances gave correct answers. This illustrates the ability of the parser to handle minor misrecognitions in the recognized string.</Paragraph> <Paragraph position="7"> The D1 test set is designed to provide a test of dialog capability. The utterances are specified in pairs. The first utterance is processed normally and is used to set the context for the second utterance of the pair. Missing the first utterance can lead to incorrectly interpreting the second. Tables 3 and 4 show the understanding performance and speech recognition rates for the D1 test set. While the recognition results are comparable to those for set A, the understanding performance is significantly worse. This is due in large part to utterances in which we missed the first utterance, causing the context for the second to be wrong. We feel that recognition error rates for spontaneous input will improve considerably with the addition of out-of-vocabulary models and with better lexical and grammatical coverage.</Paragraph> </Section> <Section position="6" start_page="103" end_page="104" type="evalu"> <SectionTitle> ERROR ANALYSIS </SectionTitle> <Paragraph position="0"> In order to interpret the performance of the system, it is useful to look at the source of the errors. Table 5 shows the percentage of errors from various sources.</Paragraph> <Paragraph position="1"> Twenty five percent of our errors were a result of lack of grammatical coverage. This includes unknown words for concepts that the system has. For example, the system knew day names (Monday, Tuesday, ere) but nov plural day names (Mondays, etc) since these had not been seen in the training data. This category also contains errors where all words were known but the specific word sequence used did not match any phrase patterns.</Paragraph> <Paragraph position="2"> Twenty percent of the errors were due to a lack of semantic coverage. In this case, there were no frames for the type of question being asked or no slots for the type of information being provided. For example, one utterance requested &quot;a general description of the aircraft&quot;. Our&quot; system allows you to ask about specific attributes of an aircraft but does not have the notion of &quot;general description&quot; which maps to a subset of these attributes. Twenty five percent of the errors were due to outputting the wrong field from the database for the CAS answer. In these cases, the utterance was correctly understood and a reasonable answer was output, but it was not the specific answer required by the CAS specifications. For example, when asked for cities near the Denver airport, we output the city name &quot;DENVER&quot; rather than the city code &quot;DDEN&quot; as required by CAS.</Paragraph> <Paragraph position="3"> Ten percent of the errors were due to utterances that our system considered unanswerable. For CAS evaluation runs, we map all system error messages to a NO_ANSWER response. For example, one utterance asked for ground transportation from Atlanta to Baltimore. Our system recognized that this was outside the abilities of the database and generated an error message that was mapped to NO~kNSWER. The reference answer was the null list &quot;0&quot;.</Paragraph> <Paragraph position="4"> The other twenty percent of the errors were due to coding bugs in the back end.</Paragraph> <Paragraph position="5"> The first two categories (grammatical and semantic errors) are errors in the &quot;understanding&quot; part of the system. Forty five percent of our total errors were due to not correctly interpreting the input. The other fifty five percent of the errors were generation errors. That is, the utterance was correctly interpreted but the correct answer was not generated.</Paragraph> </Section> class="xml-element"></Paper>