File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/90/h90-1024_abstr.xml
Size: 4,842 bytes
Last Modified: 2025-10-06 13:46:59
<?xml version="1.0" standalone="yes"?> <Paper uid="H90-1024"> <Title>DARPA ATIS Test Results</Title> <Section position="2" start_page="114" end_page="121" type="abstr"> <SectionTitle> Preliminary Results </SectionTitle> <Paragraph position="0"> Results were reported to NIST for a total of seven systems by June 19th: two systems from BBN, two from CMU and one each from MIT, SRI and Unisys. The system designated as cmu-spi Cspi&quot; = > speech input) was the only one for which the input consisted of the speech waveform. For the other systems the input consisted of the SNOR transcriptions.</Paragraph> <Paragraph position="1"> Subsequently, reformatted results for three systems were accepted: &quot;cmu-r&quot;, &quot;cmu-spir&quot;, and &quot;mit-r&quot;.</Paragraph> <Paragraph position="2"> The C/S-format input provided for an answer of the form NO ANSWER to indicate that the system failed to provide an answer for any of several reasons (e.g., failure to recognize the words, failure to parse, failure to produce a valid database query, etc). Some sites made considerable use of this option, others (e.g., MIT) initially did not, partially due to miscommunication about the validity of this option.</Paragraph> <Paragraph position="3"> Some trivial fixes in the format used for submissions of results from some of the sites were made. One site initially omitted an answer for one of the queries, throwing subsequent REF-HYP alignment off; we inserted for them a NO,ANSWER response.</Paragraph> <Paragraph position="4"> Since there was miscommunication about the use of the &quot;NOjkNSWER&quot; response, we also changed one system's stock response meaning &quot;system can't handle it&quot; to &quot;NO_ANSWER&quot; for them, and allowed another site to submit revised results with &quot;NOANSWER&quot; in place of some of their responses. In the table of results, the revised systems are &quot;cmu-r&quot;, &quot;cmuspir&quot;, and &quot;mit-r&quot;.</Paragraph> <Paragraph position="5"> Responding to several complaints from sites about specific items in the test reference material, we corrected one reference answer (bd0071s) and changed the classification of three queries (bm0011s, bp0081s, and bw00sls) from Class A to Class X (in effect deleting these from the test set, reducing the test set size to 90 valid Class A queries). The classification disputes all centered on ambiguity, one of the hardest calls to make.</Paragraph> <Paragraph position="6"> If similar limitations on what is evaluable are made for the next round, we would like to have both an explicit principle for deciding when ambiguity is present and a procedure for adjudicating disputes agreed on early. The detailed results are given in Table l a for Class A queries with only lexical items that appear at least once in the training data, and in Table 2a for Class A queries with &quot;new&quot; morphemes, words, or idioms. Table 3 presents a complete summary of the results for the entire 90 sentence-utterance test set.</Paragraph> <Paragraph position="7"> Since the Class A test queries are not contextdependent, the ordering of these queries is not significant. As an aid in analysis, for the results presented in Tables la and 2a, queries have been (roughly) rank ordered in order of increasing apparent difficulty. Note that queries toward the top of both parts of the table resulted in more &quot;T&quot; answers than &quot;F&quot; or NA&quot;, while queries toward the bottom of the table resulted in more &quot;F&quot; and &quot;NA&quot; answers. Not surprisingly, there appears to be a general trend toward increasing apparent difficulty with increased length of the utterance (number of words).</Paragraph> <Paragraph position="8"> Table 3 shows that the number of correct answers from the various systems ranged from 25 to 58. Note also that for the system for which speech waveform data was used as input, (cmu-spir), 35 of the queries were answered correctly. Comparing results from similar systems for the two subsets of the data (Tables lb and 2b), note that the ratios of the numbers of correctly recognized queries in the two subsets vary from 1.9 to 4.6, with better performance on the subset for which all lexical items occurred at least once in the training data, of course.</Paragraph> <Paragraph position="9"> Comparisons such as these are complicated, however, by the fact that different systems returned NO ANSWER for from 0 to 60 of the queries. Perhaps a more appropriate denominator to be used in computing the percentage of correct responses would have been the number of responses for which an answer was provided.</Paragraph> <Paragraph position="11"/> </Section> class="xml-element"></Paper>