File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-2305_evalu.xml
Size: 4,234 bytes
Last Modified: 2025-10-06 13:59:22
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2305"> <Title>Combining Acoustic Confidences and Pragmatic Plausibility for Classifying Spoken Chess Move Instructions</Title> <Section position="6" start_page="0" end_page="0" type="evalu"> <SectionTitle> 5 Results and Evaluation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Cost Measure </SectionTitle> <Paragraph position="0"> We evaluate the task of selecting correct hypotheses with two different metrics: i) classification accuracy and ii) a simple cost measure that computes a score for different classifications on the basis of their confusion matrices.</Paragraph> <Paragraph position="1"> Table 1 shows how we derived costs from the additional number of steps (verbal and non-verbal) that have to be taken in order to carry out a user move instruction. Note that the cost measure is not validated against user judgements and should therefore only be considered an indicator for the (relative) quality of a classification.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Results </SectionTitle> <Paragraph position="0"> with their accuracy and associated cost. Here and in subsequent tables, FHa10a12a11 and FHa13a15a14 refer to the first hypothesis baselines with confidence thresholds 17 and 45 respectively, FLM to the first legal move baseline, and Exp1 and Exp2 to Experiments 1 and 2 respectively.</Paragraph> <Paragraph position="1"> The most striking result in Table 2 is the huge classification improvement between the first hypothesis and the first legal move baselines. For our domain, this shows a clear advantage of n-best recognition processing filtered with &quot;hard&quot; domain constraints (i.e. legal moves) over single-best processing.</Paragraph> <Paragraph position="2"> Note that the results for Exp1 and Exp2 in Table 2 are given &quot;by utterance&quot; (i.e. they do not reflect the classification performance for individual hypotheses from the n-best lists and the lists of all legal moves). Note also that both the different baselines and the machine learning systems have access to different information sources and therefore what counts as correctly or incorrectly classified varies. For example, the gold standard for the first hypothesis baseline only considers the best recognition result for each move instruction. If this is not the one intended by the speaker, it counts as incorrect in the gold standard. On the other hand, the first legal move among the 10-best recognition hypotheses for the same utterance might well be the correct one and would therefore count as correct in the gold standard for the FLM baseline.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.3 Comparing Classification Systems </SectionTitle> <Paragraph position="0"> We use the a0 a1 test of independence to compute whether the classification results are significantly different from each other. Table 3 reports significance results for comparing the different classifications of the test data. The table entries include the differences in cost and the level of statistical difference between the confusion matrices as computed by the a0 should be read row by row. For example, the top row in Table 3 compares the classification from Exp2 to all other classifications. The value a13 a4a6a1a12a11a15a14 a2a3a2a3a2 means that the cost compared to FHa10a12a11 is reduced by 1054 and that the confusion matrices are significantly different ata4a16a5a17a7 a1a7a1 a4 . for all test games Tables 4 and 5 compare the performance of the different systems for strong and weak games (a variable controlled for during data collection).</Paragraph> <Paragraph position="1"> for weak test games The results show that the machine learning systems perform better for the strong test data. We conjecture that the poorer results for the weak data are due to more bad moves in these games which receive a low evaluation score and might therefore be considered incorrect by the learners.</Paragraph> </Section> </Section> class="xml-element"></Paper>