File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/p04-1044_evalu.xml

Size: 7,613 bytes

Last Modified: 2025-10-06 13:59:08

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1044">
  <Title>Combining Acoustic and Pragmatic Features to Predict Recognition Performance in Spoken Dialogue Systems</Title>
  <Section position="8" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
7 Results and Evaluation
</SectionTitle>
    <Paragraph position="0"> The middle part of Table 2 shows the classification results for TiMBL and RIPPER when run with default parameter settings (the other results are included for comparison). The individual rows show the performance when different combinations of feature groups are used for training. The results for the three-way classification are included for comparison with the baseline system and are obtained by combining the two classes clarify and reject.</Paragraph>
    <Paragraph position="1"> Note that we do not evaluate the performance of the learners for classifying the individual recognition hypotheses but the classification of (whole) user utterances (i.e. including the selection procedure to choose from the classified hypotheses).</Paragraph>
    <Paragraph position="2"> The results show that both learners profit from the addition of more features concerning dialogue context and task context for classifying user speech input appropriately. The only exception from this trend is a slight performance decrease when task features are added in the four-way classification for RIPPER. Note that both learners already outperform the baseline results even when only recognition features are considered. The most striking result is the performance gain for TiMBL (almost 10%) when we include the dialogue features. As soon as dialogue features are included, TiMBL also performs slightly better than RIPPER.</Paragraph>
    <Paragraph position="3"> Note that the introduction of (limited) task features, in addition to the DIAL and UTT features, did not have dramatic impact in this study. One aim for future work is to define and analyze the influence of further task related features for classification.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
7.1 Optimizing TiMBL Parameters
</SectionTitle>
      <Paragraph position="0"> In all of the above experiments we ran the machine learners with their default parameter settings.</Paragraph>
      <Paragraph position="1"> However, recent research (Daelemans and Hoste, 2002; Marsi et al., 2003) has shown that machine learners often profit from parameter optimization (i.e. finding the best performing parameters on some development data). We therefore selected 40 possible parameter combinations for TiMBL (varying the number of nearest neighbors, feature weighting, and class voting weights) and nested a parameter optimization step into the &amp;quot;leave-one-out&amp;quot; evaluation paradigm (cf. Figure 2).5 Note that our optimization method is not as sophisticated as the &amp;quot;Iterative Deepening&amp;quot; approach 5We only optimized parameters for TiMBL because it performed better with default settings than RIPPER and because the findings in (Daelemans and Hoste, 2002) indicate that TiMBL profits more from parameter optimization.</Paragraph>
      <Paragraph position="2">  1. Set aside the recognition hypotheses for one of the user utterances.</Paragraph>
      <Paragraph position="3"> 2. Randomly split the remaining data into an 80% training and 20% test set.</Paragraph>
      <Paragraph position="4"> 3. Run TiMBL with all possible parameter settings on the generated training and test sets and store the best performing settings.</Paragraph>
      <Paragraph position="5"> 4. Classify the left-out hypotheses with the recorded parameter settings.</Paragraph>
      <Paragraph position="6"> 5. Iterate.</Paragraph>
      <Paragraph position="7">  described by (Marsi et al., 2003) but is similar in the sense that it computes a best-performing parameter setting for each data fold.</Paragraph>
      <Paragraph position="8"> Table 3 shows the classification results when we run TiMBL with optimized parameter settings and using all feature groups for training.</Paragraph>
      <Paragraph position="9">  Table 3 shows a remarkable 9% improvement for the 3-way and 4-way classification in both accuracy and weighted f-score, compared to using TiMBL with default parameter settings. In terms of WER, the baseline system (cf. Table 1) accepted 233 user utterances with a WER of 21.51%, and in contrast, TiMBL with optimized parameters (Ti OP) only accepted 169 user utterances with a WER of 4.05%. This low WER reflects the fact that if the machine learning system accepts an user utterance, it is almost certainly the correct one. Note that although the machine learning system in total accepted far fewer utterances (169 vs. 233) it accepted more correct utterances than the baseline (159 vs. 154).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
7.2 Evaluation
</SectionTitle>
      <Paragraph position="0"> The baseline accuracy for the 3-class problem is 65.68% (61.81% weighted f-score). Our best results, obtained by using TiMBL with parameter op- null timization, show a 25% weighted f-score improvement over the baseline system.</Paragraph>
      <Paragraph position="1"> We can compare these results to a hypothetical &amp;quot;oracle&amp;quot; system in order to obtain an upper bound on classification performance. This is an imaginary system which performs perfectly on the experimental data given the 10-best recognition output. The oracle results reveal that for 18 of the in-grammar utterances the 10-best recognition hypotheses do not include the correct logical form at all and therefore have to be classified as clarify or reject (i.e. it is not possible to achieve 100% accuracy on the experimental data). Table 2 shows that our best results are only 8%/12% (absolute) away from the optimal performance.</Paragraph>
      <Paragraph position="2">  We use the kh2 test of independence to statistically compare the different classification results. However, since kh2 only tells us whether two classifications are different from each other, we introduce a simple cost measure (Table 4) for the 3-way classification problem to complement the kh2 results.6  havior of a dialogue system is to accept correctly recognized utterances and ignore crosstalk (cost 0). The worst a system can do is to accept misrecognized utterances or utterances that were not addressed to the system. The remaining classes are as6We only evaluate the 3-way classification problem because there are no baseline results for the 4-way classification available. null signed a value in-between these two extremes. Note that the cost assignment is not validated against user judgments. We only use the costs to interpret the kh2 levels of significance (i.e. as an indicator to compare the relative quality of different systems).</Paragraph>
      <Paragraph position="3"> Table 5 shows the differences in cost and kh2 levels of significance when we compare the classification results. Here, Ti OP stands for TiMBL with optimized parameters and the stars indicate the level of statistical significance as computed by the kh2 statistics ([?][?][?] indicates significance at p = .001, [?][?] at  The cost measure shows the strict ordering: Oracle &lt; Ti OP &lt; TiMBL &lt; RIPPER &lt; Baseline.</Paragraph>
      <Paragraph position="4"> Note however that according to the kh2 test there is no significant difference between the oracle system and TiMBL with optimized parameters. Table 5 also shows that all of our experiments significantly out-perform the baseline system.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML