File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-1302_evalu.xml

Size: 9,059 bytes

Last Modified: 2025-10-06 13:59:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1302">
  <Title>Hiroshi Tsujino +</Title>
  <Section position="7" start_page="13" end_page="14" type="evalu">
    <SectionTitle>
5 Experimental Evaluation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="13" end_page="13" type="sub_section">
      <SectionTitle>
5.1 Implementation
</SectionTitle>
      <Paragraph position="0"> We implemented a Japanese multi-domain spoken dialogue system with five domain experts: restaurant, hotel, temple, weather, and bus. Specifications of each expert are listed in Table 4. If there is any overlapping slot between the vocabularies of the domains, our architecture can treat it as a common slot, whose value is shared among the domains when interacting with the user. In our system, place names are treated as a common slot.</Paragraph>
      <Paragraph position="1"> We adopted Julian as the grammar-based speech recognizer (Kawahara et al., 2004). The grammar rules for the speech recognizer can be automatically generated from those used in the language understanding modules in each domain.</Paragraph>
      <Paragraph position="2"> As a phonetic model, we adopted a 3000-states PTM triphone model (Kawahara et al., 2004).</Paragraph>
    </Section>
    <Section position="2" start_page="13" end_page="13" type="sub_section">
      <SectionTitle>
5.2 Collecting Dialogue Data
</SectionTitle>
      <Paragraph position="0"> We collected dialogue data using a baseline system from 10 subjects. First, the subjects used the system by following a sample scenario, to get accustomed to the timing to speak. They, then, used the system by following three scenarios, where at least three domains were mentioned, but neither an actual temple name nor domain was explicitly mentioned. One of the scenarios is shown in Figure 5. Domain selection in the baseline system was performed on the basis of the baseline method that will be mentioned in Section 5.4, in which a was set to 40 after preliminary experiments.</Paragraph>
      <Paragraph position="1"> In the experiments, we obtained 2,205 utterances (221 per subject, 74 per dialogue). The accuracy of the speech recognition was 63.3%, which was rather low. This was because the subjects tended to repeat similar utterances even after misrecognition occurred due to out-of-grammar or out-of-vocabulary utterances. Another reason was that the dialogues for subjects with worse speech recognition results got longer, which resulted in an increase in the total number of misrecognition.</Paragraph>
      <Paragraph position="2"> a19a16 Tomorrow or the day after, you are planning a sightseeing tour of Kyoto. Please find a shrine you want to visit in the Arashiyama area, and determine, after considering the weather, on which day you will visit the shrine. Please, ask for a temperature on the day of travel. Also find out how to go to the shrine, whether you can take a bus from the Kyoto station to there, when the shrine is closing, and what the entrance fee is.</Paragraph>
    </Section>
    <Section position="3" start_page="13" end_page="13" type="sub_section">
      <SectionTitle>
5.3 Construction of the Domain Classifier
</SectionTitle>
      <Paragraph position="0"> We used the data containing 2,205 utterances collected using the baseline system, to construct a domain classifier. We used C5.0 (Quinlan, 1993) as a classifier. The features used were described in Section 4. Reference labels were given by hand for each utterance based on the domains the system had selected and transcriptions of the user's utterances, as follows  .</Paragraph>
      <Paragraph position="1"> Label (I): When the correct domain for a user's utterance is the same as the domain in which the previous system's response was made.</Paragraph>
      <Paragraph position="2"> Label (II): Except for case (I), when the correct domain for a user's utterance is the domain in which a speech recognition result in the N-best candidates with the highest score can be interpreted.</Paragraph>
      <Paragraph position="3"> Label (III): Domains other than (I) and (II).</Paragraph>
    </Section>
    <Section position="4" start_page="13" end_page="14" type="sub_section">
      <SectionTitle>
5.4 Evaluation of Domain Selection
</SectionTitle>
      <Paragraph position="0"> We compared the performance of our domain selection with that of the baseline method described below.</Paragraph>
      <Paragraph position="1"> Baseline method: A domain having an interpretation with the highest score in the N-best candidates of the speech recognition was selected, after adding a for the acoustic likelihood of the speech recognizer if the domain was the same as the previous one. We calculated the accuracies of domain selections for various a.</Paragraph>
      <Paragraph position="2">  Although only one of the authors assigned the labels, they could be easily assigned without ambiguity, since the labels were automatically defined as previously described. Thus, the annotator only needs to judge whether a user's request was about the same domain as the previous system's response or whether it was about a domain in the speech recognition result.</Paragraph>
      <Paragraph position="3">  with a 10-fold cross validation, that is, one tenth of the 2,205 utterances were used as test data, and the remainder was used as training data. The process was repeated 10 times, and the average of the accuracies was computed.</Paragraph>
      <Paragraph position="4"> Accuracies for domain selection were calculated per utterance. When there were several domains that had the same score after domain selection, one domain was randomly selected among them as an output.</Paragraph>
      <Paragraph position="5"> Figure 6 shows the number of errors for domain selection in the baseline method, categorized by their reference labels as a changed. As a increases, so does the system desire to keep the previous domain. A condition where a =0corresponds to a method in which domains are selected based only on the speech recognition results, which implies that there are no constraints on keeping the current domain. As we can see in Figure 6, the number of errors whose reference labels are &amp;quot;a domain in the previous response (choice (I))&amp;quot; decreases as a gets larger. This is because incorrect domain transitions due to speech recognition errors were suppressed by the constraint to keep the domains. Conversely, we can see an increase in errors whose labels are &amp;quot;a domain with the highest speech recognition score (choice (II))&amp;quot;. This is because there is too much incentive for keeping the previous domain. The smallest number of errors was 634 when a =35, and the error rate of domain selection was 28.8% (= 634/2205). There were 371 errors whose reference labels were neither &amp;quot;a domain in the previous response&amp;quot; nor &amp;quot;a domain with the highest speech recognition score&amp;quot;, which cannot be detected even when a is changed based on conventional frameworks. null We also calculated the classification accuracy of our method. Table 5 shows the results as a confusion matrix. The left hand figure denotes the number of outputs in the baseline method, while the right hand figure denotes the number of outputs in our method. Correct outputs are in the diagonal cells, while the domain selection errors are in the off diagonal cells. Total accuracy increased by 5.3%, from 71.2% to 76.5%, and the number of errors in domain selection was reduced from 634 to 518, so the error reduction rate was 18.3% (= 116/634). There was no output in the baseline method for &amp;quot;other domains (III)&amp;quot;, which is in the third column, because conventional frameworks have not taken this choice into consideration. Our method was able to detect this kind of error in 157 of 371 utterances, which allows us to prevent further errors from continuing. Moreover, accuracies for (I) and (II) did not get worse. Precision for (I) improved from 0.77 to 0.83, and the F-measure for (I) also improved from 0.83 to 0.86. Although recall for (II) got worse, its precision improved from 0.52 to 0.62, and consequently the F-measure for (II) improved slightly from 0.61 to 0.62. These results show that our method can detect choice (III), which was newly introduced, without degrading the existing classification accuracies. null The features that follow played an important role in the decision tree. The features that represent confidence in the previous domain appeared in the upper part of the tree, including &amp;quot;the number of affirmatives after entering the domain (P1)&amp;quot;, &amp;quot;the ratio of user's negative answers in the domain (P9)&amp;quot;, &amp;quot;the number of turns after entering the domain (P6)&amp;quot;, and &amp;quot;the number of changed slots based on the user's utterances after entering the domain (P5)&amp;quot;. These were also &amp;quot;whether a domain with the highest score has appeared before (C8)&amp;quot; and &amp;quot;whether an interpretation of a current user's utterance is negative (C2)&amp;quot;.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML