File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/w02-0221_evalu.xml

Size: 8,423 bytes

Last Modified: 2025-10-06 13:58:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0221">
  <Title>Training a Dialogue Act Tagger For Human-Human and Human-Computer Travel Dialogues</Title>
  <Section position="6" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
3 Results
</SectionTitle>
    <Paragraph position="0"> Given the corpora and features described above, we constructed a set of training and test les for use with the RIPPER engine. Each spoken dialogue utterance by the system or by the human travel agent in the corpora are represented in terms of the features and class values described above. One of the primary goals in these experiments is to test the ability of the trained DATE tagger to learn and apply general rules for dialogue act tagging. In the HC data, we examine how a DATE tagger trained on the June-2000 corpus performs on the October-2001 corpus, with and without 2000 labelled examples of October-2001 training data. For the HH data, we examine how a DATE tagger trained on the two HC corpora (June-2000 and October-2001) performs on the CMU-corpus, with and without 305 utterances of HH labelled training data. We rst report accuracy results for a DATE tagger trained and tested on the HC June-2000 and October-2001 corpora and then report results for the HH CMU-corpus.</Paragraph>
    <Paragraph position="1"> Human-Computer Results: Table 1 shows that the reported accuracies for the HC experiments are signifcantly better than the baseline in each case and the differences between the rows are also statistically signi cant. The rst row shows that the accuracy of a DATE tagger trained and tested using four-fold cross-validation on the June-2000 data is 98.5a0 with a standard error of only 0.11a0 . This indicates that after training on 75a0 of the data, there are few unexpected utterances in the remaining 25a0 . However, the second row shows that a DATE tagger trained on the 9 systems represented in the June-2000 corpus and tested on the (subset) 8 systems represented in the October-2001 corpus only achieves 71.82a0 accuracy. This roughly matches our earlier nding in Section 2.3 that during the interval from June-2000 to April-2001 when the 2001 data collection began, many changes had been made to the Communicator systems and that the learned rules from the June-2000 data were not able to generalize as well to the October-2001 corpus.. The third row shows that the overall variation in the data is still low: when 2000 labelled examples of the October-2001 data are added to the June-2000 data for training, the accuracy increases to 93.82a0 .</Paragraph>
    <Paragraph position="2"> This suggests that adding a small amount of new labelled training data for successive versions of a system would support high accuracy DATE tagging for the new version of the system.</Paragraph>
    <Paragraph position="3"> Some of the rules that RIPPER learned from the HC corpora for predicting the DATE tag for utterances requesting information about the origin city, e.g. What city are you departing from?, and requesting information about the destination city, e.g. Where are you traveling to?, are shown in Figure 8. The gure shows that all of the rules for both about task:request info:orig city and small about task:request info:dest city utilize the utterance string feature. This suggests that single words in utterances can be regarded as reliable indicators  of DATE tags. More interestingly, the words utilized are intuitively plausible for the travel planning domain. For example, the learned question words such as which, where and would are signi cant for utterances that have request info as their SPEECH-ACT dimension. The words city, airport, from, destination and departing are signi cant predictors of utterances that have orig city and dest city as their task dimension.</Paragraph>
    <Paragraph position="4"> if utt-string contains city a4 utt-string contains from a4 pattern-length a5 38 or if utt-string contains airport a4 pattern-length a5 38 or if utt-string contains city a4 pattern-length a5 17 a4 pattern-length a6 15 or if utt-string contains from a4 pattern-length a5 66 a4 utt-string contains Where or if utt-string contains city a4 utt-string contains say or if utt-string contains DEPARTING or if utt-string contains which a4 utt-string contains From or if utt-string contains city a4 system-name=IBM a4 utt-string contains departure or if utt-string contains y a4 utt-string contains which a4 left-sys-utt-string contains city or if utt-string contains y a4 utt-string contains O then about task:request info:orig city if utt-string contains where a4 utt-string contains must  cause the DATE scheme describes utterances in terms of SPEECH-ACT, CONVERSATIONAL-DOMAIN and TASK dimensions, it is also possible to extract from the composite labels and examine the DATE tagger performance for the individual dimensions. Here we focus on the SPEECH-ACT dimension since, as mentioned above, it is more likely to generalize to HH travel dialogues and to other task domains. Table 2 shows the results for a DATE tagger trained and tested on only the SPEECH-ACT dimension. The reported accuracies are signifcantly better than the baseline in each case and the differences between the rows are also statistically signi cant. The results support our original hypothesis, showing that the June-2000 SPEECH-ACT DATE tagger generalizes more readily to the October-2001 corpus, with an accuracy of 82.57a0 (Row 2). Furthermore, as before, even a small amount of training data from the 2001 corpus makes a signi cant improvement in accuracy to 95.68a0 (Row 3), which is close to the 99.1a0 accuracy (Row 1) reported for training and testing on the June-2000 corpus as estimated by 4fold cross-validation.</Paragraph>
    <Paragraph position="5"> Human-Human Results: In order to examine whether there is any generalization from labelled HC data to HH data for the same task, we apply a DATE tagger trained on only the SPEECH-ACT dimension. The rst row of Table 3 shows that when a DATE tagger is trained on only the HC corpus and tested on the HH corpus that the accuracy is 36.72a0 (a signi cant improvement over the baseline). This result demonstrates quantitatively that the HC data can be used to improve performance of a tagger for HH data.</Paragraph>
    <Paragraph position="6"> Now, let us consider a situation where we only have 305 HH labelled utterances from 10 of the HH dialogues to train a DATE tagger. Row 2 shows that we achieve 48.75a0 accuracy when testing on the remainder of the HH corpus. However if we add the HC data to the training set, the accuracy improves signi cantly to 55.48a0 (Row 3). Again this result demonstrates quantitatively that the HC data can improve performance of a tagger for HH data.</Paragraph>
    <Paragraph position="7"> Row 4 shows that the utility of the HC corpus decreases if larger amounts of HH labelled data are available; using 95a0 of the data to train and test- null Majority Class, Acc. = Accuracy, SE = Standard Error) ing on 5a0 with 20-fold cross-validation achieves an accuracy of 76.56a0 .</Paragraph>
    <Paragraph position="8"> Examination of the errors that the tagger makes indicates both similarities and differences between HH and HC dialogues. For example, information is presented in small installments in the HH dialogues whereas information presentation utterances in the HC dialogues tend to be very long. The information presentation utterances in HH dialogues then appear to be syntactically similar to the implicit con rmations in the HC data. Finally, some utterance types that are very frequent in the HC data such as instructions rarely occur in the HH dialogues.</Paragraph>
    <Paragraph position="9"> The rules that are learned for a DATE tagger trained on the HC corpora and the HH CMU-corpus for the offer SPEECH-ACT are in Figures 9 and 10.</Paragraph>
    <Paragraph position="10"> There are two main conclusions that can be drawn from these gures about the generalization from HC to HH corpora in the SPEECH-ACT dimension. First, in general, a larger number of rules are learned for the HH data, suggesting that there is greater variation for the same speech act in HH dialogues. While this is not surprising, there is also signi cant overlap in the features and values used in the rules. For example, the utterance string feature utilizes words such as select, ight, do, okay, ne, these in both rule sets.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML