File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-2302_evalu.xml

Size: 18,475 bytes

Last Modified: 2025-10-06 13:59:19

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-2302">
  <Title>Stochastic Language Generation in a Dialogue System: Toward a Domain Independent Generator</Title>
  <Section position="7" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
5 Evaluation
</SectionTitle>
    <Paragraph position="0"> This paper is evaluating two surface generation design decisions: the effectiveness of stochastic (word forest based) surface generation with domain independent language models, and the benefits of using dialogue vs.</Paragraph>
    <Paragraph position="1"> newswire models. Evaluating any natural language generation system involves many factors, but we focused on two of the most important aspects to evaluate, the content and clarity (naturalness) of the output (English utterances). This section briefly describes previous automatic evaluation approaches that we are avoiding, followed by the human evaluation we have performed on our system.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Automatic Evaluation
</SectionTitle>
      <Paragraph position="0"> Evaluating generation is particularly difficult due to the diverse amount of correct output that can be generated.</Paragraph>
      <Paragraph position="1"> There are many ways to present a given semantic representation in English and what determines quality of content and form are often subjective measures. There are two general approaches to a surface generation evaluation. The first uses human evaluators to score the output with some pre-defined ranking measure. The second uses a quantitative automatic approach usually based on n-gram presence and word ordering. Bangalore et al. describe some of the quantitative measures that have been used in (Bangalore et al., 2000). Callaway recently used quantitative measures in an evaluation between symbolic and stochastic surface generators in (Callaway, 2003).</Paragraph>
      <Paragraph position="2"> The most common quantitative measure is Simple String Accuracy. This metric uses an ideal output string and compares it to a generated string using a metric that combines three word error counts; insertion, deletion, and substitution. One variation on this approach is tree-based metrics. These attempt to better represent how bad a bad result is. The tree-based accuracy metrics do not compare two strings directly, but instead build a dependency tree for the ideal string and attempt to create the same dependency tree from the generated string. The score is dependent not only on word choice, but on positioning at the phrasal level. Finally, the most recent evaluation metric is the Bleu Metric from IBM(Papineni et al., 2001).</Paragraph>
      <Paragraph position="3"> Designed for Machine Translation, it scores generated sentences based on the n-gram appearance from multiple ideal sentences. This approach provides more than one possible realization of an LF and compares the generated sentence to all possibilities.</Paragraph>
      <Paragraph position="4"> Unfortunately, the above automatic metrics are very limited in mimicking human scores. The Bleu metric can give reasonable scores, but the results are not as good when only one human translation is available. These automatic metrics all compare the desired output with the actual output. We decided to ignore this evaluation because it is too dependent on syntactic likeness. The following two sentences represent the same semantic meaning yet appear very different in structure: The injured person is still waiting at the hospital.</Paragraph>
      <Paragraph position="5"> The person with the injury at the hospital is still waiting. The scoring metrics would judge very harshly, yet a human evaluator should see little difference in semantic content. Clearly, the first is indeed better in naturalness (closeness to human English dialogue), but both content and naturalness cannot be measured with the current quantitative (and many human study) approaches.</Paragraph>
      <Paragraph position="6"> Although it is very time consuming, human evaluation continues to be the gold standard for generation evaluation.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Evaluation Methodology
</SectionTitle>
      <Paragraph position="0"> Our evaluation does not compare an ideal utterance with a generated one. We use a real human-human dialogue transcript and replace every utterance of one of the participants with our generated output. The evaluators are thereby reading a dialogue between a human and a computer generated human, yet it is based on the original human-human dialogue. Through this approach, we can present the evaluators with both our generated and the original transcripts (as the control group). However, they do not know which is artificial, or even that any of them are not human to human. The results will give an accurate portrayal of how well the system generates dialogue. The two aspects of dialogue that the evaluators were asked to measure for each utterance were understandability (semantically within context) and naturalness.</Paragraph>
      <Paragraph position="1"> There have been many metrics used in the past. Metrics range from scoring each utterance with a subjective score (Good,Bad) to using a numeric scale. Our evaluators use a numeric scale from 0 to 5. The main motivation for this is so we can establish averages and performance results more easily. The final step is to obtain a suitable domain of study outside the typical air travel domain.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Domain Description and Dialogue Construction
</SectionTitle>
      <Paragraph position="0"> A good dialogue evaluation is one in which all aspects of a natural dialogue are present and the only aspect that has been changed is how the surface generation presents the required information. By replacing one speaker's utterances with our generated utterances in a transcript of a real conversation, we guarantee that grounding and turn-taking are still present and our evaluation is not hindered by poor dialogue cues. The TRIPS Monroe Corpus (Stent, 2000) works well for this task.</Paragraph>
      <Paragraph position="1"> There are 20 dialogues in the Monroe Corpus. Each dialogue is a conversation between two English speakers. Twenty different speakers were used to construct the dialogues. Each participant was given a map of Monroe County, NY and a description of a task that needed to be solved. There were eight different disaster scenarios ranging from a bomb attack to a broken leg and the participants were to act as emergency dispatchers (this domain is often referred to as the 911 Rescue Domain).</Paragraph>
      <Paragraph position="2"> One participant U was given control of solving the task, and the other participant S was told that U had control.</Paragraph>
      <Paragraph position="3"> S was to assist U in solving the task. At the end of the discussion, U was to summarize the final plan they had created together.</Paragraph>
      <Paragraph position="4"> The average dialogue contains approximately 500 utterances. We chose three of the twenty dialogues for our evaluation. The three were the shorter dialogues in length (Three of the only four dialogues that are less than 250 utterances long. Many are over 800 utterances.). This was needed for practical reasons so the evaluators could conduct their rankings in a reasonable amount of time and still give accurate rankings. The U and S speakers for each dialogue were different.</Paragraph>
      <Paragraph position="5"> We replaced the S speaker in each of the dialogues with generated text, created by the following steps:  * Parse each S utterance into its LF with the TRIPS parser.</Paragraph>
      <Paragraph position="6"> * Convert the LF to the AMR grammar format.</Paragraph>
      <Paragraph position="7"> * Send the AMR to HALogen.</Paragraph>
      <Paragraph position="8"> * Generate the top sentence from this conversion us null ing our chosen LM.</Paragraph>
      <Paragraph position="9"> We hand-checked for correctness each AMR that is created from the LF. The volatile nature of a dialogue system under development assured us that many of the utterances were not properly parsed. Any errors in the AMR were fixed by hand and hand constructed when no parse could be made. The fixes were done before we tried to generate the S speaker in the evaluation dialogues.</Paragraph>
      <Paragraph position="10"> We are assuming perfect input to generation. This evaluation does not evaluate how well the conversion from the LF to the AMR is performing. Our goal of generating natural dialogue from a domain-independent LM can be fully determined by analyzing the stochastic approach in isolation. Indeed, the goal of a domain independent generator is somewhat dependent on the conversion from our domain independent LF, but we found that the errors from the conversion are not methodological errors. The errors are simple lexicon and code errors that do not relate to domain-specifics. Work is currently underway to repair such inconsistencies.</Paragraph>
      <Paragraph position="11"> Each of the S participant's non-dialogue-management utterances were replaced with our generated utterances. The grounding, turn-taking and acknowledgment utterances were kept in their original form. We plan on generating these latter speech acts with templates and are only testing the stochastic generation in this evaluation. The U speaker remained in its original state. The control groups will identify any bias that U may have over S (i.e. if U speaks 'better' than S in general), but testing the generation with the same speaker allows us to directly compare our language models.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.4 Language Model Construction
</SectionTitle>
      <Paragraph position="0"> We evaluated two language models. The first is a news source model trained on 250 million words with a vocabulary of 65,529 from the WSJ, AP and other online news sources as built in (Langkilde-Geary, 2002). This model will be referred to as the WSJ LM. The second language model was built from the Switchboard Corpus (J. Godfrey, 1992), a corpus of transcribed conversations and not newswire text. The corpus is comprised of 'spontaneous' conversations recorded over the phone, including approximately 2 million words with a vocabulary of 20,363.</Paragraph>
      <Paragraph position="1"> This model will be referred to as the SB LM. Both models are trigram, open vocabulary models with Witten-Bell smoothing. The Switchboard Corpus was used because it contrasts the newswire corpus in that it is in the genre of dialogue yet does not include the Monroe Corpus that the evaluation was conducted on.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.5 Evaluators
</SectionTitle>
      <Paragraph position="0"> Ten evaluators were chosen, all were college undergraduates between the ages of 18-21. None were linguistics or computer science majors. Each evaluator received three transcripts, one from each of our three chosen dialogues.</Paragraph>
      <Paragraph position="1"> One of these three was the original human to human dialogue. The other two had the S speaker replaced by our surface generator. Half of the evaluators received generations using the WSJ LM and the other half received the SB LM. They ranked each utterance for understandability and naturalness on scales between 0 and 5. A comparison of the human and generated utterances is given in figure 8 in the appendix.</Paragraph>
      <Paragraph position="2">  for the two original human speakers, U and S. The ten evaluators are listed by number, 0 to 9. Evaluators rated the content (understandability) and clarity (naturalness) of each utterance on a 0-5 scale. S was rated slightly higher than U.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
6 Results
</SectionTitle>
    <Paragraph position="0"> Figure 4 compares the control dialogues as judged by the human evaluators by giving the percent difference between the two human speakers. It is apparent that the U speaker is judged worse than the S speaker in the average of the three dialogues. We see the S speaker is scored 3.24% higher in understanding and 1.85% higher in naturalness. Due to the nature of the domain, the U speaker tends to make more requests and short decisions while the S speaker gives much longer descriptions and reasons for his/her actions. It is believed the human evaluators tend to score shorter utterances more harshly because they aren't 'complete sentences' as most people are used to seeing in written text. We believe this also explains the discrepancy of evaluator 9's very high scores for the S speaker. Evaluator 9 received dialogue 10 as his control dialogue.</Paragraph>
    <Paragraph position="1"> Dialogue 10's S speaker tended to have much longer utterances than any of the other five speakers in the three dialogues. It is possible that this evaluator judged shorter utterances more harshly.</Paragraph>
    <Paragraph position="2"> Figure 5 shows the comparison between using the two LMs as well as the human control group. The scores shown are the average utterance scores over all evaluators and dialogues. The dialogue management (grounding, turn-taking, etc.) utterance scores are not included in these averages. Since we do not generate these types of utterances, it would be misleading to include them in our evaluation. As figure 5 shows, the difference between the two LMs is small. Both received a lower naturalness score than understandability. It is clear that we are able to generate utterances that are understood, but yet are slightly less natural than a human speaker.</Paragraph>
    <Paragraph position="3"> Figure 6 shows the distribution of speech acts in each of the 3 evaluation dialogues. Due to the nature of the  derstandability and naturalness with the dialogue management utterances removed. The first compares the S speaker generated with the WSJ LM, the second compares the S speaker generated with the SB LM, and the third is the S speaker using the original human utterances.</Paragraph>
    <Paragraph position="4"> peratives. Since the two participants in the dialogues work together and neither has more information about the rescue problem than the other, there are not many questions. Rather, it is mostly declaratives and acknowledg- null act across all evaluators. Note that the numbers are only for the S speaker in each dialogue because only S was generated with the surface generator. Since each evaluator scored 2 computer dialogues and 1 human (control) dialogue, the LM numbers are averaged across twice as many examples. The understandability scores for the WSJ and SB LMs are relatively the same across all acts, but naturalness is slightly less in the SB LM. Comparing the human scores to both out-of-domain LMs, we see that declaratives averaged almost a 0.5 point loss from the human control group in both understandability and naturalness. Imperatives suffer an even larger decrease with an approximate 0.7 loss in understandability. The SB LM actually averaged over 1.0 decrease in naturalness. The interrogatives ranged from a 0.5 to 0 loss.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.1 Discussion
</SectionTitle>
      <Paragraph position="0"> We can conclude from figure 5 that the evaluators were relatively consistent among each other in rating understandability, but not as much so with naturalness. The comparison between the WSJ and SB LMs is inconclusive because we see in figure 5 that even though the evaluators gave the WSJ utterances higher absolute scores than the SB utterances, the percent difference from how they ranked the human U speaker is lower. The fact that it is inconclusive is somewhat surprising because intuition leads us to believe that the dialogue-based SB would perform better than the newswire-based WSJ. One reason may be because the nature of the Monroe Corpus does not include many dialogue specific acts such as questions and imperatives. However, declaratives are well represented and we can conclude that the newswire WSJ LM is as effective as the dialogue SB model for generating dialogue declaratives. Also, it is of note that the WSJ LM out-performed the SB LM in naturalness for most speech act types (as seen in figure 7) as well.</Paragraph>
      <Paragraph position="1"> The main result from this work is that an out-of-domain language model cannot only be used in a stochastic dialogue generation system, but the large amount of available newswire can also be effectively utilized. We found only a 7.28% decrease in understandability and an 11.58% decrease in naturalness using our newswire LM.</Paragraph>
      <Paragraph position="2"> This result is exciting. These percentages correspond to ranking an utterance 4.64 and 4.42 instead of a perfect 5.00 and 5.00. The reader is encouraged to look at the output of the generation in the appendix, figure 8.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.2 Future Work
</SectionTitle>
      <Paragraph position="0"> We have created a new grammar to generate from the LF that recognizes the full set of thematic roles. In addition, we have linked our dialogue system's lexicon to the generation module instead of WordNet, resulting in a fully integrated component to be ported to new domains with little effort. It remains to run an evaluation of this design.</Paragraph>
      <Paragraph position="1"> Also, stochastic generation favors other avenues of generation research, such as user adaptation. Work is being done to adapt to the specific vocabulary of the human user using dynamic language models. We hope to create an adaptive, natural generation component from this effort.</Paragraph>
      <Paragraph position="2"> Finally, we are looking into random weighting approaches for the generation grammar rules and resulting word forest in order to create dynamic surface generation. One of the problems of template-based approaches is that the generation is too static. Our corpus-based approach solves much of the problem, but there is still a degree of 'sameness' that is generated among the utterances.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML