File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/00/w00-0306_evalu.xml
Size: 3,435 bytes
Last Modified: 2025-10-06 13:58:40
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0306"> <Title>Stochastic Language Generation for Spoken Dialogue Systems</Title> <Section position="4" start_page="29" end_page="30" type="evalu"> <SectionTitle> 3 Evaluation </SectionTitle> <Paragraph position="0"> It is generally difficult to empirically evaluate a generation system. In the context of spoken dialogue systems, evaluation of NLG becomes an even more difficult problem. One reason is simply that there has been very little effort in building generation engines for spoken dialogue systems. Another reason is that it is hard to separate NLG from the rest of the system. It is especially hard to separate evaluation of language generation and speech synthesis.</Paragraph> <Paragraph position="1"> As a simple solution, we have conducted a comparative evaluation by running two identical systems varying only the generation component.</Paragraph> <Paragraph position="2"> In this section we present results from two preliminary evaluations of our generation algorithms described in the previous sections.</Paragraph> <Section position="1" start_page="29" end_page="30" type="sub_section"> <SectionTitle> 3.1 Content Planning: Experiment </SectionTitle> <Paragraph position="0"> For the content planning part of the generation -system, we conducted a comparative evaluation of the two different generation algorithms: old/new and bigrams. Twelve subjects had two dialogues each, one with the old/new generation system, and another with the bigrams generation system (in counterbalanced order); all other modules were held fixed. Afterwards, each subject answered seven questions on a usability survey. Immediately after, each subject was given transcribed logs of his/her dialogues and asked to rate each system utterance on a scale of 1 to 3 (1 = good; 2 = okay; 3 = bad).</Paragraph> </Section> <Section position="2" start_page="30" end_page="30" type="sub_section"> <SectionTitle> 3.2 Content Planning: Results </SectionTitle> <Paragraph position="0"> For the usability survey, the results seem to indicate subjects' preference for the old/new system, but the difference is not statistically significant (p - 0.06). However, six out of the twelve subjects chose the bigram system to the question &quot;Durqng-the session, which system's responses were easier to understand?&quot; compared to three subjects choosing the old/new system.</Paragraph> </Section> <Section position="3" start_page="30" end_page="30" type="sub_section"> <SectionTitle> 3.3 Surface Realization: Experiment </SectionTitle> <Paragraph position="0"> For surface realization, we conducted a batch-mode evaluation. We picked six recent calls to our system and ran two generation algorithms (template-based generation and stochastic generation) on the input frames. We then presented to seven subjects the generated dialogues, consisting of decoder output of the user utterances and corresponding system responses, for each of the two generation algorithms. Subjects then selected the output utterance they would prefer, for each of the utterances that differ between the two systems.</Paragraph> <Paragraph position="1"> The results show a trend that subjects preferred stochastic generation over template-based generation, but a t-test shows no significant difference (p = 0.18). We are in the process of designing a larger evaluation.</Paragraph> </Section> </Section> class="xml-element"></Paper>