File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/p01-1057_intro.xml
Size: 3,388 bytes
Last Modified: 2025-10-06 14:01:12
<?xml version="1.0" standalone="yes"?> <Paper uid="P01-1057"> <Title>Using a Randomised Controlled Clinical Trial to Evaluate an NLG System</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Evaluation of NLG Systems </SectionTitle> <Paragraph position="0"> Evaluation is becoming increasingly important in NLG, as in other areas of NLP; see Mellish and Dale (1998) for a summary of NLG evaluation.</Paragraph> <Paragraph position="1"> As Mellish and Dale point out, we can evaluate the effectiveness of underlying theories, general properties of NLG systems and texts (such as computational speed, or text understandability), or the effectiveness of the generated texts in an actual task or application context. Theory evaluations are typically done by comparing predictions of a theory to what is observed in a human-authored corpus (for example, (Yeh and Mellish, 1997)). Evaluations of text properties are typically done by asking human judges to rate the quality of generated texts (for example, (Lester and Porter, 1997)); sometimes human-authored texts are included in the rated set (without judges knowing which texts are human-authored) to provide a baseline. Task evaluations (for example, (Young, 1999)) are typically done by showing human subjects different texts, and measuring differences in an outcome variable, such as success in performing a task.</Paragraph> <Paragraph position="2"> However, despite the above work, we are not aware of any previous evaluation which has compared the effectiveness of NLG texts at meeting a communicative goal against the effectiveness of non-NLG control texts. Young's task evaluation, which may be the most rigorous previous task evaluation of an NLG system, compared the effectiveness of texts generated by different NLG algorithms, while the IDAS task evaluation (Levine and Mellish, 1995) did not include a control text of any kind. Coch (1996) and Lester and Porter (1997) have compared NLG texts to human-written and (in Coch's case) mail-merge texts, but the comparisons were judgements by human domain experts, they did not measure the actual impact of the texts on users. Carenini and Moore (2000) probably came closest to a controlled evaluation of NLG vs non-NLG alternatives, because they compared the impact of NLG argumentative texts to a no-text control (where users had access to the underlying data but were not given any texts arguing for a particular choice).</Paragraph> <Paragraph position="3"> Task evaluations that compare the effectiveness of texts from NLG systems to the effectiveness of non-NLG alternatives (mail-merge texts, human-written texts, or fixed texts) are expensive and difficult to organise, but we believe they are essential to the progress of NLG, both scientifically and technologically. In this paper we describe such an evaluation which we performed on the STOP system. The evaluation was indeed expensive and time-consuming, and ultimately was disappointing in that it suggested STOP texts were no more effective than control texts, but we believe that this kind of evaluation was essential to the project. We hope that our description of the STOP clinical trial and what we learned from it will encourage other researchers to consider performing effectiveness evaluations of NLG systems against non-NLG alternatives.</Paragraph> </Section> class="xml-element"></Paper>