File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/e06-1040_intro.xml

Size: 9,408 bytes

Last Modified: 2025-10-06 14:03:23

<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-1040">
  <Title>Comparing Automatic and Human Evaluation of NLG Systems</Title>
  <Section position="3" start_page="0" end_page="314" type="intro">
    <SectionTitle>
2 Background
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="313" type="sub_section">
      <SectionTitle>
2.1 Evaluation of NLG systems
</SectionTitle>
      <Paragraph position="0"> NLG systems have traditionally been evaluated using human subjects (Mellish and Dale, 1998).</Paragraph>
      <Paragraph position="1"> NLG evaluations have tended to be of the intrinsic type (Sparck Jones and Galliers, 1996), involving subjects reading and rating texts; usually subjects  are shown both NLG and human-written texts, and the NLG system is evaluated by comparing the ratings of its texts and human texts. In some cases, subjects are shown texts generated by several NLG systems, including a baseline system which serves asanother point ofcomparison. Thismethodology was first used in NLG in the mid-1990s by Coch (1996) and Lester and Porter (1997), and continues to be popular today.</Paragraph>
      <Paragraph position="2"> Other, extrinsic, types of human evaluations of NLG systems include measuring the impact of different generated texts on task performance (Young, 1999), measuring how much experts post-edit generated texts (Sripada et al., 2005), and measuring how quickly people read generated texts (Williams and Reiter, 2005).</Paragraph>
      <Paragraph position="3"> In recent years there has been growing interest in evaluating NLG texts by comparing them to a corpus of human-written texts. As in other areas of NLP, the advantages of automatic corpus-based evaluation are that it is potentially much cheaper and quicker than human-based evaluation, and also that it is repeatable. Corpus-based evaluation was first used in NLG by Langkilde (1998), who parsed texts from a corpus, fed the output of her parser to her NLG system, and then compared the generated texts to the original corpus texts.</Paragraph>
      <Paragraph position="4"> Similar evaluations have been used e.g. by Bangaloreetal.(2000) andMarciniak andStrube(2004). Such corpus-based evaluations have sometimes beencriticised inthe NLG community, forexample by Reiter and Sripada (2002). Grounds for criticism include the fact that regenerating a parsed text is not a realistic NLG task; that texts can be very different from a corpus text but still effectively meet the system's communicative goal; and that corpus texts areoften notofhigh enough quality to form a realistic test.</Paragraph>
    </Section>
    <Section position="2" start_page="313" end_page="313" type="sub_section">
      <SectionTitle>
2.2 Automatic evaluation of generated texts
</SectionTitle>
      <Paragraph position="0"> in MT and Summarisation The MT and document summarisation communities have developed evaluation metrics based on comparing output texts to a corpus of human texts, and have shown that some of these metrics are highly correlated with human judgments.</Paragraph>
      <Paragraph position="1"> The BLEU metric (Papineni et al., 2002) in MT has been particularly successful; for example MT05, the 2005 NIST MT evaluation exercise, used BLEU-4 as the only method of evaluation. BLEU is a precision metric that assesses the quality of a translation in termsof the proportion of itsword n-grams (n = 4 has become standard) that it shares with one or more high-quality reference translations. BLEU scores range from 0 to 1, 1 being the highest which can only be achieved by a translation if all its substrings can be found in one of the reference texts (hence a reference text will always score 1). BLEU should be calculated on a large testsetwithseveral reference translations (fourappears to be standard in MT). Properly calculated BLEU scores have been shown to correlate reliably with human judgments (Papineni et al., 2002).</Paragraph>
      <Paragraph position="2"> The NIST MT evaluation metric (Doddington, 2002) is an adaptation of BLEU, but where BLEU givesequal weight toall n-grams, NIST gives more importance to less frequent (hence more informative) n-grams. BLEU's ability to detect subtle but important differences in translation quality has been questioned, some research showing NIST to be more sensitive (Doddington, 2002; Riezler and Maxwell III, 2005).</Paragraph>
      <Paragraph position="3"> The ROUGE metric (Lin and Hovy, 2003) was conceived asdocument summarisation's answer to BLEU, but it does not appear to have met with the same degree of enthusiasm. There are several different ROUGE metrics. The simplest is ROUGE-N, which computes the highest proportion in any reference summary of n-grams that are matched by the system-generated summary. A procedure is applied that averages the score across leave-one-out subsets of the set of reference texts. ROUGE-N is an almost straightforward n-gram recall metric between two texts, and has several counter-intuitive properties, including thatevenatextcomposed entirely of sentences from reference texts cannot score 1 (unless there is only one reference text). There are several other variants of the ROUGE metric, and ROUGE-2, along with ROUGE-SU (based on skip bigrams and unigrams), were among the official scores for the DUC 2005 summarisation task.</Paragraph>
    </Section>
    <Section position="3" start_page="313" end_page="314" type="sub_section">
      <SectionTitle>
2.3 SUMTIME
</SectionTitle>
      <Paragraph position="0"> The SUMTIME project (Reiter et al., 2005) developed an NLG system which generated textual weather forecasts from numerical forecast data.</Paragraph>
      <Paragraph position="1"> The SUMTIME system generates specialist forecasts for offshore oil rigs. It has two modules: a content-determination module that determines the content of the weather forecast by analysing the numerical data using linear segmentation and  other data analysis techniques; and a microplanning and realisation module which generates texts based on this content by choosing appropriate words, deciding on aggregation, enforcing the sublanguage grammar, and so forth. SUMTIME generates very high-quality texts, in some cases forecast users believe SUMTIME texts are better than human-written texts (Reiter et al., 2005).</Paragraph>
      <Paragraph position="2"> SUMTIME is a knowledge-based NLG system.</Paragraph>
      <Paragraph position="3"> While its design was informed by corpus analysis (Reiter et al., 2003), the system is based on manually authored rules and code.</Paragraph>
      <Paragraph position="4"> As part of the project, the SUMTIME team created a corpus of 1045 forecasts from the commercial output of five different forecasters and the input data (numerical predictions of wind, temperature, etc) that the forecasters examined when they wrote the forecasts (Sripada et al., 2003). In other words, the SUMTIME corpus contains both the inputs (numerical weather predictions) and the outputs (forecast texts) ofthe forecast-generation process. The SUMTIME team also derived a content representation (called 'tuples') from the corpus texts similar to that produced by SUMTIME's content-determination module. The SUMTIME microplanner/realiser can be driven by these tuples; this mode (combining human content determination with SUMTIME microplanning and realisation) is called SUMTIME-Hybrid. Table 1 includes an example of the tuples extracted from the corpus text (row 1), and a SUMTIME-Hybrid text produced from the tuples (row 5).</Paragraph>
      <Paragraph position="5"> 2.4 pCRU language generation Statistical NLG has focused on generate-and-select models: a set of alternatives is generated and one is selected with a language model. This technique is computationally very expensive. Moreover, the only type of language model used in NLG are n-gram models which have the additional disadvantage of a general preference for shorter realisations, which can be harmful in NLG (Belz, 2005). pCRU1 language generation (Belz, 2006) is a language generation framework that was designed to facilitate statistical generation techniques that are more efficient and less biased. In pCRU generation, a base generator is encoded as a set of generation rules made up of relations with zero or more atomic arguments. The base generator</Paragraph>
    </Section>
    <Section position="4" start_page="314" end_page="314" type="sub_section">
      <SectionTitle>
1Probabilistic Context-free Representational Underspeci-
</SectionTitle>
      <Paragraph position="0"> fication.</Paragraph>
      <Paragraph position="1"> is then trained on raw text corpora to provide a probability distribution over generation rules. The resulting PCRU generator can be run in several modes, including the following: Random: ignoring pCRU probabilities, randomly select generation rules.</Paragraph>
      <Paragraph position="2"> N-gram: ignoring pCRU probabilities, generate set of alternatives and select the most likely according to a given n-gram language model.</Paragraph>
      <Paragraph position="3"> Greedy: select the most likely among each set of candidate generation rules.</Paragraph>
      <Paragraph position="4"> Greedy roulette: select rules with likelihood proportional to their pCRU probability.</Paragraph>
      <Paragraph position="5"> The greedy modes are deterministic and therefore considerably cheaper in computational terms than the equivalent n-gram method (Belz, 2005).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML