File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/06/e06-1040_concl.xml

Size: 3,198 bytes

Last Modified: 2025-10-06 13:55:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-1040">
  <Title>Comparing Automatic and Human Evaluation of NLG Systems</Title>
  <Section position="8" start_page="318" end_page="319" type="concl">
    <SectionTitle>
7 Conclusions
</SectionTitle>
    <Paragraph position="0"> Corpus quality plays a significant role in automatic evaluation of NLG texts. Automatic metrics can be expected to correlate very highly with human judgments only if the reference texts used are of high quality, or rather, can be expected to be judged high quality by the human evaluators. This is especially important when the generated texts are of similar quality to human-written texts.</Paragraph>
    <Paragraph position="1"> In MT, high-quality texts vary less than generally in NLG, so BLEU scores against 4 reference translations from reputable sources (as in MT '05) are a feasible evaluation regime. It seems likely thatforautomatic evaluation in NLG, alarger number of reference texts than four are needed.</Paragraph>
    <Paragraph position="2"> In our experiments, we have found NIST a more reliable evaluation metric than BLEU and in particular ROUGE which did not seem to offer any advantage over simple string-edit distance. We also found individual experts' judgments are not likely to correlate highly with average expert opinion, in fact less likely than NIST scores. This seems to imply that if expert evaluation can only be done with one or two experts, but a high-quality reference corpus is available, then a NIST-based evaluation may produce more accurate results than an expert-based evaluation.</Paragraph>
    <Paragraph position="3"> It seems clear that for automatic corpus-based evaluation to work well, we need high-quality reference texts written by many different authors and large enough to give reasonable coverage of phenomena such as variation for variation's sake.</Paragraph>
    <Paragraph position="4"> Metrics that do not exclusively reward similarity with reference texts (such as NIST) are more likely to correlate well with human judges, but all of the existing metrics that we looked at still penalised generators that do not always choose the most frequent variant.</Paragraph>
    <Paragraph position="5"> The results we have reported here are for a relatively simple sublanguage and domain, and more empirical research needs to be done on how well different evaluation metrics and methodologies (including different types of human evaluations) correlate with each other. In order to establish reliable and trusted automatic cross-system  evaluation methodologies, it seems likely that the NLG community will need to establish how to collect large amounts of high-quality reference texts and develop new evaluation metrics specifically for NLG that correlate more reliably with human judgments of text quality and appropriateness. Ultimately, research should also look at developing new evaluation techniques that correlate reliably with the real world usefulness of generated texts.</Paragraph>
    <Paragraph position="6"> In the shorter term, we recommend that automatic evaluations of NLG systems be supported by conventional large-scale human-based evaluations.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML