File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/06/e06-1032_concl.xml

Size: 3,232 bytes

Last Modified: 2025-10-06 13:55:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-1032">
  <Title>Re-evaluating the Role of BLEU in Machine Translation Research</Title>
  <Section position="7" start_page="254" end_page="255" type="concl">
    <SectionTitle>
6 Conclusions
</SectionTitle>
    <Paragraph position="0"> In this paper we have shown theoretical and practical evidence that Bleu may not correlate with human judgment to the degree that it is currently believed to do. We have shown that Bleu's rather coarse model of allowable variation in translation can mean that an improved Bleu score is not sufficient to reflect a genuine improvement in translation quality. We have further shown that it is not necessary to receive a higher Bleu score in order to be judged to have better translation quality by human subjects, as illustrated in the 2005 NIST Machine Translation Evaluation and our experiment manually evaluating Systran and SMT translations. null What conclusions can we draw from this? Should we give up on using Bleu entirely? We think that the advantages of Bleu are still very strong; automatic evaluation metrics are inexpensive, and do allow many tasks to be performed that would otherwise be impossible. The important thing therefore is to recognize which uses of Bleu are appropriate and which uses are not.</Paragraph>
    <Paragraph position="1"> Appropriate uses for Bleu include tracking broad, incremental changes to a single system, comparing systems which employ similar translation strategies (such as comparing phrase-based statistical machine translation systems with other phrase-based statistical machine translation systems), and using Bleu as an objective function to optimize the values of parameters such as feature weights in log linear translation models, until a better metric has been proposed.</Paragraph>
    <Paragraph position="2"> Inappropriate uses for Bleu include comparing systems which employ radically different strategies(especiallycomparingphrase-basedstatistical null machine translation systems against systems that do not employ similar n-gram-based approaches), trying to detect improvements for aspects of translation that are not modeled well by Bleu, and monitoring improvements that occur infrequently within a test corpus.</Paragraph>
    <Paragraph position="3"> These comments do not apply solely to Bleu.</Paragraph>
    <Paragraph position="4">  Meteor (Banerjee and Lavie, 2005), Precision and Recall (Melamed et al., 2003), and other such automatic metrics may also be affected to a greater or lesser degree because they are all quite rough measures of translation similarity, and have inexact models of allowable variation in translation. Finally, that the fact that Bleu's correlation with human judgments has been drawn into question may warrant a re-examination of past work which failed to show improvements in Bleu. For example, work which failed to detect improvements in translation quality with the integration of word sense disambiguation (Carpuat and Wu, 2005), or work which attempted to integrate syntactic information but which failed to improve Bleu (Charniak et al., 2003; Och et al., 2004) may deserve a second look with a more targeted manual evaluation. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML