File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-3250_intro.xml
Size: 5,459 bytes
Last Modified: 2025-10-06 14:02:50
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3250"> <Title>Statistical Significance Tests for Machine Translation Evaluation</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Background </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Statistical Machine Translation </SectionTitle> <Paragraph position="0"> Statistical machine translation was introduced by work at IBM [Brown et al., 1990, 1993]. Currently, the most successful such systems employ so-called phrase-based methods that translate input text by translating sequences of words at a time [Och, 2002; Zens et al., 2002; Koehn et al., 2003; Vogel et al., 2003; Tillmann, 2003] Phrase-based machine translation systems make use of a language model trained for the target language and a translation model trained from a parallel corpus. The translation model is typically broken down into several components, e.g., reordering model, phrase translation model, and word translation model.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Automatic Evaluation </SectionTitle> <Paragraph position="0"> To adequately evaluate the quality of any translation is difficult, since it is not entirely clear what the focus of the evaluation should be. Surely, a good translation has to adequately capture the meaning of the foreign original. However, pinning down all the nuances is hard, and often differences in emphasis are introduced based on the interpretation of the translator. At the same time, it is desirable to have fluent output that can be read easily. These two goals, adequacy and fluency, are the main criteria in machine translation evaluation.</Paragraph> <Paragraph position="1"> sured by the BLEU score and n-gram precision Human judges may be asked to evaluate the adequacy and fluency of translation output, but this is a laborious and expensive task. Papineni et al.</Paragraph> <Paragraph position="2"> [2002] addressed the evaluation problem by introducing an automatic scoring metric, called BLEU, which allowed the automatic calculation of translation quality. The system output is compared against a reference translation of the same source text.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 BLEU: A Closer Look </SectionTitle> <Paragraph position="0"> Formally, the BLEU metric is computed as follows.</Paragraph> <Paragraph position="1"> Given the precision a0a2a1 of n-grams of size up to a3 (usually a3 a4a6a5 ), the length of the test set in words (a7 ) and the length of the reference translation in words (a8 ),</Paragraph> <Paragraph position="3"> The effectiveness of the BLEU metric has been demonstrated by showing that it correlates with human judgment.</Paragraph> <Paragraph position="4"> Let us highlight two properties of the BLEU metric: the reliance on higher n-grams and the brevity penalty BP. First, look at Table 1. Six different systems are compared here (we will get later into the nature of these systems). While the unigram precision of the three systems hovers around 60%, the difference in 4-gram precision is much larger. The Finnish system has only roughly half (7.8%) of the 4-gram precision of the Spanish system (14.7%).</Paragraph> <Paragraph position="5"> This is the cause for the relative large distance in overall BLEU (28.9% vs. 20.2%)1. Higher n-grams (and we could go beyond 4), measure not only syntactic cohesion and semantic adequacy of the output, but also give discriminatory power to the metric. null The other property worth noting is the strong influence of the brevity penalty. Since BLEU is a precision based method, the brevity penalty assures that a Is has become common practice to include a word penalty component in statistical machine translation system that gives bias to either longer or shorter output. This is especially relevant for the BLEU score that harshly penalizes translation output that is too short.</Paragraph> <Paragraph position="6"> To illustrate this point, see Figure 1. BLEU scores for both Spanish and Portuguese system drop off when a large word penalty is introduced into the translation model, forcing shorter output. This is not the case for a similar metric, GTM, an n-gram precision/recall metric proposed by Melamed et al. [2003] that does not have an explicit brevity penalty. The BLEU metric also works with multiple reference translations. However, we often do not have the luxury of having multiple translations of the same source material. Fortunately, it has not been shown so far that having only a single reference translation causes serious problems.</Paragraph> <Paragraph position="7"> While BLEU has become the most popular metric for machine translation evaluation, some of its short-comings have become apparent: It does not work on single sentences, since 4-gram precision is often 0. It is also hard to interpret. What a BLEU score of 28.9% means is not intuitive and depends, e.g., on the number of reference translation used.</Paragraph> <Paragraph position="8"> Some researchers have recently used relative human BLEU scores, by comparing machine BLEU scores with high quality human translation scores. However, the resulting numbers are unrealistically high.</Paragraph> </Section> </Section> class="xml-element"></Paper>