File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-1032_metho.xml
Size: 22,020 bytes
Last Modified: 2025-10-06 14:10:04
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-1032"> <Title>Re-evaluating the Role of BLEU in Machine Translation Research</Title> <Section position="3" start_page="0" end_page="250" type="metho"> <SectionTitle> 2 BLEU Detailed </SectionTitle> <Paragraph position="0"> The rationale behind the development of Bleu (Papinenietal., 2002)isthathumanevaluationofmachine translation can be time consuming and expensive. An automatic evaluation metric, on the other hand, can be used for frequent tasks like monitoringincrementalsystemchangesduringdevelopment, which are seemingly infeasible in a manual evaluation setting.</Paragraph> <Paragraph position="1"> The way that Bleu and other automatic evaluation metrics work is to compare the output of a machine translation system against reference human translations. Machine translation evaluation metrics differ from other metrics that use a reference, like the word error rate metric that is used Orejuela appeared calm as he was led to the American plane which will take him to Miami, Florida.</Paragraph> <Paragraph position="2"> Orejuela appeared calm while being escorted to the plane that would take him to Miami, Florida.</Paragraph> <Paragraph position="3"> Orejuela appeared calm as he was being led to the American plane that was to carry him to Miami in Florida.</Paragraph> <Paragraph position="4"> Orejuela seemed quite calm as he was being led to the American plane that would take him to Miami in Florida.</Paragraph> <Paragraph position="5"> Appeared calm when he was taken to the American plane, which will to Miami, Florida.</Paragraph> <Paragraph position="6"> a hypothesis translation from the 2005 NIST MT</Paragraph> <Section position="1" start_page="249" end_page="250" type="sub_section"> <SectionTitle> Evaluation </SectionTitle> <Paragraph position="0"> in speech recognition, because translations have a degree of variation in terms of word choice and in terms of variant ordering of some phrases.</Paragraph> <Paragraph position="1"> Bleu attempts to capture allowable variation in word choice through the use of multiple reference translations (as proposed in Thompson (1991)).</Paragraph> <Paragraph position="2"> In order to overcome the problem of variation in phrase order, Bleu uses modified n-gram precision instead of WER's more strict string edit distance.</Paragraph> <Paragraph position="3"> Bleu's n-gram precision is modified to eliminate repetitions that occur across sentences. For example, even though the bigram &quot;to Miami&quot; is repeated across all four reference translations in Table 1, it is counted only once in a hypothesis translation. Table 2 shows the n-gram sets created from the reference translations.</Paragraph> <Paragraph position="4"> Papineni et al. (2002) calculate their modified precision score, pn, for each n-gram length by summing over the matches for every hypothesis sentence S in the complete corpus C as:</Paragraph> <Paragraph position="6"> Counting punctuation marks as separate tokens, the hypothesis translation given in Table 1 has 15 unigram matches, 10 bigram matches, 5 trigram matches (these are shown in bold in Table 2), and three 4-gram matches (not shown). The hypothesis translation contains a total of 18 unigrams, 17 bigrams, 16 trigrams, and 15 4-grams. If the complete corpus consisted of this single sentence 1-grams: American, Florida, Miami, Orejuela, appeared, as, being, calm, carry, escorted, he, him, in, led, plane, quite, seemed, take, that, the, to, to, to, was , was, which, while, will, would, ,, .</Paragraph> <Paragraph position="7"> 2-grams: American plane, Florida ., Miami ,, Miami in, Orejuela appeared, Orejuela seemed, appeared calm, ashe, beingescorted, beingled, calmas, calmwhile, carry him, escorted to, he was, him to, in Florida, led to, plane that, plane which, quite calm, seemed quite, take him, that was, that would, the American, the plane, to Miami, to carry,tothe, wasbeing, wasled, wasto,whichwill, while being, will take, would take, , Florida 3-grams: American plane that, American plane which, Miami , Florida, Miami in Florida, Orejuela appeared calm, Orejuela seemed quite, appeared calm as, appeared calmwhile,ashewas,beingescortedto,beingledto,calm as he, calm while being, carry him to, escorted to the, he was being, he was led, him to Miami, in Florida ., led to the, plane that was, plane that would, plane which will, quite calm as, seemed quite calm, take him to, that was to, that would take, the American plane, the plane that, to Miami ,, to Miami in, to carry him, to the American, to the plane, was being led, was led to, was to carry, which will take, while being escorted, will take him, would take him, , Florida .</Paragraph> <Paragraph position="8"> ence translations, with matches from the hypothesis translation in bold then the modified precisions would be p1 = .83, p2 = .59, p3 = .31, and p4 = .2. Each pn is combined and can be weighted by specifying a weight wn. In practice each pn is generally assigned an equal weight.</Paragraph> <Paragraph position="9"> Because Bleu is precision based, and because recall is difficult to formulate over multiple reference translations, a brevity penalty is introduced to compensate for the possibility of proposing high-precision hypothesis translations which are too short. The brevity penalty is calculated as:</Paragraph> <Paragraph position="11"> where c is the length of the corpus of hypothesis translations, and r is the effective reference corpus length.1 Thus, the Bleu score is calculated as</Paragraph> <Paragraph position="13"> A Bleu score can range from 0 to 1, where higher scores indicate closer matches to the reference translations, and where a score of 1 is assigned to a hypothesis translation which exactly 1The effective reference corpus length is calculated as the sum of the single reference translation from each set which is closest to the hypothesis translation.</Paragraph> <Paragraph position="14"> matches one of the reference translations. A score of 1 is also assigned to a hypothesis translation which has matches for all its n-grams (up to the maximum n measured by Bleu) in the clipped reference n-grams, and which has no brevity penalty. TheprimaryreasonthatBleuisviewedasauseful stand-in for manual evaluation is that it has been shown to correlate with human judgments of translation quality. Papineni et al. (2002) showed that Bleu correlated with human judgments in its rankings of five Chinese-to-English machine translation systems, and in its ability to distinguish between human and machine translations. Bleu's correlation with human judgments has been further tested in the annual NIST Machine Translation Evaluation exercise wherein Bleu's rankings of Arabic-to-English and Chinese-to-English systems is verified by manual evaluation.</Paragraph> <Paragraph position="15"> In the next section we discuss theoretical reasons why Bleu may not always correlate with human judgments.</Paragraph> </Section> </Section> <Section position="4" start_page="250" end_page="251" type="metho"> <SectionTitle> 3 Variations Allowed By BLEU </SectionTitle> <Paragraph position="0"> WhileBleuattemptstocaptureallowablevariation in translation, it goes much further than it should.</Paragraph> <Paragraph position="1"> In order to allow some amount of variant order in phrases, Bleu places no explicit constraints on the order that matching n-grams occur in. To allow variation in word choice in translation Bleu uses multiple reference translations, but puts very few constraints on how n-gram matches can be drawn from the multiple reference translations. Because Bleu is underconstrained in these ways, it allows a tremendous amount of variation - far beyond what could reasonably be considered acceptable variation in translation.</Paragraph> <Section position="1" start_page="250" end_page="250" type="sub_section"> <SectionTitle> Inthissectionweexaminevariouspermutations </SectionTitle> <Paragraph position="0"> and substitutions allowed by Bleu. We show that foranaveragehypothesistranslationtherearemillions of possible variants that would each receive a similar Bleu score. We argue that because the number of translations that score the same is so large, it is unlikely that all of them will be judged to be identical in quality by human annotators.</Paragraph> <Paragraph position="1"> This means that it is possible to have items which receive identical Bleu scores but are judged by humans to be worse. It is also therefore possible to have a higher Bleu score without any genuine improvement in translation quality. In Sections 3.1 and 3.2 we examine ways of synthetically producing such variant translations.</Paragraph> </Section> <Section position="2" start_page="250" end_page="251" type="sub_section"> <SectionTitle> 3.1 Permuting phrases </SectionTitle> <Paragraph position="0"> One way in which variation can be introduced is by permuting phrases within a hypothesis translation. A simple way of estimating a lower bound on thenumber ofways thatphrases ina hypothesis translation can be reordered is to examine bigram mismatches. Phrases that are bracketed by these bigram mismatch sites can be freely permuted because reordering a hypothesis translation at these points will not reduce the number of matching n-grams and thus will not reduce the overall Bleu score.</Paragraph> <Paragraph position="1"> Here we denote bigram mismatches for the hypothesis translation given in Table 1 with vertical bars: Appeared calm |when |he was |taken | to the American plane |, |which will | to Miami , Florida .</Paragraph> <Paragraph position="2"> We can randomly produce other hypothesis translations that have the same Bleu score but are radically different from each other. Because Bleu only takes order into account through rewarding matches of higher order n-grams, a hypothesis sentence may be freely permuted around these bigram mismatch sites and without reducing the Bleu score. Thus: which will |he was |, |when |taken | Appeared calm |to the American plane |to Miami , Florida .</Paragraph> <Paragraph position="3"> receives an identical score to the hypothesis translation in Table 1.</Paragraph> <Paragraph position="4"> If b is the number of bigram matches in a hypothesis translation, and k is its length, then there are (k [?]b)! (1) possible ways to generate similarly scored items using only the words in the hypothesis translation.2 Thus for the example hypothesis translation there are at least 40,320 different ways of permuting the sentence and receiving a similar Bleu score. The number of permutations varies with respect to sentence length and number of bigram mismatches. Therefore as a hypothesis translation approaches being an identical match to one of the reference translations, the amount of variance decreases significantly. So, as translations improve lation against its number of possible permutations due to bigram mismatches for an entry in the 2005</Paragraph> </Section> </Section> <Section position="5" start_page="251" end_page="254" type="metho"> <SectionTitle> NIST MT Eval </SectionTitle> <Paragraph position="0"> spurious variation goes down. However, at today's levels the amount of variation that Bleu admits is unacceptably high. Figure 1 gives a scatterplot of each of the hypothesis translations produced by the second best Bleu system from the 2005 NIST MT Evaluation. The number of possible permutations for some translations is greater than 1073.</Paragraph> <Section position="1" start_page="251" end_page="251" type="sub_section"> <SectionTitle> 3.2 Drawing different items from the </SectionTitle> <Paragraph position="0"> reference set In addition to the factorial number of ways that similarly scored Bleu items can be generated by permuting phrases around bigram mismatch points, additional variation may be synthesized by drawing different items from the reference ngrams. For example, since the hypothesis translation from Table 1 has a length of 18 with 15 unigram matches, 10 bigram matches, 5 trigram matches, and three 4-gram matches, we can artificially construct an identically scored hypothesis by drawing an identical number of matching n-grams from the reference translations. Therefore the far less plausible: was being led to the |calm as he was | would take |carry him |seemed quite | when |taken would receive the same Bleu score as the hypothesis translation from Table 1, even though human judges would assign it a much lower score.</Paragraph> <Paragraph position="1"> This problem is made worse by the fact that Bleu equally weights all items in the reference sentences (Babych and Hartley, 2004). Therefore omitting content-bearing lexical items does not carry a greater penalty than omitting function words.</Paragraph> <Paragraph position="2"> The problem is further exacerbated by Bleu not having any facilities for matching synonyms or lexicalvariants. Thereforewordsinthehypothesis that did not appear in the references (such as when and taken in the hypothesis from Table 1) can be substituted with arbitrary words because they do notcontributetowardstheBleuscore. UnderBleu, we could just as validly use the words black and helicopters as we could when and taken.</Paragraph> <Paragraph position="3"> The lack of recall combined with naive token identity means that there can be overlap between similar items in the multiple reference translations. For example we can produce a translation which contains both the words carry and take even though they arise from the same source word. The chance of problems of this sort being introduced increases as we add more reference translations.</Paragraph> </Section> <Section position="2" start_page="251" end_page="254" type="sub_section"> <SectionTitle> 3.3 Implication: BLEU cannot guarantee </SectionTitle> <Paragraph position="0"> correlation with human judgments Bleu's inability to distinguish between randomly generated variations in translation hints that it may not correlate with human judgments of translation quality in some cases. As the number of identically scored variants goes up, the likelihood that they would all be judged equally plausible goes down. This is a theoretical point, and while the variants are artificially constructed, it does highlight the fact that Bleu is quite a crude measurement of translation quality.</Paragraph> <Paragraph position="1"> A number of prominent factors contribute to Bleu's crudeness: * Synonyms and paraphrases are only handled if they are in the set of multiple reference translations.</Paragraph> <Paragraph position="2"> * The scores for words are equally weighted so missing out on content-bearing material brings no additional penalty.</Paragraph> <Paragraph position="3"> * The brevity penalty is a stop-gap measure to compensate for the fairly serious problem of not being able to calculate recall.</Paragraph> <Paragraph position="4"> Each of these failures contributes to an increased amount of inappropriately indistinguishable translations in the analysis presented above.</Paragraph> <Paragraph position="5"> Given that Bleu can theoretically assign equal scoring to translations of obvious different quality, it is logical that a higher Bleu score may not necessarily be indicative of a genuine improvement in translation quality. This begs the question as to whether this is only a theoretical concern or whether Bleu's inadequacies can come into play in practice. In the next section we give two significant examples that show that Bleu can indeed fail to correlate with human judgments in practice.</Paragraph> <Paragraph position="6"> 4 Failures in Practice: the 2005 NIST MT Eval, and Systran v. SMT The NIST Machine Translation Evaluation exercise has run annually for the past five years as part of DARPA's TIDES program. The quality of Chinese-to-English and Arabic-to-English translation systems is evaluated both by using Bleu score and by conducting a manual evaluation. As such, the NIST MT Eval provides an excellent source of data that allows Bleu's correlation with human judgments to be verified. Last year's evaluation exercise (Lee and Przybocki, 2005) was startling in that Bleu's rankings of the Arabic-English translation systems failed to fully correspond to the manual evaluation. In particular, the entry that was ranked 1st in the human evaluation was ranked 6th by Bleu. In this section we examine Bleu's failure to correctly rank this entry. The manual evaluation conducted for the NIST MT Eval is done by English speakers without reference to the original Arabic or Chinese documents. Two judges assigned each sentence in Iran has already stated that Kharazi's statements to the conference because of the Jordanian King Abdullah II in which he stood accused Iran of interfering in Iraqi affairs.</Paragraph> <Paragraph position="7"> n-gram matches: 27 unigrams, 20 bigrams, 15 trigrams, and ten 4-grams human scores: Adequacy:3,2 Fluency:3,2 Iran already announced that Kharrazi will not attend the conference because of the statements made by the Jordanian Monarch Abdullah II who has accused Iran of interfering in Iraqi affairs.</Paragraph> <Paragraph position="8"> Bleu scores but different human scores, and one of four reference translations the hypothesis translations a subjective 1-5 score along two axes: adequacy and fluency (LDC, 2005). Table 3 gives the interpretations of the scores. When first evaluating fluency, the judges are shown only the hypothesis translation. They are then shown a reference translation and are asked to judge the adequacy of the hypothesis sentences. null Table 4 gives a comparison between the output of the system that was ranked 2nd by Bleu3 (top) and of the entry that was ranked 6th in Bleu but 1st in the human evaluation (bottom). The example is interesting because the number of matching n-grams for the two hypothesis translations is roughly similar but the human scores are quite different. The first hypothesis is less adequate because it fails to indicated that Kharazi is boycotting the conference, and because it inserts the word stood before accused which makes the Abdullah's actions less clear. The second hypothesis contains all of the information of the reference, but uses some synonyms and paraphrases which would not picked up on by Bleu: will not attend for would boycott and interfering for meddling.</Paragraph> <Paragraph position="9"> ments of adequacy, with R2 = 0.14 when the outlier entry is included Figures 2 and 3 plot the average human score for each of the seven NIST entries against its Bleu score. It is notable that one entry received a much higher human score than would be anticipated from its low Bleu score. The offending entry was unusual in that it was not fully automatic machine translation; instead the entry was aided by monolingual English speakers selecting among alternative automatic translations of phrases in the Arabicsourcesentencesandpost-editingtheresult (Callison-Burch, 2005). The remaining six entries were all fully automatic machine translation systems; in fact, they were all phrase-based statistical machine translation system that had been trained on the same parallel corpus and most used Bleu-based minimum error rate training (Och, 2003) to optimize the weights of their log linear models' feature functions (Och and Ney, 2002).</Paragraph> <Paragraph position="10"> This opens the possibility that in order for Bleu to bevalid onlysufficiently similarsystems should be compared with one another. For instance, when measuring correlation using Pearson's we get a very low correlation of R2 = 0.14 when the outlier in Figure 2 is included, but a strongR2 = 0.87 when it is excluded. Similarly Figure 3 goes from R2 = 0.002 to a much stronger R2 = 0.742.</Paragraph> <Paragraph position="11"> Systems which explore different areas of translation space may produce output which has differing characteristics, and might end up in different regions of the human scores / Bleu score graph.</Paragraph> <Paragraph position="12"> We investigated this by performing a manual evaluation comparing the output of two statistical machine translation systems with a rule-based machine translation, and seeing whether Bleu cor- null ments of fluency, with R2 = 0.002 when the outlier entry is included rectly ranked the systems. We used Systran for the rule-based system, and used the French-English portion of the Europarl corpus (Koehn, 2005) to train the SMT systems and to evaluate all three systems. We built the first phrase-based SMT system with the complete set of Europarl data (1415 million words per language), and optimized its feature functions using minimum error rate training in the standard way (Koehn, 2004). We evaluated it and the Systran system with Bleu using a set of 2,000 held out sentence pairs, using the same normalization and tokenization schemes on both systems' output. We then built a number of SMT systems with various portions of the training corpus, and selected one that was trained with 164 of the data, which had a Bleu score that was close to, but still higher than that for the rule-based system. null We then performed a manual evaluation where we had three judges assign fluency and adequacy ratings for the English translations of 300 French sentences for each of the three systems. These scores are plotted against the systems' Bleu scores in Figure 4. The graph shows that the Bleu score for the rule-based system (Systran) vastly underestimates its actual quality. This serves as another significant counter-example to Bleu's correlation with human judgments of translation quality, and further increases the concern that Bleu may not be appropriate for comparing systems which employ different translation strategies.</Paragraph> <Paragraph position="13"> judgments of fluency and adequacy, showing that Bleu vastly underestimates the quality of a non-statistical system</Paragraph> </Section> </Section> class="xml-element"></Paper>