File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-2070_intro.xml
Size: 3,456 bytes
Last Modified: 2025-10-06 14:03:42
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2070"> <Title>Stochastic Iterative Alignment for Machine Translation Evaluation</Title> <Section position="4" start_page="539" end_page="540" type="intro"> <SectionTitle> 2 Recap of BLEU, ROUGE-W and METEOR </SectionTitle> <Paragraph position="0"> The most commonly used automatic evaluation metrics, BLEU (Papineni et al., 2002) and NIST (Doddington, 2002), are based on the assumption that The closer a machine translation is to a promt1: Life is like one nice chocolate in box ref: Life is just like a box of tasty chocolate ref: Life is just like a box of tasty chocolate mt2: Life is of one nice chocolate in box fessional human translation, the better it is (Papineni et al., 2002). For every hypothesis, BLEU computes the fraction of n-grams which also appear in the reference sentences, as well as a brevity penalty. NIST uses a similar strategy to BLEU but further considers that n-grams with different frequency should be treated differently in the evaluation (Doddington, 2002). BLEU and NIST have been shown to correlate closely with human judgments in ranking MT systems with different qualities (Papineni et al., 2002; Doddington, 2002). ROUGE-W is based on the weighted longest common subsequence (LCS) between the MT output and the reference. The common subsequences in ROUGE-W are not necessarily strict n-grams, and gaps are allowed in both the MT output and the reference. Because of the exibility, long common subsequences are feasible in ROUGE-W and can help to re ect the sentence-wide similarity of MT output and references. ROUGE-W uses a weighting strategy where the LCS containing strict n-grams is favored. Figure 1 gives two examples that show how ROUGE-W searches for the LCS. For mt1, ROUGE-W will choose either life is like chocolate or life is like box as the LCS, since neither of the sequences 'like box' and 'like chocolate' are strict n-grams and thus make no difference in ROUGE-W (the only strict n-grams in the two candidate LCS is life is). For mt2, there is only one choice of the LCS: life is of chocolate. The LCS of mt1 and mt2 have the same length and the same number of strict n-grams, thus they get the same score in ROUGE-W. But it is clear to us that mt1 is better than mt2. It is easy to verify that mt1 and mt2 have the same number of common 1grams, 2-grams, and skipped 2-grams with the reference (they don't have common n-grams longer than 2 words), thus BLEU and ROUGE-S are also not able to differentiate them.</Paragraph> <Paragraph position="1"> METEOR is a metric sitting in the middle of the n-gram based metrics and the loose se- null quence based metrics. It has several phases and in each phase different matching techniques (EX-ACT, PORTER-STEM, WORD-NET) are used to make an alignment for the MT output and the reference. METEOR doesn't require the alignment to be monotonic, which means crossing word mappings (e.g. a b is mapped to b a) are allowed, though doing so will get a penalty. Figure 2 shows the alignments of METEOR based on the same example as ROUGE. Though the two alignments have the same number of word mappings, mt2 gets more crossed word mappings than mt1, thus it will get less credits in METEOR. Both ROUGE and METEOR normalize their evaluation result based on the MT output length (precision) and the reference length (recall), and the nal score is computed as the F-mean of them.</Paragraph> </Section> class="xml-element"></Paper>