File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/p04-1077_intro.xml
Size: 3,917 bytes
Last Modified: 2025-10-06 14:02:22
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1077"> <Title>Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics</Title> <Section position="3" start_page="0" end_page="2" type="intro"> <SectionTitle> 2 BLEU and N-gram Co-Occurrence </SectionTitle> <Paragraph position="0"> To automatically evaluate machine translations the machine translation community recently adopted an n-gram co-occurrence scoring procedure BLEU (Papineni et al. 2001). In two recent large-scale machine translation evaluations sponsored by NIST, a closely related automatic evaluation method, simply called NIST score, was used.</Paragraph> <Paragraph position="1"> The NIST (NIST 2002) scoring method is based on BLEU.</Paragraph> <Paragraph position="2"> The main idea of BLEU is to measure the similarity between a candidate translation and a set of reference translations with a numerical metric.</Paragraph> <Paragraph position="3"> They used a weighted average of variable length n-gram matches between system translations and a set of human reference translations and showed that the weighted average metric correlating highly with human assessments.</Paragraph> <Paragraph position="4"> BLEU measures how well a machine translation overlaps with multiple human translations using n-gram co-occurrence statistics. N-gram precision in BLEU is computed as follows: (n-gram) is the maximum number of n-grams co-occurring in a candidate translation and a reference translation, and Count(ngram) is the number of n-grams in the candidate translation. To prevent very short translations that try to maximize their precision scores, BLEU adds a brevity penalty, BP, to the formula: Where |c |is the length of the candidate translation and |r |is the length of the reference translation. The BLEU formula is then written as follows:</Paragraph> <Paragraph position="6"> The weighting factor, w n , is set at 1/N. Although BLEU has been shown to correlate well with human assessments, it has a few things that can be improved. First the subjective application of the brevity penalty can be replaced with a recall related parameter that is sensitive to reference length. Although brevity penalty will penalize candidate translations with low recall by a factor of e</Paragraph> <Paragraph position="8"> , it would be nice if we can use the traditional recall measure that has been a well known measure in NLP as suggested by Melamed (2003). Of course we have to make sure the resulting composite function of precision and recall is still correlates highly with human judgments.</Paragraph> <Paragraph position="9"> Second, although BLEU uses high order n-gram (n>1) matches to favor candidate sentences with consecutive word matches and to estimate their fluency, it does not consider sentence level structure. For example, given the following sentences: S1. police killed the gunman S2. police kill the gunman S3. the gunman kill police We only consider BLEU with unigram and bigram, i.e. N=2, for the purpose of explanation and call this BLEU-2. Using S1 as the reference and S2 and S3 as the candidate translations, S2 and S3 would have the same BLEU-2 score, since they both have one bigram and three unigram matches .</Paragraph> <Paragraph position="10"> However, S2 and S3 have very different meanings. Third, BLEU is a geometric mean of unigram to N-gram precisions. Any candidate translation without a N-gram match has a per-sentence BLEU score of zero. Although BLEU is usually calculated over the whole test corpus, it is still desirable to have a measure that works reliably at sentence level for diagnostic and introspection purpose. To address these issues, we propose three new automatic evaluation measures based on longest common subsequence statistics and skip bigram co-occurrence statistics in the following sections.</Paragraph> </Section> class="xml-element"></Paper>