File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0903_metho.xml
Size: 25,169 bytes
Last Modified: 2025-10-06 14:09:58
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0903"> <Title>Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 17-24, Ann Arbor, June 2005. c(c)2005 Association for Computational Linguistics Preprocessing and Normalization for Automatic Evaluation of Machine Translation</Title> <Section position="4" start_page="17" end_page="17" type="metho"> <SectionTitle> em1 [?]Ek nem1 ,k. 2.1.1 WER </SectionTitle> <Paragraph position="0"> The word error rate is defined as the Levenshtein distance dL(Ek, tildewideEr,k) between a candidate sentence Ek and a reference sentence tildewideEr,k, divided by the reference length I[?]k for normalization.</Paragraph> <Paragraph position="1"> For a whole candidate corpus with multiple references, we define the WER to be:</Paragraph> <Paragraph position="3"> minr dLparenleftbigEk, tildewideEr,kparenrightbig Note that the WER of a single sentence can be calculated as the WER for a corpus of size K = 1.</Paragraph> </Section> <Section position="5" start_page="17" end_page="17" type="metho"> <SectionTitle> 2.1.2 PER </SectionTitle> <Paragraph position="0"> The position independent error rate (Tillmann et al., 1997) ignores the ordering of the words within a sentence. Independent of the word position, the minimum number of deletions, insertions, and substitutions to transform the candidate sentence into the reference sentence is calculated. Using the counts ne,r, ~ne,r,k of a word e in the candidate sentence Ek, and the reference sentence tildewideEr,k, we can calculate this distance as</Paragraph> <Paragraph position="2"> This distance is then normalized into an error rate, the PER, as described in section 2.1.1.</Paragraph> <Paragraph position="3"> A promising approach is to compare bigram or arbitrary m-gram count vectors instead of unigram count vectors only. This will take into account the ordering of the words within a sentence implicitly, although not as strong as the WER does.</Paragraph> </Section> <Section position="6" start_page="17" end_page="17" type="metho"> <SectionTitle> 2.1.3 BLEU </SectionTitle> <Paragraph position="0"> BLEU (Papineni et al., 2001) is a precision measure based on m-gram count vectors. The precision is modified such that multiple references are combined into a single m-gram count vector, ~ne,k := maxr ~ne,r,k. Multiple occurrences of an m-gram in the candidate sentence are counted as correct only up to the maximum occurrence count within the reference sentences. Typically, m = 1,...,4.</Paragraph> <Paragraph position="1"> To avoid a bias towards short candidate sentences consisting of &quot;safe guesses&quot; only, sentences shorter than the reference length will be penalized with a brevity penalty.</Paragraph> <Paragraph position="3"> parenrightBigbracerightbigg with the geometric mean gm and a brevity penalty</Paragraph> <Paragraph position="5"> parenrightBigparenrightbigg In the original BLEU definition, the smoothing term sm is zero. To allow for sentence-wise evaluation, Lin and Och (2004) define the BLEU-S measure with s1 := 1 and sm>1 := 0. We have adopted this technique for this study.</Paragraph> </Section> <Section position="7" start_page="17" end_page="18" type="metho"> <SectionTitle> 2.1.4 NIST </SectionTitle> <Paragraph position="0"> The NIST score (Doddington, 2002) extends the BLEU score by taking information weights of the m-grams into account. The NIST information weight is defined as</Paragraph> <Paragraph position="2"> Note that the weight of a phrase occurring in many references sentence for a candidate is considered to be lower than the weight of a phrase occurring only once! The NIST score is the sum over all information counts of the co-occurring m-grams, summed up separately for each m = 1,...,5 and normalized by the total m-gram count.</Paragraph> <Paragraph position="4"> parenrightbigg As in BLEU, there is a brevity penalty to avoid a bias towards short candidates:</Paragraph> <Paragraph position="6"> Due to the information weights, the value of the NIST score depends highly on the selection of the reference corpus. This must be taken into account when comparing NIST scores of different evaluation campaigns.</Paragraph> <Section position="1" start_page="18" end_page="18" type="sub_section"> <SectionTitle> 2.2 Other measures </SectionTitle> <Paragraph position="0"> Lin and Och (2004) introduce a family of three measures named ROUGE. ROUGE-S is a skip-bigram F-measure. ROUGE-L and ROUGE-W are measures based on the length of the longest common subsequence of the sentences. ROUGE-S has a structure similar to the bigram PER presented here.</Paragraph> <Paragraph position="1"> We expect ROUGE-L and ROUGE-W to have similar properties to WER.</Paragraph> <Paragraph position="2"> In (Leusch et al., 2003), we have described INVWER, a word error rate enhanced by block transposition edit operations. As structure and scores of INVWER are similar to WER, we have omitted INVWER experiments in this paper.</Paragraph> </Section> </Section> <Section position="8" start_page="18" end_page="19" type="metho"> <SectionTitle> 3 Preprocessing and normalization </SectionTitle> <Paragraph position="0"> Although the general idea is clear, there are still several details to be specified when implementing and using an automatic evaluation measure. We are going to investigate the following problems: The first detail we have to state more precisely is the term &quot;word&quot; in the above formulae. A common approach for western languages is to consider spaces as separators of words. The role of punctuation marks in tokenization is arguable though. A punctuation mark can separate words, it can be part of a word, and it can be a word of its own. Equally it can be irrelevant at all for evaluation.</Paragraph> <Paragraph position="1"> On the same lines it is to be specified whether we consider words to be equal if they differ only with respect to upper and lower case. For the IWSLT evaluation, (Paul et al., 2004) give an introduction to how the handling of punctuation and case information may affect automatic MT evaluation.</Paragraph> <Paragraph position="2"> Also, a method to calculate the &quot;reference length&quot; must specified if there are multiple reference sentences of different length.</Paragraph> <Paragraph position="3"> Since we want to compare automatic evaluation with human evaluation, we have to clarify some questions about assessing human evaluation as well: Large evaluation tasks are usually distributed to several human evaluators. To smooth evaluation noise, it is common practice to have each candidate sentence evaluated by at least two human judges independently. Therefore there are several evaluation scores for each candidate sentence. We require a single score for each system, though. Consequently, we have to specify how to combine the evaluator scores into sentence scores and then the sentence scores into a system score.</Paragraph> <Paragraph position="4"> Different definitions of this will have a significant impact on automatic and human evaluation scores.</Paragraph> <Section position="1" start_page="18" end_page="19" type="sub_section"> <SectionTitle> 3.1 Tokenization and punctuation </SectionTitle> <Paragraph position="0"> The importance of punctuation as well as the strictness of punctuation rules depends on the language. In most western languages, correct punctuation can vastly improve the legibility of texts. Marks like full stop or comma separate words. Other marks like apostrophes and hyphens can be used to join words, forming new words by this. For example, the spelling &quot;There's&quot; is a contraction of &quot;There is&quot;.</Paragraph> <Paragraph position="1"> Similar phenomena can be found in other languages, although the set of critical characters may vary. Even when evaluating English translations, the candidate sentences may contain source language parts like proper names which should thus be treated according to the source language.</Paragraph> <Paragraph position="2"> From the viewpoint of an automatic evaluation measure, we have to decide which units we would consider to be words of their own.</Paragraph> <Paragraph position="3"> We have studied four tokenization methods. The simplest method is keeping the original sentences, and considering only spaces as word separators.</Paragraph> <Paragraph position="4"> Moreover, we can consider all punctuation marks to separate words but remove them completely then.</Paragraph> <Paragraph position="5"> The mteval tool (Papineni, 2002) improves this and contractions Powell said : &quot; we would not be alone ; that is for sure . &quot; scheme by keeping all punctuation marks as separate words except for decimal points and hyphens joining composita. We have extended this scheme by implementing a treatment of common English contractions. Table 1 illustrates these methods.</Paragraph> </Section> <Section position="2" start_page="19" end_page="19" type="sub_section"> <SectionTitle> 3.2 Case sensitivity </SectionTitle> <Paragraph position="0"> In western languages, maintaining correct upper and lower case can improve the readability of a text. Unfortunately, though the case of a word depends on the word class, classification is not always unambiguous. What is more, the first word in a sentence is always written in upper case. This lowers the significance of case information in MT evaluation, as even a valid reordering of words between candidate and reference sentence may lead to conflicting cases. Consequently, we investigated if and how case information can be exploited for automatic evaluation.</Paragraph> </Section> <Section position="3" start_page="19" end_page="19" type="sub_section"> <SectionTitle> 3.3 Reference length </SectionTitle> <Paragraph position="0"> Each automatic evaluation measure we have taken into account depends on the calculation of a reference length: WER, PER, and ROUGE are normalized by it, whereas NIST or BLEU incorporate it for the determination of the brevity penalty. In MT evaluation practise, there are multiple reference sentences for each candidate sentence, with different lengths each. It is thus not intuitively clear what the &quot;reference length&quot; is.</Paragraph> <Paragraph position="1"> A simple choice here is the average length of the reference sentences. Though this is modus operandi for NIST, it is problematic with brevity penalty or F-measure based scores, as even candidate sentences that are identical to a shorter-than-average reference sentence - which we would intuitively consider to be &quot;optimal&quot; - will then receive a sub-optimal score. BLEU incorporates a different method for the determination of the reference length in its default implementation: Reference length here is the reference sentence length which is closest to the candidate length. If there is more than one the shortest of them is chosen.</Paragraph> <Paragraph position="2"> For measures based on the comparison of single sentences such as WER, PER, and ROUGE, at least two more methods deserve consideration: * The average length of the sentences with the lowest absolute distance or highest similarity to the candidate sentence. We call this method &quot;average nearest-sentence length&quot;.</Paragraph> <Paragraph position="3"> * The length of the sentence with the lowest relative error rate or the highest relative similarity. We call this method &quot;best length&quot;. Note that when using this method, not the minimum absolute distance is used for the error rate, but the distance that leads to minimum relative error.</Paragraph> <Paragraph position="4"> Other strategies studied by us, e.g. minimum length of the reference sentences, did not show any theoretical or experimental advantage over the methods mentioned here. Thus we will not discuss them in this paper.</Paragraph> </Section> <Section position="4" start_page="19" end_page="19" type="sub_section"> <SectionTitle> 3.4 Sentence boundaries </SectionTitle> <Paragraph position="0"> The position of a word within a sentence can be quite significant for the correctness of the sentence.</Paragraph> <Paragraph position="1"> WER, INVWER, and ROUGE-L take into account the ordering explicitly. This is not the case with n-PER, BLEU, or NIST, although the positions of inner words are regarded implicitly by m-gram overlap.</Paragraph> <Paragraph position="2"> To model the position of words at the initial or the end of a sentence, one can enclose the sentence with artificial sentence boundary words. Although this is a common approach in language modelling, it has to our knowledge not yet been applied to MT evaluation.</Paragraph> </Section> <Section position="5" start_page="19" end_page="19" type="sub_section"> <SectionTitle> 3.5 Evaluator normalization </SectionTitle> <Paragraph position="0"> For human evaluation, it has to be specified how to handle evaluator bias, and how to combine sentence scores into system scores.</Paragraph> <Paragraph position="1"> Regarding evaluator bias, even accurate evaluation guidelines will not prevent a measurable discrepancy between the scores assigned by different human evaluators.</Paragraph> <Paragraph position="2"> The 2003 TIDES/MT evaluation may serve as an example here: Since the candidate sentences of each human evaluator. TIDES CE corpus.</Paragraph> <Paragraph position="3"> the participating systems were randomly distributed among ten human evaluators, one would expect the assessed scores to be independent of the evaluator. Figure 1 indicates that this is indeed not the case, as the evaluators can clearly be distinguished by the amount of good and bad marks they assessed.</Paragraph> <Paragraph position="4"> (0,1) evaluator normalization overcomes this bias: For each human evaluator the average sentence score given by him or her and its variance are calculated. These assignments are then normalized to (0,1) expectation and standard deviation (Doddington, 2003), separately for each evaluator. Evaluator normalization should be unnecessary for system evaluation, as the evaluator biases tend to cancel out over the large amount of candidate sentences if the alignment of evaluators and systems is random enough. Moreover, with (0,1) normalization the calculated system scores are relative, not absolute scores. As such they can only be compared with scores out of the same evaluation. Whereas the assessments by the human evaluators are given on the sentence level, our interest may lie on the evaluation of whole candidate systems.</Paragraph> <Paragraph position="5"> Depending on the number of assessments per candidate sentence, different combination methods for the sentence scores can be considered for this, e.g. mean or median. As our data consisted only of two or three human assessments per sentence, we have only applied the mean in our experiments.</Paragraph> <Paragraph position="6"> It has to be defined how a system score is calculated from the sentence scores. All of the automatic evaluation measures implicitly weight the candidate sentences by their length. Consequently, we applied for the human evaluation scores a weighting by length on sentence level as well.</Paragraph> </Section> </Section> <Section position="9" start_page="19" end_page="22" type="metho"> <SectionTitle> 4 Experimental results </SectionTitle> <Paragraph position="0"> To assess the impact of the mentioned preprocessing steps, we calculated scores for several automatic evaluation measures with varying preprocessing, reference length calculation, etc. on three evaluation test sets from international MT evaluation campaigns. We then compared these automatic evaluation results with human evaluation of adequacy and fluency by determining a correlation coefficient between human and automatic evaluation. We chose Pearson's r for this. Although all evaluation measures were calculated using length weighting, we did not do any weighting when calculating the sentence level correlation.</Paragraph> <Paragraph position="1"> Regarding the m-gram PER, we had studied m-gram lengths of up to 8 both separately and in combination with shorter m-gram lengths in previous experiments. However, an m-gram length of greater than 4 did not show noteworthy correlation. For this, we will leave out these results in this paper.</Paragraph> <Paragraph position="2"> For the sake of clarity, we will also leave out measures that behave very similarly to akin measures e.g. INVWER and WER, 2-PER and 1-PER, or BLEU and BLEU-S.</Paragraph> <Paragraph position="3"> Since WER and PER are error measures, whereas BLEU and NIST are similarity measures, the correlation coefficients with human evaluation will have opposite signs. For convenience, we will look at the absolute coefficients only.</Paragraph> <Section position="1" start_page="19" end_page="21" type="sub_section"> <SectionTitle> 4.1 Corpora </SectionTitle> <Paragraph position="0"> From the 2003 TIDES evaluation campaign we included both the Chinese-English and the Arabic-English test corpus in our experiments. Both were provided with adequacy and fluency scores between 1 and 5 for seven and six candidate sets respectively.</Paragraph> <Paragraph position="1"> As we wanted to perform experiments on a corpus with a larger amount of MT systems, we also included the IWSLT BTEC 2004 Chinese-English evaluation (Akiba et al., 2004). We restricted our experiments to the eleven MT systems that had been trained on a common training corpus.</Paragraph> <Paragraph position="2"> Corpus statistics can be found in table 2.</Paragraph> </Section> <Section position="2" start_page="21" end_page="21" type="sub_section"> <SectionTitle> 4.2 Experimental baseline </SectionTitle> <Paragraph position="0"> In our first experiment we studied the correlation of the different evaluation measures with human evaluation at &quot;baseline&quot; conditions. These included no sentence boundaries, but tokenization with treatment of abbreviations, see table 1. For sentence evaluation, conditions included evaluator normalization. Case information was removed. We used these settings in the other experiments, too, if not stated otherwise.</Paragraph> <Paragraph position="1"> Figure 2 shows the correlation between automatic and human scores. On the TIDES corpora the system level correlation is particularly high, at a moderate sentence level correlation. We assume the latter is due to the poor sentence inter-annotator agreement on these corpora, which is then smoothed out on system level. On the BTEC corpus a high sentence level correlation accompanies a significantly lower system level correlation. Note that due to the much lower number of samples on the system level (e.g. 5 vs. 5500), small changes in the sentence level correlation are more likely to be significant than such changes on system level.</Paragraph> <Paragraph position="2"> We have verified these effects by inspecting the rank correlation on both levels, as well as by experiments on other corpora. Although these experiments support our findings, we have omitted results here Left: sentence, right: system level correlation.</Paragraph> <Paragraph position="3"> for the sake of clarity.</Paragraph> </Section> <Section position="3" start_page="21" end_page="21" type="sub_section"> <SectionTitle> 4.3 Evaluator normalization </SectionTitle> <Paragraph position="0"> We studied the effect of (0,1)-normalization of scores assigned by human evaluators. The NIST measure showed a behavior very similar to that of the other measures and is thus left out in the graph.</Paragraph> <Paragraph position="1"> The correlation of all automatic measures both with fluency and with adequacy increases significantly at sentence level (figure 3). We do not notice a positive effect on system level, which confirms the assumption stated in section 3.5.</Paragraph> </Section> <Section position="4" start_page="21" end_page="22" type="sub_section"> <SectionTitle> 4.4 Tokenization and case normalization </SectionTitle> <Paragraph position="0"> The impact of case information was analyzed in our next experiment. Figure 4 (again without the NIST measure as it shows a similar behavior to the other measures) indicates that it is advisable to disregard case information when looking into adequacy on sentence level. Surprisingly, this also holds for Left: sentence, right: system level correlation. fluency. We do no find a clear tendency on whether or not to regard case information at system level. Figure 5 indicates that the way of handling punctuation we proposed does pay off when evaluating adequacy. For fluency our results were contradictory: A slight decrease on the Arabic-English corpus is accompanied by a slight decay on the Chinese-English corpus. We did not investigate the BTEC corpus here as most systems sticked to the tokenization guidelines for this evaluation.</Paragraph> </Section> <Section position="5" start_page="22" end_page="22" type="sub_section"> <SectionTitle> 4.5 Reference length </SectionTitle> <Paragraph position="0"> The dependency of evaluation measures on the selection of reference lengths is rarely covered in the literature. However, as we can see in figure 6, our experiments indicate a significant impact. The selected three methods here are the default for WER/PER, NIST, and BLEU, respectively. For the distance based evaluation measures, represented by Left: sentence, right: system level correlation. WER here, taking the length of the sentence leading to the best score leads to the best correlation with both fluency and adequacy. Taking the average length instead seems to be the worst choice.</Paragraph> <Paragraph position="1"> For brevity penalty based measures, the effect is not as clear: On both TIDES corpora there is no significant difference in correlation between using the average length and the nearest length. On the BTEC corpus, choosing the nearest sentence length leads to a significantly higher correlation than choosing the average length. We assume this is due to the high number of reference sentences on this corpus.</Paragraph> </Section> <Section position="6" start_page="22" end_page="22" type="sub_section"> <SectionTitle> 4.6 Sentence boundaries </SectionTitle> <Paragraph position="0"> As sentence boundaries will only influence m-gram count vector based measures, we have restricted our experiments to bigram PER, BLEU-S, and NIST here. Including sentence boundaries (figure 7) has a positive effect on correlation with fluency and adequacy for both bigram PER and BLEU-S.</Paragraph> <Paragraph position="1"> Sentence initials seem to be more important than sentence ends here. For the NIST measure, we do not find any significant effect.</Paragraph> </Section> </Section> <Section position="10" start_page="22" end_page="23" type="metho"> <SectionTitle> 5 Discussion </SectionTitle> <Paragraph position="0"> In a perfect MT world, any dependency of an evaluation on case information or tokenization should be inexistent, as MT systems already have to deal with both in the translation process, and could be designed to produce output according to evaluation campaign guidelines. Once all translation systems stick to the same specifications, no further preprocessing steps should be necessary.</Paragraph> <Paragraph position="1"> In practice there will be some systems that step out of line. If we then choose strict rules regarding case information and punctuation, automatic error measures will penalize these systems rather hard, whereas penalty is rather low if we choose lax ones. In this situation case information will have a large effect on the correlation between automatic and human evaluation, depending on whether the involved candidate systems will have a good or a bad human evaluation. It is vital to keep this in mind when drawing conclusions here regarding system evaluation, despite the obvious importance of case information in natural languages.</Paragraph> <Paragraph position="2"> These considerations also hold for the treatment of punctuation marks, as a special care should be unnecessary if all systems sticked to tokenization specifications. In practise, MT systems differ in the way they generate and handle punctuation marks. Therefore, appropriate preprocessing steps are advisable.</Paragraph> <Paragraph position="3"> Our experiments suggest that sentence boundaries increase correlation between automatic scores and adequacy both on sentence and on system level.</Paragraph> <Paragraph position="4"> For fluency, the improvement is less significant, and mainly depends on the sentence initials.</Paragraph> <Paragraph position="5"> For length penalty based measures, we have found that choosing the nearest sentence length yields the highest correlation with human evaluation. For distance based measures instead, it seems advisable to choose the sentence that leads to the best relative score as the one that determines the reference length.</Paragraph> </Section> class="xml-element"></Paper>