File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/e06-1031_intro.xml

Size: 4,422 bytes

Last Modified: 2025-10-06 14:03:18

<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-1031">
  <Title>CDER: Efficient MT Evaluation Using Block Movements</Title>
  <Section position="2" start_page="0" end_page="241" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Research in machine translation (MT) depends heavily on the evaluation of its results. Especially for the development of an MT system, an evaluation measure is needed which reliably assesses the quality of MT output. Such a measure will help analyze the strengths and weaknesses of different translation systems or different versions of the same system by comparing output at the sentence level. In most applications of MT, understandability for humans in terms of readability as well as semantical correctness should be the evaluation criterion. But as human evaluation is tedious and cost-intensive, automatic evaluation measures are used in most MT research tasks. A high correlation between these automatic evaluation measures and human evaluation is thus desirable.</Paragraph>
    <Paragraph position="1"> State-of-the-art measures such as BLEU (Papineni et al., 2002) or NIST (Doddington, 2002) aim at measuring the translation quality rather on the document level1 than on the level of single sentences. They are thus not well-suited for sentence-level evaluation. The introduction of smoothing (Lin and Och, 2004) solves this problem only partially.</Paragraph>
    <Paragraph position="2"> In this paper, we will present a new automatic error measure for MT - the CDER - which is designed for assessing MT quality on the sentence level. It is based on edit distance - such as the well-known word error rate (WER) - but allows for reordering of blocks. Nevertheless, by defining reordering costs, the ordering of the words in a sentence is still relevant for the measure. In this, the new measure differs significantly from the position independent error rate (PER) by (Tillmann et al., 1997). Generally, finding an optimal solution for such a reordering problem is NP hard, as is shown in (Lopresti and Tomkins, 1997). In previous work, researchers have tried to reduce the complexity, for example by restricting the possible permutations on the block-level, or by approximation or heuristics during the calculation. Nevertheless, most of the resulting algorithms still have high run times and are hardly applied in practice, or give only a rough approximation. An overview of some better-known measures can be found in Section 3.1. In contrast to this, our new measure can be calculated very efficiently. This is achieved by requiring complete and disjoint coverage of the blocks only for the reference sentence, and not for the candidate translation. We will present an algorithm which computes the new error measure in quadratic time.</Paragraph>
    <Paragraph position="3"> The new evaluation measure will be investigated and compared to state-of-the-art methods on two translation tasks. The correlation with human assessment will be measured for several different statistical MT systems. We will see that the new measure significantly outperforms the existing approaches.</Paragraph>
    <Paragraph position="4"> 1The n-gram precisions are measured at the sentence level and then combined into a score over the whole document.  As a further improvement, we will introduce word dependent substitution costs. This method will be applicable to the new measure as well as to established measures like WER and PER.</Paragraph>
    <Paragraph position="5"> Starting from the observation that the substitution of a word with a similar one is likely to affect translation quality less than the substitution with a completely different word, we will show how the similarity of words can be accounted for in automatic evaluation measures.</Paragraph>
    <Paragraph position="6"> This paper is organized as follows: In Section 2, we will present the state of the art in MT evaluation and discuss the problem of block reordering. Section 3 will introduce the new error measure CDER and will show how it can be calculated efficiently. The concept of word-dependent substitution costs will be explained in Section 4. In Section 5, experimental results on the correlation of human judgment with the CDER and other well-known evaluation measures will be presented. Section 6 will conclude the paper and give an outlook on possible future work.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML