File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/w05-0904_intro.xml

Size: 5,659 bytes

Last Modified: 2025-10-06 14:03:12

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0904">
  <Title>Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 25-32, Ann Arbor, June 2005. c(c)2005 Association for Computational Linguistics Syntactic Features for Evaluation of Machine Translation</Title>
  <Section position="2" start_page="0" end_page="25" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Evaluation has long been a stumbling block in the development of machine translation systems, due to the simple fact that there are many correct translations for a given sentence. Human evaluation of system output is costly in both time and money, leading to the rise of automatic evaluation metrics in recent years. The most commonly used automatic evaluation metrics, BLEU (Papineni et al., 2002) and NIST (Doddington, 2002), are based on the assumption that The closer a machine translation is to a professional human translation, the better it is (Papineni et al., 2002). For every hypothesis, BLEU computes the fraction of n-grams which also appear in the reference sentences, as well as a brevity penalty. NIST uses a similar strategy to BLEU but further considers that n-grams with different frequency should be treated differently in the evaluation. It introduces the notion of information weights, which indicate that rarely occurring n-grams count more than those frequently occurring ones in the evaluation (Doddington, 2002). BLEU and NIST have been shown to correlate closely with human judgments in ranking MT systems with different qualities (Papineni et al., 2002; Doddington, 2002).</Paragraph>
    <Paragraph position="1"> In the 2003 Johns Hopkins Workshop on Speech and Language Engineering, experiments on MT evaluation showed that BLEU and NIST do not correlate well with human judgments at the sentence level, even when they correlate well over large test sets (Blatz et al., 2003). Kulesza and Shieber (2004) use a machine learning approach to improve the correlation at the sentence level. Their method, based on the assumption that higher classi cation accuracy in discriminating human- from machine-generated translations will yield closer correlation with human judgments, uses support vector machine (SVM) based learning to weight multiple metrics such as BLEU, NIST, and WER (minimal word error rate).</Paragraph>
    <Paragraph position="2"> The SVM is trained for differentiating the MT hypothesis and the professional human translations, and then the distance from the hypothesis's metric vector to the hyper-plane of the trained SVM is taken as the nal score for the hypothesis.</Paragraph>
    <Paragraph position="3"> While the machine learning approach improves correlation with human judgments, all the metrics discussed are based on the same type of information: n-gram subsequences of the hypothesis translations.</Paragraph>
    <Paragraph position="4"> This type of feature cannot capture the grammaticality of the sentence, in part because they do not take into account sentence-level information. For example, a sentence can achieve an excellent BLEU score without containing a verb. As MT systems improve, the shortcomings of n-gram based evaluation are becoming more apparent. State-of-the-art MT output  often contains roughly the correct words and concepts, but does not form a coherent sentence. Often the intended meaning can be inferred; often it cannot. Evidence that we are reaching the limits of n-gram based evaluation was provided by Charniak et al. (2003), who found that a syntax-based language model improved the uency and semantic accuracy of their system, but lowered their BLEU score.</Paragraph>
    <Paragraph position="5"> With the progress of MT research in recent years, we are not satis ed with the getting correct words in the translations; we also expect them to be well-formed and more readable. This presents new challenges to MT evaluation. As discussed above, the existing word-based metrics can not give a clear evaluation for the hypothesis' uency. For example, in the BLEU metric, the overlapping fractions of n-grams with more than one word are considered as a kind of metric for the uency of the hypothesis.</Paragraph>
    <Paragraph position="6"> Consider the following simple example: Reference: I had a dog.</Paragraph>
    <Paragraph position="7">  Hypothesis 1: I have the dog.</Paragraph>
    <Paragraph position="8"> Hypothesis 2: A dog I had.</Paragraph>
    <Paragraph position="9">  If we use BLEU to evaluate the two sentences, hypothesis 2 has two bigrams a dog and I had which are also found in the reference, and hypothesis 1 has no bigrams in common with the reference. Thus hypothesis 2 will get a higher score than hypothesis 1. The result is obviously incorrect. However, if we evaluate their uency based on the syntactic similarity with the reference, we will get our desired results. Figure 1 shows syntactic trees for the example sentences, from which we can see that hypothesis 1 has exactly the same syntactic structure with the reference, while hypothesis 2 has a very different one. Thus the evaluation of uency can be transformed as computing the syntactic similarity of the hypothesis and the references.</Paragraph>
    <Paragraph position="10"> This paper develops a number of syntactically motivated evaluation metrics computed by automatically parsing both reference and hypothesis sentences. Our experiments measure how well these metrics correlate with human judgments, both for individual sentences and over a large test set translated by MT systems of varying quality.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML