File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1610_metho.xml

Size: 22,383 bytes

Last Modified: 2025-10-06 14:10:39

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1610">
  <Title>Re-evaluating Machine Translation Results with Paraphrase Support</Title>
  <Section position="4" start_page="77" end_page="78" type="metho">
    <SectionTitle>
2 N-gram Co-occurrence Statistics
</SectionTitle>
    <Paragraph position="0"> Being an $8 billion industry (Browner, 2006), MT calls for rapid development and the ability to differentiate good systems from less adequate ones. The evaluation process consists of comparing system-generated peer translations to human written reference translations and assigning a numeric score to each system. While human assessments are still the most reliable evaluation measurements, it is not practical to solicit manual evaluations repeatedly while making incremental system design changes that would only result in marginal performance gains. To overcome the monetary and time constraints associated with manual evaluations, automated procedures have been successful in delivering benchmarks for performance hill-climbing with little or no cost.</Paragraph>
    <Paragraph position="1"> While a variety of automatic evaluation methods have been introduced, the underlining comparison strategy is similar--matching based on lexical identity. The most prominent implementation of this type of matching is demonstrated in BLEU (Papineni et al, 2002). The remaining part of this section is devoted to an overview of BLEU, or the BLEU-esque philosophy.</Paragraph>
    <Section position="1" start_page="77" end_page="77" type="sub_section">
      <SectionTitle>
2.1 The BLEU-esque Matching Philosophy
</SectionTitle>
      <Paragraph position="0"> The primary task that a BLEU-esque procedure performs is to compare n-grams from the peer translation with the n-grams from one or more reference translations and count the number of matches. The more matches a peer translation gets, the better it is.</Paragraph>
      <Paragraph position="1"> BLEU is a precision-based metric, which is the ratio of the number of n-grams from the peer translation that occurred in reference translations to the total number of n-grams in the peer translation. The notion of Modified n-gram Precision was introduced to detect and avoid rewarding false positives generated by translation systems. To gain high precision, systems could potentially over-generate &amp;quot;good&amp;quot; n-grams, which occur multiple times in multiple references. The solution to this problem was to adopt the policy that an ngram, from both reference and peer translations, is considered exhausted after participating in a match. As a result, the maximum number of matches an n-gram from a peer translation can receive, when comparing to a set of reference translations, is the maximum number of times this n-gram occurred in any single reference translation. Papineni et al. (2002) called this capping technique clipping. Figure 1, taken from the original BLEU paper, demonstrates the computation of the modified unigram precision for a peer translation sentence.</Paragraph>
      <Paragraph position="2"> To compute the modified n-gram precision,</Paragraph>
      <Paragraph position="4"> , for a whole test set, including all translation segments (usually in sentences), the formula is:</Paragraph>
    </Section>
    <Section position="2" start_page="77" end_page="78" type="sub_section">
      <SectionTitle>
2.2 Lack of Paraphrasing Support
</SectionTitle>
      <Paragraph position="0"> Humans are very good at finding creative ways to convey the same information. There is no one definitive reference translation in one language for a text written in another. Having acknowledged this phenomenon, however natural it is, human evaluations on system-generated translations are much more preferred and trusted. However, what humans can do with ease puts machines at a loss. BLEU-esque procedures recognize equivalence only when two n-grams exhibit the same surface-level representations, i.e. the same lexical identities. The BLEU implementation addresses its deficiency in measuring semantic closeness by incorporating the comparison with multiple reference translations. The rationale is that multiple references give a higher chance that the n-grams, assuming correct translations, appearing in the peer translation would be rewarded by one of the reference's n-grams.</Paragraph>
      <Paragraph position="1"> The more reference translations used, the better  the matching and overall evaluation quality. Ideally (and to an extreme), we would need to collect a large set of human-written translations to capture all possible combinations of verbalizing variations before the translation comparison procedure reaches its optimal matching ability.</Paragraph>
      <Paragraph position="2"> One can argue that an infinite number of references are not needed in practice because any matching procedure would stabilize at a certain number of references. This is true if precision measure is the only metric computed. However, using precision scores alone unfairly rewards systems that &amp;quot;under-generate&amp;quot;--producing unreasonably short translations. Recall measurements would provide more balanced evaluations. When using multiple reference translations, if an n-gram match is made for the peer, this n-gram could appear in any of the references. The computation of recall becomes difficult, if not impossible. This problem can be reversed if there is crosschecking for phrases occurring across references--paraphrase recognition. BLEU uses the calculation of a brevity penalty to compensate the lack of recall computation problem. The brevity penalty is computed as follows: Then, the BLEU score for a peer translation is computed as: BLEU's adoption of the brevity penalty to offset the effect of not having a recall computation has drawn criticism on its crudeness in measuring translation quality. Callison-Burch et al. (2006) point out three prominent factors: * ``Synonyms and paraphrases are only handled if they are in the set of multiple reference translations [available].</Paragraph>
      <Paragraph position="3"> * The scores for words are equally weighted so missing out on content-bearing material brings no additional penalty. null * The brevity penalty is a stop-gap measure to compensate for the fairly serious problem of not being able to calculate recall.&amp;quot; null With the introduction of ParaEval, we will address two of these three issues, namely the paraphrasing problem and providing a recall measure. null</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="78" end_page="80" type="metho">
    <SectionTitle>
3 ParaEval for MT Evaluation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="78" end_page="78" type="sub_section">
      <SectionTitle>
3.1 Overview
</SectionTitle>
      <Paragraph position="0"> Reference translations are created from the same source text (written in the foreign language) to the target language. Ideally, they are supposed to be semantically equivalent, i.e. overlap completely. However, as shown in Figure 2, when matching based on lexical identity is used (indicated by links), only half (6 from the left and 5 from the right) of the 12 words from these two sentences are matched. Also, &amp;quot;to&amp;quot; was a mismatch. In applying paraphrase matching for MT evaluation from ParaEval, we aim to match all shaded words from both sentences.</Paragraph>
    </Section>
    <Section position="2" start_page="78" end_page="79" type="sub_section">
      <SectionTitle>
3.2 Paraphrase Acquisition
</SectionTitle>
      <Paragraph position="0"> The process of acquiring a large enough collection of paraphrases is not an easy task. Manual corpus analyses produce domain-specific collections that are used for text generation and are application-specific. But operating in multiple domains and for multiple tasks translates into multiple manual collection efforts, which could be very time-consuming and costly. In order to facilitate smooth paraphrase utilization across a variety of NLP applications, we need an unsupervised paraphrase collection mechanism that can be easily conducted, and produces paraphrases that are of adequate quality and can be readily used with minimal amount of adaptation effort.</Paragraph>
      <Paragraph position="1"> Our method (Anonymous, 2006), also illustrated in (Bannard and Callison-Burch, 2005), to automatically construct a large domain-independent paraphrase collection is based on the assumption that two different phrases of the same meaning may have the same translation in a Figure 2. Two reference translations. Grey areas are matched by using BLEU.</Paragraph>
      <Paragraph position="2">  foreign language. Phrase-based Statistical Machine Translation (SMT) systems analyze large quantities of bilingual parallel texts in order to learn translational alignments between pairs of words and phrases in two languages (Och and Ney, 2004). The sentence-based translation model makes word/phrase alignment decisions probabilistically by computing the optimal model parameters with application of the statistical estimation theory. This alignment process results in a corpus of word/phrase-aligned parallel sentences from which we can extract phrase pairs that are translations of each other. We ran the alignment algorithm from (Och and Ney, 2003) on a Chinese-English parallel corpus of 218 million English words, available from the Linguistic Data Consortium (LDC). Phrase pairs are extracted by following the method described in (Och and Ney, 2004) where all contiguous phrase pairs having consistent alignments are extraction candidates. Using these pairs we build paraphrase sets by joining together all English phrases that have the same Chinese translation.</Paragraph>
      <Paragraph position="3"> Figure 3 shows an example word/phrase alignment for two parallel sentence pairs from our corpus where the phrases &amp;quot;blowing up&amp;quot; and &amp;quot;bombing&amp;quot; have the same Chinese translation. On the right side of the figure we show the paraphrase set which contains these two phrases, which is typical in our collection of extracted paraphrases.</Paragraph>
      <Paragraph position="4"> Although our paraphrase extraction method is similar to that of (Bannard and Callison-Burch, 2005), the paraphrases we extracted are for completely different applications, and have a broader definition for what constitutes a paraphrase. In (Bannard and Callison-Burch, 2005), a language model is used to make sure that the paraphrases extracted are direct substitutes, from the same syntactic categories, etc. So, using the example in Figure 3, the paraphrase table would contain only &amp;quot;bombing&amp;quot; and &amp;quot;bombing attack&amp;quot;. Paraphrases that are direct substitutes of one another are useful when translating unknown phrases.</Paragraph>
      <Paragraph position="5"> For instance, if a MT system does not have the Chinese translation for the word &amp;quot;bombing&amp;quot;, but has seen it in another set of parallel data (not involving Chinese) and has determined it to be a direct substitute of the phrase &amp;quot;bombing attack&amp;quot;, then the Chinese translation of &amp;quot;bombing attack&amp;quot; would be used in place of the translation for &amp;quot;bombing&amp;quot;. This substitution technique has shown some improvement in translation quality (Callison-Burch et al., 2006).</Paragraph>
    </Section>
    <Section position="3" start_page="79" end_page="80" type="sub_section">
      <SectionTitle>
3.3 The ParaEval Evaluation Procedure
</SectionTitle>
      <Paragraph position="0"> We adopt a two-tier matching strategy for MT evaluation in ParaEval. At the top tier, a paraphrase match is performed on system-translated sentences and corresponding reference sentences.</Paragraph>
      <Paragraph position="1"> Then, unigram matching is performed on the words not matched by paraphrases. Precision is measured as the ratio of the total number of words matched to the total number of words in the peer translation.</Paragraph>
      <Paragraph position="2"> Running our system on the example in Figure 2, the paraphrase-matching phase consumes the words marked in grey and aligns &amp;quot;have been&amp;quot; and &amp;quot;to be&amp;quot;, &amp;quot;completed&amp;quot; and &amp;quot;fully&amp;quot;, &amp;quot;to date&amp;quot; and &amp;quot;up till now&amp;quot;, and &amp;quot;sequence&amp;quot; and &amp;quot;sequenced&amp;quot;. The subsequent unigram-matching aligns words based on lexical identity.</Paragraph>
      <Paragraph position="3"> We maintain the computation of modified uni-gram precision, defined by the BLEU-esque Philosophy, in principle. In addition to clipping individual candidate words with their corresponding maximum reference counts (only for words not matched by paraphrases), we clip candidate paraphrases by their maximum reference paraphrase counts. So two completely different phrases in a reference sentence can be counted as two occurrences of one phrase. For example in Figure 4, candidate phrases &amp;quot;blown up&amp;quot; and &amp;quot;bombing&amp;quot; matched with three phrases from the references, namely &amp;quot;bombing&amp;quot; and two instances of &amp;quot;explosion&amp;quot;. Treating these two candidate phrases as one (paraphrase match), we can see its clip is 2 (from Ref 1, where &amp;quot;bombing&amp;quot; and &amp;quot;explosion&amp;quot; are counted as two occurrences of a single phrase). The only word that was matched by its lexical identity is &amp;quot;was&amp;quot;. The modified uni-gram precision calculated by our method is 4/5, where as BLEU gives 2/5.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="80" end_page="81" type="metho">
    <SectionTitle>
4 Evaluating ParaEval
</SectionTitle>
    <Paragraph position="0"> To be effective in MT evaluations, an automated procedure should be capable of distinguishing good translation systems from bad ones, human translations from systems', and human translations of differing quality. For a particular evaluation exercise, an evaluation system produces a ranking for system and human translations, and compares this ranking with one created by human judges (Turian et al., 2003). The closer a system's ranking is to the human's, the better the evaluation system is.</Paragraph>
    <Section position="1" start_page="80" end_page="80" type="sub_section">
      <SectionTitle>
4.1 Validating ParaEval
</SectionTitle>
      <Paragraph position="0"> To test ParaEval's ability, NIST 2003 Chinese MT evaluation results were used (NIST 2003).</Paragraph>
      <Paragraph position="1"> This collection consists of 100 source documents in Chinese, translations from eight individual translation systems, reference translations from four humans, and human assessments (on fluency and adequacy). The Spearman rank-order coefficient is computed as an indicator of how close a system ranking is to gold-standard human ranking. It should be noted that the 2003 MT data is separate from the corpus that we extracted paraphrases from.</Paragraph>
      <Paragraph position="2"> For comparison purposes, BLEU</Paragraph>
      <Paragraph position="4"> run. Table 1 shows the correlation figures for the two automatic systems with the NIST rankings on fluency and adequacy. The lower and higher 95% confidence intervals are labeled as &amp;quot;L-CI&amp;quot; and &amp;quot;H-CI&amp;quot;. To estimate the significance of the rank-order correlation figures, we applied bootstrap resampling to calculate the confidence intervals. In each of 1000 runs, systems were ranked based on their translations of 100 randomly selected documents. Each ranking was compared with the NIST ranking, producing a correlation score for each run. A t-test was then  Results shown are from BLEU v.11 (NIST).</Paragraph>
      <Paragraph position="5"> performed on the 1000 correlation scores. In both fluency and adequacy measurements, ParaEval correlates significantly better than BLEU. The ParaEval scores used were precision scores. In addition to distinguishing the quality of MT systems, a reliable evaluation procedure must be able to distinguish system translations from humans' (Lin and Och, 2004). Figure 5 shows the overall system and human ranking. In the upper left corner, human translators are grouped together, significantly separated from the automatic MT systems clustered into the lower right corner.</Paragraph>
    </Section>
    <Section position="2" start_page="80" end_page="81" type="sub_section">
      <SectionTitle>
4.2 Implications to Word-alignment
</SectionTitle>
      <Paragraph position="0"> We experimented with restricting the paraphrases being matched to various lengths. When allowing only paraphrases of three or more words to match, the correlation figures become stabilized and ParaEval achieves even higher correlation with fluency measurement to 0.7619 on the Spearman ranking coefficient.</Paragraph>
      <Paragraph position="1"> This phenomenon indicates to us that the bi-gram and unigram paraphrases extracted using SMT word-alignment and phrase extraction programs are not reliable enough to be applied to evaluation tasks. We speculate that word pairs extracted from (Liang et al., 2006), where a bidirectional discriminative training method was used to achieve consensus for word-alignment  (mostly lower n-grams), would help to elevate the level of correlation by ParaEval.</Paragraph>
    </Section>
    <Section position="3" start_page="81" end_page="81" type="sub_section">
      <SectionTitle>
4.3 Implications to Evaluating Paraphrase
Quality
</SectionTitle>
      <Paragraph position="0"> Utilizing paraphrases in MT evaluations is also a realistic way to measure the quality of paraphrases acquired through unsupervised channels.</Paragraph>
      <Paragraph position="1"> If a comparison strategy, coupled with paraphrase matching, distinguishes good and bad MT and summarization systems in close accordance with what human judges do, then this strategy and the paraphrases used are of sufficient quality.</Paragraph>
      <Paragraph position="2"> Since our underlining comparison strategy is that of BLEU-1 for MT evaluation, and BLEU has been proven to be a good metric for their respective evaluation tasks, the performance of the overall comparison is directly and mainly affected by the paraphrase collection.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="81" end_page="82" type="metho">
    <SectionTitle>
5 ParaEval's Support for Recall Com-
</SectionTitle>
    <Paragraph position="0"> putation Due to the use of multiple references and allowing an n-gram from the peer translation to be matched with its corresponding n-gram from any of the reference translations, BLEU cannot be used to compute recall scores, which are conventionally paired with precision to detect lengthrelated problems from systems under evaluation.</Paragraph>
    <Section position="1" start_page="81" end_page="81" type="sub_section">
      <SectionTitle>
5.1 Using Single References for Recall
</SectionTitle>
      <Paragraph position="0"> The primary goal in using multiple references is to overcome the limitation in matching on lexical identity. More translation choices give more variations in verbalization, which could lead to more matches between peer and reference translations. Since MT results are generated and evaluated at a sentence-to-sentence level (or a segment level, where each segment may contain a small number of sentences) and no text condensation is employed, the number of different and correct ways to state the same sentence is small. This is in comparison to writing generic multi-document summaries, each of which contains multiple sentences and requires significant amount of &amp;quot;rewriting&amp;quot;. When using a large collection of paraphrases while evaluating, we are provided with the alternative verbalizations needed. This property allows us to use single references to evaluate MT results and compute recall measurements.</Paragraph>
    </Section>
    <Section position="2" start_page="81" end_page="81" type="sub_section">
      <SectionTitle>
5.2 Recall and Adequacy Correlations
</SectionTitle>
      <Paragraph position="0"> When validating the computed recall scores for MT systems, we correlate with human assessments on adequacy only. The reason is that according to the definition of recall, the content coverage in references, and not the fluency reflected from the peers, is being measured. Table 2 shows ParaEval's recall correlation with NIST 2003 Chinese MT evaluation results on systems ranking. We see that ParaEval's correlation with adequacy has improved significantly when using recall scores to rank than using precision scores.</Paragraph>
    </Section>
    <Section position="3" start_page="81" end_page="82" type="sub_section">
      <SectionTitle>
5.3 Not All Single References are Created
Equal
</SectionTitle>
      <Paragraph position="0"> Human-written translations differ not only in word choice, but also in other idiosyncrasies that cannot be captured with paraphrase recognition.</Paragraph>
      <Paragraph position="1"> So it would be presumptuous to declare that using paraphrases from ParaEval is enough to allow using just one reference translation to evaluate. Using multiple references allow more paraphrase sets to be explored in matching.</Paragraph>
      <Paragraph position="2"> In Table 3, we show ParaEval's correlation figures when using single reference translations.</Paragraph>
      <Paragraph position="3"> E01-E04 indicate the sets of human translations used correspondingly.</Paragraph>
      <Paragraph position="4"> Notice that the correlation figures vary a great deal depending on the set of single references used. How do we differentiate human translations and know which set of references to use? It is difficult to quantify the quality that a human written translation reflects. We can only define &amp;quot;good&amp;quot; human translations as translations that are written not very differently from what other humans would write, and &amp;quot;bad&amp;quot; translations as the ones that are written in an unconventional fashion. Table 4 shows the differences between the four sets of reference translations when com- null paring one set of references to the other three.</Paragraph>
      <Paragraph position="5"> The scores here are the raw ParaEval precision scores. E01 and E03 are better, which explains the higher correlations ParaEval has using these two sets of references individually, shown in Table 3.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML