File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3112_metho.xml

Size: 15,839 bytes

Last Modified: 2025-10-06 14:10:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3112">
  <Title>Contextual Bitext-Derived Paraphrases in Automatic MT Evaluation</Title>
  <Section position="3" start_page="86" end_page="87" type="metho">
    <SectionTitle>
Automatic MT evaluation
</SectionTitle>
    <Paragraph position="0"> The insensitivity of BLEU and NIST to perfectly legitimate variation has been raised, among others, in (Callison-Burch et al., 2006), but the criticism is widespread. Even the creators of BLEU point out that it may not correlate particularly well with human judgment at the sentence level (Papineni et al., 2002), a problem also noted by (Och et al., 2003) and (Russo-Lassner et al., 2005). A side effect of this phenomenon is that BLEU is less reliable for smaller data sets, so the advantage it provides in the speed of evaluation is to some extent counterbalanced by the time spent by developers on producing a sufficiently large test data set in order to obtain a reliable score for their system.</Paragraph>
    <Paragraph position="1"> Recently a number of attempts to remedy these shortcomings have led to the development of other automatic machine translation metrics. Some of them concentrate mainly on the word reordering aspect, like Maximum Matching String (Turian et al., 2003) or Translation Error Rate (Snover et al., 2005). Others try to accommodate both syntactic and lexical differences between the candidate translation and the reference, like CDER (Leusch et al., 2006), which employs a version of edit distance for word substitution and reordering; METEOR (Banerjee and Lavie, 2005), which uses stemming and WordNet synonymy; and a linear regression model developed by (Russo-Lassner et al., 2005), which makes use of stemming, Word-Net synonymy, verb class synonymy, matching noun phrase heads, and proper name matching.</Paragraph>
    <Paragraph position="2"> A closer examination of these metrics suggests that the accommodation of lexical equivalence is as difficult as the appropriate treatment of syntactic variation, in that it requires considerable external knowledge resources like WordNet, verb class databases, and extensive text preparation: stemming, tagging, etc. The advantage of our method is that it produces relevant paraphrases with nothing more than the evaluation bitext and a widely available word and phrase alignment software, and therefore can be used with any existing evaluation metric.</Paragraph>
  </Section>
  <Section position="4" start_page="87" end_page="89" type="metho">
    <SectionTitle>
3 Contextual bitext-derived paraphrases
</SectionTitle>
    <Paragraph position="0"> The method presented in this paper rests on a combination of two simple ideas. First, the components necessary for automatic MT evaluation like BLEU or NIST, a source text and a reference text, constitute a miniature parallel corpus, from which word and phrase alignments can be extracted automatically, much like during the training for a statistical machine translation system. Second, target language words e</Paragraph>
    <Paragraph position="2"> aligned as the likely translations to a source language word f i are often synonyms or near-synonyms of each other. This also holds for phrases: target language phrases ep</Paragraph>
    <Paragraph position="4"> aligned with a source language phrase fp i are often paraphrases of each other. For example, in our experiment, for the French word question the most probable automatically aligned English translations are question, matter, and issue, which in English are practically synonyms. Section 3.2 presents more examples of such equivalent expressions. null</Paragraph>
    <Section position="1" start_page="87" end_page="89" type="sub_section">
      <SectionTitle>
3.2
Experimental design
</SectionTitle>
      <Paragraph position="0"> For our experiment, we used two test sets, each consisting of 2000 sentences, drawn randomly from the test section of the Europarl parallel corpus. The source language was French and the target language was English. One of the test sets was translated by Pharaoh trained on 156,000 French-English sentence pairs. The other test set was translated by Logomedia, a commercially available rule-based MT system. Each test set consisted therefore of three files: the French source file, the English translation file, and the English reference file.</Paragraph>
      <Paragraph position="1"> Each translation was evaluated by the BLEU and NIST metrics first with the single reference, then with the multiple references for each sentence using the paraphrases automatically generated from the source-reference mini corpus. A subset of a 100 sentences was randomly extracted from each test set and evaluated by two independent human judges with respect to accuracy and fluency; the human scores were then compared to the BLEU and NIST scores for the single-reference and the automatically generated multiple-reference files.</Paragraph>
      <Paragraph position="2"> Word alignment and phrase extraction We used the GIZA++ word alignment software null  to produce initial word alignments for our miniature bilingual corpus consisting of the source French file and the English reference file, and the refined word alignment strategy of (Och and Ney, 2003; Koehn et al., 2003; Tiedemann, 2004) to obtain improved word and phrase alignments.</Paragraph>
      <Paragraph position="3"> For each source word or phrase f i that is aligned with more than one target words or phrases, its possible translations e</Paragraph>
      <Paragraph position="5"> placed in a list as equivalent expressions (i.e.</Paragraph>
      <Paragraph position="6"> synonyms, near-synonyms, or paraphrases of each  other). A few examples are given in (1). (1) agreement - accordance  adopted - implemented matter - lot - case funds - money arms - weapons area - aspect question - issue - matter we would expect - we certainly expect bear on - are centred around Alignment divides target words and  phrases into equivalence sets; each set corresponds to one source word/phrase that was originally aligned with the target elements. For example, for the French word citoyens three English words were deemed to be the most appropriate translations: people, public, and citizens; therefore these three words constitute an equivalence set. Another French word population was aligned with two English translations: population and people; so the word people appears in two equivalence set (this gives rise to the question of equivalence transitivity, which will be discussed in Section 3.3). From the 2000-sentence evaluation bitext we derived 769 equivalence sets, containing in total 1658 words or phrases. Each set contained on average two or three elements. In effect, we produced at least one equivalent expression for 1658 English words or phrases.</Paragraph>
      <Paragraph position="7"> An advantage of our method is that the target paraphrases and words come ordered with re-</Paragraph>
      <Paragraph position="9"> spect to their likelihood of being the translation of the source word or phrase - each of them is assigned a probability expressing this likelihood, so we are able to choose only the most likely translations, according to some experimentally established threshold. The experiment reported here was conducted without such a threshold, since the word and phrase alignment was of a very high quality.</Paragraph>
      <Paragraph position="10">  Domain-specific lexical and syntactic paraphrases It is important to notice here how the paraphrases produced are more appropriate to the task at hand than synonyms extracted from a general-purpose thesaurus or WordNet. First, our paraphrases are contextual - they are restricted to only those relevant to the domain of the text, since they are derived from the text itself. Given the context provided by our evaluation bitext, the word area in (1) turns out to be only synonymous with aspect, and not with land, territory, neighbourhood, division, or other synonyms a general-purpose thesaurus or WordNet would give for this entry. This allows us to limit our multiple references only to those that are likely to be useful in the context provided by the source text. Second, the phrase alignment captures something neither a thesaurus nor WordNet will be able to provide: a certain amount of syntactic variation of paraphrases. Therefore, we know that a string such as we would expect in (1), with the sequence noun-aux-verb, might be paraphrased by we certainly expect, a sequence of noun-adv-verb.</Paragraph>
      <Paragraph position="11"> Open and closed class items One important conclusion we draw from analysing the synonyms obtained through word alignment is that equivalence is limited mainly to words that belong to open word classes, i.e. nouns, verbs, adjectives, adverbs, but is unlikely to extend to closed word classes like prepositions or pronouns. For instance, while the French preposition a can be translated in English as to, in, or at, depending on the context, it is not the case that these three prepositions are synonymous in English. The division is not that clear-cut, however: within the class of pronouns, he, she, and you are definitely not synonymous, but the demonstrative pronouns this and that might be considered equivalent for some purposes. Therefore, in our experiment we exclude prepositions and in future work we plan to examine the word alignments more closely to decide whether to exclude any other words.</Paragraph>
      <Paragraph position="12"> Creating multiple references After the list of synonyms and paraphrases is extracted from the evaluation bitext, for each reference sentence a string search replaces every eligible word or phrase with its equivalent(s) from the paraphrase list, one at a time, and the resulting string is added to the array of references. The original string is added to the array as well. This process results in a different number of reference sentences for every test sentence, depending on whether there was anything to replace in the reference and how many paraphrases we have available for the original substring. One example of this process is shown in (2).</Paragraph>
      <Paragraph position="13"> (2) Original reference: i admire the answer mrs parly gave this morning but we have turned a blind eye to that  i admire the reply mrs parly gave this morning but we have turned a blind eye to that  i admire the answer mrs parly gave this morning however we have turned a blind eye to that  i admire the answer mrs parly gave this morning but we have turned a blind eye to it Transitivity As mentioned before, an interesting question that arises here is the potential transitivity of our automatically derived synonyms/paraphrases. It could be argued that if the word people is equivalent to public according to one set from our list, and to the word population according to another set, then public can be thought of as equivalent to population. In this case, the equivalence is not controversial. However, consider the following relation: if sure in one of the equivalence sets is synonymous to certain, and certain in a different  set is listed as equivalent to some, then treating sure and some as synonyms is a mistake. In our experiment we do not allow synonym transitivity; we only use the paraphrases from equivalence sets containing the word/phrase we want to replace.</Paragraph>
      <Paragraph position="14"> Multiple simultaneous substitution Note that at the moment the references we are producing do not contain multiple simultaneous substitutions of equivalent expressions; for example, in (2) we currently do not produce the follow- null ing versions: (3) Paraphrase 4: i admire the reply mrs parly gave this morning however we have turned a blind eye to that</Paragraph>
      <Paragraph position="16"> i admire the answer mrs parly gave this morning however we have turned a blind eye to it Paraphrase 6: i admire the reply mrs parly gave this morning but we have turned a blind eye to it This can potentially prevent higher n-grams being successfully matched if two or more equivalent expressions find themselves within the range of n-grams being tested by BLEU and NIST. To avoid combinatorial problems, implementing multiple simultaneous substitutions could be done using a lattice, much like in (Pang et al., 2003).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="89" end_page="90" type="metho">
    <SectionTitle>
4 Results
</SectionTitle>
    <Paragraph position="0"> As expected, the use of multiple references produced by our method raises both the BLEU and NIST scores for translations produced by Pharaoh (test set PH) and Logomedia (test set LM). The results are presented in Table 1.</Paragraph>
    <Paragraph position="1">  reference scores for test set PH and test set LM The hypothesis that the multiple-reference scores reflect better human judgment is also confirmed. For 100-sentence subsets (Subset PH and Subset LM) randomly extracted from our test sets PH and LM, we calculated Pearson's correlation between the average accuracy and fluency scores that the translations in this subset received from two human judges (for each subset) and the single-reference and multiple-reference sentence-level BLEU and NIST scores.</Paragraph>
    <Paragraph position="2"> There are two issues that need to be noted at this point. First, BLEU scored many of the sentences as zero, artificially leveling many of the weaker translations.</Paragraph>
    <Paragraph position="3">  This explains the low, although still statistically significant (p value &lt;  ) correlation with BLEU for both single and multiple reference translations. Using a version of BLEU with add-one smoothing we obtain considerably higher correlations. Table 2 shows Pearson's correlation coefficient for BLEU, BLEU with add-one smoothing, NIST, and human judgments for Subsets PH. Multiple paraphrase references produced by our method consistently lead to a higher correlation with human judgment for every metric.</Paragraph>
    <Paragraph position="4">  reference BLEU, smoothed BLEU, and NIST for subset PH (of test set PH) The second issue that requires explanation is the lower general scores Logomedia's translation received on the full set of 2000 sentences, and the extremely low correlation of its automatic evaluation with human judgment, irrespective of the number of references. It has been noticed (Calli null BLEU uses a geometric average while calculating the sentence-level score and will score a sentence as 0 if it does not have at least one 4-gram.</Paragraph>
    <Paragraph position="5">  A critical value for Pearson's correlation coefficient for the sample size between 90 and 100 is 0.267, with p &lt; 0.01.  The significance of the rise in scores was confirmed in a resampling/bootstrapping test, with p &lt; 0.0001.  son-Burch et al., 2006) that BLEU and NIST favour n-gram based MT models such as Pharaoh, so the translation produced by Logomedia scored lower on the automatic evaluation, even though the human judges rated Logomedia output higher than Pharaoh's translation. Both human judges consistently gave very high scores to most sentences in subset LM (Logomedia), and as a consequence there was not enough variation in the scores assigned by them to create a good correlation with the BLEU and NIST scores. The average human scores for the subsets PH and LM and the coefficients of variation are presented in Table 3. It is easy to see that Logomedia's translation received a higher mean score (on a scale 0 to 5) from the human judges and with less variance than Pharaoh.  cients of variation for Subset PH and Subset LM As a result of the consistently high human scores for Logomedia, none of the Pearson's correlations computed for Subset LM is high enough to be significant. The values are lower than the critical value 0.164 corresponding to p &lt; 0.10.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML