File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1002_metho.xml

Size: 21,398 bytes

Last Modified: 2025-10-06 14:10:16

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1002">
  <Title>Going Beyond AER: An Extensive Analysis of Word Alignments and Their Impact on MT</Title>
  <Section position="5" start_page="9" end_page="11" type="metho">
    <SectionTitle>
3 Intrinsic Evaluation of Alignments
</SectionTitle>
    <Paragraph position="0"> Our goal is to compare different alignments and to investigate how their characteristics affect the MT systems. We evaluate alignments in terms of precision, recall, alignment error rate (AER), and a new measure called consistent phrase error rate (CPER).</Paragraph>
    <Paragraph position="1"> We focus on 5 different alignments obtained by combining two uni-directional alignments. Each uni-directional alignment is the result of running GIZA++ (Och, 2000b) in one of two directions (source-to-target and vice versa) with default configurations. The combined alignments that are used in this paper are as follows:  1. Union of both directions (SU), 2. Intersection of both directions (SI), 3. A heuristic based combination technique called grow-diag-final (SG), which is the default alignment combination heuristic employed in Pharaoh (Koehn, 2004), 4-5. Two supervised alignment combination techniques (SA and SB) using 2 and 4 in- null put alignments as described in (Ayan et al., 2005).</Paragraph>
    <Paragraph position="2"> This paper examines the impact of alignments according to their orientation toward precision or recall. Among the five alignments above, SU and SG are recall-oriented while the other three are precision-oriented. SB is an improved version of SA which attempts to increase recall without a significant sacrifice in precision.</Paragraph>
    <Paragraph position="3"> Manually aligned data from two language pairs are used in our intrinsic evaluations using the five combinations above. A summary of the training and test data is presented in Table 1.</Paragraph>
    <Paragraph position="4"> Our gold standard for each language pair is a manually aligned corpus. English-Chinese an- null notations distinguish between sure and probable alignment links, but English-Arabic annotations do not. The details of how the annotations are done can be found in (Ayan et al., 2005) and (Ittycheriah and Roukos, 2005).</Paragraph>
    <Section position="1" start_page="10" end_page="10" type="sub_section">
      <SectionTitle>
3.1 Precision, Recall and AER
</SectionTitle>
      <Paragraph position="0"> Table 2 presents the precision, recall, and AER for 5 different alignments on 2 language pairs. For each of these metrics, a different system achieves the best score - respectively, these are SI, SU, and SB. SU and SG yield low precision, high recall alignments. In contrast, SI yields very high precision but very low recall. SA and SB attempt to balance these two measures but their precision is still higher than their recall. Both systems have nearly the same precision but SB yields significantly higher recall than SA.</Paragraph>
      <Paragraph position="1"> Align. en-ch en-ar Sys. Pr Rc AER Pr Rc AER</Paragraph>
    </Section>
    <Section position="2" start_page="10" end_page="11" type="sub_section">
      <SectionTitle>
3.2 Consistent Phrase Error Rate
</SectionTitle>
      <Paragraph position="0"> In this section, we present a new method, called consistent phrase error rate (CPER), for evaluating word alignments in the context of phrase-based MT. The idea is to compare phrases consistent with a given alignment against phrases that would be consistent with human alignments.</Paragraph>
      <Paragraph position="1"> CPER is similar to AER but operates at the phrase level instead of at the word level. To compute CPER, we define a link in terms of the position of its start and end words in the phrases. For instance, the phrase link (i1,i2,j1,j2) indicates that the English phrase ei1,...,ei2 and the FL phrase fj1,...,fj2 are consistent with the given alignment. Once we generate the set of phrases PA and PG that are consistent with a given alignment A and a manual alignment G, respectively, we compute precision (Pr), recall (Rc), and CPER as follows:1</Paragraph>
      <Paragraph position="3"> Phrase Lengths of 3 and 7 CPER penalizes incorrect or missing alignment links more severely than AER. While computing AER, an incorrect alignment link reduces the number of correct alignment links by 1, affecting precision and recall slightly. Similarly, if there is a missing link, only the recall is reduced slightly. However, when computing CPER, an incorrect or missing alignment link might result in more than one phrase pair being eliminated from or added to the set of phrases. Thus, the impact is more severe on both precision and recall.</Paragraph>
      <Paragraph position="4">  alignment and an automated alignment: Gray cells show the alignment links, and rectangles show the possible phrases. In Figure 1, the first box represents a manual alignment and the other two represent automated alignments A. In the case of a missing alignmentlink(Figure1b),PA includes9validphrases. For this alignment, AER = 1 [?] (2 x 2/2 x 2/3)/(2/2 + 2/3) = 0.2 and CPER = 1[?](2x 5/9x5/6)/(5/9+5/6) = 0.33. In the case of an incorrect alignment link (Figure 1c), PA includes only 2 valid phrases, which results in a higher</Paragraph>
      <Paragraph position="6"> but a lower AER (1 [?] (2 x 3/4 x 3/3)/(3/4 + 3/3) = 0.14).</Paragraph>
      <Paragraph position="7"> Table 3 presents the CPER values on two different language pairs, using 2 different maximum phraselengths. Forbothmaximumphraselengths, SA and SB yield the lowest CPER. For all 5 alignments--in both languages--CPER increases as the length of the phrase increases. For all alignments except SI, this amount of increase is nearly the same on both languages. Since SI contains very few alignment points, the number of generated phrases dramatically increases, yielding  poor precision and CPER as the maximum phrase length increases.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="11" end_page="15" type="metho">
    <SectionTitle>
4 Evaluating Alignments within MT
</SectionTitle>
    <Paragraph position="0"> We now move from intrinsic measurement to extrinsic measurement using an off-the-shelf phrase-based MT system Pharaoh (Koehn, 2004). Our goal is to identify the characteristics of alignments that change MT behavior and the types of changes induced by these characteristics.</Paragraph>
    <Paragraph position="1"> All MT system components were kept the same in our experiments except for the component that generates a phrase table from a given alignment.</Paragraph>
    <Paragraph position="2"> We used the corpora presented in Table 1 to train the MT system. The phrases were scored using translation probabilities and lexical weights in two directions and a phrase penalty score. We also use a language model, a distortion model and a word penalty feature for MT.</Paragraph>
    <Paragraph position="3"> We measure the impact of different alignments  on Pharaoh using three different settings: 1. Different maximum phrase length, 2. Different sizes of training data, and 3. Different lexical weighting.</Paragraph>
    <Paragraph position="4">  For maximum phrase length, we used 3 (based onwhatwassuggestedby(Koehn etal., 2003)and 7(thedefaultmaximumphraselengthinPharaoh).</Paragraph>
    <Paragraph position="5"> For lexical weighting, we used the original weighting scheme employed in Pharaoh and a modified version. We realized that the publicly-available implementation of Pharaoh computes the lexical weights only for non-NULL alignment links. As a consequence, loose phrases containingNULL-alignedwordsalongtheiredgesreceive null the same lexical weighting as tight phrases without NULL-aligned words along the edges. We therefore adopted a modified weighting scheme following(Koehnetal., 2003), whichincorporates NULL alignments.</Paragraph>
    <Paragraph position="6"> MT output was evaluated using the standard evaluation metric BLEU (Papineni et al., 2002).2 The parameters of the MT System were optimized for BLEU metric on NIST MTEval'2002 test sets using minimum error rate training (Och, 2003), and the systems were tested on NIST MTEval'2003 test sets for both languages.</Paragraph>
    <Paragraph position="7"> 2We used the NIST script (version 11a) for BLEU with its default settings: case-insensitive matching of n-grams up to n = 4, and the shortest reference sentence for the brevity penalty. The words that were not translated during decoding were deleted from the MT output before running the BLEU script.</Paragraph>
    <Paragraph position="8"> The SRI Language Modeling Toolkit was used totrainatrigrammodelwithmodifiedKneser-Ney smoothing on 155M words of English newswire text, mostly from the Xinhua portion of the Gigaword corpus. During decoding, the number of English phrases per FL phrase was limited to 100 and phrase distortion was limited to 4.</Paragraph>
    <Section position="1" start_page="11" end_page="12" type="sub_section">
      <SectionTitle>
4.1 BLEU Score Comparison
Table4presentstheBLEUscoresforPharaohruns
</SectionTitle>
      <Paragraph position="0"> on Chinese with five different alignments using different settings for maximum phrase length (3 vs. 7), size of training data (107K vs. 241K), and lexical weighting (original vs. modified).3 The modified lexical weighting yields huge improvements when the alignment leaves several words unaligned: the BLEU score for SA goes from 24.26 to 25.31 and the BLEU score for SB goes from 23.91 to 25.38. In contrast, when the alignments contain a high number of alignment links (e.g., SU and SG), modifying lexical weighting does not bring significant improvements because the number of phrases containing unaligned words is relatively low. Increasing the phrase length increases the BLEU scores for all systems by nearly 0.7 points and increasing the size of the training data increases the BLEU scores by 1.5-2 points for all systems. For all settings, SU yields the lowest BLEU scores while SB clearly outperforms the others.</Paragraph>
      <Paragraph position="1"> Table 5 presents BLEU scores for Pharaoh runs on5differentalignmentsonEnglish-Arabic,using different settings for lexical weighting and maximum phrase lengths.4 Using the original lexical weighting, SA and SB perform better than the others while SU and SI yield the worst results.</Paragraph>
      <Paragraph position="2"> Modifying the lexical weighting leads to slight reductions in BLEU scores for SU and SG, but improves the scores for the other 3 alignments significantly. Finally, increasing the maximum phrase length to 7 leads to additional improvements in BLEU scores, where SG and SU benefit nearly 2 BLEU points. As in English-Chinese, the worst BLEU scores are obtained by SU while the best scores are produced by SB.</Paragraph>
      <Paragraph position="3"> As we see from the tables, the relation between intrinsic alignment measures (AER and CPER)  Lexical Weightings and Maximum Phrase Lengths and the corresponding BLEU scores varies, depending on the language, lexical weighting, maximum phrase length, and training data size. For example,usingamodifiedlexicalweighting,thesys- null tems are ranked according to their BLEU scores as follows: SB, SA, SG, SI, SU--an ordering that differs from that of AER but is identical to that of CPER (with a phrase length of 3) for Chinese. On the other hand, in Arabic, both AER and CPER provide a slightly different ranking from that of BLEU, with SG and SI swapping places.</Paragraph>
    </Section>
    <Section position="2" start_page="12" end_page="12" type="sub_section">
      <SectionTitle>
4.2 Tight vs. Loose Phrases
</SectionTitle>
      <Paragraph position="0"> To demonstrate how alignment-related components of the MT system might change the translation quality significantly, we did an additional  experimenttocomparedifferenttechniquesforextracting phrases from a given alignment. Specifically, we are comparing two techniques for phrase extraction:  1. Loose phrases (the original 'consistent phrase extraction' method) 2. Tight phrases (the set of phrases where  thefirst/lastwordsoneachsideareforced to align to some word in the phrase pair) Using tight phrases penalizes alignments with many unaligned words, whereas using loose phrases rewards them. Our goal is to compare the performance of precision-oriented vs. recall-oriented alignments when we allow only tight phrases in the phrase extraction step. To simplify things, we used only 2 alignments: SG, the best recall-oriented alignment, and SB, the best precision-oriented alignment. For this experiment, we used modified lexical weighting and a maximum phrase length of 7.</Paragraph>
    </Section>
    <Section position="3" start_page="12" end_page="12" type="sub_section">
      <SectionTitle>
Chinese Arabic
</SectionTitle>
      <Paragraph position="0"> using two different phrase extraction techniques on English-Chinese and English-Arabic. In both languages, SB outperforms SG significantly when loose phrases are used. However, when we use only tight phrases, the performance of SB gets significantly worse (3.5 to 6.5 BLEU-score reduction in comparison to loose phrases). The performance of SG also gets worse but the degree of BLEU-score reduction is less than that of SB. Overall SG performs better than SB with tight phrases; forEnglish-Arabic,thedifferencebetweenthetwo systems is more than 3 BLEU points. Note that, as before, the relation between the alignment measures and the BLEU scores varies, this time depending on whether loose phrases or tight phrases are used: both CPER and AER track the BLEU rankings for loose (but not for tight) phrases.</Paragraph>
      <Paragraph position="1"> This suggests that changing alignment-related components of the system (i.e., phrase extraction and phrase scoring) influences the overall translation quality significantly for a particular alignment. Therefore, when comparing two alignments in the context of a MT system, it is important to take the alignment characteristics into account. For instance, alignments with many unaligned words are severely penalized when using tight phrases.</Paragraph>
    </Section>
    <Section position="4" start_page="12" end_page="13" type="sub_section">
      <SectionTitle>
4.3 Untranslated Words
</SectionTitle>
      <Paragraph position="0"> We analyzed the percentage of words left untranslated during decoding. Figure 2 shows the percentage of untranslated words in the FL using the Chinese and Arabic NIST MTEval'2003 test sets.</Paragraph>
      <Paragraph position="1"> On English-Chinese data (using all four settings given in Table 4) SU and SG yield the highest percentage of untranslated words while SI produces the lowest percentage of untranslated words. SA and SB leave about 2% of the FL words phrases  number of FL words without translating them. Increasing the training data size reduces the percentage of untranslated words by nearly half with all five alignments. No significant impact on untranslated words is observed from modifying the lexical weights and changing the phrase length.</Paragraph>
      <Paragraph position="2"> On English-Arabic data, all alignments result in higher percentages of untranslated words than English-Chinese, most likely due to data sparsity. As in Chinese-to-English translation, SU is the worst and SB is the best. SI behaves quite differently, leaving nearly 7% of the words untranslated--an indicator of why it produces a higher BLEU score on Chinese but a lower score on Arabic compared to other alignments.</Paragraph>
    </Section>
    <Section position="5" start_page="13" end_page="13" type="sub_section">
      <SectionTitle>
4.4 Analysis of Phrase Tables
</SectionTitle>
      <Paragraph position="0"> This section presents several experiments to analyze how different alignments affect the size of the generated phrase tables, the distribution of the phrases that are used in decoding, and the coverage of the test set with the generated phrase tables. Size of Phrase Tables The major impact of using different alignments in a phrase-based MT system is that each one results in a different phrase table. Table 7 presents the number of phrases that are extracted from five alignments using two different maximum phrase lengths (3 vs. 7) in two languages, after filtering the phrase table for MTEval'2003 test set. The size of the phrase table increases dramatically as the number of links in the initial alignment gets smaller. As a result, for both languages, SU and SG yield a much smaller</Paragraph>
    </Section>
    <Section position="6" start_page="13" end_page="15" type="sub_section">
      <SectionTitle>
Chinese Arabic
</SectionTitle>
      <Paragraph position="0"> phrase table than the other three alignments. As the maximum phrase length increases, the size of the phrase table gets bigger for all alignments; however, the growth of the table is more significant for precision-oriented alignments due to the high number of unaligned words.</Paragraph>
      <Paragraph position="1"> Distribution of Phrases To investigate how the decoder chooses phrases of different lengths, we analyzed the distribution of the phrases in the filtered phrase table and the phrases that were used to decode Chinese MTEval'2003 test set.5 For the remaining experiments in the paper, we use modified lexical weighting, a maximum phrase length of 7, and 107K sentence pairs for training.</Paragraph>
      <Paragraph position="2"> The top row in Figure 3 shows the distribution of the phrases generated by the five alignments (using a maximum phrase length of 7) according to their length. The &amp;quot;j-i&amp;quot; designators correspond to the phrase pairs with j FL words and i English words. For SU and SG, the majority of the phrases contain only one FL word, and the percentage of the phrases with more than 2 FL words is less than 18%. For the other three alignments, however, the distribution of the phrases is almost inverted. For SI, nearly 62% of the phrases contain more than 3 words on either FL or English side; for SA and SB, this percentage is around 45-50%.</Paragraph>
      <Paragraph position="3"> Given the completely different phrase distribution, the most obvious question is whether the longer phrases generated by SI, SA and SB are actually used in decoding. In order to investigate this, we did an analysis of the phrases used to decode the same test set.</Paragraph>
      <Paragraph position="4"> The bottom row of Figure 3 shows the percentage of phrases used to decode the Chinese MTEval'2003 test set. The distribution of the actual phrases used in decoding is completely the reverse of the distribution of the phrases in the entire filtered table. For all five alignments, the majority of the used phrases is one-to-one (between  majority of the phrase table contains phrases with more than 3 words on one side. It is surprising that the inclusion of phrase pairs with more than 3 words in the search space increases the BLEU score although the majority of the phrases used in decoding is mostly one-to-one.</Paragraph>
      <Paragraph position="5"> Length of the Phrases used in Decoding We also investigated the number and length of phrases that are used to decode the given test set for different alignments. Table 8 presents the average number of English and FL words in the phrases used in decoding Chinese MTEval'2003 test set.</Paragraph>
      <Paragraph position="6"> The decoder uses fewer phrases with SI, SA and SB than for the other two, thus yielding a higher number of FL words per phrase. The number of English words per phrase is also higher for these three systems than the other two.</Paragraph>
      <Paragraph position="7"> Coverage of the Test Set Finally, we examine the coverage of a test set using phrases of a specific length in the phrase table. Table 9 presents  the coverage of the Chinese MTEval'2003 test set (source side) using only phrases of a particular length (from 1 to 7). For this experiment, we assume that a word in the test set is covered if it is part of a phrase pair that exists in the phrase table (if a word is part of multiple phrases, it is counted only once). Not surprisingly, using only phrases with one FL word, more than 90% of the test set can be covered for all 5 alignments. As the length of the phrases increases, the coverage of the test set decreases. For instance, using phrases with 5 FL words results in less than 5% coverage of the test set.</Paragraph>
      <Paragraph position="8">  is higher for precision-oriented alignments than recall-oriented alignments for all different lengths of the phrases. For instance, SI, SA, and SB cover nearly 75% of the corpus using only phrases with 2 FL words, and nearly 36% of the corpus using phraseswith3FLwords. Thissuggeststhatrecallorientedalignmentsfailtocatchasignificantnum- null ber of phrases that would be useful to decode this test set, and precision-oriented alignments would yield potentially more useful phrases.</Paragraph>
      <Paragraph position="9"> Since precision-oriented alignments make a higher number of longer phrases available to the decoder (based on the coverage of phrases presented in Table 9), they are used more during decoding. Consequently, the major difference between the alignments is the coverage of the phrases extracted from different alignments. The more the phrase table covers the test set, the more the longer phrases are used during decoding, and precision-oriented alignments are better at generating high-coverage phrases than recall-oriented alignments.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML