File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/n03-1017_evalu.xml

Size: 12,422 bytes

Last Modified: 2025-10-06 13:58:56

<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-1017">
  <Title>Statistical Phrase-Based Translation</Title>
  <Section position="5" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> We used the freely available Europarl corpus 2 to carry out experiments. This corpus contains over 20 million words in each of the eleven official languages of the European Union, covering the proceedings of the European Parliament 1996-2001. 1755 sentences of length 5-15 were reserved for testing.</Paragraph>
    <Paragraph position="1"> In all experiments in Section 4.1-4.6 we translate from German to English. We measure performance using the BLEU score [Papineni et al., 2001], which estimates the accuracy of translation output with respect to a reference translation.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Comparison of Core Methods
</SectionTitle>
      <Paragraph position="0"> First, we compared the performance of the three methods for phrase extraction head-on, using the same decoder (Section 2) and the same trigram language model. Figure 1 displays the results.</Paragraph>
      <Paragraph position="1"> In direct comparison, learning all phrases consistent with the word alignment (AP) is superior to the joint model (Joint), although not by much. The restriction to only syntactic phrases (Syn) is harmful. We also included in the figure the performance of an IBM Model 4 word-based translation system (M4), which uses a greedy decoder [Germann et al., 2001]. Its performance is worse than both AP and Joint. These results are consistent over training corpus sizes from 10,000 sentence pairs to 320,000 sentence pairs. All systems improve with more data.</Paragraph>
      <Paragraph position="2"> Table 1 lists the number of distinct phrase translation pairs learned by each method and each corpus. The number grows almost linearly with the training corpus size, due to the large number of singletons. The syntactic restriction eliminates over 80% of all phrase pairs.</Paragraph>
      <Paragraph position="3"> Note that the millions of phrase pairs learned fit easily into the working memory of modern computers. Even the largest models take up only a few hundred megabyte of RAM.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Weighting Syntactic Phrases
</SectionTitle>
      <Paragraph position="0"/>
      <Paragraph position="2"> pairs consistent with a word alignment (AP), phrase pairs from the joint model (Joint), IBM Model 4 (M4), and only syntactic phrases (Syn) One way to check this is to use all phrase pairs and give more weight to syntactic phrase translations. This can be done either during the data collection - say, by counting syntactic phrase pairs twice - or during translation - each time the decoder uses a syntactic phrase pair, it credits a bonus factor to the hypothesis score.</Paragraph>
      <Paragraph position="3"> We found that neither of these methods result in significant improvement of translation performance. Even penalizing the use of syntactic phrase pairs does not harm performance significantly. These results suggest that requiring phrases to be syntactically motivated does not lead to better phrase pairs, but only to fewer phrase pairs, with the loss of a good amount of valuable knowledge.</Paragraph>
      <Paragraph position="4"> One illustration for this is the common German &amp;quot;es gibt&amp;quot;, which literally translates as &amp;quot;it gives&amp;quot;, but really means &amp;quot;there is&amp;quot;. &amp;quot;Es gibt&amp;quot; and &amp;quot;there is&amp;quot; are not syntactic constituents. Note that also constructions such as &amp;quot;with regard to&amp;quot; and &amp;quot;note that&amp;quot; have fairly complex syntactic representations, but often simple one word translations. Allowing to learn phrase translations over such sentence fragments is important for achieving high performance. null</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Maximum Phrase Length
</SectionTitle>
      <Paragraph position="0"> How long do phrases have to be to achieve high performance? Figure 2 displays results from experiments with different maximum phrase lengths. All phrases consistent with the word alignment (AP) are used. Surprisingly, limiting the length to a maximum of only three words</Paragraph>
      <Paragraph position="2"> show that length 3 is enough  maximum phrase length limits per phrase already achieves top performance. Learning longer phrases does not yield much improvement, and occasionally leads to worse results. Reducing the limit to only two, however, is clearly detrimental. Allowing for longer phrases increases the phrase translation table size (see Table 2). The increase is almost linear with the maximum length limit. Still, none of these model sizes cause memory problems.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4 Lexical Weighting
</SectionTitle>
      <Paragraph position="0"> One way to validate the quality of a phrase translation pair is to check, how well its words translate to each other.</Paragraph>
      <Paragraph position="1"> For this, we need a lexical translation probability distribution a0 a5 a2 a8 a8 a10 . We estimated it by relative frequency from the same word alignments as the phrase model.</Paragraph>
      <Paragraph position="3"> A special English NULL token is added to each English sentence and aligned to each unaligned foreign word.</Paragraph>
      <Paragraph position="4"> Given a phrase pair a1a2 a0  a8 and a word alignment a11 between the foreign word positions a19 a12 a20 a0 a5a6a5a7a5 a0 a46 and the English word positions a6 a12a9a8 a0 a20 a0 a5a6a5a7a5 a0 a45 , we compute the lexical weight a3a11a10 by</Paragraph>
      <Paragraph position="6"/>
      <Paragraph position="8"> an alignment a11 and a lexical translation probability distribution a0 a5 a22 a10 See Figure 3 for an example.</Paragraph>
      <Paragraph position="9"> If there are multiple alignments a11 for a phrase pair  The parameter a66 defines the strength of the lexical weight a3a11a10 . Good values for this parameter are around 0.25.</Paragraph>
      <Paragraph position="10"> Figure 4 shows the impact of lexical weighting on machine translation performance. In our experiments, we achieved improvements of up to 0.01 on the BLEU score scale. Again, all phrases consistent with the word alignment are used (Section 3.1).</Paragraph>
      <Paragraph position="11"> Note that phrase translation with a lexical weight is a special case of the alignment template model [Och et al., 1999] with one word class for each word. Our simplification has the advantage that the lexical weights can be factored into the phrase translation table beforehand, speeding up decoding. In contrast to the beam search decoder for the alignment template model, our decoder is able to search all possible phrase segmentations of the input sentence, instead of choosing one segmentation before decoding. null</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.5 Phrase Extraction Heuristic
</SectionTitle>
      <Paragraph position="0"> Recall from Section 3.1 that we learn phrase pairs from word alignments generated by Giza++. The IBM Models that this toolkit implements only allow at most one English word to be aligned with a foreign word. We remedy this problem with a heuristic approach.</Paragraph>
      <Paragraph position="1">  First, we align a parallel corpus bidirectionally - foreign to English and English to foreign. This gives us two word alignments that we try to reconcile. If we intersect the two alignments, we get a high-precision alignment of high-confidence alignment points. If we take the union of the two alignments, we get a high-recall alignment with additional alignment points.</Paragraph>
      <Paragraph position="2"> We explore the space between intersection and union with expansion heuristics that start with the intersection and add additional alignment points. The decision which points to add may depend on a number of criteria: a0 In which alignment does the potential alignment point exist? Foreign-English or English-foreign? a0 Does the potential point neighbor already established points? a0 Does &amp;quot;neighboring&amp;quot; mean directly adjacent (blockdistance), or also diagonally adjacent? a0 Is the English or the foreign word that the potential point connects unaligned so far? Are both unaligned? null a0 What is the lexical probability for the potential point? The base heuristic [Och et al., 1999] proceeds as follows: We start with intersection of the two word alignments. We only add new alignment points that exist in the union of two word alignments. We also always require that a new alignment point connects at least one previously unaligned word.</Paragraph>
      <Paragraph position="3"> First, we expand to only directly adjacent alignment points. We check for potential points starting from the top right corner of the alignment matrix, checking for alignment points for the first English word, then continue with alignment points for the second English word, and so on. This is done iteratively until no alignment point can be added anymore. In a final step, we add non-adjacent alignment points, with otherwise the same requirements.  Figure 5 shows the performance of this heuristic (base) compared against the two mono-directional alignments (e2f, f2e) and their union (union). The figure also contains two modifications of the base heuristic: In the first (diag) we also permit diagonal neighborhood in the iterative expansion stage. In a variation of this (diag-and), we require in the final step that both words are unaligned. The ranking of these different methods varies for different training corpus sizes. For instance, the alignment f2e starts out second to worst for the 10,000 sentence pair corpus, but ultimately is competitive with the best method at 320,000 sentence pairs. The base heuristic is initially the best, but then drops off.</Paragraph>
      <Paragraph position="4"> The discrepancy between the best and the worst method is quite large, about 0.02 BLEU. For almost all training corpus sizes, the heuristic diag-and performs best, albeit not always significantly.</Paragraph>
    </Section>
    <Section position="6" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.6 Simpler Underlying Word-Based Models
</SectionTitle>
      <Paragraph position="0"> The initial word alignment for collecting phrase pairs is generated by symmetrizing IBM Model 4 alignments.</Paragraph>
      <Paragraph position="1"> Model 4 is computationally expensive, and only approximate solutions exist to estimate its parameters. The IBM Models 1-3 are faster and easier to implement. For IBM Model 1 and 2 word alignments can be computed efficiently without relying on approximations. For more information on these models, please refer to Brown et al.</Paragraph>
      <Paragraph position="2"> [1993]. Again, we use the heuristics from the Section 4.5 to reconcile the mono-directional alignments obtained through training parameters using models of increasing complexity.</Paragraph>
      <Paragraph position="3"> How much is performance affected, if we base word alignments on these simpler methods? As Figure 6 indi- null guage pairs (measured with BLEU) cates, not much. While Model 1 clearly results in worse performance, the difference is less striking for Model 2 and 3. Using different expansion heuristics during symmetrizing the word alignments has a bigger effect.</Paragraph>
      <Paragraph position="4"> We can conclude from this, that high quality phrase alignments can be learned with fairly simple means. The simpler and faster Model 2 provides similar performance to the complex Model 4.</Paragraph>
    </Section>
    <Section position="7" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.7 Other Language Pairs
</SectionTitle>
      <Paragraph position="0"> We validated our findings for additional language pairs.</Paragraph>
      <Paragraph position="1"> Table 3 displays some of the results. For all language pairs the phrase model (based on word alignments, Section 3.1) outperforms IBM Model 4. Lexicalization (Lex) always helps as well.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML