XML Viewer - p06-1066

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1066_metho.xml
Size: 23,947 bytes
Last Modified: 2025-10-06 14:10:16
<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1066">
  <Title>Maximum Entropy Based Phrase Reordering Model for Statistical Machine Translation</Title>
  <Section position="4" start_page="521" end_page="523" type="metho">
    <SectionTitle>
2 System Overview
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="521" end_page="522" type="sub_section">
      <SectionTitle>
2.1 Model
</SectionTitle>
      <Paragraph position="0"> Under the BTG scheme, translation is more like monolingual parsing through derivations.</Paragraph>
      <Paragraph position="1"> Throughout the translation procedure, three rules are used to derive the translation</Paragraph>
      <Paragraph position="3"> During decoding, the source sentence is segmented into a sequence of phrases as in a standard phrase-based model. Then the lexical rule (3) 2 is 2Currently, we restrict phrases x and y not to be null.</Paragraph>
      <Paragraph position="4"> Therefore neither deletion nor insertion is carried out during decoding. However, these operations are to be considered in our future version of model.</Paragraph>
      <Paragraph position="5"> used to translate source phrase y into target phrase x and generate a block A. Later, the straight rule (1) merges two consecutive blocks into a single larger block in the straight order; while the inverted rule (2) merges them in the inverted order.</Paragraph>
      <Paragraph position="6"> These two merging rules will be used continuously until the whole source sentence is covered. When the translation is finished, a tree indicating the hierarchical segmentation of the source sentence is also produced.</Paragraph>
      <Paragraph position="7"> In the following, we will define the model in a straight way, not in the dynamic programming recursion way used by (Wu, 1996; Zens et al., 2004). We focus on defining the probabilities of different rules by separating different features (including the language model) out from the rule probabilities and organizing them in a log-linear form. This straight way makes it clear how rules are used and what they depend on.</Paragraph>
      <Paragraph position="8"> For the two merging rules straight and inverted, applying them on two consecutive blocks A1 and</Paragraph>
      <Paragraph position="10"> where the Ohm is the reordering score of block A1 and A2, lOhm is its weight, and trianglepLM(A1,A2) is the increment of the language model score of the two blocks according to their final order, lLM is its weight.</Paragraph>
      <Paragraph position="11"> For the lexical rule, applying it is assigned a</Paragraph>
      <Paragraph position="13"> where p(*) are the phrase translation probabilities in both directions, plex(*) are the lexical translation probabilities in both directions, and exp(1) and exp(|x|) are the phrase penalty and word penalty, respectively. These features are very common in state-of-the-art systems (Koehn et al., 2005; Chiang, 2005) and ls are weights of features. null For the reordering model Ohm, we define it on the two consecutive blocks A1 and A2 and their order</Paragraph>
      <Paragraph position="15"> Under this framework, different reordering models can be designed. In fact, we defined four re-ordering models in our experiments. The first one  is NONE, meaning no explicit reordering features at all. We set Ohm to 1 for all different pairs of blocks and their orders. So the phrasal reordering is totally dependent on the language model. This model is obviously different from the monotone search, which does not use the inverted rule at all. The second one is a distortion style reordering model, which is formulated as</Paragraph>
      <Paragraph position="17"> where |Ai |denotes the number of words on the source side of blocks. When lOhm &lt; 0, this design will penalize those non-monotone translations. The third one is a flat reordering model, which assigns probabilities for the straight and inverted order. It is formulated as</Paragraph>
      <Paragraph position="19"> In our experiments on Chinese-English tasks, the probability for the straight order is set at pm = 0.95. This is because word order in Chinese and English is usually similar. The last one is the maximum entropy based reordering model proposed by us, which will be described in the next section.</Paragraph>
      <Paragraph position="20"> We define a derivation D as a sequence of applications of rules (1)[?](3), and let c(D) and e(D) be the Chinese and English yields of D. The probability of a derivation D is</Paragraph>
      <Paragraph position="22"> where Pr(i) is the probability of the ith application of rules. Given an input sentence c, the final translation e[?] is derived from the best derivation</Paragraph>
      <Paragraph position="24"/>
    </Section>
    <Section position="2" start_page="522" end_page="523" type="sub_section">
      <SectionTitle>
2.2 Decoder
</SectionTitle>
      <Paragraph position="0"> We developed a CKY style decoder that employs a beam search algorithm, similar to the one by Chiang (2005). The decoder finds the best derivation that generates the input sentence and its translation. From the best derivation, the best English e[?] is produced.</Paragraph>
      <Paragraph position="1"> Given a source sentence c, firstly we initiate the chart with phrases from phrase translation table by applying the lexical rule. Then for each cell that spans from i to j on the source side, all possible derivations spanning from i to j are generated. Our algorithm guarantees that any sub-cells within (i,j) have been expanded before cell (i,j) is expanded. Therefore the way to generate derivations in cell (i,j) is to merge derivations from any two neighbor sub-cells. This combination is done by applying the straight and inverted rules.</Paragraph>
      <Paragraph position="2"> Each application of these two rules will generate a new derivation covering cell (i,j). The score of the new generated derivation is derived from the scores of its two sub-derivations, reordering model score and the increment of the language model score according to the Equation (4). When the whole input sentence is covered, the decoding is over.</Paragraph>
      <Paragraph position="3"> Pruning of the search space is very important for the decoder. We use three pruning ways. The first one is recombination. When two derivations in the same cell have the same w leftmost/rightmost words on the English yields, where w depends on the order of the language model, they will be recombined by discarding the derivation with lower score. The second one is the threshold pruning which discards derivations that have a score worse than a times the best score in the same cell. The last one is the histogram pruning which only keeps the top n best derivations for each cell. In all our experiments, we set n = 40,a = 0.5 to get a tradeoff between speed and performance in the development set.</Paragraph>
      <Paragraph position="4"> Another feature of our decoder is the k-best list generation. The k-best list is very important for the minimum error rate training (Och, 2003a) which is used for tuning the weights l for our model. We use a very lazy algorithm for the k-best list generation, which runs two phases similarly to the one by Huang et al. (2005). In the first phase, the decoder runs as usual except that it keeps some information of weaker derivations which are to be discarded during recombination. This will generate not only the first-best of final derivation but also a shared forest. In the second phase, the lazy algorithm runs recursively on the shared forest. It finds the second-best of the final derivation, which makes its children to find their secondbest, and children's children's second-best, until the leaf node's second-best. Then it finds the thirdbest, forth-best, and so on. In all our experiments, we set k = 200.</Paragraph>
      <Paragraph position="5">  The decoder is implemented in C++. Using the pruning settings described above, without the k-best list generation, it takes about 6 seconds to translate a sentence of average length 28.3 words on a 2GHz Linux system with 4G RAM memory.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="523" end_page="524" type="metho">
    <SectionTitle>
3 Maximum Entropy Based Reordering
</SectionTitle>
    <Paragraph position="0"> Model In this section, we discuss how to create a maximum entropy based reordering model. As described above, we defined the reordering model Ohm on the three factors: order o, block A1 and block A2. The central problem is, given two neighbor blocks A1 and A2, how to predicate their order o [?] {straight,inverted}. This is a typical problem of two-class classification. To be consistent with the whole model, the conditional probability p(o|A1,A2) is calculated. A simple way to compute this probability is to take counts from the training data and then to use the maximum likeli-</Paragraph>
    <Paragraph position="2"> The similar way is used by lexicalized reordering model. However, in our model this way can't work because blocks become larger and larger due to using the merging rules, and finally unseen in the training data. This means we can not use blocks as direct reordering evidences.</Paragraph>
    <Paragraph position="3"> A good way to this problem is to use features of blocks as reordering evidences. Good features can not only capture reorderings, avoid sparseness, but also integrate generalizations. It is very straight to use maximum entropy model to integrate features to predicate reorderings of blocks. Under the MaxEnt model, we have</Paragraph>
    <Paragraph position="5"> where the functions hi [?]{0,1}are model features and the thi are weights of the model features which can be trained by different algorithms (Malouf, 2002).</Paragraph>
    <Section position="1" start_page="523" end_page="524" type="sub_section">
      <SectionTitle>
3.1 Reordering Example Extraction
Algorithm
</SectionTitle>
      <Paragraph position="0"> The input for the algorithm is a bilingual corpus with high-precision word alignments. We obtain the word alignments using the way of Koehn et al.</Paragraph>
      <Paragraph position="1"> (2005). After running GIZA++ (Och and Ney,</Paragraph>
      <Paragraph position="3"> rows from the corners are their links. Corner c1 is shared by block b1 and b2, which in turn are linked by the STRAIGHT links, bottomleft and topright of c1. Similarly, block b3 and b4 are linked by the INVERTED links, topleft and bottomright of c2.</Paragraph>
      <Paragraph position="4"> 2000) in both directions, we apply theogrowdiag-finalprefinement rule on the intersection alignments for each sentence pair.</Paragraph>
      <Paragraph position="5"> Before we introduce this algorithm, we introduce some formal definitions. The first one is block which is a pair of source and target contiguous sequences of words</Paragraph>
      <Paragraph position="7"> This definition is similar to that of bilingual phrase except that there is no length limitation over block.</Paragraph>
      <Paragraph position="8"> A reordering example is a triple of (o,b1,b2) where b1 and b2 are two neighbor blocks and o is the order between them. We define each vertex of block as corner. Each corner has four links in four directions: topright, topleft, bottomright, bottomleft, and each link links a set of blocks which have the corner as their vertex. The topright and bottomleft link blocks with the straight order, so we call them STRAIGHT links. Similarly, we call the topleft and bottomright INVERTED links since they link blocks with the inverted order. For convenience, we use b -arrowhookright L to denote that block b is linked by the link L. Note that the STRAIGHT links can not coexist with the INVERTED links.</Paragraph>
      <Paragraph position="9"> These definitions are illustrated in Figure 1.</Paragraph>
      <Paragraph position="10"> The reordering example extraction algorithm is shown in Figure 2. The basic idea behind this algorithm is to register all neighbor blocks to the associated links of corners which are shared by them. To do this, we keep an array to record link  1: Input: sentence pair (s,t) and their alignment M 2: Rfractur := [?] 3: for each span (i1,i2) [?] s do 4: find block b = (si2i1,tj2j1) that is consistent with M 5: Extend block b on the target boundary with one possible non-aligned word to get blocks E(b) 6: for each block b[?] [?] buniontextE(b) do 7: Register b[?] to the links of four corners of it 8: end for 9: end for 10: for each corner C in the matrix M do 11: if STRAIGHT links exist then 12: Rfractur := Rfracturuniontext{(straight,b1,b2)}, b1 -arrowhookright C.bottomleft,b2 -arrowhookright C.topright 13: else if INVERTED links exist then 14: Rfractur := Rfracturuniontext{(inverted,b1,b2)}, b1 -arrowhookright C.topleft,b2 -arrowhookright C.bottomright 15: end if 16: end for 17: Output: reordering examples Rfractur  information of corners when extracting blocks.</Paragraph>
      <Paragraph position="11"> Line 4 and 5 are similar to the phrase extraction algorithm by Och (2003b). Different from Och, we just extend one word which is aligned to null on the boundary of target side. If we put some length limitation over the extracted blocks and output them, we get bilingual phrases used in standard phrase-based SMT systems and also in our system. Line 7 updates all links associated with the current block. You can attach the current block to each of these links. However this will increase reordering examples greatly, especially those with the straight order. In our Experiments, we just attach the smallest blocks to the STRAIGHT links, and the largest blocks to the INVERTED links.</Paragraph>
      <Paragraph position="12"> This will keep the number of reordering examples acceptable but without performance degradation.</Paragraph>
      <Paragraph position="13"> Line 12 and 14 extract reordering examples.</Paragraph>
    </Section>
    <Section position="2" start_page="524" end_page="524" type="sub_section">
      <SectionTitle>
3.2 Features
</SectionTitle>
      <Paragraph position="0"> With the extracted reordering examples, we can obtain features for our MaxEnt-based reordering model. We design two kinds of features, lexical features and collocation features. For a block b = (s,t), we use s1 to denote the first word of the source s, t1 to denote the first word of the target t.</Paragraph>
      <Paragraph position="1"> Lexical features are defined on the single word s1 or t1. Collocation features are defined on the combination s1 or t1 between two blocks b1 and b2. Three kinds of combinations are used. The first one is source collocation, b1.s1&amp;b2.s1. The second is target collocation, b1.t1&amp;b2.t1. The last one</Paragraph>
      <Paragraph position="3"> plates. The first one is a lexical feature, and the second one is a target collocation feature, where Ei are English words, O [?]{straight,inverted}.</Paragraph>
      <Paragraph position="4"> is block collocation, b1.s1&amp;b1.t1 and b2.s1&amp;b2.t1.</Paragraph>
      <Paragraph position="5"> The templates for the lexical feature and the collocation feature are shown in Figure 3.</Paragraph>
      <Paragraph position="6"> Why do we use the first words as features? These words are nicely at the boundary of blocks.</Paragraph>
      <Paragraph position="7"> One of assumptions of phrase-based SMT is that phrase cohere across two languages (Fox, 2002), which means phrases in one language tend to be moved together during translation. This indicates that boundary words of blocks may keep information for their movements/reorderings. To test this hypothesis, we calculate the information gain ratio (IGR) for boundary words as well as the whole blocks against the order on the reordering examples extracted by the algorithm described above.</Paragraph>
      <Paragraph position="8"> The IGR is the measure used in the decision tree learning to select features (Quinlan, 1993). It represents how precisely the feature predicate the class. For feature f and class c, the IGR(f,c)</Paragraph>
      <Paragraph position="10"> where En(*) is the entropy and En(*|*) is the conditional entropy. To our surprise, the IGR for the four boundary words (IGR(&lt;b1.s1, b2.s1, b1.t1, b2.t1&gt; , order) = 0.2637) is very close to that for the two blocks together (IGR(&lt;b1, b2&gt; , order) = 0.2655).</Paragraph>
      <Paragraph position="11"> Although our reordering examples do not cover all reordering events in the training data, this result shows that boundary words do provide some clues for predicating reorderings.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="524" end_page="526" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> We carried out experiments to compare against various reordering models and systems to demonstrate the competitiveness of MaxEnt-based reordering: null  1. Monotone search: the inverted rule is not used.</Paragraph>
    <Paragraph position="1">  2. Reordering variants: the NONE, distortion and flat reordering models described in Section 2.1.</Paragraph>
    <Paragraph position="2"> 3. Pharaoh: A state-of-the-art distortion-based decoder (Koehn, 2004).</Paragraph>
    <Section position="1" start_page="525" end_page="525" type="sub_section">
      <SectionTitle>
4.1 Corpus
</SectionTitle>
      <Paragraph position="0"> Our experiments were made on two Chinese-to-English translation tasks: NIST MT-05 (news domain) and IWSLT-04 (travel dialogue domain).</Paragraph>
      <Paragraph position="1"> NIST MT-05. In this task, the bilingual training data comes from the FBIS corpus with 7.06M Chinese words and 9.15M English words. The tri-gram language model training data consists of English texts mostly derived from the English side of the UN corpus (catalog number LDC2004E12), which totally contains 81M English words. For the efficiency of minimum error rate training, we built our development set using sentences of length at most 50 characters from the NIST MT-02 evaluation test data.</Paragraph>
      <Paragraph position="2"> IWSLT-04. For this task, our experiments were carried out on the small data track. Both the bilingual training data and the trigram language model training data are restricted to the supplied corpus, which contains 20k sentences, 179k Chinese words and 157k English words. We used the CSTAR 2003 test set consisting of 506 sentence pairs as development set.</Paragraph>
    </Section>
    <Section position="2" start_page="525" end_page="525" type="sub_section">
      <SectionTitle>
4.2 Training
</SectionTitle>
      <Paragraph position="0"> We obtained high-precision word alignments using the way described in Section 3.1. Then we ran our reordering example extraction algorithm to output blocks of length at most 7 words on the Chinese side together with their internal alignments.</Paragraph>
      <Paragraph position="1"> We also limited the length ratio between the target and source language (max(|s|,|t|)/min(|s|,|t|)) to 3. After extracting phrases, we calculated the phrase translation probabilities and lexical translation probabilities in both directions for each bilingual phrase.</Paragraph>
      <Paragraph position="2"> For the minimum-error-rate training, we re-implemented Venugopal's trainer 3 (Venugopal et al., 2005) in C++. For all experiments, we ran this trainer with the decoder iteratively to tune the weights ls to maximize the BLEU score on the</Paragraph>
    </Section>
    <Section position="3" start_page="525" end_page="525" type="sub_section">
      <SectionTitle>
Pharaoh
</SectionTitle>
      <Paragraph position="0"> We shared the same phrase translation tables between Pharaoh and our system since the two systems use the same features of phrases. In fact, we extracted more phrases than Pharaoh's trainer with its default settings. And we also used our re-implemented trainer to tune lambdas of Pharaoh to maximize its BLEU score. During decoding, we pruned the phrase table with b = 100 (default 20), pruned the chart with n = 100,a = 10[?]5 (default setting), and limited distortions to 4 (default 0).</Paragraph>
      <Paragraph position="1"> MaxEnt-based Reordering Model We firstly ran our reordering example extraction algorithm on the bilingual training data without any length limitations to obtain reordering examples and then extracted features from these examples. In the task of NIST MT-05, we obtained about 2.7M reordering examples with the straight order, and 367K with the inverted order, from which 112K lexical features and 1.7M collocation features after deleting those with one occurrence were extracted. In the task of IWSLT-04, we obtained 79.5k reordering examples with the straight order, 9.3k with the inverted order, from which 16.9K lexical features and 89.6K collocation features after deleting those with one occurrence were extracted. Finally, we ran the MaxEnt toolkit by Zhang 4 to tune the feature weights. We set iteration number to 100 and Gaussian prior to 1 for avoiding overfitting.</Paragraph>
    </Section>
    <Section position="4" start_page="525" end_page="526" type="sub_section">
      <SectionTitle>
4.3 Results
</SectionTitle>
      <Paragraph position="0"> We dropped unknown words (Koehn et al., 2005) of translations for both tasks before evaluating their BLEU scores. To be consistent with the official evaluation criterions of both tasks, case-sensitive BLEU-4 scores were computed For the NIST MT-05 task and case-insensitive BLEU-4 scores were computed for the IWSLT-04 task 5.</Paragraph>
      <Paragraph position="1"> Experimental results on both tasks are shown in  the difference to the best result (indicated in bold) is not statistically significant. For all scores, we also show the 95% confidence intervals computed using Zhang's significant tester (Zhang et al., 2004) which was modified to conform to NIST's  definition of the BLEU brevity penalty.</Paragraph>
      <Paragraph position="2"> We observe that if phrasal reordering is totally dependent on the language model (NONE) we get the worst performance, even worse than the monotone search. This indicates that our language models were not strong to discriminate between straight orders and inverted orders. The flat and distortion reordering models (Row 3 and 4) show similar performance with Pharaoh. Although they are not dependent on phrases, they really reorder phrases with penalties to wrong orders supported by the language model and therefore outperform the monotone search. In row 6, only lexical features are used for the MaxEnt-based reordering model; while row 7 uses lexical features and collocation features. On both tasks, we observe that various reordering approaches show similar and stable performance ranks in different domains and the MaxEnt-based reordering models achieve the best performance among them. Using all features for the MaxEnt model (lex + col) is marginally better than using only lex features (lex).</Paragraph>
    </Section>
    <Section position="5" start_page="526" end_page="526" type="sub_section">
      <SectionTitle>
4.4 Scaling to Large Bitexts
</SectionTitle>
      <Paragraph position="0"> In the experiments described above, collocation features do not make great contributions to the performance improvement but make the total number of features increase greatly. This is a problem for MaxEnt parameter estimation if it is scaled to large bitexts. Therefore, for the integration of MaxEnt-based phrase reordering model in the system trained on large bitexts, we remove collocation features and only use lexical features from the last words of blocks (similar to those from the first words of blocks with similar performance).</Paragraph>
      <Paragraph position="1"> This time the bilingual training data contain 2.4M sentence pairs (68.1M Chinese words and 73.8M English words) and two trigram language models are used. One is trained on the English side of the bilingual training data. The other is trained on the Xinhua portion of the Gigaword corpus with 181.1M words. We also use some rules to translate numbers, time expressions and Chinese per-son names. The new Bleu score on NIST MT-05 is 0.291 which is very promising.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML