File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1033_metho.xml

Size: 16,958 bytes

Last Modified: 2025-10-06 14:10:10

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-1033">
  <Title>Synchronous Binarization for Machine Translation</Title>
  <Section position="3" start_page="256" end_page="258" type="metho">
    <SectionTitle>
2 Synchronous Binarization
</SectionTitle>
    <Paragraph position="0"> A synchronous CFG (SCFG) is a context-free rewriting system for generating string pairs. Each rule (synchronous production) rewrites a nonterminal in two dimensions subject to the constraint that the sequence of nonterminal children on one side is a permutation of the nonterminal sequence on the other side. Each co-indexed child nonterminal pair will be further rewritten as a unit.2 We de ne the language L(G) produced by an SCFG G as the pairs of terminal strings produced by rewriting exhaustively from the start symbol.</Paragraph>
    <Paragraph position="1"> As shown in Section 3.2, terminals do not play an important role in binarization. So we now write rules in the following notation: X -X(1)1 ...X(n)n , X(pi(1))pi(1) ...X(pi(n))pi(n) where each Xi is a variable which ranges over non-terminals in the grammar and pi is the permutation of the rule. We also de ne an SCFG rule as n-ary if its permutation is of n and call an SCFG n-ary if its longest rule is n-ary. Our goal is to produce an equivalent binary SCFG for an input n-ary SCFG.</Paragraph>
    <Paragraph position="2"> 2In making one nonterminal play dual roles, we follow the de nitions in (Aho and Ullman, 1972; Chiang, 2005), originally known as Syntax Directed Translation Schema (SDTS). An alternative de nition by Satta and Peserico (2005) allows co-indexed nonterminals taking different symbols in two dimensions. Formally speaking, we can construct an equivalent SDTS by creating a cross-product of nonterminals from two sides. See (Satta and Peserico, 2005, Sec. 4) for other details.  for (2, 3, 5, 4). (c): alignment matrix for the non-binarizable permuted sequence (2, 4, 1, 3) However, not every SCFG can be binarized. In fact, the binarizability of an n-ary rule is determined by the structure of its permutation, which can sometimes be resistant to factorization (Aho and Ullman, 1972). So we now start to rigorously de ne the binarizability of permutations.</Paragraph>
    <Section position="1" start_page="258" end_page="258" type="sub_section">
      <SectionTitle>
2.1 Binarizable Permutations
</SectionTitle>
      <Paragraph position="0"> A permuted sequence is a permutation of consecutive integers. For example, (3, 5, 4) is a permuted sequence while (2, 5) is not. As special cases, single numbers are permuted sequences as well.</Paragraph>
      <Paragraph position="1"> A sequence a is said to be binarizable if it is a permuted sequence and either  1. a is a singleton, i.e. a = (a), or 2. a can be split into two sub sequences, i.e. a = (b;c), where b and c are both binarizable permuted sequences. We call such a division (b;c) a binarizable split of a.</Paragraph>
      <Paragraph position="2">  This is a recursive de nition. Each binarizable permuted sequence has at least one hierarchical binarization pattern. For instance, the permuted sequence (2, 3, 5, 4) is binarizable (with two possible binarization patterns) while (2, 4, 1, 3) is not (see Figure 3).</Paragraph>
    </Section>
    <Section position="2" start_page="258" end_page="258" type="sub_section">
      <SectionTitle>
2.2 Binarizable SCFG
</SectionTitle>
      <Paragraph position="0"> An SCFG is said to be binarizable if the permutation of each synchronous production is binarizable. We denote the class of binarizable SCFGs as bSCFG. This set represents an important subclass of SCFG that is easy to handle (parsable in O(|w|6)) and covers many interesting longer-than-two rules.3 3Although we factor the SCFG rules individually and dene bSCFG accordingly, there are some grammars (the dashed  notes the direction of synchronous binarization. For clarity reasons, binary SCFG is coded as SCFG-2.</Paragraph>
      <Paragraph position="1"> Theorem 1. For each grammar G in bSCFG, there exists a binary SCFG Gprime, such that L(Gprime) = L(G). Proof. Once we decompose the permutation of n in the original rule into binary permutations, all that remains is to decorate the skeleton binary parse with nonterminal symbols and attach terminals to the skeleton appropriately. We explain the technical details in the next section.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="258" end_page="260" type="metho">
    <SectionTitle>
3 Binarization Algorithms
</SectionTitle>
    <Paragraph position="0"> We have reduced the problem of binarizing an SCFG rule into the problem of binarizing its permutation.</Paragraph>
    <Paragraph position="1"> This problem can be cast as an instance of synchronous ITG parsing (Wu, 1997). Here the parallel string pair that we are parsing is the integer sequence (1...n) and its permutation (pi(1)...pi(n)). The goal of the ITG parsing is to nd a synchronous tree that agrees with the alignment indicated by the permutation. In fact, as demonstrated previously, some permutations may have more than one binarization patterns among which we only need one. Wu (1997, Sec. 7) introduces a non-ambiguous ITG that prefers left-heavy binary trees so that for each permutation there is a unique synchronous derivation (binarization pattern).</Paragraph>
    <Paragraph position="2"> However, this problem has more ef cient solutions. Shapiro and Stephens (1991, p. 277) informally present an iterative procedure where in each pass it scans the permuted sequence from left to right and combines two adjacent sub sequences whenever possible. This procedure produces a left-heavy binarization tree consistent with the unambiguous ITG and runs in O(n2) time since we need n passes in the worst case. We modify this procedure and improve circle in Figure 4), which can be binarized only by analyzing interactions between rules. Below is a simple example:</Paragraph>
    <Paragraph position="4"> (1, 5, 3, 4, 2). The rightmost column shows the binarization-trees generated at each reduction step.</Paragraph>
    <Paragraph position="5"> it into a linear-time shift-reduce algorithm that only needs one pass through the sequence.</Paragraph>
    <Section position="1" start_page="259" end_page="259" type="sub_section">
      <SectionTitle>
3.1 The linear-time skeleton algorithm
</SectionTitle>
      <Paragraph position="0"> The (unique) binarization tree bi(a) for a binarizable permuted sequence a is recursively de ned as follows: * if a = (a), then bi(a) = a; * otherwise let a = (b;c) to be the rightmost binarizable split of a. then</Paragraph>
      <Paragraph position="2"> For example, the binarization tree for (2, 3, 5, 4) is [[2, 3],&lt;5, 4&gt; ], which corresponds to the binarization pattern in Figure 3(a). We use [] and &lt;&gt; for straight and inverted combinations respectively, following the ITG notation (Wu, 1997). The rightmost split ensures left-heavy binary trees.</Paragraph>
      <Paragraph position="3"> The skeleton binarization algorithm is an instance of the widely used left-to-right shift-reduce algorithm. It maintains a stack for contiguous subsequences discovered so far, like 2-5, 1. In each iteration, it shifts the next number from the input and repeatedly tries to reduce the top two elements on the stack if they are consecutive. See Algorithm 1 for details and Figure 5 for an example.</Paragraph>
      <Paragraph position="4"> Theorem 2. Algorithm 1 succeeds if and only if the input permuted sequence a is binarizable, and in case of success, the binarization pattern recovered is the binarization tree of a.</Paragraph>
      <Paragraph position="5"> Proof. -: it is obvious that if the algorithm succeeds then a is binarizable using the binarization pattern recovered.</Paragraph>
      <Paragraph position="6"> -: by a complete induction on n, the length of a.</Paragraph>
      <Paragraph position="7"> Base case: n = 1, trivial.</Paragraph>
      <Paragraph position="8"> Assume it holds for all nprime &lt; n.</Paragraph>
      <Paragraph position="9"> If a is binarizable, then let a = (b;c) be its right-most binarizable split. By the induction hypothesis, the algorithm succeeds on the partial input b, reducing it to the single element s[0] on the stack and recovering its binarization tree bi(b).</Paragraph>
      <Paragraph position="10"> Let c = (c1;c2). If c1 is binarizable and triggers our binarizer to make a straight combination of (b;c1), based on the property of permutations, it must be true that (c1;c2) is a valid straight concatenation. We claim that c2 must be binarizable in this situation. So, (b,c1;c2) is a binarizable split to the right of the rightmost binarizable split (b;c), which is a contradiction. A similar contradiction will arise if b and c1 can make an inverted concatenation.</Paragraph>
      <Paragraph position="11"> Therefore, the algorithm will scan through the whole c as if from the empty stack. By the induction hypothesis again, it will reduce c into s[1] on the stack and recover its binarization tree bi(c).</Paragraph>
      <Paragraph position="12"> Since b and c are combinable, the algorithm reduces s[0] and s[1] in the last step, forming the binarization tree for a, which is either [bi(b), bi(c)] or &lt;bi(b), bi(c)&gt; .</Paragraph>
      <Paragraph position="13"> The running time of Algorithm 1 is linear in n, the length of the input sequence. This is because there are exactly n shifts and at most n[?]1 reductions, and each shift or reduction takes O(1) time.</Paragraph>
    </Section>
    <Section position="2" start_page="259" end_page="260" type="sub_section">
      <SectionTitle>
3.2 Binarizing tree-to-string transducers
</SectionTitle>
      <Paragraph position="0"> Without loss of generality, we have discussed how to binarize synchronous productions involving only nonterminals through binarizing the corresponding skeleton permutations. We still need to tackle a few technical problems in the actual system.</Paragraph>
      <Paragraph position="1"> First, we are dealing with tree-to-string transducer rules. We view each left-hand side subtree as a monolithic nonterminal symbol and factor each transducer rule into two SCFG rules: one from the root nonterminal to the subtree, and the other from the subtree to the leaves. In this way we can uniquely reconstruct the tree-to-string derivation using the two-step SCFG derivation. For example,  Algorithm 1 The Linear-time Binarization Algorithm  1: function BINARIZABLE(a) 2: top-0 triangleright stack top pointer 3: PUSH(a1, a1) triangleright initial shift 4: for i-2 to|a|do triangleright for each remaining element 5: PUSH(ai, ai) triangleright shift 6: while top &gt; 1 and CONSECUTIVE(s[top], s[top[?]1]) do triangleright keep reducing if possible 7: (p, q)-COMBINE(s[top], s[top[?]1]) 8: top-top[?]2 9: PUSH(p, q) 10: return (top = 1) triangleright if reduced to a single element then the input is binarizable, otherwise not 11: function CONSECUTIVE((a, b), (c, d)) 12: return (b = c[?]1) or (d = a[?]1) triangleright either straight or inverted 13: function COMBINE((a, b), (c, d)) 14: return (min(a, c), max(b, d)) consider the following tree-to-string rule:  We create a speci c nonterminal, say, T859, which is a unique identi er for the left-hand side subtree and generate the following two SCFG rules:</Paragraph>
      <Paragraph position="3"> Second, besides synchronous nonterminals, terminals in the two languages can also be present, as in the above example. It turns out we can attach the terminals to the skeleton parse for the synchronous nonterminal strings quite freely as long as we can uniquely reconstruct the original rule from its binary parse tree. In order to do so we need to keep track of sub-alignments including both aligned nonterminals and neighboring terminals.</Paragraph>
      <Paragraph position="4"> When binarizing the second rule above, we rst run the skeleton algorithm to binarize the underlying permutation (1, 3, 2) to its binarization tree [1,&lt;3, 2&gt; ]. Then we do a post-order traversal to the skeleton tree, combining Chinese terminals (one at a time) at the leaf nodes and merging English terminals greedily at internal nodes:  A pre-order traversal of the decorated binarization tree gives us the following binary SCFG rules:</Paragraph>
      <Paragraph position="6"> where the virtual nonterminals are: V1: V[RB, fuze] V2: V&lt;V[PP, de], resp. for the NN&gt; V3: V[PP, de] Analogous to the dotted rules in Earley parsing for monolingual CFGs, the names we create for the virtual nonterminals re ect the underlying sub-alignments, ensuring intermediate states can be shared across different tree-to-string rules without causing ambiguity.</Paragraph>
      <Paragraph position="7"> The whole binarization algorithm still runs in time linear in the number of symbols in the rule (including both terminals and nonterminals).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="260" end_page="262" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> In this section, we answer two empirical questions.</Paragraph>
    <Paragraph position="1">  dashed-line stairs indicate the percentage of non-binarizable rules in our initial rule set while the dotted-line denotes that percentage among all permutations.</Paragraph>
    <Section position="1" start_page="261" end_page="261" type="sub_section">
      <SectionTitle>
4.1 How many rules are binarizable?
</SectionTitle>
      <Paragraph position="0"> It has been shown by Shapiro and Stephens (1991) and Wu (1997, Sec. 4) that the percentage of binarizable cases over all permutations of length n quickly approaches 0 as n grows (see Figure 6). However, for machine translation, it is more meaningful to compute the ratio of binarizable rules extracted from real text. Our rule set is obtained by rst doing word alignment using GIZA++ on a Chinese-English parallel corpus containing 50 million words in English, then parsing the English sentences using a variant of Collins parser, and nally extracting rules using the graph-theoretic algorithm of Galley et al. (2004).</Paragraph>
      <Paragraph position="1"> We did a spectrum analysis on the resulting rule set with 50,879,242 rules. Figure 6 shows how the rules are distributed against their lengths (number of nonterminals). We can see that the percentage of non-binarizable rules in each bucket of the same length does not exceed 25%. Overall, 99.7% of the rules are binarizable. Even for the 0.3% non-binarizable rules, human evaluations show that the majority of them are due to alignment errors. It is also interesting to know that 86.8% of the rules have monotonic permutations, i.e. either taking identical or totally inverted order.</Paragraph>
    </Section>
    <Section position="2" start_page="261" end_page="262" type="sub_section">
      <SectionTitle>
4.2 Does synchronous binarizer help decoding?
</SectionTitle>
      <Paragraph position="0"> We did experiments on our CKY-based decoder with two binarization methods. It is the responsibility of the binarizer to instruct the decoder how to compute the language model scores from children nonterminals in each rule. The baseline method is mono-lingual left-to-right binarization. As shown in Section 1, decoding complexity with this method is exponential in the size of the longest rule and since we postpone all the language model scorings, pruning in this case is also biased.</Paragraph>
      <Paragraph position="1">  To move on to synchronous binarization, we rst did an experiment using the above baseline system without the 0.3% non-binarizable rules and did not observe any difference in BLEU scores. So we safely move a step further, focusing on the binarizable rules only.</Paragraph>
      <Paragraph position="2"> The decoder now works on the binary translation rules supplied by an external synchronous binarizer. As shown in Section 1, this results in a simpli ed decoder with a polynomial time complexity, allowing less aggressive and more effective pruning based on both translation model and language model scores.</Paragraph>
      <Paragraph position="3"> We compare the two binarization schemes in terms of translation quality with various pruning thresholds. The rule set is that of the previous section. The test set has 116 Chinese sentences of no longer than 15 words. Both systems use trigram as the integrated language model. Figure 7 demonstrates that decoding accuracy is signi cantly improved after synchronous binarization. The number of edges proposed during decoding is used as a measure of the size of search space, or time ef ciency. Our system is consistently faster and more accurate than the baseline system.</Paragraph>
      <Paragraph position="4"> We also compare the top result of our synchronous binarization system with the state-of-the-art alignment-template approach (ATS) (Och and Ney, 2004). The results are shown in Table 1. Our system has a promising improvement over the ATS  in terms of translation quality against search effort. system which is trained on a larger data-set but tuned independently.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML