File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/n06-1033_intro.xml

Size: 8,666 bytes

Last Modified: 2025-10-06 14:03:23

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-1033">
  <Title>Synchronous Binarization for Machine Translation</Title>
  <Section position="2" start_page="0" end_page="256" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Several recent syntax-based models for machine translation (Chiang, 2005; Galley et al., 2004) can be seen as instances of the general framework of synchronous grammars and tree transducers. In this framework, both alignment (synchronous parsing) and decoding can be thought of as parsing problems, whose complexity is in general exponential in the number of nonterminals on the right hand side of a grammar rule. To alleviate this problem, we investigate bilingual binarization to factor the synchronous grammar to a smaller branching factor, although it is not guaranteed to be successful for any synchronous rule with arbitrary permutation. In particular: * We develop a technique called synchronous binarization and devise a fast binarization algorithm such that the resulting rule set allows efcient algorithms for both synchronous parsing and decoding with integrated n-gram language models.</Paragraph>
    <Paragraph position="1"> * We examine the effect of this binarization method on end-to-end machine translation quality, compared to a more typical baseline method.</Paragraph>
    <Paragraph position="2"> * We examine cases of non-binarizable rules in a large, empirically-derived rule set, and we investigate the effect on translation quality when excluding such rules.</Paragraph>
    <Paragraph position="3"> Melamed (2003) discusses binarization of multitext grammars on a theoretical level, showing the importance and dif culty of binarization for ef cient synchronous parsing. One way around this dif culty is to stipulate that all rules must be binary from the outset, as in inversion-transduction grammar (ITG) (Wu, 1997) and the binary synchronous context-free grammar (SCFG) employed by the Hiero system (Chiang, 2005) to model the hierarchical phrases. In contrast, the rule extraction method of Galley et al. (2004) aims to incorporate more syntactic information by providing parse trees for the target language and extracting tree transducer rules that apply to the parses. This approach results in rules with many nonterminals, making good binarization techniques critical.</Paragraph>
    <Paragraph position="4"> Suppose we have the following SCFG, where superscripts indicate reorderings (formal de nitions of  chronous nonterminals (and sub trees).</Paragraph>
    <Paragraph position="5"> SCFGs can be found in Section 2):</Paragraph>
    <Paragraph position="7"> VP- held a meeting, juxing le huitan PP- with Sharon, yu Shalong Decoding can be cast as a (monolingual) parsing problem since we only need to parse the source-language side of the SCFG, as if we were constructing a CFG projected on Chinese out of the SCFG. The only extra work we need to do for decoding is to build corresponding target-language (English) subtrees in parallel. In other words, we build synchronous trees when parsing the source-language input, as shown in Figure 1.</Paragraph>
    <Paragraph position="8"> To ef ciently decode with CKY, we need to binarize the projected CFG grammar.1 Rules can be binarized in different ways. For example, we could binarize the rst rule left to right or right to left:</Paragraph>
    <Paragraph position="10"> We call those intermediate symbols (e.g. VPP-VP) virtual nonterminals and corresponding rules virtual rules, whose probabilities are all set to 1.</Paragraph>
    <Paragraph position="11"> These two binarizations are no different in the translation-model-only decoding described above, just as in monolingual parsing. However, in the source-channel approach to machine translation, we need to combine probabilities from the translation model (an SCFG) with the language model (an ngram), which has been shown to be very important for translation quality (Chiang, 2005). To do bigram-integrated decoding, we need to augment each chart item (X, i, j) with two target-language 1Other parsing strategies like the Earley algorithm use an internal binary representation (e.g. dotted-rules) of the original grammar to ensure cubic time complexity.</Paragraph>
    <Paragraph position="12"> boundary words u and v to produce a bigram-item like parenleftBig u *** v Xi j parenrightBig , following the dynamic programming algorithm of Wu (1996).</Paragraph>
    <Paragraph position="13"> Now the two binarizations have very different effects. In the rst case, we rst combine NP with PP:  where p and q are the scores of antecedent items. This situation is unpleasant because in the target-language NP and PP are not contiguous so we cannot apply language model scoring when we build the VNP-PP item. Instead, we have to maintain all four boundary words (rather than two) and postpone the language model scoring till the next step where VNP-PP is combined with parenleftbigg held *** meeting VP2 4 parenrightbigg to form an S item.</Paragraph>
    <Paragraph position="14"> We call this binarization method monolingual binarization since it works only on the source-language projection of the rule without respecting the constraints from the other side.</Paragraph>
    <Paragraph position="15"> This scheme generalizes to the case where we have n nonterminals in a SCFG rule, and the decoder conservatively assumes nothing can be done on language model scoring (because target-language spans are non-contiguous in general) until the real nonterminal has been recognized. In other words, target-language boundary words from each child nonterminal of the rule will be cached in all virtual non-terminals derived from this rule. In the case of m-gram integrated decoding, we have to maintain 2(m[?]1) boundary words for each child nonterminal, which leads to a prohibitive overall complexity of O(|w|3+2n(m[?]1)), which is exponential in rule size (Huang et al., 2005). Aggressive pruning must be used to make it tractable in practice, which in general introduces many search errors and adversely affects translation quality.</Paragraph>
    <Paragraph position="16"> In the second case, however:  matrix (right) of the synchronous production.</Paragraph>
    <Paragraph position="17"> language model score by adding Pr(with  |meeting), and the resulting item again has two boundary words. Later we add Pr(held  |Powell) when the resulting item is combined with  form an S item. As illustrated in Figure 2, VPP-VP has contiguous spans on both source and target sides, so that we can generate a binary-branching SCFG:</Paragraph>
    <Paragraph position="19"> In this case m-gram integrated decoding can be done in O(|w|3+4(m[?]1)) time which is much lower-order polynomial and no longer depends on rule size (Wu, 1996), allowing the search to be much faster and more accurate facing pruning, as is evidenced in the Hiero system of Chiang (2005) where he restricts the hierarchical phrases to be a binary SCFG. The bene t of binary grammars also lies in synchronous parsing (alignment). Wu (1997) shows that parsing a binary SCFG is in O(|w|6) while parsing SCFG is NP-hard in general (Satta and Peserico, 2005).</Paragraph>
    <Paragraph position="20"> The same reasoning applies to tree transducer rules. Suppose we have the following tree-to-string rules, following Galley et al. (2004):</Paragraph>
    <Paragraph position="22"> -juxing le huitan PP(TO(with), NP(NNP(Sharon)))-yu Shalong where the reorderings of nonterminals are denoted by variables xi.</Paragraph>
    <Paragraph position="23"> Notice that the rst rule has a multi-level left-hand side subtree. This system can model non-isomorphic transformations on English parse trees to t another language, for example, learning that the (S (V O)) structure in English should be transformed into a (V S O) structure in Arabic, by looking at two-level tree fragments (Knight and Graehl, 2005). From a synchronous rewriting point of view, this is more akin to synchronous tree substitution grammar (STSG) (Eisner, 2003). This larger locality is linguistically motivated and leads to a better parameter estimation. By imagining the left-hand-side trees as special nonterminals, we can virtually create an SCFG with the same generative capacity. The technical details will be explained in Section 3.2.</Paragraph>
    <Paragraph position="24"> In general, if we are given an arbitrary synchronous rule with many nonterminals, what are the good decompositions that lead to a binary grammar? Figure 2 suggests that a binarization is good if every virtual nonterminal has contiguous spans on both sides. We formalize this idea in the next section.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML