File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-1507_metho.xml

Size: 17,641 bytes

Last Modified: 2025-10-06 14:09:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-1507">
  <Title>Machine Translation as Lexicalized Parsing with Hooks</Title>
  <Section position="4" start_page="0" end_page="69" type="metho">
    <SectionTitle>
2 Machine Translation using Inversion
Transduction Grammar
</SectionTitle>
    <Paragraph position="0"> The Inversion Transduction Grammar (ITG) of Wu (1997) is a type of context-free grammar (CFG) for generating two languages synchronously. To model the translational equivalence within a sentence pair, ITG employs a synchronous rewriting mechanism to relate two sentences recursively. To deal with the syntactic divergence between two languages, ITG allows the inversion of rewriting order going from one language to another at any recursive level. ITG in Chomsky normal form consists of unary production rules that are responsible for generating word pairs:</Paragraph>
    <Paragraph position="2"> where e is a source language word, f is a foreign language word, and epsilon1 means the null token, and binary production rules in two forms that are responsible for generating syntactic subtree pairs:</Paragraph>
    <Paragraph position="4"/>
    <Paragraph position="6"> The rules with square brackets enclosing the right-hand side expand the left-hand side symbol into the two symbols on the right-hand side in the same order in the two languages, whereas the rules with angled brackets expand the left hand side symbol into the two right-hand side symbols in reverse order in the two languages. The first class of rules is called straight rule. The second class of rules is called inverted rule.</Paragraph>
    <Paragraph position="7"> One special case of 2-normal ITG is the so-called Bracketing Transduction Grammar (BTG) which has only one nonterminal A and two binary rules A - [AA] and A - &lt;AA&gt; By mixing instances of the inverted rule with those of the straight rule hierarchically, BTG can meet the alignment requirements of different language pairs. There exists a more elaborate version of BTG that has 4 nonterminals working together to guarantee the property of one-to-one correspondence between alignments and synchronous parse trees. Table 1 lists the rules of this BTG. In the discussion of this paper, we will consider ITG in 2-normal form.</Paragraph>
    <Paragraph position="8"> By associating probabilities or weights with the bitext production rules, ITG becomes suitable for weighted deduction over bitext. Given a sentence pair, searching for the Viterbi synchronous parse tree, of which the alignment is a byproduct, turns out to be a two-dimensional extension of PCFG parsing, having time complexity of O(n6), where n is the length of the English string and the foreign language string. A more interesting variant of parsing over bi-text space is the asymmetrical case in which only the foreign language string is given so that Viterbi parsing involves finding the English string &amp;quot;on the fly&amp;quot;. The process of finding the source string given its target counterpart is decoding. Using ITG, decoding is a form of parsing.</Paragraph>
    <Section position="1" start_page="65" end_page="68" type="sub_section">
      <SectionTitle>
2.1 ITG Decoding
</SectionTitle>
      <Paragraph position="0"> Wu (1996) presented a polynomial-time algorithm for decoding ITG combined with an m-gram language model. Such language models are commonly used in noisy channel models of translation, which find the best English translation e of a foreign sentence f by finding the sentence e that maximizes the product of the translation model P(f|e) and the language model P(e).</Paragraph>
      <Paragraph position="1"> It is worth noting that since we have specified ITG as a joint model generating both e and f, a language model is not theoretically necessary. Given a foreign sentence f, one can find the best translation e[?]:</Paragraph>
      <Paragraph position="3"> by approximating the sum over parses q with the probability of the Viterbi parse:</Paragraph>
      <Paragraph position="5"> This optimal translation can be computed in using standard CKY parsing over f by initializing the chart with an item for each possible translation of each foreign word in f, and then applying ITG rules from the bottom up.</Paragraph>
      <Paragraph position="6"> However, ITG's independence assumptions are too strong to use the ITG probability alone for machine translation. In particular, the context-free assumption that each foreign word's translation is chosen independently will lead to simply choosing each foreign word's single most probable English translation with no reordering. In practice it is beneficial to combine the probability given by ITG with a local m-gram language model for English:</Paragraph>
      <Paragraph position="8"> with some constant language model weight a. The language model will lead to more fluent output by influencing both the choice of English words and the reordering, through the choice of straight or inverted rules. While the use of a language model complicates the CKY-based algorithm for finding the best translation, a dynamic programming solution is still possible. We extend the algorithm by storing in each chart item the English boundary words that will affect the m-gram probabilities as the item's English string is concatenated with the string from an adjacent item. Due to the locality of m-gram language  model, only m[?]1 boundary words need to be stored to compute the new m-grams produced by combining two substrings. Figure 1 illustrates the combination of two substrings into a larger one in straight order and inverted order.</Paragraph>
      <Paragraph position="9"> 3 Hook Trick for Bilexical Parsing A traditional CFG generates words at the bottom of a parse tree and uses nonterminals as abstract representations of substrings to build higher level tree nodes. Nonterminals can be made more specific to the actual substrings they are covering by associating a representative word from the nonterminal's yield. When the maximum number of lexicalized nonterminals in any rule is two, a CFG is bilexical.</Paragraph>
      <Paragraph position="10"> A typical bilexical CFG in Chomsky normal form has two types of rule templates:</Paragraph>
      <Paragraph position="12"> depending on which child is the head child that agrees with the parent on head word selection.</Paragraph>
      <Paragraph position="13"> Bilexical CFG is at the heart of most modern statistical parsers (Collins, 1997; Charniak, 1997), because the statistics associated with word-specific rules are more informative for disambiguation purposes. If we use A[i, j, h] to represent a lexicalized constituent, b(*) to represent the Viterbi score function applicable to any constituent, and P(*) to represent the rule probability function applicable to any rule, Figure 2 shows the equation for the dynamic programming computation of the Viterbi parse. The two terms of the outermost max operator are symmetric cases for heads coming from left and right. Containing five free variables i,j,k,hprime,h, ranging over 1 to n, the length of input sentence, both terms can be instantiated in n5 possible ways, implying that the complexity of the parsing algorithm is O(n5).</Paragraph>
      <Paragraph position="14"> Eisner and Satta (1999) pointed out we don't have to enumerate k and hprime simultaneously. The trick, shown in mathematical form in Figure 2 (bottom) is very simple. When maximizing over hprime, j is irrelevant. After getting the intermediate result of maximizing over hprime, we have one less free variable than before. Throughout the two steps, the maximum number of interacting variables is 4, implying that the algorithmic complexity is O(n4) after binarizing the factors cleverly. The intermediate result</Paragraph>
      <Paragraph position="16"> can be represented pictorially as  j. The shape of the intermediate results gave rise to the nickname of &amp;quot;hook&amp;quot;. Melamed (2003) discussed the applicability of the hook trick for parsing bilexical multitext grammars. The analysis of the hook trick in this section shows that it is essentially an algebraic manipulation. We will formulate the ITG Viterbi decoding algorithm in a dynamic programming equation in the following section and apply the same algebraic manipulation to produce hooks that are suitable for ITG decoding.</Paragraph>
      <Paragraph position="17"> 4 Hook Trick for ITG Decoding We start from the bigram case, in which each decoding constituent keeps a left boundary word and  and right (v) of each constituent. In (a), two constituents Y and Z spanning substrings s, S and S, t of the input are combined using a straight rule X - [Y Z]. In (b), two constituents are combined using a inverted</Paragraph>
      <Paragraph position="19"> a right boundary word. The dynamic programming equation is shown in Figure 3 (top) where i,j,k range over 1 to n, the length of input foreign sentence, and u,v,v1,u2 (or u,v,v2,u1) range over 1 to V , the size of English vocabulary. Usually we will constrain the vocabulary to be a subset of words that are probable translations of the foreign words in the input sentence. So V is proportional to n. There are seven free variables related to input size for doing the maximization computation. Hence the algorithmic complexity is O(n7).</Paragraph>
      <Paragraph position="20"> The two terms in Figure 3 (top) within the first level of the max operator, corresponding to straight rules and inverted rules, are analogous to the two terms in Equation 1. Figure 3 (bottom) shows how to decompose the first term; the same method applies to the second term. Counting the free variables enclosed in the innermost max operator, we get five: i, k, u, v1, and u2. The decomposition eliminates one free variable, v1. In the outermost level, there are six free variables left. The maximum number of interacting variables is six overall. So, we reduced the complexity of ITG decoding using bigram language model from O(n7) to O(n6).</Paragraph>
      <Paragraph position="21"> The hooks k</Paragraph>
      <Paragraph position="23"> i that we have built for decoding with a bigram language model turn out to be similar to the hooks for bilexical parsing if we focus on the two boundary words v1 and u2 (or v2 and u1)  b(X[i, j, u, v]) = max  that are interacting between two adjacent decoding constituents and relate them with the hprime and h that are interacting in bilexical parsing. In terms of algebraic manipulation, we are also rearranging three factors (ignoring the non-lexical rules), trying to reduce the maximum number of interacting variables in any computation step.</Paragraph>
    </Section>
    <Section position="2" start_page="68" end_page="69" type="sub_section">
      <SectionTitle>
4.1 Generalization to m-gram Cases
</SectionTitle>
      <Paragraph position="0"> In this section, we will demonstrate how to use the hook trick for trigram decoding which leads us to a general hook trick for any m-gram decoding case.</Paragraph>
      <Paragraph position="1"> We will work only on straight rules and use icons of constituents and hooks to make the equations easier to interpret.</Paragraph>
      <Paragraph position="2"> The straightforward dynamic programming equation is:</Paragraph>
      <Paragraph position="4"> By counting the variables that are dependent on input sentence length on the right hand side of the equation, we know that the straightforward algorithm's complexity is O(n11). The maximization computation is over four factors that are dependent on n: b(Y [i, k, u1, u2, v11, v12]), b(Z[k, j, u21, u22, v1, v2]), trigram(v11, v12, u21), and trigram(v12, u21, u22). As before, our goal is to cleverly bracket the factors.</Paragraph>
      <Paragraph position="5"> By bracketing trigram(v11, v12, u21) and b(Y [i, k, u1, u2, v11, v12]) together and maximizing over v11 and Y , we can build the the level-1 hook:</Paragraph>
      <Paragraph position="7"> The complexity is O(n7).</Paragraph>
      <Paragraph position="8"> Grouping the level-1 hook and trigram(v12, u21, u22), maximizing over v12, we can build the level-2 hook:</Paragraph>
      <Paragraph position="10"> The complexity is O(n7). Finally, we can use the level-2 hook to combine with Z[k, j, u21, u22, v1, v2] to build X[i, j, u1, u2, v1, v2]. The complexity is O(n9) after reducing v11 and v12 in the first two steps.</Paragraph>
      <Paragraph position="12"> Using the hook trick, we have reduced the complexity of ITG decoding using bigrams from O(n7) to O(n6), and from O(n11) to O(n9) for trigram  case. We conclude that for m-gram decoding of ITG, the hook trick can change the the time complexity from O(n3+4(m[?]1)) to O(n3+3(m[?]1)). To get an intuition of the reduction, we can compare Equation 3 with Equation 4. The variables v11 and v12 in Equation 3, which are independent of v1 and v2 for maximizing the product have been concealed under the level-2 hook in Equation 4. In general, by building m [?] 1 intermediate hooks, we can reduce m [?] 1 free variables in the final combination step, hence having the reduction from 4(m [?] 1) to  Although we have presented our algorithm as a decoder for the binary-branching case of Inversion Transduction Grammar, the same factorization technique can be applied to more complex synchronous grammars. In this general case, items in the dynamic programming chart may need to represent non-contiguous span in either the input or output language. Because synchronous grammars with increasing numbers of children on the right hand side of each production form an infinite, non-collapsing hierarchy, there is no upper bound on the number of discontinuous spans that may need to be represented (Aho and Ullman, 1972). One can, however, choose to factor the grammar into binary branching rules in one of the two languages, meaning that discontinuous spans will only be necessary in the other language.</Paragraph>
      <Paragraph position="13"> If we assume m is larger than 2, it is likely that the language model combinations dominate computation. In this case, it is advantageous to factor the grammar in order to make it binary in the output language, meaning that the subrules will only need to represent adjacent spans in the output language. Then the hook technique will work in the same way, yielding O(n2(m[?]1)) distinct types of items with respect to language model state, and 3(m[?]1) free indices to enumerate when combining a hook with a complete constituent to build a new item. However, a larger number of indices pointing into the input language will be needed now that items can cover discontinuous spans. If the grammar factorization yields rules with at most R spans in the input language, there may be O(n2R) distinct types of chart items with respect to the input language, because each span has an index for its beginning and ending points in the input sentence.</Paragraph>
      <Paragraph position="14"> Now the upper bound of the number of free indices with respect to the input language is 2R + 1, because otherwise if one rule needs 2R + 2 indices, say i1,*** , i2R+2, then there are R + 1 spans (i1, i2),*** , (i2R+1, i2R+2), which contradicts the above assumption. Thus the time complexity at the input language side is O(n2R+1), yielding a total algorithmic complexity of O(n3(m[?]1)+(2R+1)).</Paragraph>
      <Paragraph position="15"> To be more concrete, we will work through a 4-ary translation rule, using a bigram language model. The standard DP equation is:</Paragraph>
      <Paragraph position="17"> This 4-ary rule is a representative difficult case.</Paragraph>
      <Paragraph position="18"> The underlying alignment pattern for this rule is as follows:  It is a rule that cannot be binarized in the bitext space using ITG rules. We can only binarize it in one dimension and leave the other dimension having discontinuous spans. Without applying binarization and hook trick, decoding parsing with it according to Equation 5 requires time complexity of O(n13). However, we can build the following partial constituents and hooks to do the combination gradually. The first step finishes a hook by consuming one bigram. Its time complexity is O(n5):  The overall complexity has been reduced to O(n8) after using binarization on the output side and using the hook trick all the way to the end. The result is one instance of our general analysis: here R = 2, m = 2, and 3(m [?] 1) + (2R + 1) = 8.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="69" end_page="69" type="metho">
    <SectionTitle>
6 Implementation
</SectionTitle>
    <Paragraph position="0"> The implementation of the hook trick in a practical decoder is complicated by the interaction with pruning. If we build hooks looking for all words in the vocabulary whenever a complete constituent is added to the chart, we will build many hooks that are never used, because partial hypotheses with many of the boundary words specified by the hooks may never be constructed due to pruning. Instead of actively building hooks, which are intermediate results, we can build them only when we need them and then cache them for future use. To make this idea concrete, we sketch the code for bi-gram integrated decoding using ITG as in Algorithm 1. It is worthy of noting that for clarity we are building hooks in shape of  as we have been showing in the previous sections. That is, the probability for the grammar rule is multiplied in when a complete constituent is built, rather than when a hook is created. If we choose the original representation, we would have to create both straight hooks and inverted hooks because the straight rules and inverted rules are to be merged with the &amp;quot;core&amp;quot; hooks, creating more specified hooks.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML