File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1083_metho.xml
Size: 24,019 bytes
Last Modified: 2025-10-06 14:08:58
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1083"> <Title>Statistical Machine Translation by Parsing</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Multitext Grammars and Multitrees </SectionTitle> <Paragraph position="0"> The algorithms in this paper can be adapted for any synchronous grammar formalism. The vehicle for the present guided tour shall be multitext grammar (MTG), which is a generalization of context-free grammar to the synchronous case (Melamed, 2003).</Paragraph> <Paragraph position="1"> We shall limit our attention to MTGs in Generalized Chomsky Normal Form (GCNF) (Melamed et al., 2004). This normal form allows simpler algorithm descriptions than the normal forms used by Wu (1997) and Melamed (2003).</Paragraph> <Paragraph position="2"> In GCNF, every production is either a terminal production or a nonterminal production. A nonterminal production might look like this: There are nonterminals on the left-hand side (LHS) and in parentheses on the right-hand side (RHS). Each row of the production describes rewriting in a different component text of a multitext. In each row, a role template describes the relative order and contiguity of the RHS nonterminals. E.g., in the top row, [1,2] indicates that the first nonterminal (A) precedes the second (B). In the bottom row, [1,2,1] indicates that the first nonterminal both precedes and follows the second, i.e. D is discontinuous. Discontinuous nonterminals are annotated with the number of their contiguous segments, as in (&quot;join&quot;) operator rearranges the non-terminals in each component according to their role template. The nonterminals on the RHS are written in columns called links. Links express translational equivalence. Some nonterminals might have no translation in some components, indicated by (), as in the 2nd row. Terminal productions have exactly one &quot;active&quot; component, in which there is exactly one terminal on the RHS. The other components are inactive. E.g., are the usual semantics of rewriting systems, i.e., that the expression on the LHS can be rewritten as the expression on the RHS. However, all the nonterminals in the same link must be rewritten simultaneously. In this manner, MTGs generate tuples of parse trees that are isomorphic up to reordering of sibling nodes and deletion. Figure 2 shows two representations of a tree that might be generated by an MTG in GCNF for the imperative sentence pair Wash the dishes / Pasudu moy . The tree exhibits both deletion and inversion in translation. We shall refer to such multidimensional trees as multitrees.</Paragraph> <Paragraph position="3"> The different classes of generalized parsing algorithms in this paper differ only in their grammars and in their logics. They are all compatible with the same parsing semirings and search strategies. Therefore, we shall describe these algorithms in terms of their underlying logics and grammars, abstracting away the semirings and search strategies, in order to elucidate how the different classes of algorithms are related to each other. Logical descriptions of inference algorithms involve inference rules: and a4 . An item that appears in an inference rule stands for the proposition that the item is in the parse chart. A production rule that appears in an inference rule stands for the proposition that the production is in the grammar. Such specifications are nondeter- null in English and (transliterated) Russian. Every internal node is annotated with the linear order of its children, in every component where there are two children. Below: A graphical representation of the same tree. Rectangles are 2D constituents.</Paragraph> <Paragraph position="4"> ministic: they do not indicate the order in which a parser should attempt inferences. A deterministic parsing strategy can always be chosen later, to suit the application. We presume that readers are familiar with declarative descriptions of inference algorithms, as well as with semiring parsing (Goodman, 1999).</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 A Synchronous CKY Parser </SectionTitle> <Paragraph position="0"> Figure 3 shows Logic C. Parser C is any parser based on Logic C. As in Melamed (2003)'s Parser A, Parser C's items consist of a a0 -dimensional label vector a2a50a49a51 and a</Paragraph> <Paragraph position="2"> sions of a vector. E.g., a53a45a54a55 is a vector spanning dimensions 1 through a56 . See Melamed (2003) for definitions of cardinality, d-span, and the operators a57 and a58 .</Paragraph> <Paragraph position="3"> Parser C needs to know all the boundaries of each item, not just the outermost boundaries. Some (but not all) dimensions of an item can be inactive, denoted null a23a25a24 , and have an empty d-span ().</Paragraph> <Paragraph position="4"> The input to Parser C is a tuple of a0 parallel texts,</Paragraph> <Paragraph position="6"> dicates that the Goal item must span the input from the left of the first word to the right of the last word in each component a7 a12a20a10a9a8 a7 a8 a0 . Thus, the Goal item must be contiguous in all dimensions.</Paragraph> <Paragraph position="7"> Parser C begins with an empty chart. The only inferences that can fire in this state are those with no antecedent items (though they can have antecedent production rules). In Logic C, a10a12a11 a23a14a13a16a15a18a17 a24 is the value that the grammar assigns to the terminal productiona13 . The range of this value depends on the semiring used. A Scan inference can fire for the a19 th word a20a21a5 a3a22 in component a7 for every terminal production in the grammar where a20a23a5 a3a22 appears in the a7 th component. Each Scan consequent has exactly one active d-span, and that d-span always has the form</Paragraph> <Paragraph position="9"> because such items always span one word, so the distance between the item's boundaries is always one.</Paragraph> <Paragraph position="10"> The Compose inference in Logic C is the same as in Melamed's Parser A, using slightly different notation: In Logic C, the function a10a26a25</Paragraph> <Paragraph position="12"> represents the value that the grammar assigns to the nonterminal production</Paragraph> <Paragraph position="14"> compose two items if their labels appear on the RHS of a production rule in the grammar, and if the contiguity and relative order of their intervals is consistent with the role templates of that production rule.</Paragraph> <Paragraph position="16"> These constraints are enforced by the d-span operators a84 and a85 .</Paragraph> <Paragraph position="17"> Parser C is conceptually simpler than the synchronous parsers of Wu (1997), Alshawi et al. (2000), and Melamed (2003), because it uses only one kind of item, and it never composes terminals. The inference rules of Logic C are the multidimensional generalizations of inference rules with the same names in ordinary CKY parsers. For example, given a suitable grammar and the input (imperative) sentence pair Wash the dishes / Pasudu moy, Parser C might make the 9 inferences in Figure 4 to infer the multitree in Figure 2. Note that there is one inference per internal node of the multitree.</Paragraph> <Paragraph position="18"> Goodman (1999) shows how a parsing logic can be combined with various semirings to compute different kinds of information about the input. Depending on the chosen semiring, a parsing logic can compute the single most probable derivation and/or its probability, the a86 most probable derivations and/or their total probability, all possible derivations and/or their total probability, the number of possible derivations, etc. All the parsing semirings catalogued by Goodman apply the same way to synchronous parsing, and to all the other classes of algorithms discussed in this paper.</Paragraph> <Paragraph position="19"> The class of synchronous parsers includes some algorithms for word alignment. A translation lexicon (weighted or not) can be viewed as a degenerate MTG (not in GCNF) where every production has a link of terminals on the RHS. Under such an MTG, the logic of word alignment is the one in Melamed (2003)'s Parser A, but without Compose inferences.</Paragraph> <Paragraph position="20"> The only other difference is that, instead of a single item, the Goal of word alignment is any set of items that covers all dimensions of the input. This logic can be used with the expectation semiring (Eisner, 2002) to find the maximum likelihood estimates of the parameters of a word-to-word translation model.</Paragraph> <Paragraph position="21"> An important application of Parser C is parameter estimation for probabilistic MTGs (PMTGs). Eisner (2002) has claimed that parsing under an expectation semiring is equivalent to the Inside-Outside algorithm for PCFGs. If so, then there is a straight-forward generalization for PMTGs. Parameter estimation is beyond the scope of this paper, however.</Paragraph> <Paragraph position="22"> The next section assumes that we have an MTG, probabilistic or not, as required by the semiring.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Translation </SectionTitle> <Paragraph position="0"> A a0 -MTG can guide a synchronous parser to infer the hidden structure of a a0 -component multitext. Now suppose that we have a a0 -MTG and an input multitext with only a87 components, a87a89a88</Paragraph> <Paragraph position="2"> Parser C on input Wash the dishes / Pasudu moy.</Paragraph> <Paragraph position="3"> When some of the component texts are missing, we can ask the parser to infer a a0 -dimensional multitree that includes the missing components.</Paragraph> <Paragraph position="4"> The resulting multitree will cover the a87 input components/dimensions among its a0 dimensions.</Paragraph> <Paragraph position="5"> It will also express the a0 a24 a87 output components/dimensions, along with their syntactic structures. null</Paragraph> <Paragraph position="7"> a0 -dimensional label vector, as usual. However, their d-span vectors are only a87 -dimensional, because it is not necessary to constrain absolute word positions in the output dimensions. Instead, we need only constrain the cardinality of the output nonterminals, which is accomplished by the role templatesa28</Paragraph> <Paragraph position="9"> a10a38a25 term. Translator CT scans only the input components. Terminal productions with active output components are simply loaded from the grammar, and their LHSs are added to the chart without d-span information. Composition proceeds as before, except that there are no constraints on the role templates in the output dimensions - the role templates in</Paragraph> <Paragraph position="11"> In summary, Logic CT differs from Logic C as follows: a49 Items store no position information (d-spans) for the output components.</Paragraph> <Paragraph position="12"> a49 For the output components, the Scan inferences are replaced by Load inferences, which are not constrained by the input.</Paragraph> <Paragraph position="13"> a49 The Compose inference does not constrain the d-spans of the output components. (Though it still constrains their cardinality.) We have constructed a translator from a synchronous parser merely by relaxing some constraints on the output dimensions. Logic C is just Logic CT for the special case where a87a1a0 a0 . The relationship between the two classes of algorithms is easier to see from their declarative logics than it would be from their procedural pseudocode or equations. null Like Parser C, Translator CT can Compose items that have no dimensions in common. If one of the items is active only in the input dimension(s), and the other only in the output dimension(s), then the inference is, de facto, a translation. The possible translations are determined by consulting the grammar. Thus, in addition to its usual function of evaluating syntactic structures, the grammar simultaneously functions as a translation model.</Paragraph> <Paragraph position="14"> Logic CT can be coupled with any parsing semiring. For example, under a boolean semiring, this logic will succeed on an a87 -dimensional input if and only if it can infer a a0 -dimensional multitree whose root is the goal item. Such a tree would contain aa23</Paragraph> <Paragraph position="16"> -dimensional translation of the input. Thus, under a boolean semiring, Translator CT can determine whether a translation of the input exists.</Paragraph> <Paragraph position="17"> Under an inside-probability semiring, Translator CT can compute the total probability of all multitrees containing the input and its translations in the a24a65a87 output components. All these derivation trees, along with their probabilities, can be efficiently represented as a packed parse forest, rooted at the goal item. Unfortunately, finding the most probable output string still requires summing probabilities over an exponential number of trees. This problem was shown to be NP-hard in the one-dimensional case (Sima'an, 1996). We have no reason to believe that it is any easier when a0a3a2 a10 .</Paragraph> <Paragraph position="18"> The Viterbi-derivation semiring would be the most often used with Translator CT in practice. Given a a0 -PMTG, Translator CT can use this semiring to find the single most probable a0 -dimensional multitree that covers the a87 -dimensional input. The multitree inferred by the translator will have the words of both the input and the output components in its leaves. For example, given a suitable grammar and the input Pasudu moy, Translator CT could infer the multitree in Figure 2.</Paragraph> <Paragraph position="19"> The set of inferences would be exactly the same as those listed in Figure 4, except that the items would have no d-spans in the English component.</Paragraph> <Paragraph position="20"> In practice, we usually want the output as a string tuple, rather than as a multitree. Under the various derivation semirings (Goodman, 1999), Translator CT can store the output role templates</Paragraph> <Paragraph position="22"> each internal node of the tree. The intended ordering of the terminals in each output dimension can be assembled from these templates by a linear-time linearization post-process that traverses the finished multitree in postorder.</Paragraph> <Paragraph position="23"> To the best of our knowledge, Logic CT is the first published translation logic to be compatible with all of the semirings catalogued by Goodman (1999), among others. It is also the first to simultaneously accommodate multiple input components and multiple output components. When a source document is available in multiple languages, a translator can benefit from the disambiguating information in each. Translator CT can take advantage of such information without making the strong independence assumptions of Och & Ney (2001). When output is desired in multiple languages, Translator CT offers all the putative benefits of the interlingual approach to MT, including greater efficiency and greater consistency across output components. Indeed, the language of multitrees can be viewed as an interlingua.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Synchronization </SectionTitle> <Paragraph position="0"> We have explored inference of a87 -dimensional multitrees under a a0 -dimensional grammar, where a0a5a4 a87 . Now we generalize along the other axis of Figure 1(a). Multitext synchronization is most often used to infer a87 -dimensional multitrees without the benefit of an a87 -dimensional grammar. One application is inducing a parser in one language from a parser in another (L&quot;u et al., 2002). The application that is most relevant to this paper is bootstrapping an a87 -dimensional grammar. In theory, it is possible to induce a PMTG from multitext in an unsupervised manner. A more reliable way is to start from a corpus of multitrees -- a multitreebank.3 We are not aware of any multitreebanks at this time. The most straightforward way to create one is to parse some multitext using a synchronous parser, such as Parser C. However, if the goal is to bootstrap an a87 -PMTG, then there is no a87 -PMTG that can evaluate the a10 terms in the parser's logic. Our solution is to orchestrate lower-dimensional knowledge sources to evaluate the a10 terms. Then, we can use the same parsing logic to synchronize multitext into a multitreebank.</Paragraph> <Paragraph position="1"> To illustrate, we describe a relatively simple synchronizer, using the Viterbi-derivation semiring.4 Under this semiring, a synchronizer computes the single most probable multitree for a given multitext.</Paragraph> <Paragraph position="2"> dependency structure (dashed arrows) is compatible with the monolingual structure (solid arrows) and word alignment (shaded cells).</Paragraph> <Paragraph position="3"> If we have no suitable PMTG, then we can use other criteria to search for trees that have high probability. We shall consider the common synchronization scenario where a lexicalized monolingual grammar is available for at least one component.5 Also, given a tokenized set of a87 -tuples of parallel sentences, it is always possible to estimate a word-to-word A word-to-word translation model and a lexicalized monolingual grammar are sufficient to drive a synchronizer. For example, in Figure 6 a mono-lingual grammar has allowed only one dependency structure on the English side, and a word-to-word translation model has allowed only one word alignment. The syntactic structures of all dimensions of a multitree are isomorphic up to reordering of sibling nodes and deletion. So, given a fixed correspondence between the tree leaves (i.e. words) across components, choosing the optimal structure for one component is tantamount to choosing the optimal synchronous structure for all components.7 Ignoring the nonterminal labels, only one dependency structure is compatible with these constraints - the one indicated by dashed arrows. Bootstrapping a PMTG from a lower-dimensional PMTG and a word-to-word translation model is similar in spirit to the way that regular grammars can help to estimate CFGs (Lari & Young, 1990), and the way that simple translation models can help to bootstrap more sophisticated ones (Brown et al., 1993).</Paragraph> <Paragraph position="4"> 5Such a grammar can be induced from a treebank, for example. We are currently aware of treebanks for English, Spanish, German, Chinese, Czech, Arabic, and Korean.</Paragraph> <Paragraph position="5"> 6Although most of the literature discusses word translation models between only two languages, it is possible to combine several 2D models into a higher-dimensional model (Mann & Yarowsky, 2001).</Paragraph> <Paragraph position="6"> 7Except where the unstructured components have words that are linked to nothing.</Paragraph> <Paragraph position="7"> We need only redefine the a10 terms in a way that does not rely on an a87 -PMTG. Without loss of generality, we shall assume a a0 -PMTG that ranges over the first a0 components, where a0 a88 a87 . We shall then refer to the a0 structured components and the a87 a24 a0 unstructured components.</Paragraph> <Paragraph position="8"> We begin with a10a65a11 . For the structured components a7 a12a20a10 a8 a7 a8 a0 , we retain the grammar-based definition: a10 a11 where the latter probability can be looked up in our a0 -PMTG. For the unstructured components, there are no useful nonterminal labels. Therefore, we assume that the unstructured components use only one (dummy) nonterminal label a10 , so that and continues by making independence assumptions. The first assumption is that the structured components of the production's RHS are conditionally independent of the unstructured components of otherwise. Third, we assume that the word-to-word translation probabilities are independent of anything else: terminal link on the RHS, rather than the second.</Paragraph> <Paragraph position="9"> These probabilities can be obtained from our word-to-word translation model, which would typically be estimated under exactly such an independence assumption. Finally, we assume that the output role templates are independent of each other and uniformly distributed, up to some maximum cardinal- null a41 and 0 otherwise. We can use these definitions of the grammar terms in the inference rules of Logic C to synchronize multitexts into multitreebanks.</Paragraph> <Paragraph position="10"> More sophisticated synchronization methods are certainly possible. For example, we could project a part-of-speech tagger (Yarowsky & Ngai, 2001) to improve our estimates in Equation 6. Yet, despite their relative simplicity, the above methods for estimating production rule probabilities use all of the available information in a consistent manner, without double-counting. This kind of synchronizer stands in contrast to more ad-hoc approaches (e.g., Matsumoto, 1993; Meyers, 1996; Wu, 1998; Hwa et al., 2002). Some of these previous works fix the word alignments first, and then infer compatible parse structures. Others do the opposite. Information about syntactic structure can be inferred more accurately given information about translational equivalence, and vice versa. Commitment to either kind of information without consideration of the other increases the potential for compounded errors. null</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Multitree-based Statistical MT </SectionTitle> <Paragraph position="0"> Multitree-based statistical machine translation (MTSMT) is an architecture for SMT that revolves around multitrees. Figure 7 shows how to build and use a rudimentary MTSMT system, starting from some multitext and one or more monolingual treebanks. The recipe follows: T1. Induce a word-to-word translation model.</Paragraph> <Paragraph position="1"> T2. Induce PCFGs from the relative frequencies of productions in the monolingual treebanks.</Paragraph> <Paragraph position="2"> T3. Synchronize some multitext, e.g. using the approximations in Section 5.</Paragraph> <Paragraph position="3"> T4. Induce an initial PMTG from the relative frequencies of productions in the multitreebank.</Paragraph> <Paragraph position="4"> T5. Re-estimate the PMTG parameters, using a synchronous parser with the expectation semiring. null A1. Use the PMTG to infer the most probable multitree covering new input text.</Paragraph> <Paragraph position="5"> A2. Linearize the output dimensions of the multitree. null Steps T2, T4 and A2 are trivial. Steps T1, T3, T5, and A1 are instances of the generalized parsers described in this paper.</Paragraph> <Paragraph position="6"> complexity and generalization error stand in the way of its practical implementation. Nevertheless, it is satisfying to note that all the non-trivial algorithms in Figure 7 are special cases of Translator CT. It is therefore possible to implement an MTSMT system using just one inference algorithm, parameterized by a grammar, a semiring, and a search strategy. An advantage of building an MT system in this manner is that improvements invented for ordinary parsing algorithms can often be applied to all the main components of the system. For example, Melamed (2003) showed how to reduce the computational complexity of a synchronous parser by , just by changing the logic. The same optimization can be applied to the inference algorithms in this paper. With proper software design, such optimizations need never be implemented more than once. For simplicity, the algorithms in this paper are based on CKY logic. However, the architecture in Figure 7 can also be implemented using generalizations of more sophisticated parsing logics, such as those inherent in Earley or Head-Driven parsers.</Paragraph> </Section> class="xml-element"></Paper>