File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2230_metho.xml
Size: 16,230 bytes
Last Modified: 2025-10-06 14:15:01
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-2230"> <Title>Machine Translation with a Stochastic Grammatical Channel Dekai Wu and Hongsing WONG HKUST</Title> <Section position="4" start_page="1408" end_page="1408" type="metho"> <SectionTitle> 3 A SITG Channel Model </SectionTitle> <Paragraph position="0"> The translation channel we propose is based on the recently introduced bilingual language modeling approach. The model employs a stochastic version of an inversion transduction grammar or ITG (Wu, 1995c; Wu, 1995d; Wu, 1997). This formalism was originally developed for the purpose of parallel corpus annotation, with applications for bracketing, alignment, and segmentation. Subsequently, a method was developed to use a special case of the ITGRthe aforementioned BTGRfor the translation task itself (Wu, 1996). The next few paragraphs briefly review the main properties of ITGs, before we describe the SITG channel.</Paragraph> <Paragraph position="1"> An ITG consists of context-free productions where terminal symbols come in couples, for example x/y, where x is a English word and y is an Chinese translation of x, with singletons of the form x/e or e/y representing function words that are used in only one of the languages. Any parse tree thus generates both English and Chinese strings simultaneously. Thus, the tree: (1) \[I/~-~ \[\[took/$-~ \[a/-- e/:~s: book/:~\]N P \]vP \[for/.~ you/~J~\]pp \]VP Is produces, for example, the mutual translations: (2) a. \[~ \[\[~~ \[--:~\]NP \]VP \[,,~'{~\]PP \]VP \]S b. \[I \[\[took \[a book\]Nv \]va \[for you\]pp \]vp \]s An additional mechanism accommodates a conservative degree of word-order variation between the two languages. With each production of the grammar is associated either a straight orientation or an inverted orientation, respectively denoted as follows: VP ~ \[VPPP\]</Paragraph> </Section> <Section position="5" start_page="1408" end_page="1409" type="metho"> <SectionTitle> VP ~ (VPPP) </SectionTitle> <Paragraph position="0"> In the case of a production with straight orientation, the right-hand-side symbols are visited left-to-right for both the English and Chinese streams.</Paragraph> <Paragraph position="1"> But for a production with inverted orientation, the right-hand-side symbols are visited left-to-right for English and right-to-left for Chinese. Thus, the tree: (3) \[I/~ (\[took/~T \[a/-- e/:~ book\]--~\]N P \]VP \[for/,,~ you/~J~\]pp)vp \]S produces translations with different word order: (4) a. \[I \[\[took \[a book\]Np \]vP \[for you\]pp \]vp \]s b. \[~ \[\[.~/~\]pp \[~7 \[--2~\]NP \]VP \]VP \]S The surprising ability of ITGs to accommodate nearly all word-order variation between fixed-word-order languages 2 (English and Chinese in particular), has been analyzed mathematically, linguisti2With the exception of higher-order phenomena such as neg-raising and wh-movement.</Paragraph> <Paragraph position="2"> cally, and experimentally (Wu, 1995b; Wu, 1997). Any ITG can be transformed to an equivalent binary-branching normal form.</Paragraph> <Paragraph position="3"> A stochastic ITG associates a probability with each production. It follows that a SITG assigns a probability Pr(e,c,q) to all generable trees q and sentence-pairs. In principle it can be used as the translation channel model by normalizing with Pr(c) and integrating out Pr(q) to give Pr(clc) in Equation (1). In practice, a strong language model makes this unnecessary, so we can instead optimize the simpler Viterbi approximation</Paragraph> <Paragraph position="5"> To complete the picture we add a bigram model gc~_~c~ = g(cj \] cj-1) for the Chinese language model Pr(c).</Paragraph> <Paragraph position="6"> This approach was used for the SBTG channel (Wu, 1996), using the language-independent bracketing degenerate case of the SITG: 3</Paragraph> <Paragraph position="8"> A b(54Y) x/y VX, y lexical translations A b(.~C/) .z'/~? VX language 1 vocabulary A b(_~y) e/y Vy language 2 vocabulary In the proposed model, a structured language-dependent ITG is used instead.</Paragraph> </Section> <Section position="6" start_page="1409" end_page="1411" type="metho"> <SectionTitle> 4 A Grammatical Channel Model </SectionTitle> <Paragraph position="0"> Stated radically, our novel modeling thesis is that a mirrored version of the target language grammar can parse sentences of the source language.</Paragraph> <Paragraph position="1"> Ideally, an ITG would be tailored for the desired source and target languages, enumerating the transduction patterns specific to that language pair. Constructing such an ITG, however, requires massive manual labor effort for each language pair. Instead, our approach is to take a more readily acquired monolingual context-free grammar for the target language, and use (or perhaps misuse) it in the SITG channel, by employing the three tactics described below: production mirroring, part-of-speech mapping, and word skipping.</Paragraph> <Paragraph position="2"> In the following, keep in mind our convention that language 1 is the source (English), while language 2 is the target (Chinese).</Paragraph> <Paragraph position="3"> 3Wu (Wu, 1996) experimented with Chinese-English translation, while this paper experiments with English-Chinese translation.</Paragraph> <Section position="1" start_page="1410" end_page="1410" type="sub_section"> <SectionTitle> 4.1 Production Mirroring </SectionTitle> <Paragraph position="0"> The first step is to convert the monolingual Chinese CFG to a bilingual ITG. The production mirroring tactic simply doubles the number of productions, transforming every monolingual production into two bilingual productions, 4 one straight and one inverted, as for example in Figure 1 where the upper Chinese CFG becomes the lower ITG.</Paragraph> <Paragraph position="1"> The intent of the mirroring is to add enough flexibility to allow parsing of English sentences using the language 1 side of the ITG. The extra productions accommodate reversed subconstituent order in the source language's constituents, at the same time restricting the language 2 output sentence to conform the given target grammar whether straight or inverted productions are used.</Paragraph> <Paragraph position="2"> The following example illustrates how production mirroring works. Consider the input sentence He is the son of Stephen, which can be parsed by the ITG of Figure 1 to yield the corresponding output sentence ~~1~:~, with the following parse tree: If the target CFG is purely binary branching, then the previous theoretical and linguistic analyses (Wu, 1997) suggest that much of the requisite constituent and word order transposition may be accommodated without change to the mirrored ITG.</Paragraph> <Paragraph position="3"> On the other hand, if the target CFG contains productions with long right-hand-sides, then merely inverting the subconstituent order will probably be insufficient. In such cases, a more complex transformation heuristic would be needed.</Paragraph> <Paragraph position="4"> Objective 3 (improving grammaticality of the output) can be directly tackled by using a tight tar4Except for unary productions, which yield only one bilingual production.</Paragraph> <Paragraph position="5"> get grammar. To see this, consider using a mirrored Chinese CFG to parse English sentences with the language 1 side of the ITG. Any resulting parse tree must be consistent with the original Chinese grammar. This follows from the fact that both the straight and inverted versions of a production have language 2 (Chinese) sides identical to the original monolingual production: inverting production orientation cancels out the mirroring of the right-hand-side symbols. Thus, the output grammaticality depends directly on the tightness of the original Chinese grammar.</Paragraph> <Paragraph position="6"> In principle, with this approach a single target grammar could be used for translation from any number of other (fixed word-order) source languages, so long as a translation lexicon is available for each source language.</Paragraph> <Paragraph position="7"> Probabilities on the mirrored ITG cannot be reliably estimated from bilingual data without a very large parallel corpus. A straightforward approximation is to employ EM or Viterbi training on just a monolingual target language (Chinese) corpus.</Paragraph> </Section> <Section position="2" start_page="1410" end_page="1410" type="sub_section"> <SectionTitle> 4.2 Part-of-Speech Mapping </SectionTitle> <Paragraph position="0"> The second problem is that the part-of-speech (PoS) categories used by the target (Chinese) grammar do not correspond to the source (English) words when the source sentence is parsed. It is unlikely that any English lexicon will list Chinese parts-of-speech.</Paragraph> <Paragraph position="1"> We employ a simple part-of-speech mapping technique that allows the PoS tag of any corresponding word in the target language (as found in the translation lexicon) to serve as a proxy for the source word's PoS. The word view, for example, may be tagged with the Chinese tags nc and vn, since the translation lexicon holds both viewyy/~ ~nc and viewvB/~vn.</Paragraph> <Paragraph position="2"> Unknown English words must be handled differently since they cannot be looked up in the translation lexicon. The English PoS tag is first found by tagging the English sentence. A set of possible corresponding Chinese PoS tags is then found by table lookup (using a small hand-constructed mapping table). For example, NN may map to nc, loc and pref, while VB may map to vi, vn, vp, vv, vs, etc. This method generates many hypotheses and should only be used as a last resort.</Paragraph> </Section> <Section position="3" start_page="1410" end_page="1411" type="sub_section"> <SectionTitle> 4.3 Word Skipping </SectionTitle> <Paragraph position="0"> Regardless of how constituent-order transposition is handled, some function words simply do not occur in both languages, for example Chinese aspect markers. This is the rationale for the singletons mentioned in Section 3.</Paragraph> <Paragraph position="1"> If we create an explicit singleton hypothesis for every possible input word, the resulting search space will be too large. To recognize singletons, we instead borrow the word-skipping technique from speech recognition and robust parsing. As formalized in the next section, we can do this by modifying the item extension step in our chart-parser-like algorithm. When the dot of an item is on the rightmost position, we can use such constituent, a subtree, to extend other items. In chart parsing, the valid sub-trees that can be used to extend an item are those that are located on the adjacent right of the dot position of the item and the anticipated category of the item should also be equal to that of the subtrees.</Paragraph> <Paragraph position="2"> If word-skipping is to be used, the valid subtrees can be located a few positions right (or, left for the item corresponding to inverted production) to the dot position of the item. In other words, words between the dot position and the start of the subtee are skipped, and considered to be singletons.</Paragraph> <Paragraph position="3"> Consider Sentence 5 again. Word-skipping handled the the which has no Chinese counterpart. At a certain point during translation, we have the following item: VP--+\[is/x~\]veNP. With word-skipping, it can be extended to VP --+\[is/x~\]vNPe by the sub-tree (son/~ of/~ Stephen/~)Np, even the subtree is not adjacent (but within a certain distance, see Section 5) to the dot position of the item. The the located on the adjacent to the dot position of the item is skipped.</Paragraph> <Paragraph position="4"> Word-skipping provides us the flexibility to parse the source input by skipping possible singleton(s), if when we doing so, the source input can be parsed with the highest likelihood, and grammatical output can be produced.</Paragraph> </Section> </Section> <Section position="7" start_page="1411" end_page="1411" type="metho"> <SectionTitle> 5 Translation Algorithm </SectionTitle> <Paragraph position="0"> The translation search algorithm differs from that of Wu's SBTG model in that it handles arbitrary grammars rather than binary bracketing grammars. As such it is more similar to active chart parsing (Earley, 1970) rather than CYK parsing (Kasami, 1965; Younger, 1967). We take the standard notion of items (Aho and Ullman, 1972), and use the term anticipation to mean an item which still has symbols right of its dot. Items that don't have any symbols right of the dot are called subtree.</Paragraph> <Paragraph position="1"> As with Wu's SBTG model, the algorithm maximizes a probabilistic objective function, Equation (2), using dynamic programming similar to that for HMM recognition (Viterbi, 1967). The presence of the bigram model in the objective function necessitates indexes in the recurrence not only on sub-trees over the source English string, but also on the delimiting words of the target Chinese substrings.</Paragraph> <Paragraph position="2"> The dynamic programming exploits a recursive formulation of the objective function as follows.</Paragraph> <Paragraph position="3"> Some notation remarks: es..t denotes the subsequence of English tokens e,+l, e~+2, * *., et. We use C(s..t) to denote the set of Chinese words that are translations of the English word created by taking all tokens in es..t together. C(s, t) denotes the set of Chinese words that are translations of any of the English words anywhere within es..t. K is the maximium number of consecutive English words that can be skipped. 5 Finally, the argmax operator is generalized to vector notation to accommodate multiple indices.</Paragraph> <Paragraph position="4"> 1. Initialization</Paragraph> <Paragraph position="6"/> </Section> <Section position="8" start_page="1411" end_page="1411" type="metho"> <SectionTitle> 2. Recursion </SectionTitle> <Paragraph position="0"> For all r, s, t, u, v such that r is the category of a constituent spanning s to t</Paragraph> <Paragraph position="2"/> <Paragraph position="4"> Assuming the number of translation per word is bounded by some constant, then the maximum size of C(s, t) is proportional to t - s. The asymptotic time complexity for our algorithm is thus bounded by O(Tr). However, note that in theory the complexity upper bound rises exponentially rather than polynomially with the size of the grammar, just as for context-free parsing (Barton et al., 1987), whereas this is not a problem for Wu's SBTG algorithm. In practice, natural language grammars are usually sufficiently constrained so that speed is actually improved over the SBTG algorithm, as discussed later.</Paragraph> <Paragraph position="5"> The dynamic programming is efficiently implemented by an active-chart-parser-style agenda-based algorithm, sketched as follows: 1. Initialization For each word in the input sentence, put a subtree with category equal to the PoS of its translation into the agenda.</Paragraph> <Paragraph position="6"> 2. Recursion Loop while agenda is not empty: (a) If the current item is a subtree of category X, extend existing anticipations by calling ANTIEIPA-TIONEXTENSION, For each rule in the grammar of Z ~ XW... Y, add an initial anticipation of the form Z ~ X * W... Y and put it into the agenda. Add subtree X to the chart.</Paragraph> <Paragraph position="7"> (b) If the current item is an anticipation of the form Z ~ W...*X... Y from s to to, find all subtrees in the chart with category X that start at position t~ and use each subtree to extend this anticipation by calling ANTICIPATIONEXTENSION.</Paragraph> <Paragraph position="8"> ANTICIPATIONEXTENS1ON : Assuming the subtree we found is of category X from position sl to t, for any anticipation of the form Z --+ W... * X ... Y from so to \[sl-If, sl\], extend it to Z --+ IV... X * ... Y with span from so to t and add it to the agenda.</Paragraph> <Paragraph position="9"> 3. Reconstruction The output string is recursively reconstructed from the highest likelihood subtree, with category S, that span the whole input sentence.</Paragraph> </Section> class="xml-element"></Paper>