File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/95/p95-1033_intro.xml
Size: 5,106 bytes
Last Modified: 2025-10-06 14:05:52
<?xml version="1.0" standalone="yes"?> <Paper uid="P95-1033"> <Title>An Algorithm for Simultaneously Bracketing Parallel Texts by Aligning Words</Title> <Section position="4" start_page="0" end_page="244" type="intro"> <SectionTitle> 2 Inversion-Invariant Transduction Grammars </SectionTitle> <Paragraph position="0"> A Wansduction grammar is a bilingual model that generates two output streams, one for each language. The usual view of transducers as having one input stream and one output stream is more appropriate for restricted or deterministic finite-state machines. Although finite-state transducers have been well studied, they are insufficiently powerful for bilingual models. The models we consider here are non-deterministic models where the two languages' role is symmetric.</Paragraph> <Paragraph position="1"> We begin by generalizing transduction to context-free form. In a context-free transduction grammar, terminal symbols come in pairs that~ are emitted to separate output streams. It follows that each rewrite rule emits not one but two streams, and that every non-terminal stands for a class of derivable substring pairs. For example, in the rewrite rule A ~ B x/y C z/e the terminal symbols z and z are symbols of the language Lx and are emitted on stream 1, while the terminal symbol y is a symbol of the language L2 and is emitted on stream 2. This rule implies that z/y must be a valid entry in the translation lexicon. A matched terminal symbol pair such as z/y is called a couple. As a spe,Aal case, the null symbol e in either language means that no output token is generated. We call a symbol pair such as x/e an Ll-singleton, and ely an L2-singleton.</Paragraph> <Paragraph position="2"> We can employ context-free transduction grammars in simple attempts at generative models for bilingual sentence pairs. For example, pretend for the moment that the simple ttansduetion grammar shown in Figure 1 is a context-free transduction grammar, ignoring the ~ symbols that are in place of the usual ~ symbols. This grammar generates the following example pair of English and Chinese sentences in translation: (1) a. \[I \[\[took \[a book\]so \]vp \[for yon\]~ \]vp \]s b. \[~i \[\[~T \[--*W\]so \]w \[~\]~ \]vt, \]s Each instance of a non-terminal here actually derives two subsltings, one in each of the sentences; these two substrings are translation counterparts. This suggests writing the parse trees together: (2) ~ \[\[took/~Y \[a/~ d~: book/1\[\]so \]vp \[for/~\[~ you/~\]pp \]vv \]s The problem with context-free transduction granunars is that, just as with finite-state transducers, both sentences in a translation pair must share exactly the same grammatic~d structure (except for optional words that can be handled with lexical singletons). For example, the following sentence pair with a perfectly valid, alternative Chinese translation cannot be generated: (3) a. \[I \[\[took \[a book\]so \]vp \[for you\]v~ \]vP \]s b. \[~ \[\[~C/~\]~ \[~T \[--~\]so \]vt, \]vP \]s We introduce the device of an inversion-invafiant transduction grammar (IITG) to get around the inflexibility of context-free txansduction grammars. Productions are interpreted as rewrite rules just as with context-free transduction grammars, with one additional proviso: when generating output for stream 2, the constituents on a rule's right-hand side may be emitted either left-to-right (as usual) or right-to-left (in inverted order). We use instead of --~ to indicate this. Note that inversion is permitted at any level of rule expansion.</Paragraph> <Paragraph position="3"> With this simple proviso, the transduction grammar of Figure 1 straightforwardly generates sentence-pair (3).</Paragraph> <Paragraph position="4"> However, the IITG's weakened ordering constraints now also permit the following sentence pairs, where some constituents have been reversed: (4) & *\[I \[\[for youlpp \[\[a bookl~p tooklvp \]vp \]s b. \[~ \[\[~C/~\]1~ \[~tT \[--:*:It\]so \]w \]vp \]s (5) a. *\[\[\[yon for\]re \[\[a book\]so took\]w \]vp I\]s b. *\[~ \[\[~\]rp \[\[tl\[:~--\]so ~T\]vP \]VP \]S As a bilingual generative linguistic theory, therefore, IITGs are not well-motivated (at least for most natural language pairs), since the majority of constructs do not have freely revexsable constituents.</Paragraph> <Paragraph position="5"> We refer to the direction of a production's L2 constituent ordering as an orientation. It is sometimes useful to explicitly designate one of the two possible orientations when writing productions. We do this by distinguishing two varieties of concatenation operators on string-pairs, depending on tim odeatation. Tim operator \[\] performs the &quot;usual&quot; paitwise concatenation so that \[ A B\] yields the string-pair ( Cx , C2 ) where Cx = A1Bx and (52 = A2B2. But the operator 0 concatema~ constituents on output stream 1 while reversing them on stream 2, so that Ci = AxBx but C2 = B2A2. For example, the NP .-. Det Class NN rule in the transduction grammar above actually expands to two standard rewrite rules:</Paragraph> </Section> class="xml-element"></Paper>