File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-2041_metho.xml
Size: 22,287 bytes
Last Modified: 2025-10-06 14:08:19
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-2041"> <Title>Learning Non-Isomorphic Tree Mappings for Machine Translation</Title> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> 1 Introduction: Tree-to-Tree Mappings </SectionTitle> <Paragraph position="0"> Statistical machine translation systems are trained on pairs of sentences that are mutual translations. For example, (beaucoup d'enfants donnent un baiser `a Sam, kids kiss Sam quite often). This translation is somewhat free, as is common in naturally occurring data. The first sentence is literally Lots of'children give a kiss to Sam.</Paragraph> <Paragraph position="1"> This short paper outlines &quot;natural&quot; formalisms and algorithms for training on pairs of trees. Our methods work on either dependency trees (as shown) or phrase-structure trees. Note that the depicted trees are not isomorphic.</Paragraph> <Paragraph position="2"> Our main concern is to develop models that can align and learn from these tree pairs despite the &quot;mismatches&quot; in tree structure. Many &quot;mismatches&quot; are characteristic of a language pair: e.g., preposition insertion (of - epsilon1), multiword locutions (kiss - give a kiss to; misinform - wrongly inform), and head-swapping (float down descend by floating). Such systematic mismatches should be learned by the model, and used during translation.</Paragraph> <Paragraph position="3"> It is even helpful to learn mismatches that merely tend to arise during free translation. Knowing that beaucoup d' is often deleted will help in aligning the rest of the tree. When would learned tree-to-tree mappings be useful? Obviously, in MT, when one has parsers for both the source and target language. Systems for &quot;deep&quot; analysis and generation might wish to learn mappings between deep and surface trees (B&quot;ohmov'a et al., 2001) or between syntax and semantics (Shieber and Schabes, 1990). Systems for summarization or paraphrase could also be trained on tree pairs (Knight and Marcu, 2000).</Paragraph> <Paragraph position="4"> Non-NLP applications might include comparing studentwritten programs to one another or to the correct solution. Our methods can naturally extend to train on pairs of forests (including packed forests obtained by chart parsing). The correct tree is presumed to be an element of the forest. This makes it possible to train even when the correct parse is not fully known, or not known at all.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 A Natural Proposal: Synchronous TSG </SectionTitle> <Paragraph position="0"> We make the quite natural proposal of using a synchronous tree substitution grammar (STSG). An STSG is a collection of (ordered) pairs of aligned elementary trees. These may be combined into a derived pair of trees. Both the elementary tree pairs and the operation to combine them will be formalized in later sections.</Paragraph> <Paragraph position="1"> As an example, the tree pair shown in the introduction might have been derived by &quot;vertically&quot; assembling the 6 elementary tree pairs below. The slurabove symbol denotes a frontier node of an elementary tree, which must be replaced by the circled root of another elementary tree.</Paragraph> <Paragraph position="2"> If two frontier nodes are linked by a dashed line labeled with the state X, then they must be replaced by two roots that are also linked by a dashed line labeled with X.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Sam SamNP </SectionTitle> <Paragraph position="0"> The elementary trees represent idiomatic translation &quot;chunks.&quot; The frontier nodes represent unfilled roles in the chunks, and the states are effectively nonterminals that specify the type of filler that is required. Thus, donnent un baiser `a (&quot;give a kiss to&quot;) corresponds to kiss, with the French subject matched to the English subject, and the French indirect object matched to the English direct object. The states could be more refined than those shown above: the state for the subject, for example, should probably be not NP but a pair (Npl, NP3s).</Paragraph> <Paragraph position="1"> STSG is simply a version of synchronous tree-adjoining grammar or STAG (Shieber and Schabes, 1990) that lacks the adjunction operation. (It is also equivalent to top-down tree transducers.) What, then, is new here? First, we know of no previous attempt to learn the &quot;chunk-to-chunk&quot; mappings. That is, we do not know at training time how the tree pair of section 1 was derived, or even what it was derived from. Our approach is to reconstruct all possible derivations, using dynamic programming to decompose the tree pair into aligned pairs of elementary trees in all possible ways. This produces a packed forest of derivations, some more probable than others. We use an efficient inside-outside algorithm to do Expectation-Maximization, reestimating the model by training on all derivations in proportion to their probabilities. The runtime is quite low when the training trees are fully specified and elementary trees are bounded in size.1 Second, it is not a priori obvious that one can reasonably use STSG instead of the slower but more powerful STAG. TSG can be parsed as fast as CFG. But without an adjunction operation,2, one cannot break the training trees into linguistically minimal units. An elementary tree pair A = (elle est finalement partie, finally she left) cannot be further decomposed into B = (elle est partie, she left) and C = (finalement, finally). This appears to miss a generalization. Our perspective is that the generalization should be picked up by the statistical model that defines the probability of elementary tree pairs. p(A) can be defined using mainly the same parameters that define p(B) and p(C), with the result that p(A) [?] p(B)*p(C).</Paragraph> <Paragraph position="2"> The balance between the STSG and the statistical model is summarized in the last paragraph of this paper.</Paragraph> <Paragraph position="3"> Third, our version of the STSG formalism is more flexible than previous versions. We carefully address the case of empty trees, which are needed to handle free-translation &quot;mismatches.&quot; In the example, an STSG cannot replace beaucoup d' (&quot;lots of&quot;) in the NP by quite often in the VP; instead it must delete the former and insert the latter. Thus we have the alignments (beaucoup d', epsilon1) and (epsilon1, quite often). These require innovations. The tree-internal deletion of beaucoup d' is handled by an empty elementary tree in which the root is itself a frontier node. (The subject frontier node of kiss is replaced with this frontier node, which is then replaced with kids.) The tree-peripheral insertion of quite often requires an English frontier node that is paired with a French null.</Paragraph> <Paragraph position="4"> We also formulate STSGs flexibly enough that they can handle both phrase-structure trees and dependency trees.</Paragraph> <Paragraph position="5"> The latter are small and simple (Alshawi et al., 2000): tree nodes are words, and there need be no other structure to recover or align. Selectional preferences and other interactions can be accommodated by enriching the states.</Paragraph> <Paragraph position="6"> Any STSG has a weakly equivalent SCFG that generates the same string pairs. So STSG (unlike STAG) has no real advantage for modeling string pairs.3 But STSGs can generate a wider variety of tree pairs, e.g., non-isomorphic ones. So when actual trees are provided for training, STSG can be more flexible in aligning them.</Paragraph> <Paragraph position="7"> 1Goodman (2002) presents efficient TSG parsing with unbounded elementary trees. Unfortunately, that clever method does not permit arbitrary models of elementary tree probabilities, nor does it appear to generalize to our synchronous case. (It would need exponentially many nonterminals to keep track of an matching of unboundedly many frontier nodes.)</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Past Work </SectionTitle> <Paragraph position="0"> Most statistical MT derives from IBM-style models (Brown et al., 1993), which ignore syntax and allow arbitrary word-to-word translation. Hence they are able to align any sentence pair, however mismatched. However, they have a tendency to translate long sentences into word salad. Their alignment and translation accuracy improves when they are forced to translate shallow phrases as contiguous, potentially idiomatic units (Och et al., 1999).</Paragraph> <Paragraph position="1"> Several researchers have tried putting &quot;more syntax&quot; into translation models: like us, they use statistical versions of synchronous grammars, which generate source and target sentences in parallel and so describe their correspondence.4 This approach offers four features absent from IBM-style models: (1) a recursive phrase-based translation, (2) a syntax-based language model, (3) the ability to condition a word's translation on the translation of syntactically related words, and (4) polynomial-time optimal alignment and decoding (Knight, 1999).</Paragraph> <Paragraph position="2"> Previous work in statistical synchronous grammars has been limited to forms of synchronous context-free grammar (Wu, 1997; Alshawi et al., 2000; Yamada and Knight, 2001). This means that a sentence and its translation must have isomorphic syntax trees, although they may have different numbers of surface words if null words epsilon1 are allowed in one or both languages. This rigidity does not fully describe real data.</Paragraph> <Paragraph position="3"> The one exception is the synchronous DOP approach of (Poutsma, 2000), which obtains an STSG by decomposing aligned training trees in all possible ways (and using &quot;naive&quot; count-based probability estimates). However, we would like to estimate a model from unaligned data.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 A Probabilistic TSG Formalism </SectionTitle> <Paragraph position="0"> For expository reasons (and to fill a gap in the literature), first we formally present non-synchronous TSG. Let Q be a set of states. Let L be a set of labels that may decorate nodes or edges. Node labels might be words or nonterminals. Edge labels might include grammatical roles such as Subject. In many trees, each node's children have an order, recorded in labels on the node's outgoing edges.</Paragraph> <Paragraph position="1"> An elementary tree is a a tuple <V,V i,E,lscript,q,s> where V is a set of nodes; V i [?] V is the set of internal nodes, and we write V f = V [?]V i for the set of frontier nodes; E [?] V i x V is a set of directed edges (thus all frontier nodes are leaves). The graph <V,E> must be connected and acyclic, and there must be exactly one node r [?] V (the root) that has no incoming edges. The function lscript : (V i [?]E) - L labels each internal node or edge; q [?] Q is the root state, and s : V f - Q assigns a frontier state to each frontier node (perhaps including r).</Paragraph> <Paragraph position="2"> 4The joint probability model can be formulated, if desired, as a language model times a channel model.</Paragraph> <Paragraph position="3"> A TSG is a set of elementary trees. The generation process builds up a derived tree T that has the same form as an elementary tree, and for which V f = [?]. Initially, T is chosen to be any elementary tree whose root state T.q = Start. As long as T has any frontier nodes, T.V f, the process expands each frontier node d [?] T.V f by substituting at d an elementary tree t whose root state, t.q, equals d's frontier state, T.s(d). This operation replaces</Paragraph> <Paragraph position="5"> garded here as a set of <input,output> pairs. T.Eprime is a version of T.E in which d has been been replaced by t.r.</Paragraph> <Paragraph position="6"> A probabilistic TSG also includes a function p(t |q), which, for each state q, gives a conditional probability distribution over the elementary trees t with root state q.</Paragraph> <Paragraph position="7"> The generation process uses this distribution to randomly choose which tree t to substitute at a frontier node of T having state q. The initial value of T is chosen from p(t | Start). Thus, the probability of a given derivation is a product of p(t |q) terms, one per chosen elementary tree.</Paragraph> <Paragraph position="8"> There is a natural analogy between (probabilistic) TSGs and (probabilistic) CFGs. An elementary tree t with root state q and frontier states q1 ...qk (for k [?] 0) is analogous to a CFG rule q - t q1 ...qk. (By including t as a terminal symbol in this rule, we ensure that distinct elementary trees t with the same states correspond to distinct rules.) Indeed, an equivalent definition of the generation process first generates a derivation tree from this derivation CFG, and then combines its terminal nodes t (which are elementary trees) into the derived tree T.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Tree Parsing Algorithms for TSG </SectionTitle> <Paragraph position="0"> Given a a grammar G and a derived tree T, we may be interested in constructing the forest of T's possible derivation trees (as defined above). We call this tree parsing, as it finds ways of decomposing T into elementary trees.</Paragraph> <Paragraph position="1"> Given a node c [?] T.v, we would like to find all the potential elementary subtrees t of T whose root t.r could have contributed c during the derivation of T. Such an elementary tree is said to fit c, in the sense that it is isomorphic to some subgraph of T rooted at c.</Paragraph> <Paragraph position="2"> The following procedure finds an elementary tree t that fits c. Freely choose a connected subgraph U of T such that U is rooted at c (or is empty). Let t.V i be the vertex set of U. Let t.E be the set of outgoing edges from nodes in t.V i to their children, that is, t.E = T.E [?] (t.V i x T.V ). Let t.lscript be the restriction of T.lscript to t.V i [?]t.E, that is, t.lscript = T.lscript [?] ((t.V i [?] t.E) x L). Let t.V be the set of nodes mentioned in t.E, or put t.V = {c} if t.V i = t.E = [?]. Finally, choose t.q freely from Q, and choose s : t.V f - Q to associate states with the frontier nodes of t; the free choice is because the nodes of the derived tree T do not specify the states used during the derivation.</Paragraph> <Paragraph position="3"> How many elementary trees can we find that fit c? Let us impose an upper bound k on |t.V i |and hence on |U|.</Paragraph> <Paragraph position="4"> Then in an m-ary tree T, the above procedure considers at most mk[?]1m[?]1 connected subgraphs U of order [?] k rooted at c. For dependency grammars, limiting to m [?] 6 and k = 3 is quite reasonable, leaving at most 43 subgraphs U rooted at each node c, of which the biggest contain only c, a child cprime of c, and a child or sibling of cprime. These will constitute the internal nodes of t, and their remaining children will be t's frontier nodes.</Paragraph> <Paragraph position="5"> However, for each of these 43 subgraphs, we must jointly hypothesize states for all frontier nodes and the root node. For |Q |> 1, there are exponentially many ways to do this. To avoid having exponentially many hypotheses, one may restrict the form of possible elementary trees so that the possible states of each node of t can be determined somehow from the labels on the corresponding nodes in T. As a simple but useful example, a node labeled NP might be required to have state NP. Rich labels on the derived tree essentially provide supervision as to what the states must have been during the derivation.</Paragraph> <Paragraph position="6"> The tree parsing algorithm resembles bottom-up chart parsing under the derivation CFG. But the input is a tree rather than a string, and the chart is indexed by nodes of the input tree rather than spans of the input string:5 1. for each node c of T, in bottom-up order 2. for each q [?] Q, let bc(q) = 0 3. for each elementary tree t that fits c 4. increment bc(t.q) by p(t |t.q)*producttextd[?]t.V f bd(t.s(d)) The b values are inside probabilities. After running the algorithm, if r is the root of T, then br(Start) is the probability that the grammar generates T.</Paragraph> <Paragraph position="7"> p(t |q) in line 4 may be found by hash lookup if the grammar is stored explicitly, or else by some probabilistic model that analyzes the structure, labels, and states of the elementary tree t to compute its probability.</Paragraph> <Paragraph position="8"> One can mechanically transform this algorithm to compute outside probabilities, the Viterbi parse, the parse forest, and other quantities (Goodman, 1999). One can also apply agenda-based parsing strategies.</Paragraph> <Paragraph position="9"> For a fixed grammar, the runtime and space are only O(n) for a tree of n nodes. The grammar constant is the number of possible fits to a node c of a fixed tree. As noted above, there usually not many of these (unless the states are uncertain) and they are simple to enumerate.</Paragraph> <Paragraph position="10"> As discussed above, an inside-outside algorithm may be used to compute the expected number of times each elementary tree t appeared in the derivation of T. That is the E step of the EM algorithm. In the M step, these expected counts (collected over a corpus of trees) are used to reestimate the parameters vectorth of p(t |q). One alternates E and M steps till p(corpus |vectorth)*p(vectorth) converges to a local maximum. The prior p(vectorth) can discourage overfitting.</Paragraph> <Paragraph position="11"> 5We gloss over the standard difficulty that the derivation CFG may contain a unary rule cycle. For us, such a cycle is a problem only when it arises solely from single-node trees.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Extending to Synchronous TSG </SectionTitle> <Paragraph position="0"> We are now prepared to discuss the synchronous case.</Paragraph> <Paragraph position="1"> A synchronous TSG consists of a set of elementary tree pairs. An elementary tree pair t is a tuple <t1,t2,q,m,s> .</Paragraph> <Paragraph position="2"> Here t1 and t2 are elementary trees without state labels: we write tj = <Vj,V ij ,Ej,lscriptj> . q [?] Q is the root state as before. m [?] V f1 x V f2 is a matching between t1's and t2's frontier nodes,6. Let -m denote m [?] {(d1,null) : d1 is unmatched in m} [?] {(null,d2) : d2 is unmatched in m}. Finally, s : -m - Q assigns a state to each frontier node pair or unpaired frontier node.</Paragraph> <Paragraph position="3"> In the figure of section 2, donnent un baiser `a has 2 frontier nodes and kiss has 3, yielding 13 possible matchings. Note that least one English node must remain unmatched; it still generates a full subtree, aligned with null. As before, a derived tree pair T has the same form as an elementary tree pair. The generation process is similar to before. As long as T.-m negationslash= [?], the process expands some node pair (d1,d2) [?] T.-m. It chooses an elementary tree pair t such that t.q = T.s(d1,d2). Then for each j = 1,2, it substitutes tj at dj if non-null. (If dj is null, then t.q must guarantee that tj is the special null tree.) In the probabilistic case, we have a distribution p(t |q) just as before, but this time t is an elementary tree pair.</Paragraph> <Paragraph position="4"> Several natural algorithms are now available to us: * Training. Given an unaligned tree pair (T1,T2), we can again find the forest of all possible derivations, with expected inside-outside counts of the elementary tree pairs. This allows EM training of the p(t |q) model.</Paragraph> <Paragraph position="5"> The algorithm is almost as before. The outer loop iterates bottom-up over nodes c1 of T1; an inner loop iterates bottom-up over c2 of T2. Inside probabilities (for example) now have the form bc1,c2(q). Although this brings the complexity up to O(n2), the real complication is that there can be many fits to (c1,c2). There are still not too many elementary trees t1 and t2 rooted at c1 and c2; but each (t1,t2) pair may be used in many elementary tree pairs t, since there are exponentially many matchings of their frontier nodes. Fortunately, most pairs of frontier nodes have low b values that indicate that their subtrees cannot be aligned well; pairing such nodes in a matching would result in poor global probability. This observation can be used to prune the space of matchings greatly.</Paragraph> <Paragraph position="6"> * 1-best Alignment (if desired). This is just like training, except that we use the Viterbi algorithm to find the single best derivation of the input tree pair. This derivation can be regarded as the optimal syntactic alignment.7 6A matching between A and B is a 1-to-1 correspondence between a subset of A and a subset of B.</Paragraph> <Paragraph position="7"> 7As free-translation post-processing, one could try to match pairs of stray subtrees that could have aligned well, according to the chart, but were forced to align with null for global reasons. * Decoding. We create a forest of possible synchronous derivations (cf. (Langkilde, 2000)). We chart-parse T1 as much as in section 5, but fitting the left side of an elementary tree pair to each node. Roughly speaking: 1. for c1 = null and then c1 [?] T1.V , in bottom-up order 2. for each q [?] Q, let bc1(q) = [?][?] 3. for each probable t = (t1,t2,q,m,s) whose t1 fits c1 4. max p(t |q)*producttext(d1,d2)[?]-m bd1(s(d1,d2)) into bc1(q) We then extract the max-probability synchronous derivation and return the T2 that it derives. This algorithm is essentially alignment to an unknown tree T2; we do not loop over its nodes c2, but choose t2 freely.</Paragraph> </Section> class="xml-element"></Paper>