File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3104_metho.xml

Size: 18,765 bytes

Last Modified: 2025-10-06 14:10:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3104">
  <Title>Quasi-Synchronous Grammars: Alignment by Soft Projection of Syntactic Dependencies</Title>
  <Section position="3" start_page="23" end_page="26" type="metho">
    <SectionTitle>
2 Quasi-Synchronous Grammar
</SectionTitle>
    <Paragraph position="0"> Given an input S1 or its parse T1, a quasi-synchronous grammar (QG) constructs a monolingual grammar for parsing, or generating, the possible translations S2--that is, a grammar for finding appropriate trees T2. What ties this target-language grammar to the source-language input? The grammar provides for target-language words to take on  tion that allowed subtrees of the source tree to be reused in generating a target tree. In order to preserve dynamic programming constraints, theidentityoftheclonedsubtreeischosenindependently of its insertion point. This breakage of monotonic tree alignment moves Gildea's alignment model from synchronous to quasi-synchronous.</Paragraph>
    <Paragraph position="1">  is the target language. Tschernobyl depends on k&amp;quot;onnte even though their English analogues are not in a dependency relationship. Note the parser's error in not attaching etwas to sp&amp;quot;ater. German: Tschernobyl k&amp;quot;onnte dann etwas sp&amp;quot;ater an die Reihe kommen . Literally: Chernobyl could then somewhat later on the queue come. English: Then we could deal with Chernobyl some time later .</Paragraph>
    <Paragraph position="2">  of bekommen instead of the verb itself.</Paragraph>
    <Paragraph position="3"> German: Auf diese Frage habe ich leider keine Antwort bekommen . Literally: To this question have I unfortunately no answer received. English: I did not unfortunately receive an answer to this question .  multiple hidden &amp;quot;senses,&amp;quot; which correspond to (possibly empty sets of) word tokens in S1 or nodes in T1. To take a familiar example, when parsing the English side of a French-English bitext, the word bank might have the sense banque (financial) in one sentence and rive (littoral) in another.</Paragraph>
    <Paragraph position="4"> TheQG4 considersthe&amp;quot;sense&amp;quot;oftheformerbank token to be a pointer to the particular banque token to which it aligns. Thus, a particular assignment of S1 &amp;quot;senses&amp;quot; to word tokens in S2 encodes a word alignment.</Paragraph>
    <Paragraph position="5"> Now, selectional preferences in the monolingual grammar can be influenced by these T1-specific senses. So they can encode preferences for how T2 ought to copy the syntactic structure of T1. For example, if T1 contains the phrase banque nationale, then the QG for generating a corresponding T2 may encourage any T2 English noun whose sense is banque (more precisely, T1's token of banque) to generate an adjectival English modifier with sense nationale. The exact probability of this, as well as the likely identity and position of that English modifier (e.g., national bank), may also be influenced by monolingual facts about English.</Paragraph>
    <Section position="1" start_page="25" end_page="25" type="sub_section">
      <SectionTitle>
2.1 Definition
</SectionTitle>
      <Paragraph position="0"> A quasi-synchronous grammar is a monolingual grammar that generates translations of a source-language sentence. Each state of this monolingual grammar is annotated with a &amp;quot;sense&amp;quot;--a set of zero or more nodes from the source tree or forest.</Paragraph>
      <Paragraph position="1"> For example, consider a quasi-synchronous context-free grammar (QCFG) for generating translations of a source tree T1. The QCFG generates the target sentence using nonterminals from the cross product U x 2V1, where U is the set of monolingual target-language nonterminals such as NP, and V1 is the set of nodes in T1.</Paragraph>
      <Paragraph position="2"> Thus, a binarized QCFG has rules of the form</Paragraph>
      <Paragraph position="4"> where A,B,C [?] U are ordinary target-language nonterminals, a,b,g [?] 2V1 are sets of source tree 4By abuse of terminology, we often use &amp;quot;QG&amp;quot; to refer to the T1-specific monolingual grammar, although the QG is properly a recipe for constructing such a grammar from any input T1.</Paragraph>
      <Paragraph position="5"> nodes to which A,B,C respectively align, and w is a target-language terminal.</Paragraph>
      <Paragraph position="6"> Similarly, a quasi-synchronous tree-substitution grammar (QTSG) annotates the root and frontier nodes of its elementary trees with sets of source nodes from 2V1.</Paragraph>
    </Section>
    <Section position="2" start_page="25" end_page="26" type="sub_section">
      <SectionTitle>
2.2 Taming Source Nodes
</SectionTitle>
      <Paragraph position="0"> This simple proposal, however, presents two main difficulties. First, the number of possible senses for each target node is exponential in the number of source nodes. Second, note that the senses are sets of source tree nodes, not word types or absolute sentence positions as in some other translation models.</Paragraph>
      <Paragraph position="1"> Except in the case of identical source trees, source tree nodes will not recur between training and test.</Paragraph>
      <Paragraph position="2">  Toovercomethefirstproblem,wewantfurtherrestrictionsonthesetainaQGstatesuchas&lt;A,a&gt; . It shouldnotbeanarbitrarysetofsourcenodes. Inthe experiments of this paper, we adopt the simplest option of requiring |a |[?] 1. Thus each node in the target tree is aligned to a single node in the source tree, orto[?](thetraditional NULL alignment). Thisallows one-to-many but not many-to-one alignments.</Paragraph>
      <Paragraph position="3"> To allow many-to-many alignments, one could limit |a |to at most 2 or 3 source nodes, perhaps further requiring the 2 or 3 source nodes to fall in a particular configuration within the source tree, such as child-parent or child-parent-grandparent. With that configurational requirement, the number of possible senses a remains small--at most three times the number of source nodes.</Paragraph>
      <Paragraph position="4"> We must also deal with the menagerie of different source tree nodes in different sentences. In other words,howcanwetietheparametersofthedifferent QGs that are used to generate translations of different source sentences? The answer is that the probability or weight of a rule such as (2) should depend on the specific nodes in a, b, and g only through their properties--e.g., their nonterminal labels, their head words, and their grammatical relationship in the source tree. Such properties do recur between training and test.</Paragraph>
      <Paragraph position="5"> For example, suppose for simplicity that |a |=</Paragraph>
      <Paragraph position="7"> and (3) could be log-linearly modeled using features that ask whether the single node in a has two children in the source tree; whether its children in the  source are the nodes in b and g; whether its non-terminal label in the source is A; whether its fringe in the source translates as w; and so on. The model shouldalsoconsidermonolingualfeaturesof(2)and (3), evaluating in particular whether A - BC is likely in the target language.</Paragraph>
      <Paragraph position="8"> Whether rule weights are given by factored generative models or by naive Bayes or log-linear models, we want to score QG productions with a small set of monolingual and bilingual features.</Paragraph>
    </Section>
    <Section position="3" start_page="26" end_page="26" type="sub_section">
      <SectionTitle>
2.3 Synchronous Grammars Again
</SectionTitle>
      <Paragraph position="0"> Finally, note that synchronous grammar is a special case of quasi-synchronous grammar. In the context-free case, a synchronous grammar restricts senses to single nodes in the source tree and the NULL node.</Paragraph>
      <Paragraph position="1"> Further, for any k-ary production  &lt;X0,a0&gt; - &lt;X1,a1&gt; ...&lt;Xk,ak&gt; a synchronous context-free grammar requires that 1. ([?]i negationslash= j) ai negationslash= aj unless ai = NULL, 2. ([?]i &gt; 0) ai is a child of a0 in the source tree,  unless ai = NULL.</Paragraph>
      <Paragraph position="2"> Since NULL has no children in the source tree, these rules imply that the children of any node aligned to NULL are themselves aligned to NULL. The construction for synchronous tree-substitution and tree-adjoining grammars goes through similarly but operates on the derivation trees.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="26" end_page="27" type="metho">
    <SectionTitle>
3 Parameterizing a QCFG
</SectionTitle>
    <Paragraph position="0"> Recall that our goal is a conditional model of p(T2,A  |T1). For the remainder of this paper, we adopt a dependency-tree representation of T1 and T2. Each tree node represents a word of the sentence together with a part-of-speech tag. Syntactic dependencies in each tree are represented directly by the parent-child relationships.</Paragraph>
    <Paragraph position="1"> Why this representation? First, it helps us concisely formulate a QG translation model where the source dependencies influence the generation of target dependencies (see figure 3). Second, for evaluation, it is trivial to obtain the word-to-word alignments from the node-to-node alignments. Third, the part-of-speech tags are useful backoff features, and in fact play a special role in our model below.</Paragraph>
    <Paragraph position="2"> When stochastically generating a translation T2,  ourquasi-synchronousgenerativeprocesswillbeinfluenced by both fluency and adequacy. That is, it considers both the local well-formedness of T2 (a monolingual criterion) and T2's local faithfulness to T1 (a bilingual criterion). We combine these in a simple generative model rather than a log-linear model. When generating the children of a node in T2, the process first generates their tags using mono-lingual parameters (fluency), and then fills in in the  wordsusingbilingualparameters(adequacy)thatselect and translate words from T1.5 Concretely, each node in T2 is labeled by a triple (tag, word, aligned word). Given a parent node (p,h,hprime) in T2, we wish to generate sequences of left and right child nodes, of the form (c,a,aprime).</Paragraph>
    <Paragraph position="3"> Our monolingual parameters come from a simple generative model of syntax used for grammar induction: the Dependency Model with Valence (DMV) of Klein and Manning (2004). In scoring dependency attachments, DMV uses tags rather than words. The parameters of the model are:  1. pchoose(c  |p,dir): the probability of generating c as the next child tag in the sequence of dir children, where dir [?] {left,right}.</Paragraph>
    <Paragraph position="4"> 2. pstop(s  |h,dir,adj): the probability of generating no more child tags in the sequence of dir  children. This is conditioned in part on the &amp;quot;adjacency&amp;quot; adj [?] {true,false}, which indicates whether the sequence of dir children is empty so far.</Paragraph>
    <Paragraph position="5"> Our bilingual parameters score word-to-word translation and aligned dependency configurations.</Paragraph>
    <Paragraph position="6"> We thus use the conditional probability ptrans(a | aprime) that source word aprime, which may be NULL, translates as target word a. Finally, when a parent word h aligned to hprime generates a child, we stochastically decide to align the child to a node aprime in T1 with one several possible relations to hprime. A &amp;quot;monotonic&amp;quot; dependency alignment, for example, would have hprime and aprime in a parent-child relationship like their target-tree analogues. In different versions of the model, we allowed various dependency alignment configurations (figure 3). These configurations rep5This division of labor is somewhat artificial, and could be remedied in a log-linear model, Naive Bayes model, or deficient generative model that generates both tags and words conditioned on both monolingual and bilingual context.</Paragraph>
    <Paragraph position="7">  resent cases where the parent-child dependency being generated by the QG in the target language maps onto source-language child-parent, for head swapping; the same source node, for two-to-one alignment; nodes that are siblings or in a c-command relationship, for scrambling and extraposition; or in a grandparent-grandchild relationship, e.g. when a preposition is inserted in the source language. We alsoalloweda&amp;quot;none-of-the-above&amp;quot;configuration,to account for extremely mismatched sentences.</Paragraph>
    <Paragraph position="8"> The probability of the target-language dependency treelet rooted at h is thus:</Paragraph>
    <Paragraph position="10"/>
  </Section>
  <Section position="5" start_page="27" end_page="28" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> We claim that for modeling human-translated bitext, itisbettertoprojectsyntaxonlyloosely. Toevaluate this claim, we train quasi-synchronous dependency grammars that allow progressively more divergence from monotonic tree alignment. We evaluate these models on cross-entropy over held-out data and on error rate in a word-alignment task.</Paragraph>
    <Paragraph position="1"> One might doubt the use of dependency trees for alignment, since Gildea (2004) found that constituencytreesalignedbetter. Thatexperiment,however, aligned only the 1-best parse trees. We too will consider only the 1-best source tree T1, but in constrast to Gildea, we will search for the target tree T2 that aligns best with T1. Finding T2 and the alignment is simply a matter of parsing S2 with the QG derived from T1.</Paragraph>
    <Section position="1" start_page="27" end_page="27" type="sub_section">
      <SectionTitle>
4.1 Data and Training
</SectionTitle>
      <Paragraph position="0"> We performed our modeling experiments with the German-English portion of the Europarl European Parliament transcripts (Koehn, 2002). We obtained monolingual parse trees from the Stanford German and English parsers (Klein and Manning, 2003).</Paragraph>
      <Paragraph position="1"> Initial estimates of lexical translation probabilities came from the IBM Model 4 translation tables produced by GIZA++ (Brown et al., 1993; Och and Ney, 2003).</Paragraph>
      <Paragraph position="2"> All text was lowercased and numbers of two or more digits were converted to an equal number of hash signs. The bitext was divided into training sets of 1K, 10K, and 100K sentence pairs. We held out one thousand sentences for evaluating the cross-entropy of the various models and hand-aligned 100 sentence pairs to evaluate alignment error rate (AER).</Paragraph>
      <Paragraph position="3"> We trained the model parameters on bitext using the Expectation-Maximization (EM) algorithm. The T1 tree is fully observed, but we parse the target language. Asnoted, theinitiallexicaltranslationprobabilities came from IBM Model 4. We initialized the monolingual DMV parameters in one of two ways: using either simple tag co-occurrences as in (Klein andManning,2004)or&amp;quot;supervised&amp;quot;countsfromthe monolingual target-language parser. This latter initialization simulates the condition when one has a small amount of bitext but a larger amount of target data for language modeling. As with any mono-lingual grammar, we perform EM training with the Inside-Outside algorithm, computing inside probabilities with dynamic programming and outside probabilities through backpropagation.</Paragraph>
      <Paragraph position="4">  Searchingthefullspaceoftarget-languagedependency trees and alignments to the source tree consumed several seconds per sentence. During training, therefore, we constrained alignments to come from the union of GIZA++ Model 4 alignments.</Paragraph>
      <Paragraph position="5"> These constraints were applied only during training and not during evaluation of cross-entropy or AER.</Paragraph>
    </Section>
    <Section position="2" start_page="27" end_page="28" type="sub_section">
      <SectionTitle>
4.2 Conditional Cross-Entropy of the Model
</SectionTitle>
      <Paragraph position="0">  TotesttheexplanatorypowerofourQCFG,weevaluated its conditional cross-entropy on held-out data (table 1). In other words, we measured how well a trained QCFG could predict the true translation of novel source sentences by summing over all parses of the target given the source. We trained QCFG models under different conditions of bitext size and parameter initialization. However, the principal independent variable was the set of dependency alignment configurations allowed.</Paragraph>
      <Paragraph position="1"> From these cross-entropy results, it is clear that strictly synchronous grammar is unwise. We ob- null source tree as, among other things, (a) parent-child, (b) child-parent, (c) identical nodes, (d) siblings, (e) grandparent-grandchild, (f) c-commander-c-commandee, (g) none of the above. Here German is the source and English is the target. Case (g), not pictured above, can be seen in figure 1, in English-German order, where the child-parent pair Tschernobyl k&amp;quot;onnte correspond to the words Chernobyl and could, respectively. Since could dominates Chernobyl, they are not in a c-command relationship.  dency configurations (figure 3) allowed, for 1k, 10k, and 100k training sentences. The big error reductions arrive when we allow arbitrary non-local alignments in condition (g). Distinguishingsomecommoncasesofnon-localalignmentsimproves null performance further. For comparison, we show cross-entropy when every target language node is unaligned.</Paragraph>
      <Paragraph position="2"> tain comparatively poor performance if we require  parent-childpairsinthetargettreetoaligntoparentchild pairs in the source (or to parent-NULL or NULL-NULL). Performance improves as we allow and distinguish more alignment configurations.</Paragraph>
    </Section>
    <Section position="3" start_page="28" end_page="28" type="sub_section">
      <SectionTitle>
4.3 Word Alignment
</SectionTitle>
      <Paragraph position="0"> We computed standard measures of alignment precision, recall, and error rate on a test set of 100 hand-aligned German sentence pairs with 1300 alignment links. As with many word-alignment evaluations, we do not score links to NULL. Just as for crossentropy, we see that more permissive alignments lead to better performance (table 2).</Paragraph>
      <Paragraph position="1"> Having selected the best system using the cross-entropy measurement, we compare its alignment error rate against the standard GIZA++ Model 4 baselines. As Figure 4 shows, our QCFG for German Englishconsistentlyproducesbetteralignmentsthan null the Model 4 channel model for the same direction, German - English. This comparison is the appropriate one because both of these models are forced to align each English word to at most one German word. 6</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML