XML Viewer - p06-1123

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1123_metho.xml
Size: 24,279 bytes
Last Modified: 2025-10-06 14:10:22
<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1123">
  <Title>Empirical Lower Bounds on the Complexity of Translational Equivalence [?]</Title>
  <Section position="4" start_page="977" end_page="978" type="metho">
    <SectionTitle>
2 A Measure of Alignment Complexity
</SectionTitle>
    <Paragraph position="0"> Any translation model can memorize a training sentence pair as a unit. For example, given a sentence pair like (he left slowly / slowly he left) with the correct word alignment, a phrase-based translation model can add a single 3-word biphrase to its phrase table. However, this biphrase would not help the model predict translations of the individual words in it. That's why phrase-based models typically decompose such training examples into their sub-biphrases and remember them too. Decomposing the translational equivalence relations in the training data into smaller units of knowledge can improve a model's ability to generalize (Zhang et al., 2006). In the limit, to maximize the chances of covering arbitrary new data, a model should decompose the training data into the smallest possible units, and learn from them.1 For phrase-based models, this stipulation implies phrases of length one. If the model is a synchronous rewriting system, then it should be able to generate every training sentence pair as the yield of a binary1Many popular models learn from larger units at the same time, but the size of the smallest learnable unit is what's important for our purposes.</Paragraph>
    <Paragraph position="1"> branching synchronous derivation tree, where every word-to-word link is generated by a different derivation step. For example, a model that uses production rules could generate the previous example using the synchronous productions (S, S) ! (X Y / Y X); (X, X) ! (U V / U V); (Y, Y) ! (slowly, slowly); (U, U) ! (he, he); and (V, V) ! (left, left).</Paragraph>
    <Paragraph position="2"> A problem arises when this kind of decomposition is attempted for the alignment in Figure 1(a). If each link is represented by its own nonterminal, and production rules must be binary-branching, then some of the nonterminals involved in generating this alignment need discontinuities, or gaps. Figure 1(b) illustrates how to generate the sentence pair and its word alignment in this manner.</Paragraph>
    <Paragraph position="3"> The nonterminals X and Y have one discontinuity each.</Paragraph>
    <Paragraph position="4"> More generally, for any positive integer k, it is possible to construct a word alignment that cannot be generated using binary production rules whose nonterminals all have fewer than k gaps (Satta and Peserico, 2005). Our study measured the com- null plexity of a word alignment as the minimum number of gaps needed to generate it under the following constraints: 1. Each step of the derivation generates no more than two different nonterminals.</Paragraph>
    <Paragraph position="5"> 2. Each word-to-word link is generated from a  separate nonterminal.2 Our measure of alignment complexity is analogous to what Melamed et al. (2004) call &amp;quot;fanout.&amp;quot;3 The least complex alignments on this measure -- those that can be generated with zero gaps -- are precisely those that can be generated by an  mum/median/maximum sentence lengths in each bitext. All failure rates reported later have a 95% confidence interval that is no wider than the value shown for each bitext. ITG. For the rest of the paper, we restrict our attention to binary derivations, except where explicitly noted otherwise.</Paragraph>
    <Paragraph position="6"> To measure the number of gaps needed to generate a given word alignment, we used a bottom-up hierarchical alignment algorithm to infer a binary synchronous parse tree that was consistent with the alignment, using as few gaps as possible. A hierarchical alignment algorithm is a type of synchronous parser where, instead of constraining inferences by the production rules of a grammar, the constraints come from word alignments and possibly other sources (Wu, 1997; Melamed and Wang, 2005). A bottom-up hierarchical aligner begins with word-to-word links as constituents, where some of the links might be to nothing (&amp;quot;NULL&amp;quot;). It then repeatedly composes constituents with other constituents to make larger ones, trying to find a constituent that covers the entire input.</Paragraph>
    <Paragraph position="7"> One of the important design choices in this kind of study is how to treat multiple links attached to the same word token. Word aligners, both human and automatic, are often inconsistent about whether they intend such sets of links to be disjunctive or conjunctive. In accordance with its focus on lower bounds, the present study treated them as disjunctive, to give the hierarchical alignment algorithm more opportunities to use fewer gaps. This design decision is one of the main differences between our study and that of Fox (2002), who treated links to the same word conjunctively.</Paragraph>
    <Paragraph position="8"> By treating many-to-one links disjunctively, our measure of complexity ignored a large class of discontinuities. Many types of discontinuous constituents exist in text independently of any translation. Simard et al. (2005) give examples such as English verb-particle constructions, and the French negation ne. . . pas. The disparate elements of such constituents would usually be aligned to the same word in a translation. However, when  a hierarchical alignment is possible without gaps. b) With a parse tree constraining the bottom sentence, no such alignment exists.</Paragraph>
    <Paragraph position="9"> our hierarchical aligner saw two words linked to one word, it ignored one of the two links. Our lower bounds would be higher if they accounted for this kind of discontinuity.</Paragraph>
  </Section>
  <Section position="5" start_page="978" end_page="982" type="metho">
    <SectionTitle>
3 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="978" end_page="978" type="sub_section">
      <SectionTitle>
3.1 Data
</SectionTitle>
      <Paragraph position="0"> We used two monolingual bitexts and five bilingual bitexts. The Romanian/English and Hindi/English data came from Martin et al. (2005). For Chinese/English and Spanish/English, we used the data from Ayan et al. (2005). The French/English data were those used by Mihalcea and Pedersen (2003). The monolingual bitext labeled &amp;quot;MTEval&amp;quot; in the tables consists of multiple independent translations from Chinese to English (LDC, 2002). The other monolingual bitext, labeled &amp;quot;fiction,&amp;quot; consists of two independent translations from French to English of Jules Verne's novel 20,000 Leagues Under the Sea, sentence-aligned by Barzilay and McKeown (2001).</Paragraph>
      <Paragraph position="1"> From the monolingual bitexts, we removed all sentence pairs where either sentence was longer than 100 words. Table 1 gives descriptive statistics for the remaining data. The table also shows the upper bound of the 95% confidence intervals for the coverage rates reported later. The results of experiments on different bitexts are not directly comparable, due to the varying genres and sentence lengths.</Paragraph>
    </Section>
    <Section position="2" start_page="978" end_page="979" type="sub_section">
      <SectionTitle>
3.2 Constraining Parse Trees
</SectionTitle>
      <Paragraph position="0"> One of the main independent variables in our experiments was the number of monolingual parse trees used to constrain the hierarchical alignments.</Paragraph>
      <Paragraph position="1"> To induce models of translational equivalence, some researchers have tried to use such trees to constrain bilingual constituents: The span of every node in the constraining parse tree must coincide with the relevant monolingual span of some  out gaps in a manner consistent with both parse trees. node in the bilingual derivation tree. These additional constraints can thwart attempts at hierarchical alignment that might have succeeded otherwise. Figure 2a shows a word alignment and a parse tree that can be hierarchically aligned without gaps. George and left can be composed in both sentences into a constituent without crossing any phrase boundaries in the tree, as can on and Friday. These two constituents can then be composed to cover the entire sentence pair. On the other hand, if a constraining tree is applied to the other sentence as shown in Figure 2b, then the word alignment and tree constraint conflict. The projection of the VP is discontinuous in the top sentence, so the links that it covers cannot be composed into a constituent without gaps. On the other hand, if a gap is allowed, then the VP can compose as on Friday . . . left in the top sentence, where the ellipsis represents a gap. This VP can then compose with the NP complete a synchronous parse tree. Some authors have applied constraining parse trees to both sides of the bitext. The example in Figure 3 can be hierarchically aligned using either one of the two constraining trees, but gaps are necessary to align it with both trees.</Paragraph>
    </Section>
    <Section position="3" start_page="979" end_page="980" type="sub_section">
      <SectionTitle>
3.3 Methods
</SectionTitle>
      <Paragraph position="0"> We parsed the English side of each bilingual bitext and both sides of each English/English bitext using an off-the-shelf syntactic parser (Bikel, 2004), which was trained on sections 02-21 of the Penn English Treebank (Marcus et al., 1993).</Paragraph>
      <Paragraph position="1"> Our bilingual bitexts came with manually annotated word alignments. For the monolingual bitexts, we used an automatic word aligner based on a cognate heuristic and a list of 282 function words compiled by hand. The aligner linked two words to each other only if neither of them was on the function word list and their longest common subsequence ratio (Melamed, 1995) was at least 0.75. Words that were not linked to another word in this manner were linked to NULL. For the purposes of this study, a word aligned to NULL is a non-constraint, because it can always be composed without a gap with some constituent that is adjacent to it on just one side of the bitext. The number of automatically induced non-NULL links was lower than what would be drawn by hand.</Paragraph>
      <Paragraph position="2"> We modified the word alignments in all bi-texts to minimize the chances that alignment errors would lead to an over-estimate of alignment complexity. All of the modifications involved adding links to NULL. Due to our disjunctive treatment of conflicting links, the addition of a link to NULL can decrease but cannot increase the complexity of an alignment. For example, if we added the links (cela, NULL) and (NULL, that) to the alignment in Figure 1, the hierarchical alignment algorithm could use them instead of the link between cela and that. It could thus generate the modified alignment without using a gap. We added NULL links in two situations. First, if a subset of the links in an alignment formed a many-to-many mapping but did not form a bipartite clique (i.e. every word on one side linked to every word on the other side), then we added links from each of these words to NULL. Second, if n words on one side of the bi-text aligned to m words on the other side with m &gt; n then we added NULL links for each of the words on the side with m words.</Paragraph>
      <Paragraph position="3"> After modifying the alignments and obtaining monolingual parse trees, we measured the alignment complexity of each bitext using a hierarchical alignment algorithm, as described in Section 2. Separate measurements were taken with zero, one, and two constraining parse trees. The synchronous parser in the GenPar toolkit4 can be configured for all of these cases (Burbank et al., 2005).</Paragraph>
      <Paragraph position="4"> Unlike Fox (2002) and Galley et al. (2004), we measured failure rates per corpus rather than per sentence pair or per node in a constraining tree.</Paragraph>
      <Paragraph position="5"> This design was motivated by the observation that if a translation model cannot correctly model a certain word alignment, then it is liable to make incorrect inferences about arbitrary parts of that alignment, not just the particular word links involved in a complex pattern. The failure rates we report represent lower bounds on the fraction of training data  gual bitexts under the constraints of a word alignment and a monolingual parse tree on the English side.</Paragraph>
      <Paragraph position="6"> that is susceptible to misinterpretation by overconstrained translation models.</Paragraph>
    </Section>
    <Section position="4" start_page="980" end_page="981" type="sub_section">
      <SectionTitle>
3.4 Summary Results
</SectionTitle>
      <Paragraph position="0"> Table 2 shows the lower bound on alignment failure rates with and without gaps for five languages paired with English. This table represents the case where the only constraints are from word alignments. Wu (1997) has &amp;quot;been unable to find real examples&amp;quot; of cases where hierarchical alignment would fail under these conditions, at least in &amp;quot;fixed-word-order languages that are lightly inflected, such as English and Chinese.&amp;quot; (p. 385).</Paragraph>
      <Paragraph position="1"> In contrast, we found examples in all bitexts that could not be hierarchically aligned without gaps, including at least 5% of the Chinese/English sentence pairs. Allowing constituents with a single gap on one side of the bitext decreased the observed failure rate to zero for all five bitexts.</Paragraph>
      <Paragraph position="2"> Table 3 shows what happened when we used monolingual parse trees to restrict the compositions on the English side. The failure rates were above 35% for four of the five language pairs, and 61% for Chinese/English! Again, the failure rate fell dramatically when one gap was allowed on the unconstrained (non-English) side of the bitext. Allowing two gaps on the non-English side led to almost complete coverage of these word alignments.</Paragraph>
      <Paragraph position="3"> Table 3 does not specify the number of gaps allowed on the English side, because varying this parameter never changed the outcome. The only way that a gap on that side could increase coverage is if there was a node in the constraining parse tree that</Paragraph>
      <Paragraph position="5"> MTEval bitext, over varying numbers of gaps and constraining trees (CTs).</Paragraph>
      <Paragraph position="7"> tion bitext, over varying numbers of gaps and constraining trees (CTs).</Paragraph>
      <Paragraph position="8"> had at least four children whose translations were in one of the complex permutations. The absence of such cases in the data implies that the failure rates under the constraints of one parse tree would be identical even if we allowed production rules of rank higher than two.</Paragraph>
      <Paragraph position="9"> Table 4 shows the alignment failure rates for the MTEval bitext. With word alignment constraints only, 3% of the sentence pairs could not be hierarchically aligned without gaps. Allowing a single gap on one side decreased this failure rate to zero. With a parse tree constraining constituents on one side of the bitext and with no gaps, alignment failure rates rose from 3% to 34%, but allowing a single gap on the side of the bitext that was not constrained by a parse tree brought the failure rate back down to 3%. With two constraining trees the failure rate was 61%, and allowing gaps did not lower it, for the same reasons that allowing gaps on the tree-constrained side made no difference in  The trends in the fiction bitext (Table 5) were similar to those in the MTEval bitext, but the coverage was always higher, for two reasons. First, the median sentence size was lower in the fiction bitext. Second, the MTEval translators were instructed to translate as literally as possible, but the fiction translators paraphrased to make the fiction more interesting. This freedom in word choice reduced the frequency of cognates and thus imposed fewer constraints on the hierarchical alignment, which resulted in looser estimates of the lower bounds. We would expect the opposite effect with hand-aligned data (Galley et al., 2004).</Paragraph>
      <Paragraph position="10"> To study how sentence length correlates with the complexity of translational equivalence, we took subsets of each bitext while varying the max- null in MTEval bitext.</Paragraph>
      <Paragraph position="11"> imum length of the shorter sentence in each pair.5 Figure 4 plots the resulting alignment failure rates with and without constraining parse trees. The lines in these graphs are not comparable to each other because of the variety of genres involved.</Paragraph>
    </Section>
    <Section position="5" start_page="981" end_page="982" type="sub_section">
      <SectionTitle>
3.5 Detailed Failure Analysis
</SectionTitle>
      <Paragraph position="0"> We examined by hand 30 random sentence pairs from the MTEval bitext in each of three different categories: (1) the set of sentence pairs that could not be hierarchically aligned without gaps, even without constraining parse trees; (2) the set of sentence pairs that could not be hierarchically aligned without gaps with one constraining parse tree, but that did not fall into category 1; and (3) the set of sentence pairs that could not be hierarchically aligned without gaps with two constraining parse trees, but that did not fall into category 1 or 2. Table 6 shows the results of this analysis.</Paragraph>
      <Paragraph position="1"> In category 1, 60% of the word alignments that could not be hierarchically aligned without gaps were caused by word alignment errors. E.g.:  the number of non-NULL word alignments.</Paragraph>
      <Paragraph position="2"> have been linked.6 Three errors were caused by words like targeted and started, which our word alignment algorithm deemed cognates. 12 of the hierarchical alignment failures in this category were true failures. For example:  sion of his trip was to organize an assault on Iraq. The alignment pattern of the words in bold is the familiar (3,1,4,2) permutation, as in Figure 1. Most of the 12 true failures were due to movement of prepositional phrases. The freedom of movement for such modifiers would be greater in bitexts that involve languages with less rigid word order than English.</Paragraph>
      <Paragraph position="3"> Of the 30 sentence pairs in category 2, 16 could not be hierarchically aligned due to parser errors and 4 due to faulty word alignments. 10 were due to valid word reordering. In the following example, a co-referring pronoun causes the word alignment to fail with a constraining tree on the second sentence:  Washington, he seemed to change his original stance. 25 of the 30 sentence pairs in category 3 failed to align due to parser error. 5 examples failed because of valid word reordering. 1 of the 5 reorderings was due to a difference between active voice and passive voice, as in Figure 3.</Paragraph>
      <Paragraph position="4"> The last row of Table 6 takes the various reasons for alignment failure into account. It estimates what the failure rates would be if the mono-lingual parses and word alignments were perfect, with 95% confidence intervals. These revised rates emphasize the importance of reliable word alignments for this kind of study.</Paragraph>
      <Paragraph position="5"> 6This sort of error is likely to happen with other word alignment algorithms too, because words and their common translations are likely to be linked even if they're not translationally equivalent in the given sentence.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="982" end_page="983" type="metho">
    <SectionTitle>
4 Discussion
</SectionTitle>
    <Paragraph position="0"> Figure 1 came from a real bilingual bitext, and Example 2 in Section 3.5 came from a real monolingual bitext.7 Neither of these examples can be hierarchically aligned correctly without gaps, even without constraining parse trees. The received wisdom in the literature led us to expect no such examples in bilingual bitexts, let alone in monolingual bitexts. See http://nlp.cs.nyu.edu/GenPar/ACL06 for more examples. The English/English lower bounds are very loose, because the automatic word aligner would not link words that were not cognates. Alignment failure rates on a hand aligned bitext would be higher. We conclude that the ITG formalism cannot account for the &amp;quot;natural&amp;quot; complexity of translational equivalence, even when translation divergences are factored out.</Paragraph>
    <Paragraph position="1"> Perhaps our most surprising results were those involving one constraining parse tree. These results explain why constraints from independently generated monolingual parse trees have not improved statistical translation models. For example, Koehn et al. (2003) reported that &amp;quot;requiring constituents to be syntactically motivated does not lead to better constituent pairs, but only fewer constituent pairs, with loss of a good amount of valuable knowledge.&amp;quot; This statement is consistent with our findings. However, most of the knowledge loss could be prevented by allowing a gap. With a parse tree constraining constituents on the English side, the coverage failure rate was 61% for the Chinese/English bitext (top row of Table 3), but allowing a gap decreased it to 6%. Zhang and Gildea (2004) found that their alignment method, which did not use external syntactic constraints, outperformed the model of Yamada and Knight (2001). However, Yamada and Knight's model could explain only the data that would pass the nogap test in our experiments with one constraining tree (first column of Table 3). Zhang and Gildea's conclusions might have been different if Yamada and Knight's model were allowed to use discontinuous constituents. The second row of Table 4 suggests that when constraining parse trees are used without gaps, at least 34% of training sentence pairs are likely to introduce noise into the model, even if systematic syntactic differences between languages are factored out. We should not 7The examples were shortened for the sake of space and clarity.</Paragraph>
    <Paragraph position="2">  be surprised when such constraints do more harm than good.</Paragraph>
    <Paragraph position="3"> To increase the chances that a translation model can explain complex word alignments, some authors have proposed various ways of extending a model's domain of locality. For example, Callison-Burch et al. (2005) have advocated for longer phrases in finite-state phrase-based translation models. We computed the phrase length that would be necessary to cover the words involved in each (3,1,4,2) permutation in the MTEval bitext. Figure 5 shows the cumulative percentage of these cases that would be covered by phrases up to a certain length. Only 9 of the 171 cases (5.2%) could be covered by phrases of length 10 or less.</Paragraph>
    <Paragraph position="4"> Analogous techniques for tree-structured translation models involve either allowing each nonterminal to generate both terminals and other non-terminals (Groves et al., 2004; Chiang, 2005), or, given a constraining parse tree, to &amp;quot;flatten&amp;quot; it (Fox, 2002; Zens and Ney, 2003; Galley et al., 2004).</Paragraph>
    <Paragraph position="5"> Both of these approaches can increase coverage of the training data, but, as explained in Section 2, they risk losing generalization ability.</Paragraph>
    <Paragraph position="6"> Our study suggests that there might be some benefits to an alternative approach using discontinuous constituents, as proposed, e.g., by Melamed et al. (2004) and Simard et al. (2005). The large differences in failure rates between the first and second columns of Table 3 are largely independent of the tightness of our lower bounds. Synchronous parsing with discontinuities is computationally expensive in the worst case, but recently invented data structures make it feasible for typical inputs, as long as the number of gaps allowed per constituent is fixed at a small maximum (Waxmonsky and Melamed, 2006). More research is needed to investigate the trade-off between these costs and benefits.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML