XML Viewer - w03-0304

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0304_metho.xml
Size: 9,871 bytes
Last Modified: 2025-10-06 14:08:21
<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0304">
  <Title>Statistical Translation Alignment with Compositionality Constraints</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Statistical Word Alignment
</SectionTitle>
    <Paragraph position="0"> Brown et al. (1993) define a word alignment as a vector a = a1...am that connects each word of a source-language text S = s1...sm to a target-language word in its translation T = t1...tn, with the interpretation that word taj is the translation of word sj in S (aj = 0 is used to denote words of s that do not produce anything in T).</Paragraph>
    <Paragraph position="1"> The Viterbi alignment between source and target sentences S and T is defined as the alignment ^a whose probability is maximal under some translation model:</Paragraph>
    <Paragraph position="3"> where A is the set of all possible alignments between S and T, and PrM(a|S,T) is the estimate of a's probability under model M, which we denote Pr(a|S,T) from hereon. In general, the size of A grows exponentially with the sizes of S and T, and so there is no efficient way of computing ^a efficiently. However, under the independence hypotheses of IBM Model 2, the Viterbi alignment can be obtained by simply picking for each position i in S, the alignment that maximizes t(si|tj)a(j,i,m,n), the product of the model's &amp;quot;lexical&amp;quot; and &amp;quot;alignment&amp;quot; probability estimates. This procedure can trivially be carried out in O(mn) operations. Because of this convenient property, we take the Viterbi-2 WA method (which we later refer to as the V method) as the basis for the rest of this work.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Compositionality
</SectionTitle>
    <Paragraph position="0"> In IBM-style alignments, each SL token is connected to a single (possibly null) TL token, typically the TL token with which it has the most &amp;quot;lexical affinities&amp;quot;, regardless of other existing connections in the alignment and, more importantly, of the relationships it holds with other SL tokens in its vicinity. In practice, this means that some TL tokens can end up being connected to several SL tokens, while other TL tokens are left unconnected.</Paragraph>
    <Paragraph position="1"> This contrasts with alternative alignment models such as those of Melamed (1998) and Wu (1997), which impose a &amp;quot;one-to-one&amp;quot; constraint on alignments. Such a constraint evokes the notion of compositionality in translation: it suggests that each SL token operates independently in the SL sentence to produce a single TL token in the TL sentence, which then depends on no other SL token.</Paragraph>
    <Paragraph position="2"> This view is, of course, extreme, and real-life translations are full of examples that show how this compositionality principle breaks down as we approach the level of word correspondences. Yet, if we can find a way of imposing compositionality constraints on WA's, at least to the level where it applies, then we should obtain more sensible results than with Viterbi alignments.</Paragraph>
    <Paragraph position="3"> For instance, consider a procedure that splits both the SL and TL sentences S and T into two independent parts, in such a way as to maximise the probability of the two resulting Viterbi alignments:</Paragraph>
    <Paragraph position="5"> In the triple &lt;i,j,d&gt; above, i represents a &amp;quot;split point&amp;quot; in the SL sentence S, j is the analog for TL sentence T, and d is the &amp;quot;direction of correspondence&amp;quot;: d = 1 denotes a &amp;quot;parallel correspondence&amp;quot;, i.e. s1...si corresponds to t1...tj and si+1...sm corresponds to tj+1...tn; d = [?]1 denotes a &amp;quot;crossing correspondence&amp;quot;, i.e. s1...si corresponds to tj+1...tn and si+1...sm corresponds to t1...tj.</Paragraph>
    <Paragraph position="6"> The triple &lt;I,J,D&gt; produced by this procedure refers to the most probable alignment between S and T, under the hypothesis that both sentences are made up of two independent parts (s1...sI and sI+1...sm on the one hand, t1...tJ and tJ+1...tn on the other), that correspond to each other two-by-two, following direction D. Such an alignment suggests that translation T was obtained by &amp;quot;composing&amp;quot; the translation of s1...sI with that of sI+1...sm.</Paragraph>
    <Paragraph position="7"> In the above procedure, these &amp;quot;composing parts&amp;quot; of S and T are further assumed to be contiguous sub-sequences of words. Once again, real-life translations are full of examples that contradict this (negations in French and particle verbs in German are two examples that immediately spring to mind when aligning with English).</Paragraph>
    <Paragraph position="8"> Yet, this contiguity assumption turns out to be very convenient, because examining pairings of non-contiguous sequences would quickly become intractable. In contrast, the procedure above can find the optimal partition in polynomial time.</Paragraph>
    <Paragraph position="9"> The &amp;quot;splitting&amp;quot; process described above can be repeated recursively on each pair of matching segments,  down to the point where the SL segment contains a single token. (TL segments can always be split, even when empty, because IBM-style alignments allow connecting SL tokens to the &amp;quot;null&amp;quot; TL token, which is always available.) This recursive procedure actually produces two different outputs: 1. A parallel partition of S and T into m pairs of segments &lt;si,tkj&gt; , where each tkj is a (possibly null) contiguous sub-sequence of T; this partition can of course be viewed as an alignment on the words of S and T.</Paragraph>
    <Paragraph position="10"> 2. an IBM-style alignment, such that each SL and TL token is linked to at most one token in the other lan null guage: this alignment is actually the concatenation of individual Viterbi alignments on the &lt;si,tkj&gt; pairs, which connects each si to (at most) one of the tokens in the corresponding tkj .</Paragraph>
    <Paragraph position="11"> In this procedure, which we call Compositional WA (or C for short), there are at least two problems. First, each SL token finds itself &amp;quot;isolated&amp;quot; in its own partition bin, which makes it impossible to account for multiple SL tokens acting together to produce a TL sequence. Second, the TL tokens that are not connected in the resulting IBM-style alignment do not play any role in the computation of the probability of the optimal alignment; therefore, the pair &lt;si,tkj&gt; in which these &amp;quot;superfluous&amp;quot; tokens end up is more or less random.</Paragraph>
    <Paragraph position="12"> To compensate in part for these, we propose using two IBM-2 models to compute the optimal partition: the &amp;quot;forward&amp;quot; (SL-TL) model, and the &amp;quot;reverse&amp;quot; (TL-SL) model. When examining a particular split &lt;i,j,d&gt; for S and T, we compute both Viterbi alignments, forward and reverse, between all pairs of segments, and score each pair with the product of the two alignments' probabilities. null In this variant, which we call Combined Compositional WA (CC), we can no longer allow &amp;quot;empty&amp;quot; segments in the TL, and so we stop the recursion as soon as either the SL or TL segment contains a single token. The resulting partition therefore consists in a series of 1-to-k or k-to-1 alignments, with k [?] 1.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Implementation
</SectionTitle>
    <Paragraph position="0"> The C and CC WA methods of section 3 were implemented in a program called ralign (Recursive - or RALI - alignment, as you wish). As suggested above, this program takes as input a pair of sentence-aligned texts, and the parameters of two IBM-2 models (forward and reverse), and outputs WA's for the given texts. This program also implements plain Viterbi alignments, using the forward (V) or reverse (RV) models, as well as what we call the Reverse compositional WA (or RC), which is just the C method using the reverse IBM-2 model.</Paragraph>
    <Paragraph position="1"> The output format proposed for the WPT-03 shared task on WA allowed participants to distinguish between &amp;quot;sure&amp;quot; (S) and &amp;quot;probable&amp;quot; (P) WA's. We figured that our alignment procedure implicitly incorporated a way of distinguishing between the two: within each produced pair of segments, we marked as &amp;quot;sure&amp;quot; all WA's that were predicted by both (forward and reverse) Viterbi alignments, and as &amp;quot;probable&amp;quot; all the others.</Paragraph>
    <Paragraph position="2"> The translation models for ralign were trained using the programs of the EGYPT statistical translation toolkit (Al-Onaizan et al., 1999). This training was done using the data provided as part of the WPT-03 shared task on WA (table 1). We thus produced two sets of models, one for English and French (en-fr), and one for Romanian and English (ro-en). All models were trained on both the training and test datasets1. For en-fr, we considered all words that appeared only once in the corpus to be &amp;quot;unknown words&amp;quot; (whittle option -f 2), so as to obtain default values of &amp;quot;real&amp;quot; unknowns in the test corpus2. In the case of ro-en, there was too little training data for this to be beneficial, and so we chose to use all words.</Paragraph>
    <Paragraph position="3">  We trained and tested a number of translation models before settling for this particular setup. All of these  excessively long sentences in the corpus.</Paragraph>
    <Paragraph position="4"> tests were performed using the trial data provided for the WPT-03 shared task.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML