XML Viewer - w03-0313

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0313_metho.xml
Size: 19,254 bytes
Last Modified: 2025-10-06 14:08:19
<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0313">
  <Title>Translation Spotting for Translation Memories</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Sub-sentential Translation Memory
Systems
</SectionTitle>
    <Paragraph position="0"> A translation memory system is a type of translation support tool whose purpose is to avoid the re-translation of segments of text for which a translation has previously been produced. Typically, these systems are integrated to a word-processing environment. Every sentence that the user translates within this environment is stored in a database (the translation memory - or TM). Whenever the system encounters some new text that matches a sentence in the TM, its translation is retrieved and proposed to the translator for reuse.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Sentence Pair
</SectionTitle>
      <Paragraph position="0"> Query SL (English) TL (French) 1. and a growing gap Is this our model of the future, regional disparity and a growing gap between rich and poor? Est ce l`a le mod`ele que nous visons, soit la disparit'e r'egionale et un foss'e de plus en plus large entre les riches et les pauvres? 2. the government's commitment null The government's commitment was laid out in the 1994 white paper.</Paragraph>
      <Paragraph position="1"> Le gouvernement a expos'e ses engagements dans le livre blanc de 1994. 3. close to [...] years I have been fortunate to have been travelling for close to 40 years.</Paragraph>
      <Paragraph position="2"> J'ai eu la chance de voyager pendant pr`es de 40 ans .</Paragraph>
      <Paragraph position="3"> 4. to the extent that To the extent that the Canadian government could be open, it has been so. Le gouvernement canadien a 'et'e aussi ouvert qu'il le pouvait.</Paragraph>
      <Paragraph position="4">  As suggested in the above paragraph, existing systems essentially operate at the level of sentences: the TM is typically made up of pairs of sentences, and the system's proposals consist in translations of complete sentences. Because the repetition of complete sentences is an extremely rare phenomenon in general language, this level of resolution limits the usability of TM's to very specific application domains - most notably the translation of revised or intrinsically repetitive documents. In light of these limitations, some proposals have recently been made regarding the possibility of building TM systems that operate &amp;quot;below&amp;quot; the sentence level, or sub-sentential translation memories (SSTM) - see for example (Lang'e et al., 1997; McTait et al., 1999).</Paragraph>
      <Paragraph position="5"> Putting together this type of system raises the problem of automatically establishing correspondences between arbitrary sequences of words in the TM, or, in other words, of &amp;quot;spotting translations&amp;quot;. This process (translation spotting) can be viewed as a by-product of wordalignment, i.e. the problem of establishing correspondences between the words of a text and those of its translation: obviously, given a complete alignment between the words of the SL and TL texts, we can extract only that part of the alignment that concerns the TS query; conversely, TS may be seen as a sub-task of the word-alignment problem: a complete word-alignment can be obtained by combining the results of a series of TS operations, covering the entirety of the SL text.</Paragraph>
      <Paragraph position="6"> From the point of view of an SSTM application, the TS mechanism should find the TL segments that are the most likely to be useful to the translator in producing the translation of a given SL sentence. In the end, the final criterion by which a SSTM will be judged is profitability: to what extent do the system's proposals enable the user to save time and/or effort in producing a new translation. From that perspective, the two most important characteristics of the TL answers are relevance, i.e. whether or not the system's TL proposals constitute valid translations for some part of the source sentence; and coherence, i.e. whether the proposed segments are wellformed, at least from a syntactic point of view. As suggested by McTait et al. (1999), &amp;quot;linguistically motivated&amp;quot; sub-sentential entities are more likely than arbitrary sequences of words to lead to useful proposals for the user. Planas (2000) proposes a fairly simple approach for an SSTM: his system would operate on sequences of syntactic chunks, as defined by Abney (1991). Both the contents of the TM and the new text under consideration would be segmented into chunks; sequences of chunks from the new text would then be looked up verbatim in the TM; the translation of the matched sequences would be proposed to the user as partial translations of the current input. Planas's case for using sequences of chunks as the unit of translation for SSTM's is supported by the coherence criterion above: chunks constitute &amp;quot;natural&amp;quot; textual units, which users should find easier to grasp and reuse than arbitrary sequences.</Paragraph>
      <Paragraph position="7"> The coherence criterion also supports the case for contiguous TL proposals, i.e. proposals that take the form of contiguous sequences of tokens from the TM, as opposed to discontiguous sets such as those of examples 2 and 3, in figure 1. This also makes intuitive sense from the more general point of view of profitability: manually &amp;quot;filling holes&amp;quot; within a discontiguous proposal is likely to be time-consuming and counter-productive. On the other hand, filling those holes automatically, as proposed for example by Lang'e et al. and McTait et al., raises numerous problems with regard to syntactic and semantic well-formedness of the TL proposals. In theory, contiguous sequences of token from the TM should not suffer from such ills.</Paragraph>
      <Paragraph position="8"> Finally, and perhaps more importantly, in a SSTM application such as that proposed by Planas, there appears to be statistical argument in favor of contiguous TL proposals: the more frequent a contiguous SL sequences, the more likely it is that its TL equivalent is also contiguous. In other words, there appears to be a natural tendency for frequently-occurring phrases and formulations to correspond to like-structured sequences in other languages. This will be discussed further in section 4. But clearly, a TS mechanism intended for such a SSTM should take advantage of this tendency.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 TS Methods
</SectionTitle>
    <Paragraph position="0"> In this section, we propose various TS methods, specifically adapted to a SSTM application such as that proposed by Planas (2000), i.e. one which takes as translation unit contiguous sequences of syntactic chunks.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Viterbi TS
</SectionTitle>
      <Paragraph position="0"> As mentioned earlier, TS can be seen as a bi-product of word-level alignments. Such alignments have been the focus of much attention in recent years, especially in the field of statistical translation modeling, where they play an important role in the learning process.</Paragraph>
      <Paragraph position="1"> For the purpose of statistical translation modeling, Brown et al. (1993) define an alignment as a vector a = a1...am that connects each word of a source-language text S = s1...sm to a target-language word in its translation T = t1...tn, with the interpretation that word taj is the translation of word sj in S (aj = 0 is used to denote words of s that do not produce anything in T).</Paragraph>
      <Paragraph position="2"> Brown et al. also define the Viterbi alignment between source and target sentences S and T as the alignment ^a whose probability is maximal under some translation model:</Paragraph>
      <Paragraph position="4"> where A is the set of all possible alignments between S and T, and PrM(a|S,T) is the estimate of a's probability under model M, which we denote Pr(a|S,T) from hereon. In general, the size of A grows exponentially with the sizes of S and T, and so there is no efficient way of computing ^a efficiently. However, under Model 2, the probability of an alignment a is given by:</Paragraph>
      <Paragraph position="6"> In this last equation, t(si|tj) is the model's estimate of the &amp;quot;lexical&amp;quot; distribution p(si|tj), while a(j,i,m,n) estimates the &amp;quot;alignment&amp;quot; distribution p(j|i,m,n). Therefore, with this model, the Viterbi alignment can be obtained by simply picking for each position i in S, the alignment that maximizes t(si|tj)a(j,i,m,n). This procedure can trivially be carried out in O(mn) operations.</Paragraph>
      <Paragraph position="7"> Because of this convenient property, we base the rest of this work on this model.</Paragraph>
      <Paragraph position="8"> Adapting this procedure to the TS task is straightforward: given the TS query q, produce as TL answer the corresponding set of TL tokens in the Viterbi alignment: rq(T) = {t^ai1,...,t^ai2} (the SL answer is simply q itself). We call this method Viterbi TS: it corresponds to the most likely alignment between the query q and TL text T, given the probability estimates of the translation model. If q contains I tokens, the Model 2 Viterbi TS can be computed in O(In) operations. Figure 2 shows an example of the result of this process.</Paragraph>
      <Paragraph position="9"> query : the government 's commitment  couple: S = Let us see where the government's commitment is really at in terms of the farm community.</Paragraph>
      <Paragraph position="10"> T = Voyons quel est le v'eritable engagement du gouvernement envers la communaut'e agricole.</Paragraph>
      <Paragraph position="11"> Viterbi alignment on query tokens: the - le government - gouvernement 's - du commitment - engagement TL answer: T = Voyons quel est le v'eritable engagement du gouvernement envers la communaut'e agricole.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Post-processings
</SectionTitle>
      <Paragraph position="0"> The tokens of the TL answer produced by Viterbi TS are not necessarily contiguous in T which, as remarked earlier, is problematic in a TM application. Various a posteriori processings on rq(T) are possible to fix this; we list here only the most obvious: expansion : Take the minimum and maximum values in {^ai1,...,^ai2}, and produce the sequence tminai...tmaxai; in other words, produce as TL answer the smallest contiguous sequence in T that contains all the tokens of rq(T).</Paragraph>
      <Paragraph position="1"> longest-sequence : Produce the subset of rq(T) that constitutes the longest contiguous sequence in T.</Paragraph>
      <Paragraph position="2"> zero-tolerance : If the tokens in rq(T) cannot be arranged in a contiguous sequence of T, then simply discard the whole TL answer.</Paragraph>
      <Paragraph position="3"> Figure 3 illustrates how these three strategies affect the Viterbi TS of figure 2.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Contiguous TS
</SectionTitle>
      <Paragraph position="0"> The various independence assumptions underpinning IBM Model 2 often have negative effects on the resulting Viterbi alignments. In particular, this model assumes rq(T) ={le, engagement, du, gouvernement} post-processing:</Paragraph>
      <Paragraph position="2"> that all connections within an alignment are independent of each other, which leads to numerous aberrations in the alignments. Typically, each SL token gets connected to the TL token with which it has the most &amp;quot;lexical affinities&amp;quot;, regardless of other existing connections in the alignment and, more importantly, of the relationships this token holds with other SL tokens in its vicinity. Conversely, some TL tokens end up being connected to several SL tokens, while other TL tokens are left unconnected. null As mentioned in section 2, in a sub-sentential TM application, contiguous sequences of tokens in the SL tend to translate into contiguous sequences in the TL. This suggests that it might be a good idea to integrate a &amp;quot;contiguity constraint&amp;quot; right into the alignment search procedure. null For example, we can formulate a variant of the Viterbi TS method above, which looks for the alignment that maximizes Pr(a|S,T), under the constraint that the TL tokens aligned with the SL query must be contiguous.</Paragraph>
      <Paragraph position="3"> Consider a procedure that seeks the (possibly null) sequence tj1...tj2 of T, that maximizes:</Paragraph>
      <Paragraph position="5"> Such a procedure actually produces two distinct alignments over S and T: an alignment aq, which connects the query tokens (the sequence si2i1) with a sequence of contiguous tokens in T (the sequence tj2j1), and an alignment a-q, which connects the rest of sentence S (i.e. all the tokens outside the query) with the rest of T. Together, these two alignments constitute the alignment a = aq [?] a-q, whose probability is maximal, under a double constraint:  1. the query tokens si2i1 can only be connected to tokens within a contiguous region of T (the sequences tj2j1); 2. the tokens outside the query (in either one of the two  sequences si1[?]11 and smi2+1) can only get connected to tokens outside tj2j1.</Paragraph>
      <Paragraph position="6"> With such an alignment procedure, we can trivially devise a TS method, which will return the optimal tj2j1 as TL answer. We call this method Contiguous TS. Alignments satisfying the above constraints can be obtained directly, by computing Viterbi alignments aq and a-q for each pair of target positions &lt;j1,j2&gt; . The TS procedure then retains the pair of TL language positions that maximizes the joint probability of alignments aq and a-q. This operation requires the computation of two Viterbi alignments for each pair &lt;j1,j2&gt; , i.e. n(n [?] 1) Viterbi alignments, plus a &amp;quot;null&amp;quot; alignment, corresponding to the situation where tj2j1 = [?]. Overall, using IBM Model 2, the oper-</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Compositional TS
</SectionTitle>
      <Paragraph position="0"> As pointed out in section 3.3, In IBM-style alignments, a single TL token can be connected to several SL tokens, which sometimes leads to aberrations. This contrasts with alternative alignment models such as those of Melamed (1998) and Wu (1997), which impose a &amp;quot;one-to-one&amp;quot; constraint on alignments. Such a constraint evokes the notion of compositionality in translation: it suggests that each SL token operates independently in the SL sentence to produce a single TL token in the TL sentence, which then depends on no other SL token.</Paragraph>
      <Paragraph position="1"> This view is, of course, extreme, and real-life translations are full of examples (idiomatic expressions, terminology, paraphrasing, etc.) that show how this compositionality principle breaks down as we approach the level of word correspondences.</Paragraph>
      <Paragraph position="2"> However, in a TM application, TS usually needs not go down to the level of individual words. Therefore, compositionality can often be assumed to apply, at least to the level of the TS query. The contiguous TS method proposed in the previous section implicitly made such an assumption. Here, we push it a little further.</Paragraph>
      <Paragraph position="3"> Consider a procedure that splits each the source and target sentences S and T into two independent parts, in such a way as to maximise the probability of the two resulting Viterbi alignments:</Paragraph>
      <Paragraph position="5"> In the triple &lt;i,j,d&gt; above, i represents a &amp;quot;split point&amp;quot; in the SL sentence S, j is the analog for TL sentence T, and d is the &amp;quot;direction of correspondence&amp;quot;: d = 1 denotes a &amp;quot;parallel correspondence&amp;quot;, i.e. s1...si corresponds to t1...tj and si+1...sm corresponds to tj+1...tn; d = [?]1 denotes a &amp;quot;crossing correspondence&amp;quot;, i.e. s1...si corresponds to tj+1...tn and si+1...sm corresponds to t1...tj.</Paragraph>
      <Paragraph position="6"> The triple &lt;I,J,D&gt; produced by this procedure refers to the most probable alignment between S and T, under the hypothesis that both sentences are made up of two independent parts (s1...sI and sI+1...sm on the one hand, t1...tJ and tJ+1...tn on the other), that correspond to each other two-by-two, following direction D. Such an alignment suggests that translation T was obtained by &amp;quot;composing&amp;quot; the translation of s1...sI with that of sI+1...sm.</Paragraph>
      <Paragraph position="7"> This &amp;quot;splitting&amp;quot; process can be repeated recursively on each pair of matching segments, down to the point where each SL segment contains a single token. (TL segments can always be split, even when empty, because IBM-style alignments make it possible to connect SL tokens to the &amp;quot;null&amp;quot; TL token, which is always available.) This gives rise to a word-alignment procedure that we call Compositional word alignment.</Paragraph>
      <Paragraph position="8"> This procedure actually produces two different outputs: first, a parallel partition of S and T into m pairs of segments &lt;si,tkj&gt; , where each tkj is a (possibly null) contiguous sub-sequence of T; second, an IBM-style alignment, such that each SL and TL token is linked to at most one token in the other language: this alignment is actually the concatenation of individual Viterbi alignments on the &lt;si,tkj&gt; pairs, which connects each si to (at most) one of the tokens in the corresponding tkj .</Paragraph>
      <Paragraph position="9"> Of course, such alignments face even worst problems than ordinary IBM-style alignments when confronted with non-compositional translations. However, when adapting this procedure to the TS task, we can hypothesize that compositionality applies, at least to the level of the SL query. This adaptation proceeds along the following modifications to the alignment procedure described  above: 1. forbid splittings within the SL query: i1 [?] i [?] i2; 2. at each level of recursion, only consider that pair of segments which contains the SL query; 3. stop the procedure as soon as it is no longer possible  to split the SL segment, i.e. it consists of si1...si2.</Paragraph>
      <Paragraph position="10"> The TL segment matched with si1...si2 when the procedure terminates is the TL answer. We call this procedure Compositional TS. It can be shown that it can be carried out in O(m3n2) operations in the worst case, and O(m2n2 logm) on average. Furthermore, by limiting the search to split points yielding matching segments of comparable sizes, the number of required operations can be cut by one order of magnitude (Simard, 2003).</Paragraph>
      <Paragraph position="11"> Figure 5 shows how this procedure splits the example pair of figure 2 (the query is shown in italics).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML