XML Viewer - c00-2123

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-2123_metho.xml
Size: 21,009 bytes
Last Modified: 2025-10-06 14:07:14
<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-2123">
  <Title>Word Re-ordering and DP-based Search in Statistical Machine Translation</Title>
  <Section position="4" start_page="0" end_page="853" type="metho">
    <SectionTitle>
2 Basic Approach
</SectionTitle>
    <Paragraph position="0"> In this section, we briefly review our translation approach. In Eq. (1), Pr(C~l) is the language model, which is a trigrain language model in this case. For the translation model Pr(fille{), we go on the assunlption that each source word is aligned to exactly one target word. The alignment model uses two kinds of parameters: alignment probabilities p(ajlaj_l, I, J), where the probability of alignment aj for position j deI)ends on tile previous alignment position aj-i (Ney et al., 2000) and lexicon proba.bilities p(\]~le~j). When aligning the words in parallel texts (tbr language pairs like Spanish-English, French-Englisll, Italian-German,...), we typically observe a strong localization effect. In many cases, there is an even stronger restriction: over large portions of tile source string, the alignment is monotone.</Paragraph>
    <Section position="1" start_page="0" end_page="850" type="sub_section">
      <SectionTitle>
2.1 Inverted Alignments
</SectionTitle>
      <Paragraph position="0"> To explicitly handle the word re-ordering between words in source and target language, we use the concept of the so-called inverted aligmnents as given in (Ney et al., 2000). An inverted alignment is defined as follows: inverted alignment: i -+ j = bi.</Paragraph>
      <Paragraph position="1"> Target positions i are mapped to source positions bi. What is important and is not expressed by the notation is the so-called coverage constraint: each source position j should be 'hit' exactly once by the path of the inverted aligmnent b~ = bL...bi...bi. Using the inverted alignments in the maximum approximation,  we obtain as search criterion:</Paragraph>
      <Paragraph position="3"> where the two products over i have l)een merged into ) i-1 a single product ()vet i. I (cilei_~) is tim trigram language model probability. The inverted alignment probability p(bi\[bi-l, I, .1) and the lexicon probability p(J'~,~ led are obtained by relative fl'equency estimal;es frosll the Viterbi alignment path after the final training iteration. The details are given in (Och art(1 Ney, 2000). The sentence length probability p(J\[1) is omitted without any loss in pertbrmance. For the inverted alignment probability p(bi\[bi-~, I, J), we drop the dependence on the target sentence length I.</Paragraph>
    </Section>
    <Section position="2" start_page="850" end_page="850" type="sub_section">
      <SectionTitle>
2.2 Word Joining
</SectionTitle>
      <Paragraph position="0"> The baseline alignment model does not pernfit that a source word is aligned to two or more target words, e.g. %r the translation direction from German to \]?,nglish, the German (:Oml)ound IIOlStl 'Zahnarztterrain' causes ira&gt;blares, because it must be translated by the two target words dcntist'.s appoi'ntmcnt. We use a solution to this 1)roblenl similar to the one presented in (()ell el; al., 1999), where target words are joined during training. The word joining is (lotto on the basis of a likelihood criterion. An extended lexicon model is defined, and its likelihood is compared to a baseline lexicon model, which takes only single-word dependencies into aecollnt. E.g. when 'Zahnarzttermin' is aligned to dentist'.s, the extended lexicon model might learn that 'Zahnarzttcrmin' actually has to be aligned to both dentist's and appointment. In the following, we assmne that this word joining has been carried out.</Paragraph>
      <Paragraph position="1"> &amp;quot;I DP Algorithm for Statistical</Paragraph>
    </Section>
    <Section position="3" start_page="850" end_page="851" type="sub_section">
      <SectionTitle>
Machine Translation
</SectionTitle>
      <Paragraph position="0"> in order to handle the necessary word re-ordering as &amp;n optimization problem within our dynmnic programming approach, we describe a solution to the traveling salesnmn problem (TSP) which is based on dynamic programming (Held, Karp, 1962). The traveling salesman problem is an oi)timization problem which is defined as follows: given are a set of</Paragraph>
      <Paragraph position="2"> Figm'e 1: Re-ordering for the Gerlnan verbgroul).</Paragraph>
      <Paragraph position="3"> cities S = Sl ,---, s, and tbr each pair of cities si, sj  the cost dij &gt; 0 for traveling flom city s: to city .sj. We arc.' looking for the shortest tour visiting all cities exactly once while starting and ending in city sl. A straightforward way to find the shortest tour is by trying all possible permutations of the n cities. The resulting algorithm has a complexity of O(n!). Itowever, dynamic progrmnming can be used to find tile shortest tour in exponential time, namely in O(n 2-2'~), using the algorithm by Ileld and Karp. The approach recursively evahlates a quantity Q(C,j), where C is the set; of already visited cities and sj is the last visited city. Subsets C of increasing cardinality c are processed. The algorithm works due to the fact that not all permutations of cities have to be considered explicitly. For a given partial hypothesis (C, j), the order in which the cities in (2 have beast visited cast be ignored (except j), only the score for the best path reaching j has to be stored.</Paragraph>
      <Paragraph position="4"> This algorithm can be applied to statistical machine translation. Using the concept of inverted alignments, we explicitly take care of the coverage constraint by introducing a coverage set C of source sentence positions that have been already processed.</Paragraph>
      <Paragraph position="5"> The advantage is that we can recombine search hypotheses by dynmnic programming. The cities of the traveling salesman probleIn correspond to source  input: source string fl..-fj...f.l</Paragraph>
      <Paragraph position="7"> words fj in the input string of length J. For the final translation each source position is considered exactly once. Subsets of partial hypotheses with coverage sets C of increasing cardinality c are processed. For a trigrmn language model, the partial hyl)otheses are of tile form (c', c,(2,j), e', c are the last two target words, C is a coverage set for tile already covered source positions and j is the last position visited. Each distance in the traveling salesman problem now corresponds to the negative logarithm of tile product of the translation, alignment and language model probabilities. The following auxiliary quantity is defined: Qc~((?,~ C~j) :.~. probability of tile best partial hypothesis (c~, b~), where</Paragraph>
      <Paragraph position="9"> The type of alignment we have considered so far requires the stone length tbr source and target sentence, i.e. I = J. Evidently, this is an unrealistic assumption, therefore we extend the concept of inverted alignments as follows: Wtmn adding a new position to the coverage set C, we might generate either 5 = 0 or a = 1 new target words. For 5 = 1, a new target language word is generated using the tri-gram language model p(e\[e', e&amp;quot;). For 5 = 0, no new target word is generated, while an additional source sentence position is covered. A modified language inodel probability pa(e\[c', c&amp;quot;) is defined as follows:</Paragraph>
      <Paragraph position="11"> We associate a distribution p(5) with the two cases 5 = 0 and 5 = 1 and set p(5 = 1) = 0.7.</Paragraph>
      <Paragraph position="12"> The above auxiliary quantity satisfies tile following recursive DP equation:</Paragraph>
      <Paragraph position="14"> 4. mein 5. Kollege ~~ ,Q ~Verb_ .~ Fin:l~ --_ JJ 1. In 7.nicht 9. Sie 2. diesem 8. besuehen 10. am 3. Fall 11. vierten 6. kann 12. Mai 13..</Paragraph>
      <Paragraph position="16"> * pa( le', e&amp;quot;)- Qe,, c \ {j}, j') }.</Paragraph>
      <Paragraph position="17"> The DP equation is evaluated recursively for each hypothesis (e',e,C,j). TILe resulting algorithm is depicted in Table 1. The complexity of the algorithm is O(E a * j2.2.1), where E is the size of the target language vocabulary.</Paragraph>
    </Section>
    <Section position="4" start_page="851" end_page="852" type="sub_section">
      <SectionTitle>
3.1 Word Re-Ordering with Verbgroup
</SectionTitle>
      <Paragraph position="0"> Restrictions: Quasi-monotone Search The above search space is still too large to allow the translation of a inedium length input sentence. On the other hand, only very restricted re-orderings are necessary, e.g. for the translation direction fi'om  Oerlnan to English the monotonicity constraint is violated mainly with respect; to the German verbgroup. In German, the verbgroui) usually consists of a left and a right verbal brace., whereas in English the words of the verbgroul) usually tbrm a sequence of consecutive words. Our new al)t)roach, which is ('alle.d quasi-monotone search, proce.sses the source sentence monotonically, while explicitly taking into account the positions of the (-lel'lnan verbgroup.</Paragraph>
      <Paragraph position="1"> A typical situation is shown in Figure \]. When translating the sentence monotonically fl'om left to right, the translation of the German finite verb 'kmm', which is the left verbal brace in this case, is postponed mttil the German noun phrase 'mein t(ollege' is translated, which is the subject of the sentence. The.n, the. German infinitive 'besuclmn' and the negation particle 'nicht;' are translated. The trails\]at\oil of erie position in the source sentence nmy be postponed ti)r up to L = 3 source positions, and the translation of u I) to two source positions link be anticittated for at most 1~ = l0 source l)ositions. To formalize the attl)roach, we introduce four  cessed monotonically taking accoullt of tit(; al: ready covered positions.</Paragraph>
      <Paragraph position="2"> \Vhile processing the source sentence monotonically, the iifitial state Z is entere.d whenever there are no mmovered positions to the left of the rightmost covered position. The sequence of states needed to carry out the. word re-ordering example in Fig. 1 is given in Fig. 2. The 13 positions of the source sentence are processed in the order shown. A position is presented by the word at that position. Using these states, we define partial hypothesis extensions, which are of the following type: (S',C \ {j},j') --9 (S,C,j), Not only the cove.rage set C and the posit\oilS j,j', but also the verbgroup states S, S' are taken into account. ~1~) be short, we omit the target words c, e' in the tbrinulation of the search hypotheses. There are 13 types of extensions needed to describe the verb-group re-ordering. The details are given in (Tillmann, 2000). For each extension a new position is added to the coverage set. Covering the first lulcovered position in the source sentence, we use the language model probatfility p(e\[$, $). IIere, $ is the sentence boundary symbol, which is thought to be at; position 0 in the target sentence. Tile search starts in the hyl)othesis (Z, {~}, 0). {~} denotes the empty set, where, no source sentence t)osition is covered.</Paragraph>
      <Paragraph position="3"> The following recursive equation is evaluated:</Paragraph>
      <Paragraph position="5"> The search ends in the hypotheses (Z, {1,.-., d}, j).</Paragraph>
      <Paragraph position="6"> {1,..., d} de.notes a coverage se.t including all positions from the starting 1)osition I to position J and j C {d- L,-.-, .\]}. The final score is obtaiiled from: ,nax P($l&amp;quot;, c'). G' (c, Z, { 1,..-, d}, .it,</Paragraph>
      <Paragraph position="8"> where p($lc, c/) denotes the trigram language model, which predicts the sentence boundary $ at tim end of the target sentence. The complexity of the quasi-monotone search is O(\]'J a-,l- (\[~2-t-L-1~)). The proof is given ill (Tilhnann, 2000).</Paragraph>
    </Section>
    <Section position="5" start_page="852" end_page="853" type="sub_section">
      <SectionTitle>
3.2 Re-ordering with IBM Style
Restrictions
</SectionTitle>
      <Paragraph position="0"> We compare our new api)roach with tim word re-ordering used in the IBM translation approach (Berger et al., 1996). A detailed descrit)tion of tile search procedure used is given in this patent. Source sentence words are aligned with hypothesized target sentence words, where the choice of a new source word, which has not been aligned with a target word yet, is restricted I . A procedural definitioll to restrict l In the approach described in (Berger et al., 1996), a morphological analysis is carried out and word morphenles rather thin, full-form words are used during the search, ltere, we process only flfll-forin words within the trmmlation proce.durc.  the number of pernmtations carried out for the word re-ordering is given. During the search process, a partial hyt)othesis is extended by choosing a source sentence position, which has not been aligned with a target sentence t)osition yet. Only one of the first n positions which are not already aligned in a partial hyt)othesis may be chosen, where n is set to 4. Tile restriction can be expressed in terms of the nmnber of uncovered source sentence positions to the left of the rightmost position m in the coverage set.</Paragraph>
      <Paragraph position="1"> This munber must be less than or equal to n - 1.</Paragraph>
      <Paragraph position="2"> Otherwise for the predecessor search hyt)othesis, we wonld have chosen a position that would not have been among the first n uncovered t)ositions.</Paragraph>
      <Paragraph position="3"> Ignoring the identity of the target language words e and c', the possible partial hypothesis extensions due to the IBM restrictions are shown in Table 2.</Paragraph>
      <Paragraph position="4"> In general, m, l, l' ~k {/1,12,/3} and in line umber 3 and 4, l' must be chosen not to violate the above re-ordering restriction. Note that in line 4 the last; visited position for tile successor hypothesis must be m. Otherwise, there will be four uncovered positions tbr the t)redecessor hypothesis violating the restriction. A dynamic programming recursion sin&gt; ilar to the one in Eq. 2 is evaluated. In this case, we have no finite-state restrictions for the search space.</Paragraph>
      <Paragraph position="5"> Tile search stm'ts in hyi)othesis ({0}, 0) and ends in the hyt)otheses ({1,...,J},j), with j C {1,..-,d}.</Paragraph>
      <Paragraph position="6"> This approach leads to a search procedure with complexity O(E a . j4). The proof is given in (Tilhnann, 2000).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="853" end_page="854" type="metho">
    <SectionTitle>
4 Experimental Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="853" end_page="853" type="sub_section">
      <SectionTitle>
4.1 The Task and the Corpus
</SectionTitle>
      <Paragraph position="0"> We have tested tim translation system Oil the Verbmobil task (Wahlster 1993). The Verbmobil task is an appointment scheduling task. Two subjects are each given a calendar and they are asked to schedule a meeting. The translation direction is from German to English. A summary of the corpus used in the experiments is given in Table 3. The perplexity for the trigrmn language model used is 26.5. Although the ultimate goal of tile Verbmobil project is the translation of spoken language, the input used for the translation experinmnts reported on in this paper is the (more or less) correct orthographic transcription of the spoken sentences. Thus, the effects of spontaneous speech are t)resent in the corpus, e.g.</Paragraph>
      <Paragraph position="1"> the syntactic structure of tile sentence is rather less restricted, however the effect of sl)eech recognition errors is not covered.</Paragraph>
      <Paragraph position="2"> For the experiments, we use a simt)le pret)rocessing step. German city names are replaced by category markers. The translation search is carried out with tlm category markers and tlm city names are resubstituted into the target sentence as a postt)rocessing step.</Paragraph>
    </Section>
    <Section position="2" start_page="853" end_page="853" type="sub_section">
      <SectionTitle>
4.2 Performance Measures
</SectionTitle>
      <Paragraph position="0"> The following two error criteria are used ill our ext)erinmnts: null * roWER: multi-reference WER: We use the Levenshtein distance between tile automatic translation and several reference translations as a measure of tile translation errors. On average, 6 reference translations per automatic translation are availal)le. Tile Levenshtein distance between the automatic translation and each of tile reference translations is comt)uted, and the minimum Levenshtein distance is taken. This measure has the advantage of being completely automatic.</Paragraph>
      <Paragraph position="1"> * SSER: subjective sentence error rate: For a more detailed analysis, tile translations are judged i)y a tminan test 1)erson. For the er-ror counts, a range from 0.0 to 1.0 is used. An error count of 0.0 is assigned to a perfect translation, and an error count of 1.0 is assigned to a semantically and syntactically wrong transb&gt; tion.</Paragraph>
    </Section>
    <Section position="3" start_page="853" end_page="854" type="sub_section">
      <SectionTitle>
4.3 Translation Experiments
</SectionTitle>
      <Paragraph position="0"> For tile translation experiments, Eq. 2 is recursively evahlated. We apply a beam search concet)t as in st)eech recognition. However there is no global pruning. Search hypotheses are i)rocessed separately according to their coverage set d. The best scored  hyi)othesis tbr each coverage set is comlmted:</Paragraph>
      <Paragraph position="2"> The hyl)othesis (d, e, $, C, j) is t)1'1151(;(l if: Q~,(e,S,C,j) &lt; to.O,~cam(C), where to is a threshold to control the mmlber of surviving hypotheses. Additionally, for a given coverage set, at most 250 different hypotheses are kept during the search process, and the number of difl'erent words to |)e hyl)othesized by a source word is limited. For each source word f, the list of its possible translations c is sorte(1 according to p(.flc) * p.,,.,,i(c), where Puui(e) is the unigrmn probability of the English word c. It is sufficient to consi(ter only the best 50 words.</Paragraph>
      <Paragraph position="3"> We show translation results for three at)l)roaches: tile monotone search (MonS): where no word re-ordering is allowed (Tillmann, 1997), the quasi-monotone search (QmS) as 1)resented in this palser amt the IBM style (IbmS) search as described in Section 3.2.</Paragraph>
      <Paragraph position="4"> TMsle 4: shows translation results tbr the three ap-I)roaches. The eomlsuting time is given in terms of CPU time per sentence (on a 450-MIlz l?entimn-III-PC). Itere, the printing threshold to = 10.0 is used. q_5:anslation errors are reported in terms of multi-reference word error rate (roWER) and subjective s(mtenee error rate (SSER). The monotone search tserforms worst in terms of both elTror rates 5IsWI~;I~. mid SSEIL The (;OSlll)lstislg time is low, sitlce 51o 5eordering is (:arried o551,. '\]'he quasi-inonotone search i)e1foI'551s t)est, in ter551s of l)oth error rates roWER and SSh;R. Additionally, it; works about 3 times as fast as the II3M style sem:eh. For our demonstration system, we typically use the pruning threshold to = 5.0 to speed Ul) the search by a factor 5 while allowing for a smC/fll degradation in translation accuracy. null The effect of the pruning threshold to is shown in Table 5. The coml)uting time, the number of search errors, and the mull;i-reference WEll, (roWER) are shown as a flmction of to. The negative logarithm of to is reporte(t. The translation scores for the hyi)otheses generated with different threshohl values to are compared to the translation scores obtained with a conservatively large threshold to = 10.0. For each test series, we count tile mlml)er of sentences whose score is worse than the corresponding score of the tent; series with the conserw~tively large threshold to = 10.0, and this mm:ber is reported as the number of search errors, l)epending on the threshold to, the search algorithm may miss the globally of)timal path which typically results in additional translation errors. Decreasing the threshold results in higher mWEll due to additional search errors.</Paragraph>
      <Paragraph position="5">  Table 6 shows example translations obtained by the three, difli;rent appro~mhes. Again, the monotone search performs worst. In the second and third translation examples, the 1bins word re-ordering performs worse than the QmS word re-ordering, since it; can not take l)roperly into at:count the word re-ordering (115(; to the, German verbgroul). Tile German finite verbs 'l)in' (second exmnple) and 'kSnnten' (third exmnt)le) are too far away from the t)ersonal pronouns 'ich' and 'Sic' (6 respectively 5 source sentence positions). In the last example, the less restrictive IbmS word re-ordering leads to a better translation, although the QmS translation is still aceeptabh'.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML