File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-1010_metho.xml
Size: 27,601 bytes
Last Modified: 2025-10-06 14:08:07
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-1010"> <Title>Greedy Decoding for Statistical Machine Translation in Almost Linear Time</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 The IBM Translation Models </SectionTitle> <Paragraph position="0"> Brown et al. (1993) and Berger et al. (1994, 1996) view the problem of translation as that of decoding a message that has been distorted in a noisy channel.</Paragraph> <Paragraph position="1"> Exploiting Bayes' theorem</Paragraph> <Paragraph position="3"> they recast the problem of finding the best translation a6a9 for a given input a13 as a7 is typically calculated using an n-gram language model. For the sake of simplicity, we assume here and everywhere else in the paper that the ultimate task is to translate from a foreign language into English. The model pictures the conversion from English to a foreign language roughly as follows (cf. Fig. 1; note that because of the noisy channel approach, the modeling is &quot;backwards&quot;).</Paragraph> <Paragraph position="4"> a25 of so-called spurious words (words that have no counterpart in the original English) are inserted into the foreign text. The probability of the value of a3 a25 depends on the length a27 of the original English string.</Paragraph> <Paragraph position="5"> As a result, each foreign word is linked, by virtue of the derivation history, to either nothing (the imaginary NULL word), or exactly one word of the English source sentence. null The triple a28 a29 a3 a30a32a31a34a33a36a35a37a33a39a38a41a40 with a31 a29 a3 a30 a23a43a42a45a44a46a44 a33</Paragraph> <Paragraph position="7"> a13a10a54 , and a13a52a54 with a9 a20 , respectively.</Paragraph> <Paragraph position="8"> Since each of the changes occurs with a certain probability, we can calculate the translation model probability of a28 as the product of the individual probabilities of each of the changes. The product of the translation model probability and the language model probability of a31 is called the alignment probability of a28 .</Paragraph> <Paragraph position="9"> Detailed formulas for the calculation of alignment probabilities according to the various models can be found in Brown et al. (1993). It should be noted here that the calculation of the alignment probability of an entire alignment (a31a69a68a24a70a72a71a12a73a75a74a24a70 ) has linear complexity. Well will show below that by re-evaluating only fractions of an alignment (a31a69a70a72a71a12a76a77a74a24a70 ), we can reduce the evaluation cost to a constant time factor.</Paragraph> </Section> <Section position="4" start_page="0" end_page="550" type="metho"> <SectionTitle> 3 Decoding </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Decoding Algorithm </SectionTitle> <Paragraph position="0"> The task of the decoder is to revert the process just described. In this subsection we recapitulate the greedy hill-climbing algorithm presented in Germann et al. (2001).</Paragraph> <Paragraph position="1"> In contrast to all other decoders mentioned in Sec. 1, this algorithm does not process the input one word at a time to incrementally build up a full translation hypothesis. Instead, it starts out with a complete gloss of the input sentence, aligning each input word a13 with the word a9 that maximizes the inverse (with respect to the noisy channel approach) translation probability a78 a1 a9 a11a13 a7 . (Note that for the calculation of the alignment probability, a78 a1 a13 a11a9 a7 is used.) The decoder then systematically tries out various types of changes to the alignment: changing the translation of a word, inserting extra words, reordering words, etc. These change operations are described in more detail below. In each search iteration, the algorithm makes a complete pass over the alignment, evaluating all possible changes.</Paragraph> <Paragraph position="2"> The simpler, &quot;faster&quot; version a23a2a24 of the algorithm considers only one operation at a time. A more thorough variant a23a1a0 applies up to two word translation changes, or inserts one zero fertility word in addition to a word translation change before the effect of these changes is evaluated.</Paragraph> <Paragraph position="3"> At the end of the iteration, the decoder permanently applies that change, or, in the case of a23a1a0 , change combination, that leads to the biggest improvement in alignment probability, and then starts the next iteration. This cycle is repeated until no more improvements can be found.</Paragraph> <Paragraph position="4"> The changes to the alignment that the decoder considers are as follows.</Paragraph> <Paragraph position="5"> CHANGE the translation of a word: For a given foreign word a13 , change the English word a9 that is aligned with a13 . If a9 has a fertility of 1, replace it with the new word a9a3a2 ; if it has a fertility of more than one, insert the new word a9a3a2 in the position that optimizes the alignment probability. The list of candidates for a9a4a2 is derived from the inverse translation table (a78 a1 a9a12a11 a13 a7 ). Typically, the top ten words on that list are considered, that is, for an input of length a3 , a24a49a62 a3 possible change operations are evaluated during each CHANGE iteration.</Paragraph> <Paragraph position="6"> In theory, a single CHANGE iteration in a23a2a24 has a complexity of a0a2a1a4a3a6a5 a7 : for each word a13 , there is a certain probability that changing the word translation of a13 requires a pass over the complete English hypothesis in order to find the best insertion point. This is the case when a13 is currently either spurious (that is, aligned with the NULL word), or aligned with a word with a fertility of more than one. The probability of this happening, however, is fairly small, so that we can assume for all practical purposes that a CHANGE iteration in a23a2a24 has a complexity of a0a2a1 a3 a7 . Since a23a1a0 allows up to two CHANGE operations at a time, the respective complexities for a23a1a0 are a0a2a1 a3a8a7 a7 in theory and a0a2a1a4a3a6a5 a7 in practice. We will argue below that by exploiting the notion of change dependencies, the complexity for CHANGE can be reduced to practically a0a2a1 a3 a7 for a23a1a0 decoding as well, albeit with a fairly large coefficient.</Paragraph> <Paragraph position="7"> INSERT a so-called zero fertility word (i.e., an English word that is not aligned to any foreign word) into the English string. Since all possible positions in the English hypothesis have to be considered, a9 a23a11a10 a31a1a12a14a13 a3 a0a2a1 a3 a7 , assuming a linear correlation between input length and hypothesis length.</Paragraph> <Paragraph position="8"> ERASE a zero fertility word. a31a15a12 a28a14a10 a31 a3 a0a2a1 a3 a7 . JOIN two English words. This is an asymmetrical operation: one word, a9a4a16 a74a18a17 a18a20a19 , stays where it is, the other one,</Paragraph> <Paragraph position="10"> a18a20a19 , is removed from the English hypothesis. All foreign words originally aligned with a9 a19 a71a36a71a22a21 a18a20a19 are then</Paragraph> <Paragraph position="12"> Even though a JOIN iteration has a complexity of a0a2a1 a3a6a5 a7 ,2 empirical data indicates that its actual time consumption is very small (cf. Fig. 6). This is because the chances of success of a join operation can be determined very cheaply without actually performing the operation. Suppose for the sake of simplicity that a9 a19 a71a12a71a23a21 a18a20a19 is aligned with only one word a13 . If the translation proba-</Paragraph> <Paragraph position="14"> the resulting alignment probability will be zero. Therefore, we can safely skip such operations.</Paragraph> <Paragraph position="15"> SWAP any two non-overlapping regions a9 a20 a48a50a48a49a48 a9a17a54 and</Paragraph> <Paragraph position="17"> However, if we limit the size of the swapped regions to a constant a49 and their distance to a constant a50 , we can reduce the number of swaps performed to a linear function of the input length. For each start position (defined as the first word of the first swap region), there are at most a50a26a49 a5 swaps that can be performed within these limitations. Therefore, a10a42a41 a28 a0 a45a24a18 a16a47a46 a45 a20 a76 a46 a18a20a19 a3 a0a2a1 a3 a7 . It is obvious that the baseline version of this algorithm is very inefficient. In the following subsection, we discuss the algorithm's complexity in more detail. In Sec. 4, we show how the decoding complexity can be reduced.</Paragraph> </Section> <Section position="2" start_page="0" end_page="550" type="sub_section"> <SectionTitle> 3.2 Decoding Complexity </SectionTitle> <Paragraph position="0"> The total decoding complexity of the search algorithm is the number of search iterations (I) times the number of search steps per search iteration (S) times the evaluation cost per search step (E): a23a2a24 a3 a31 a1 a10 a1 a9 a48 We now show that the original implementation of the algorithm has a complexity of (practically) a0a2a1a4a3a52a51 a7 for a23a2a24 decoding, and a0a2a1 a3 a7 a7 for a23a1a0 decoding, if swap operations are restricted. With unrestricted swapping, the complexity is a0a2a1a4a3a6a5 a7 . Since our argument is based on some assumptions that cannot be proved formally, we cannot provide a formal complexity proof.</Paragraph> <Paragraph position="1"> a31 a3 a0a2a1 a3 a7 . In the original implementation of the algorithm, the entire alignment is evaluated after each search step (global evaluation, or a31 a68a24a70a72a71a12a73a22a74a56a70 ). Therefore, the evaluation cost rises linearly with the length of the hypothesized alignment: The evaluation requires two passes over the English hypothesis (n-grams for the language model; fertility probabilities) and two passes over the input string (translation and distortion probabilities). We assume a high correlation between input length and the hypothesis length. Thus, a31a69a68a12a70 a71a36a73a75a74a24a70 a3 a0a2a1 a3 a7 . decoding time (seconds) global probability recalculations, no improvement tiling local probability calculations, no improvement tiling global probability calculations, with improvement tiling local probability calculations, with improvement tiling graph shows the average runtimes (a23a2a24 ) of 10 different sample sentences of the respective length with swap operations restricted to a maximum swap segment size of 5 and a maximum swap distance of 2.</Paragraph> <Paragraph position="3"> efficient search strategy. At the end of each iteration, only the single best improvement is executed; all others, even when independent, are discarded. In other words, the algorithm needs one search iteration per improvement. We assume that there is a linear correlation between input length and the number of improvements -- an assumption that is supported by the empirical data in Fig. 4.</Paragraph> <Paragraph position="5"> The number of search steps per iteration is the sum of the number of search steps for CHANGE, SWAP, JOIN, INSERT, and ERASE. The highest order term in this sum is unrestricted SWAP with a0a2a1 a3 a7 a7 .</Paragraph> <Paragraph position="6"> With restricted swapping, S has a theoretical complexity of a0a2a1a4a3a6a5 a7 (due to JOIN) in a23a2a24 decoding, but the contribution of the JOIN operation to overall time consumption is so small that it can be ignored for all practical purposes. Therefore, the average complexity of a10 in practice is a0a2a1a4a3 a7 , and the total complexity of a23a2a24 in practice is</Paragraph> <Paragraph position="8"> In a23a1a0 decoding, which combines up to two CHANGE operations or one CHANGE operation and one INSERT operation, a10 has a practical complexity of a0a2a1 a23 a5 a7 , so that</Paragraph> <Paragraph position="10"> We discuss below how a10 can be reduced to practically linear time for a23a1a0 decoding as well.</Paragraph> </Section> </Section> <Section position="5" start_page="550" end_page="550" type="metho"> <SectionTitle> 4 Reducting Decoder Complexity </SectionTitle> <Paragraph position="0"> Every change to the alignment affects only a few of the individual probabilities that make up the overall alignment score: the n-gram contexts of those places in the English hypothesis where a change occurs, plus a few translation model probabilities. We call the -- not necessarily contiguous -- area of an alignment that is affected by a change the change's local context.</Paragraph> <Paragraph position="1"> With respect to an efficient implementation of the greedy search, we can exploit the notion of local contexts in two ways. First, we can limit probability recalculations to the local context (that is, those probabilities that actually are affected by the respective change), and secondly, we can develop the notion of change dependencies: Two changes are independent if their local contexts do not overlap. As we will explain below, we can use this notion to devise a scheme of improvement caching and tiling (ICT) that greatly reduces the total number of alignments considered during the search.</Paragraph> <Paragraph position="2"> Our argument is that local probability calculations and ICT each reduce the complexity of the algorithm by practically a0a2a1 a3 a7 , that is, from a0a2a1 a3a5a4 a7 to a0a2a1 a3a6a4 a25 a47 a7 with a7a9a8 a24 . Thus, the complexity for a23a2a24 decreases from a0a2a1 a3a52a51 a7 to a0a2a1 a3 a7 . If we limit the search space for the second operation (CHANGE or INSERT) in a23a1a0 decoding to its local context, a23a1a0 decoding, too, has practically linear complexity, even though with a much higher coefficient (cf Fig. 6).</Paragraph> <Section position="1" start_page="550" end_page="550" type="sub_section"> <SectionTitle> 4.1 Local Probability Calculations </SectionTitle> <Paragraph position="0"> The complexity of calculating the alignment probability globally (that is, over the entire alignment) is a0a2a1 a3 a7 . However, since there is a constant upper bound3 on the size of local contexts, a31a69a68a24a70a72a71a12a73a75a74a24a70 needs to be performed only once for the initial gloss, therafter, recalculation of only those probabilities affected by each change (a31 a70a72a71a12a76a77a74a24a70 a3</Paragraph> <Paragraph position="2"> a7 ) suffices. This reduces the overall decoding complexity from a0a2a1 a3 a4 a7 to a0a2a1 a3 a4 a25 a47 a7 with a7a10a8 a24 . Even though profoundly trivial, this improvement significantly reduces translation times, especially when improvements are not tiled (cf. below and Fig. 2).</Paragraph> </Section> <Section position="2" start_page="550" end_page="550" type="sub_section"> <SectionTitle> 4.2 Improvement Caching and Tiling4 (ICT) </SectionTitle> <Paragraph position="0"> Based on the notions of local contexts and change dependencies, we devised the following scheme of improvement caching and tiling (ICT): During the search, we keep track of the best possible change affecting each local context. (In practice, we maintain a map that maps from 3In practice, 16 with a trigram language model: a swap of two large segments over a large distance affects four points in the English hypothesis, resulting in a11a13a12a15a14a17a16 a55a19a18 trigrams, plus four individual distortion probabilities.</Paragraph> <Paragraph position="1"> 4Thanks to Daniel Marcu for alerting us to this term in this context.</Paragraph> <Paragraph position="2"> initial gloss us localities computer system suffer computer virus attack and refused service attack and there various security loopholes instance everywhere alignments checked: 1430 possible improvements: 28 improvements applied: 5 u.s. localities computer system opposed computer virus attack and rejecting service attack and there are various security loopholes instance everywhere . alignments checked: 1541 possible improvements: 3 improvements applied: 3 u.s. citizens computer system opposed the computer virus attack and rejecting service attack and there are various security loopholes publicize everywhere . alignments checked: 768 possible improvements: 1 improvements applied: 1 u.s. citizens computer system opposed to the computer virus attack and rejecting service attack and there are various security loopholes publicize everywhere . alignments checked: 364 possible improvements: 1 improvements applied: 1 u.s. citizens computer system is opposed to the computer virus attack and rejecting service attack and there are various security loopholes publicize everywhere . alignments checked: 343 possible improvements: 0 improvements applied: 0 u.s. citizens computer system is opposed to the computer virus attack and rejecting service attack and there are various security loopholes publicize everywhere . additional word, which increases the number of possible swap and insertion operations. Decoding without ICT results in the same translation but requires 11 iterations and checks a total of 17701 alignments as opposed to 5 iterations with a total of 4464 alignments with caching.</Paragraph> <Paragraph position="3"> the local context of each change that has been considered to the best change possible that affects exactly this context.) At the end of the search iteration a64 , we apply a very restricted stack search to find a good tiling of non-overlapping changes, all of which are applied. The goal of this stack search is to find a tiling that maximizes the overal gain in alignment probability. Possible improvements that overlap with higher-scoring ones are ignored. In the following search iteration a64 a36 a24 , we restrict the search to changes that overlap with changes just applied.</Paragraph> <Paragraph position="4"> We can safely assume that there are no improvements to be found that are independent of the changes applied at the end of iteration a64 : If there were such improvements, they would have been found in and applied after iteration a64 . Figure 3 illustrates the procedure.</Paragraph> <Paragraph position="5"> We assume that improvements are, on average, evenly distributed over the input text. Therefore, we can expect the number of places where improvements can be applied to grow with the input length at the same rate as the number of improvements. Without ICT, the number of iterations grows linearly with the input length, as shown in Fig. 4. With ICT, we can parallelize the improvement process and thus reduce the number of iterations for each search to a constant upper bound, which will be determined by the average 'improvement density' of the domain. One exception to this rule should be noted: since the expected number of spurious words (words with no counterpart in English) in the input is a function of the input length, and since all changes in word translation that involve the NULL word are mutually dependent, we should expect to find a very weak effect of this on the number of search iterations. Indeed, the scatter diagram in Fig.4 suggests a slight increase in the number of iterations as the input length increases.5 At the same time, however, the number of changes considered during each search iteration eventually decreases, because subsequent search iterations are limited to areas where a change was previously performed. Empirical evidence as plotted on the right in Fig. 4 suggests that this effect &quot;neutralizes&quot; the increase in iterations in dependence of the input length: the total number of changes considered indeed appears to grow linearly with the input length. It should be noted that ICT, while it does change the course of the search, primarily avoids redundant search steps -- it does not necessarily search a smaller search space, but searches it only once. The total number of improvements found is roughly the same (15,299 with ICT, 14,879 without for the entire test corpus with a maximum swap distance of 2 and a maximum swap segment size of 5).</Paragraph> <Paragraph position="6"> 5Another possible explanation for this increase, especially at the left end, is that &quot;improvement clusters&quot; occur rarely enough not to occur at all in shorter sentences.</Paragraph> <Paragraph position="7"> number of search iterations without improvement caching and tiling with improvement caching and tiling</Paragraph> </Section> <Section position="3" start_page="550" end_page="550" type="sub_section"> <SectionTitle> 4.3 Restrictions on Word Reordering </SectionTitle> <Paragraph position="0"> With a0a2a1a4a3 a7 a7 , unlimited swapping swapping is by far the biggest consumer of processing time during decoding.</Paragraph> <Paragraph position="1"> When translating the Chinese test corpus from the 2002 TIDES MT evaluation6 without any limitations on swapping, swapping operations account for over 98% of the total search steps but for less than 5% of the improvements; the total translation time (with ICT) is about 34 CPU hours. For comparison, translating with a maximum swap segment size of 5 and a maximum swap distance of 2 takes ca. 40 minutes under otherwise unchanged circumstances. null It should be mentioned that in practice, it is generally not a good idea to run the decoder with without restrictions on swapping. In order to cope with hardware and time limitations, the sentences in the training data are typically limited in length. For example, the models used for the experiments reported here were trained on data with a sentence length limit of 40. Sentence pairs where one of the sentences exceeded this limit were ignored in training. Therefore, any swap that involves a distortion greater than that limit will result in the minimal (smoothed) distortion probability and most likely not lead to an improvement. The question is: How much swapping is enough? Is there any benefit to it at all? This is an interesting question since virtually all efficient MT decoders (e.g. Tillmann and Ney, 2000; Berger et al., 1994; Alshawi et al., 2000; Vidal, 1997) impose limits on word reordering.</Paragraph> <Paragraph position="2"> In order to determine the effect of swap restrictions on decoder performance, we translated the Chinese test corpus 101 times with restrictions on the maximum swap 6100 short news texts; 878 text segments; ca. 25K tokens/words. null coding) in dependence of maximum swap distance and maximum swap segment size.</Paragraph> <Paragraph position="3"> distance (MSD) and the maximum swap segment size (MSSS) ranging from 0 to 10 and evaluated the translations with the BLEU7 metric (Papineni et al., 2002). The results are plotted in Fig. 5.</Paragraph> <Paragraph position="4"> On the one hand, the plot seems to paint a pretty clear picture on the low end: score improvements are comparatively large initially but level off quickly. Furthermore, the slight slope suggests slow but continuous improvements as swap restrictions are eased. For the Arabic test data from the same evaluation, we obtained a similar shape (although with a roughly level plateau). On the other hand, the 'bumpiness' of the surface raises the question as to which of these differences are statistically significant.</Paragraph> <Paragraph position="5"> We are aware of several ways to determine the statistical significance of BLEU score differences. One is bootstrap resampling (Efron and Tibshirani, 1993)8 to determine confidence intervals, another one splitting the test corpus into a certain number of subcorpora (e.g. 30) and then using the t-test to compare the average scores over these subcorpora (cf. Papineni et al., 2001). Bootstrap resampling for the various system outputs leads to very similar confidence intervals of about 0.006 to 0.007 for a one-sided test at a confidence level of .95. With the t-score method, differences in score of 0.008 or higher seem to be significant at the same level of confidence.</Paragraph> <Paragraph position="6"> According to these metrics, none of the differences in the plot are significant, although the shape of the plot suggests that moderate swapping probably is a good idea.</Paragraph> <Paragraph position="7"> In addition to limitations of the accuracy of the BLEU method itself, variance in the decoders performance can blur the picture. A third method to determine a confidence corridor is therefore to perform several randomized searches and compare their performance. Following a suggestion by Franz Josef Och (personal communications), we ran the decoder multiple times from randomized starting glosses for each sentence and then used the highest scoring one as the &quot;official&quot; system output. This gives us a lower bound on the price in performance that we pay for search errors. The results for up to ten searches from randomized starting points in addition to the baseline gloss are given in Tab. 1. Starting points were randomized by randomly picking one of the top 10 translation candidates (instead of the top candidate) for each input word, and performing a (small) random number of SWAP and INSERT operations before the actual search started. In order to insure consistency across repeated runs, we used a pseudo random function. In our experiments, we did not mix a23a2a24 and a23a1a0 decoding. The practical reason for this is that a23a1a0 decoding takes more than ten times as long as a23a2a24 decoding. As the table illustrates, running multiple searches in a23a2a24 from randomized starting points is more efficient that running a23a1a0 once. 8Thanks to Franz Josef Och for pointing this option out to us.</Paragraph> <Paragraph position="8"> Choosing the best sentences from all decoder runs results in a BLEU score of 0.157. Interestingly, the decoding time from the default starting point is much lower (G1: ca. 40 min. vs. ca. 1 hour; G2: ca. 9.5 hours vs. ca. 11.3 hours), and the score, on average, is higher than when searching from a random starting point (G1: 0.143 vs.</Paragraph> <Paragraph position="9"> 0.127 (average); G2: 0.145 vs. 0.139 (average)). This indicates that the default seeding strategy is a good one.</Paragraph> <Paragraph position="10"> From the results of our experiments we conclude the following.</Paragraph> <Paragraph position="11"> First, Tab. 1 suggests that there is a good correlation between IBM Model 4 scores and the BLEU metric. Higher alignment probabilities lead to higher BLEU scores. Even though hardly any of the score differences are statistically significant (see confidence intervals above), there seems to be a trend.</Paragraph> <Paragraph position="12"> Secondly, from the swapping experiment we conclude that except for very local word reorderings, neither the IBM models nor the BLEU metric are able to recognize long distance dependencies (such as, for example, accounting for fundamental word order differences when translating from a SOV language into a SVO language).</Paragraph> <Paragraph position="13"> This is hardly surprising, since both the language model for decoding and the BLEU metric rely exclusively on ngrams. This explains why swapping helps so little. For a different approach that is based on dependency tree transformations, see Alshawi et al. (2000).</Paragraph> <Paragraph position="14"> Thirdly, the results of our experiments with randomized searches show that greedy decoding does not perform as well on longer sentences as one might conclude from the findings in Germann et al. (2001). At the same time, the speed improvements presented in this paper make multiple searches feasible, allowing for an overall faster and better decoder.</Paragraph> </Section> </Section> class="xml-element"></Paper>