File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/p01-1030_metho.xml

Size: 21,709 bytes

Last Modified: 2025-10-06 14:07:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="P01-1030">
  <Title>Fast Decoding and Optimal Decoding for Machine Translation</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 IBM Model 4
</SectionTitle>
    <Paragraph position="0"> In this paper, we work with IBM Model 4, which revolves around the notion of a word alignment over a pair of sentences (see Figure 1). A word alignment assigns a single home (English string position) to each French word. If two French words align to the same English word, then that it is not clear .</Paragraph>
    <Paragraph position="2"> English word is said to have a fertility of two.</Paragraph>
    <Paragraph position="3"> Likewise, if an English word remains unalignedto, then it has fertility zero. The word alignment in Figure 1 is shorthand for a hypothetical stochastic process by which an English string gets converted into a French string. There are several sets of decisions to be made.</Paragraph>
    <Paragraph position="4"> First, every English word is assigned a fertility. These assignments are made stochastically according to a table n(a10a11a4 ea12 ). We delete from the string any word with fertility zero, we duplicate any word with fertility two, etc. If a word has fertility greater than zero, we call it fertile. If its fertility is greater than one, we call it very fertile.</Paragraph>
    <Paragraph position="5"> After each English word in the new string, we may increment the fertility of an invisible English NULL element with probability pa13 (typically about 0.02). The NULL element ultimately produces &amp;quot;spurious&amp;quot; French words.</Paragraph>
    <Paragraph position="6"> Next, we perform a word-for-word replacement of English words (including NULL) by French words, according to the table t(fa14a15a4 ea12 ). Finally, we permute the French words. In permuting, Model 4 distinguishes between French words that are heads (the leftmost French word generated from a particular English word), non-heads (non-leftmost, generated only by very fertile English words), and NULL-generated.</Paragraph>
    <Paragraph position="7"> Heads. The head of one English word is assigned a French string position based on the position assigned to the previous English word. If an English word ea12a17a16a18a13 translates into something at French position j, then the French head word of ea12 is stochastically placed in French position k with distortion probability da13 (k-j a4 class(ea12a17a16a18a13 ), class(fa19 )), where &amp;quot;class&amp;quot; refers to automatically determined word classes for French and English vocabulary items. This relative offset k-j encourages adjacent English words to translate into adjacent French words. If ea12a17a16a18a13 is infertile, then j is taken from ea12a17a16a21a20 , etc. If ea12a17a16a18a13 is very fertile, then j is the average of the positions of its French translations. null Non-heads. If the head of English word ea12 is placed in French position j, then its first non-head is placed in French position k (a9 j) according to another table da22a18a13 (k-j a4 class(fa19 )). The next non-head is placed at position q with probability da22a18a13 (q-k a4 class(fa23 )), and so forth.</Paragraph>
    <Paragraph position="8"> NULL-generated. After heads and non-heads are placed, NULL-generated words are permuted into the remaining vacant slots randomly. If there are a10a25a24 NULL-generated words, then any placement scheme is chosen with probability 1/a10 a24a27a26 . These stochastic decisions, starting with e, result in different choices of f and an alignment of f with e. We map an e onto a particular a28 a,fa9 pair with probability:</Paragraph>
    <Paragraph position="10"> where the factors separated by a39 symbols denote fertility, translation, head permutation, non-head permutation, null-fertility, and null-translation probabilities.1</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Definition of the Problem
</SectionTitle>
    <Paragraph position="0"> If we observe a new sentence f, then an optimal decoder will search for an e that maximizes P(ea4 f)  (the length of f), ea42 (the ia92a32a93 English word in e), ea83 (the NULL word), a94 a42 (the fertility of ea42 ), a94 a83 (the fertility of the NULL word), a95 a42a97a96 (the ka92a32a93 French word produced by ea42 in a), a98 a42a99a96 (the position of a95</Paragraph>
    <Paragraph position="2"> a42 (the position of the first fertile word to the left of ea42 in a), a101a85a102a104a103 (the ceiling of the average of all a98 a102a104a103 a96 for a100 a42 , or 0 if a100 a42 is undefined).</Paragraph>
    <Paragraph position="4"> over all possible alignments a. Because this sum involves significant computation, we typically avoid it by instead searching for an a28 e,aa9 pair that maximizes P(e,aa4 f) a105 P(e) a7 P(a,fa4 e). We take the language model P(e) to be a smoothed n-gram model of English.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Stack-Based Decoding
</SectionTitle>
    <Paragraph position="0"> The stack (also called A*) decoding algorithm is a kind of best-first search which was first introduced in the domain of speech recognition (Jelinek, 1969). By building solutions incrementally and storing partial solutions, or hypotheses, in a &amp;quot;stack&amp;quot; (in modern terminology, a priority queue), the decoder conducts an ordered search of the solution space. In the ideal case (unlimited stack size and exhaustive search time), a stack decoder is guaranteed to find an optimal solution; our hope is to do almost as well under real-world constraints of limited space and time. The generic stack decoding algorithm follows: a106 Initialize the stack with an empty hypothesis. null a106 Pop h, the best hypothesis, off the stack.</Paragraph>
    <Paragraph position="1"> a106 If h is a complete sentence, output h and terminate.</Paragraph>
    <Paragraph position="2"> a106 For each possible next word w, extend h by adding w and push the resulting hypothesis onto the stack.</Paragraph>
    <Paragraph position="3"> a106 Return to the second step (pop).</Paragraph>
    <Paragraph position="4"> One crucial difference between the decoding process in speech recognition (SR) and machine translation (MT) is that speech is always produced in the same order as its transcription. Consequently, in SR decoding there is always a simple left-to-right correspondence between input and output sequences. By contrast, in MT the left-to-right relation rarely holds even for language pairs as similar as French and English. We address this problem by building the solution from left to right, but allowing the decoder to consume its input in any order. This change makes decoding significantly more complex in MT; instead of knowing the order of the input in advance, we must consider all a107 a26 permutations of an a107 -word input sentence.</Paragraph>
    <Paragraph position="5"> Another important difference between SR and MT decoding is the lack of reliable heuristics in MT. A heuristic is used in A* search to estimate the cost of completing a partial hypothesis. A good heuristic makes it possible to accurately compare the value of different partial hypotheses, and thus to focus the search in the most promising direction. The left-to-right restriction in SR makes it possible to use a simple yet reliable class of heuristics which estimate cost based on the amount of input left to decode. Partly because of the absence of left-to-right correspondence, MT heuristics are significantly more difficult to develop (Wang and Waibel, 1997). Without a heuristic, a classic stack decoder is ineffective because shorter hypotheses will almost always look more attractive than longer ones, since as we add words to a hypothesis, we end up multiplying more and more terms to find the probability. Because of this, longer hypotheses will be pushed off the end of the stack by shorter ones even if they are in reality better decodings. Fortunately, by using more than one stack, we can eliminate this effect.</Paragraph>
    <Paragraph position="6"> In a multistack decoder, we employ more than one stack to force hypotheses to compete fairly.</Paragraph>
    <Paragraph position="7"> More specifically, we have one stack for each sub-set of input words. This way, a hypothesis can only be pruned if there are other, better, hypotheses that represent the same portion of the input. With more than one stack, however, how does a multistack decoder choose which hypothesis to extend during each iteration? We address this issue by simply taking one hypothesis from each stack, but a better solution would be to somehow compare hypotheses from different stacks and extend only the best ones.</Paragraph>
    <Paragraph position="8"> The multistack decoder we describe is closely patterned on the Model 3 decoder described in the (Brown et al., 1995) patent. We build solutions incrementally by applying operations to hypotheses. There are four operations: a106 Add adds a new English word and aligns a single French word to it.</Paragraph>
    <Paragraph position="9"> a106 AddZfert adds two new English words.</Paragraph>
    <Paragraph position="10"> The first has fertility zero, while the second is aligned to a single French word.</Paragraph>
    <Paragraph position="11"> a106 Extend aligns an additional French word to the most recent English word, increasing its fertility.</Paragraph>
    <Paragraph position="12"> a106 AddNull aligns a French word to the English NULL element.</Paragraph>
    <Paragraph position="13"> AddZfert is by far the most expensive operation, as we must consider inserting a zero-fertility English word before each translation of each unaligned French word. With an English vocabulary size of 40,000, AddZfert is 400,000 times more expensive than AddNull! We can reduce the cost of AddZfert in two ways. First, we can consider only certain English words as candidates for zero-fertility, namely words which both occur frequently and have a high probability of being assigned frequency zero. Second, we can only insert a zero-fertility word if it will increase the probability of a hypothesis. According to the definition of the decoding problem, a zero-fertility English word can only make a decoding more likely by increasing P(e) more than it decreases P(a,fa4 e).2 By only considering helpful zero-fertility insertions, we save ourselves significant overhead in the AddZfert operation, in many cases eliminating all possibilities and reducing its cost to less than that of AddNull.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Greedy Decoding
</SectionTitle>
    <Paragraph position="0"> Over the last decade, many instances of NP-complete problems have been shown to be solvable in reasonable/polynomial time using greedy methods (Selman et al., 1992; Monasson et al., 1999). Instead of deeply probing the search space, such greedy methods typically start out with a random, approximate solution and then try to improve it incrementally until a satisfactory solution is reached. In many cases, greedy methods quickly yield surprisingly good solutions.</Paragraph>
    <Paragraph position="1"> We conjectured that such greedy methods may prove to be helpful in the context of MT decoding. The greedy decoder that we describe starts the translation process from an English gloss of the French sentence given as input. The gloss is constructed by aligning each French word fa14 with its most likely English translation ef</Paragraph>
    <Paragraph position="3"> argmaxa111 t(e a4 fa14 )). For example, in translating the French sentence &amp;quot;Bien entendu , il parle de une belle victoire .&amp;quot;, the greedy decoder initially as- null sumes that a good translation of it is &amp;quot;Well heard , it talking a beautiful victory&amp;quot; because the best translation of &amp;quot;bien&amp;quot; is &amp;quot;well&amp;quot;, the best translation of &amp;quot;entendu&amp;quot; is &amp;quot;heard&amp;quot;, and so on. The alignment corresponding to this translation is shown at the top of Figure 2.</Paragraph>
    <Paragraph position="4"> Once the initial alignment is created, the greedy decoder tries to improve it, i.e., tries to find an alignment (and implicitly translation) of higher probability, by applying one of the follow-</Paragraph>
    <Paragraph position="6"> changes the translation of one or two French words, those located at positions a114a75a13 and a114a43a20 ,  the NULL word, the word ea19 is inserted into the translation at the position that yields the alignment of highest probability. If ea115  into a6a75a13 and simulataneously inserts word ea20 at the position that yields the alignment of highest probability. Word a6a118a20 is selected from an automatically derived list of 1024 words with high probability of having fertility 0. When ea115</Paragraph>
    <Paragraph position="8"> amounts to inserting a word of fertility 0 into the alignment.</Paragraph>
    <Paragraph position="10"> alignment from the old one by swapping non-overlapping English word segments a120 a119a51a13a118a66a51a119a74a20a60a121 and a120 a114a75a13a118a66a104a114a118a20a60a121 . During the swap operation, all existing links between English and French words are preserved. The segments can be as small as a word or as long as a4a69a6a122a4a33a55a123a84 words, where a4a69a6a124a4 is the length of the English sentence.</Paragraph>
    <Paragraph position="12"> ment the English word at position a119a51a13 (or a119a104a20 ) and links the French words generated by a6 a12 a116</Paragraph>
    <Paragraph position="14"> entendu, il parle de une belle victoire.&amp;quot; In a stepwise fashion, starting from the initial gloss, the greedy decoder iterates exhaustively over all alignments that are one operation away from the alignment under consideration. At every step, the decoder chooses the alignment of highest probability, until the probability of the current alignment can no longer be improved. When it starts from the gloss of the French sentence &amp;quot;Bien entendu, il parle de une belle victoire.&amp;quot;, for example, the greedy decoder alters the initial alignment incrementally as shown in Figure 2, eventually producing the translation &amp;quot;Quite naturally, he talks about a great victory.&amp;quot;. In the process, the decoder explores a total of 77421 distinct alignments/translations, of which &amp;quot;Quite naturally, he talks about a great victory.&amp;quot; has the highest probability. null We chose the operation types enumerated above for two reasons: (i) they are general enough to enable the decoder escape local maxima and modify in a non-trivial manner a given alignment in order to produce good translations; (ii) they are relatively inexpensive (timewise). The most time consuming operations in the decoder are swapSegments, translateOneOrTwoWords,  a4 is the number of translations we associate with each word (in our implementation, we limit this number to the top 10 translations). TranslateAndInsert iterates over</Paragraph>
    <Paragraph position="16"> size of the list of words with high probability of having fertility 0 (1024 words in our implementation). null</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Integer Programming Decoding
</SectionTitle>
    <Paragraph position="0"> Knight (1999) likens MT decoding to finding optimal tours in the Traveling Salesman Problem (Garey and Johnson, 1979)--choosing a good word order for decoder output is similar to choosing a good TSP tour. Because any TSP problem instance can be transformed into a decoding problem instance, Model 4 decoding is provably NP-complete in the length of f. It is interesting to consider the reverse direction--is it possible to transform a decoding problem instance into a TSP instance? If so, we may take great advantage of previous research into efficient TSP algorithms. We may also take advantage of existing software packages, obtaining a sophisticated decoder with little programming effort. It is difficult to convert decoding into straight TSP, but a wide range of combinatorial optimization problems (including TSP) can be expressed in the more general framework of linear integer programming. A sample integer program (IP) looks like this: minimize objective function:</Paragraph>
    <Paragraph position="2"> A solution to an IP is an assignment of integer values to variables. Solutions are constrained by inequalities involving linear combinations of variables. An optimal solution is one that respects the constraints and minimizes the value of the objective function, which is also a linear combination of variables. We can solve IP instances with generic problem-solving software such as lp solve or CPLEX.3 In this section we explain  tence f = &amp;quot;CE NE EST PAS CLAIR .&amp;quot; There is one city for each word in f. City boundaries are marked with bold lines, and hotels are illustrated with rectangles. A tour of cities is a sequence of hotels (starting at the sentence boundary hotel) that visits each city exactly once before returning to the start.</Paragraph>
    <Paragraph position="3"> how to express MT decoding (Model 4 plus English bigrams) in IP format.</Paragraph>
    <Paragraph position="4"> We first create a salesman graph like the one in Figure 3. To do this, we set up a city for each word in the observed sentence f. City boundaries are shown with bold lines. We populate each city with ten hotels corresponding to ten likely English word translations. Hotels are shown as small rectangles. The owner of a hotel is the English word inside the rectangle. If two cities have hotels with the same owner x, then we build a third xowned hotel on the border of the two cities. More generally, if a107 cities all have hotels owned by x, we build a174a48a175a122a55a78a107a168a55a176a84 new hotels (one for each non-empty, non-singleton subset of the cities) on various city borders and intersections. Finally, we add an extra city representing the sentence boundary. null We define a tour of cities as a sequence and hotels (starting at the sentence boundary hotel) that visits each city exactly once before returning to the start. If a hotel sits on the border between two cities, then staying at that hotel counts as visiting both cities. We can view each tour of cities as corresponding to a potential decoding a28 e,aa9 . The owners of the hotels on the tour give us e, while the hotel locations yield a.</Paragraph>
    <Paragraph position="5"> The next task is to establish real-valued (asymmetric) distances between pairs of hotels, such that the length of any tour is exactly the negative of log(P(e) a7 P(a,fa4 e)). Because log is monotonic, the shortest tour will correspond to the likeliest decoding.</Paragraph>
    <Paragraph position="6"> The distance we assign to each pair of hotels consists of some small piece of the Model 4 formula. The usual case is typified by the large black arrow in Figure 3. Because the destination hotel &amp;quot;not&amp;quot; sits on the border between cities NE and PAS, it corresponds to a partial alignment in which the word &amp;quot;not&amp;quot; has fertility two: ... what not ...</Paragraph>
    <Paragraph position="8"> If we assume that we have already paid the price for visiting the &amp;quot;what&amp;quot; hotel, then our interhotel distance need only account for the partial alignment concerning &amp;quot;not&amp;quot;:  NULL-owned hotels are treated specially. We require that all non-NULL hotels be visited before any NULL hotels, and we further require that at most one NULL hotel visited on a tour. Moreover, the NULL fertility sub-formula is easy to compute if we allow only one NULL hotel to be visited: a10 a24 is simply the number of cities that hotel straddles, and a77 is the number of cities minus one. This case is typified by the large gray arrow shown in Figure 3.</Paragraph>
    <Paragraph position="9"> Between hotels that are located (even partially) in the same city, we assign an infinite distance in both directions, as travel from one to the other can never be part of a tour. For 6-word French sentences, we normally come up with a graph that has about 80 hotels and 3500 finite-cost travel segments. null The next step is to cast tour selection as an integer program. Here we adapt a subtour elimination strategy used in standard TSP. We create a binary (0/1) integer variable a177a21a12 a14 for each pair of hotels a119 and a114 . a177a36a12 a14 a109 a84 if and only if travel from hotel a119 to hotel a114 is on the itinerary. The objective function is straightforward: minimize: a178</Paragraph>
    <Paragraph position="11"> This minimization is subject to three classes of constraints. First, every city must be visited exactly once. That means exactly one tour segment must exit each city:  Second, the segments must be linked to one another, i.e., every hotel has either (a) one tour segment coming in and one going out, or (b) no segments in and none out. To put it another way, every hotel must have an equal number of tour segments going in and out:  There are an exponential number of constraints in this third class.</Paragraph>
    <Paragraph position="12"> Finally, we invoke our IP solver. If we assign mnemonic names to the variables, we can easily extract a28 e,aa9 from the list of variables and their binary values. The shortest tour for the graph in Figure 3 corresponds to this optimal decoding: it is not clear .</Paragraph>
    <Paragraph position="13"> We can obtain the second-best decoding by adding a new constraint to the IP to stop it from</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML