File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1091_metho.xml
Size: 14,488 bytes
Last Modified: 2025-10-06 14:08:42
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1091"> <Title>An Algorithmic Framework for the Decoding Problem in Statistical Machine Translation</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Framework for Decoding </SectionTitle> <Paragraph position="0"> We begin with a couple of useful observations about the decoding problem. Although deceptively simple, these observations are very crucial for developing our framework. They are the source for algorithmic handles for breaking the decoding problem into two relatively easier search problems. The first of these observations concerns with solving the problem when we know in advance the mapping between the source and target sentences. This leads to the development of an extremely simple algorithm for decoding when the alignment is known (or can be guessed). Our second observation is on finding a better alignment between the source and target sentences starting with an initial (possibly suboptimal) alignment. The insight provided by the two observations are employed in building a powerful algorithmic framework.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Handles for attacking the Decoding Problem </SectionTitle> <Paragraph position="0"> Our goal is to arrive at algorithmic handles for attacking RELAXED DECODING. In this section, we make couple of useful observations and develop algorithmic handles from the insight provided by them. The first of the two observations is: Observation 1 For a given target length l and a given alignment ~a that maps source words to target positions, it is easy to compute the optimal target sentence ^e.</Paragraph> <Paragraph position="2"> Let us call the search problem specified by Equation 4 as FIXED ALIGNMENT DECODING.</Paragraph> <Paragraph position="3"> What Observation 1 is saying is that once the target sentence length and the source to target mapping is fixed, the optimal target sentence (with the specified target length and alignment) can be computed efficiently. As we will show later, the optimal solution for</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> FIXED ALIGNMENT DECODING can be com- </SectionTitle> <Paragraph position="0"> puted in O(m) time for IBM models 1-5 using dynamic programming. As we can always guess an alignment (as is the case with many decoding algorithms in the literature), the above observation provides an algorithmic handle for finding suboptimalsolutions for RELAXED DECODING.</Paragraph> <Paragraph position="1"> Our second observation is on computing the optimal alignment between the source sentence and the target sentence.</Paragraph> <Paragraph position="2"> Observation 2 For a given target sentence ~e, it is easy to compute the optimal alignment ^a that maps the source words to the target words.</Paragraph> <Paragraph position="4"> It is easy to determine the optimal (Viterbi) alignment between the source sentence and its translation. In fact, for IBM models 1 and 2, the Viterbi alignment can be computed using a straight forward algorithm in O(ml) time. For higher models, an approximate Viterbi alignment can be computed iteratively by an iterative procedure called local search. In each iteration of local search, we look in the neighborhood of the current best alignment for a better alignment (Brown et al., 1993). The first iteration can start with any arbitrary alignment (say the Viterbi alignment of Model 2). It is possible to implement one iteration of local search in O(ml) time. Typically, the number of iterations is bounded in practice by O(m), and therefore, local search takes Oparenleftbigm2lparenrightbig time. Our framework is not strictly dependent on the computation of an optimal alignment. Any alignment which is better than the current alignment is good enough for it to work. It is straight forward to find one such alignment using restricted swaps and moves in O(m) time.</Paragraph> <Paragraph position="5"> In the remainder of this paper, we use the term Viterbi to denote any linear time algorithm for computing an improved alignment between the source sentence and its translation.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Illustrative Algorithms </SectionTitle> <Paragraph position="0"> In this section, we show how the handles provided by the above two observations can be employed to solve RELAXED DECODING. The two handles are in some sense complementary to each other. When the alignment is known, we can efficiently determine the optimal translation with that alignment. On the other hand, when the translation is known, we can efficiently determine a better alignment. Therefore, we can use one to improve the other. We begin with the following simple linear time decoding algorithm which is based on the first observation.</Paragraph> <Paragraph position="1"> Algorithm NaiveDecode Input: Source language sentence f of length m> 0.</Paragraph> <Paragraph position="2"> Optional Inputs: Target sentence length l, alignment ~a between the source words and target positions.</Paragraph> <Paragraph position="3"> Output: Target language sentence ^e of length l.</Paragraph> <Paragraph position="4"> 1. If l is not specified, let l = m.</Paragraph> <Paragraph position="5"> 2. If an alignment is not specified, guess some alignment ~a.</Paragraph> <Paragraph position="6"> 3. Compute the optimal translation ^e by solving FIXED ALIGNMENT DECODING, i.e., ^e = argmaxe Pr(f,~a|e)Pr(e).</Paragraph> <Paragraph position="7"> 4. return ^e.</Paragraph> <Paragraph position="8"> When the length of the translation is not specified, NaiveDecode assumes that the translation is of the same length as the source sentence. If an alignment that maps the source words to target positions is not specified, the algorithm guesses an alignment ~a (~a can be the trivial alignment that maps the source word fj to target position j, that is, ~aj = j, or can be guessed more intelligently). It then computes the optimal translation for the source sentence f, with the length of the target sentence and the alignment between the source and the target sentences kept fixed to l and ~a respectively, by maximizing Pr(f,~a|e)Pr(e). As</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> FIXED ALIGNMENT DECODING can be solved </SectionTitle> <Paragraph position="0"> in O(m) time, NaiveDecode takes only O(m) time.</Paragraph> <Paragraph position="1"> The value of NaiveDecode lies not in itself per se, butin its instrumental role in designing more superior algorithms. The power of NaiveDecode can be demonstrated with the following optimal algorithm for RELAXED DECODING.</Paragraph> <Paragraph position="2"> Algorithm NaiveOptimalDecode Input: Source language sentence f of length m> 0.</Paragraph> <Paragraph position="3"> Output: Target language sentence ^e of length l, m2 [?]l [?] 2m.</Paragraph> <Paragraph position="4"> 1. Let ^e = null and ^a = null.</Paragraph> <Paragraph position="5"> 2. For each l = m2 ,...,2m do 3. For each alignment a between the source words and the target positions do (a) Let e = NaiveDecode(f,l,a).</Paragraph> <Paragraph position="6"> (b) If Pr(f,e,a) >Pr(f,^e,^a) then i. ^e = e ii. ^a = a.</Paragraph> <Paragraph position="7"> 4. return (^e,^a).</Paragraph> <Paragraph position="8"> NaiveOptimalDecode considers various target lengths and all possible alignments between the source words and the target positions. For each target length l and alignment a it employs NaiveDecode to find the best solution. There are (l + 1)m candidate align- null ments for a target length l and O(m) candidate target lengths. Therefore, NaiveOptimalDecode explores Th(m(l+ 1)m) alignments. For each of these candidate alignments, it makes a call to NaiveDecode. The time complexity of NaiveOptimalDecode is, therefore, Oparenleftbigm2(l+ 1)mparenrightbig. Although an exponential time algorithm, it can compute the optimal solution for RELAXED DECODING.</Paragraph> <Paragraph position="9"> With NaiveDecode and NaiveOptimalDecode we have demonstrated the power of the algorithmic handle provided by Observation 1. It is important to note that these two algorithms are at the two extremities of the spectrum. NaiveDecode is a linear time decoding algorithm that computes a suboptimal solution for RE-</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> LAXED DECODING while NaiveOptimalDecode </SectionTitle> <Paragraph position="0"> is an exponential time algorithm that computes the optimal solution. What we want are algorithms that are close to NaiveDecode in complexity and to NaiveOptimalDecode in quality. It is possible to reduce the complexity of NaiveOptimalDecode significantly by carefully reducing the number of alignments that are examined. Instead of examining all Th(m(l+1)m) alignments, if we examine only a small number, say g(m), alignments in NaiveOptimalDecode, we can find a solution in O(mg(m)) time. In the next section, we show how to restrict the search to only a small number of promising alignments.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Alternating Maximization </SectionTitle> <Paragraph position="0"> We now show how to use the two algorithmic handles to come up with a fast search paradigm.</Paragraph> <Paragraph position="1"> We alternate between searching the best translation given an alignment and searching the best alignment given a translation. Since the two subproblems are complementary, they can be used to improve the solution computed by one another by alternating between the two problems.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Algorithm AlternatingSearch </SectionTitle> <Paragraph position="0"> Input: Source language sentence f of length m> 0.</Paragraph> <Paragraph position="1"> Output: Target language sentence e(o) of length l (m/2 [?]l [?] 2m).</Paragraph> <Paragraph position="2"> 1. Let e(o) = null and a(o) = null. 2. For each l = m/2,...,2m do (a) Let e = null and a = null. (b) While there is improvement in solution do i. Let e = NaiveDecode(f,l,a). ii. Let ^a = Viterbi(f,e).</Paragraph> <Paragraph position="3"> (c) If Pr(f,e,a) >Prparenleftbigf,e(o),a(o)parenrightbig then i. e(o) = e ii. a(o) = a.</Paragraph> <Paragraph position="4"> 3. return e(o).</Paragraph> <Paragraph position="5"> AlternatingSearch searches for a good translation by varying the length of the target sentence. For a sentence length l, the algorithm finds a translation of length l and then iteratively improves the translation. In each iteration it solves two subproblems: FIXED ALIGNMENT DECODING and VITERBI ALIGNMENT. The input to each iteration are the source sentence f, the target sentence length l, and an alignment a between the source and target sentences. So, AlternatingSearch finds a better translation e for f by solving FIXED ALIGNMENT DECODING. For this purpose it employs NaiveDecode. Having computed e, the algorithm computes a better alignment (^a) between e and f by solving VITERBI ALIGNMENT using Viterbi algorithm. The new alignment thus found is used by the algorithm in the subsequent iteration. At the end of each iteration the algorithm checks whether it has made progress. The algorithm returns the best translation of the source f across a range of target sentence lengths.</Paragraph> <Paragraph position="6"> The analysis of AlternatingSearch is complicated by the fact that the number of iterations (see step 2.b) depends on the input. It is reasonable to assume that the length of the source sentence (m) is an upper bound on the number of iterations. In practice, however, the number of iterations is typically O(1). There are 3m/2 candidate sentence lengths for the translation (l varies from m/2 to 2m) and both NaiveDecode and Viterbi are O(m). Therefore, the time complexity of AlternatingSearch is Oparenleftbigm2parenrightbig.</Paragraph> </Section> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 A Linear Time Algorithm for FIXED ALIGNMENT DECODING </SectionTitle> <Paragraph position="0"> A key component of all our algorithms is a linear time algorithm for the problem</Paragraph> </Section> <Section position="9" start_page="0" end_page="0" type="metho"> <SectionTitle> FIXED ALIGNMENT DECODING. Recall that in FIXED ALIGNMENT DECODING, we are given </SectionTitle> <Paragraph position="0"> the target length l and a mapping ~afrom source words to target positions. The goal is then to find the optimal translation with ~a as the alignment. In this section, we give a dynamic programming based solution to this problem. Our solution is based on a new formulation of IBM translation models. We begin our discussion with a few technical definitions.</Paragraph> <Paragraph position="1"> Alignment ~a maps each of the source words fj,j = 1,...,mto a target position in the range [0,...,l]. Define a mapping ps from [0,...,l] to subsets of {1,...,m} as follows: ps(i) = {j : j [?] {1,...,m} [?] ~aj =i} [?] i = 0,...,l. ps(i) is the set of source positions which are mapped to the target location i by the alignment ~a and the fertility of the target position i is phi = |ps(i)|.</Paragraph> <Paragraph position="2"> We can rewrite each of the IBM models Pr(f,~a|e) as follows:</Paragraph> <Paragraph position="4"> TiDiNi.</Paragraph> <Paragraph position="5"> Table 2 shows the breaking of Pr(f,~a|e) into the constituents Ti,Di and Ni. As a consequence, we can write Pr(f,~a|e)Pr(e) as:</Paragraph> <Paragraph position="7"> where Li = trigram(ei|ei[?]2,ei[?]1) and l is the trigram probability of the boundary word.</Paragraph> <Paragraph position="8"> The above reformulation of the optimization function of the decoding problem allows us to employ Dynamic Programming for solving FIXED ALIGNMENT DECODING efficiently.</Paragraph> <Paragraph position="9"> Note that each wordei has only a constant number of candidates in the vocabulary. Therefore, the set of words e1,...,el that maximizes the LHS of the above optimization function can be found in O(m) time using the standard Dynamic Programming algorithm (Cormen et al., 2001).</Paragraph> </Section> class="xml-element"></Paper>