File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/w93-0301_metho.xml
Size: 19,011 bytes
Last Modified: 2025-10-06 14:13:30
<?xml version="1.0" standalone="yes"?> <Paper uid="W93-0301"> <Title>Robust Bilingual Word Alignment for Machine Aided Translation</Title> <Section position="3" start_page="0" end_page="5" type="metho"> <SectionTitle> 2 The alignment Algorithm </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="3" type="sub_section"> <SectionTitle> 2.1 Estimation of translation </SectionTitle> <Paragraph position="0"> probabilities The translation probabilities are estimated using a method based on Brown et al.'s Model 2 (1993), which is summarized in the following subsection, 2.1.1. Then, in subsection 2.1.2, we describe modifications that achieve three goals: (1) enable word_align to accept input which may not be aligned by sentence (e.g. char_align's output), (2) reduce the number of parameters that need to be estimated, and (3) prepare the ground for the second step, the search for the best alignment (described in section 2.2).</Paragraph> <Paragraph position="1"> 2.1.1 Brown et al.'s Model In the context of their statistical machine translation project (Brown et al., 1990), Brown et al. estimate Pr(f\[e), the probability that f, a sentence in one language (say French), is the translation of e, a sentence in the other language (say English). Pr(fle ) is computed using the concept of alignment, denoted by a, which is a set of connections between each French word in f and the corresponding English word in e. A connection, which we will write f,e specifies that position j in f is connected as coBj, i , to position i in e. If a French word in f does not correspond to any English word in e, then it is connected to the special word n~ll (position 0 in e). Notice that this model is directional, as each French position is connected to exactly one position in the English sentence (which might be the null word), and accordingly the number of connections in an alignment is equal to the length of the French sentence. However, an English word may be connected to several words in the French sentence, or not connected at all.</Paragraph> <Paragraph position="2"> Using alignments, the translation probability for a pair of sentences is expressed as</Paragraph> <Paragraph position="4"> where .A is the set of all combinatorially possible alignments for the sentences f and e (calligraphic font will be used to denote sets).</Paragraph> <Paragraph position="5"> In their paper, Brown et al. present a series of 5 models of Pr(f\[e). The first two of these 5 models are summarized here.</Paragraph> <Paragraph position="6"> Model 1 Model 1 assumes that Pr(f, ale) depends primarily on t(f\[e), the probability that an occurrence of the English word e is translated as the French word f. That is,</Paragraph> <Paragraph position="8"> (2) where Cf,e, an irrelevant constant, accounts for certain dependencies on sentence lengths, which are not important for our purposes here. Except for Cf.e, most of the notation is borrowed from Brown ctal.. The variable, j, is used to refer to a position in a French sentence, and the variable, i, is used to refer to a position in an English sentence. The expression, fj, is used to refer to the French word in position j of a French sentence, and ei is used to refer to the English word in position i of an English sentence. An alignment, a, is a set of pairs (j, i), each of which connects a position in a French sentence with a corresponding position in an English sentence. The expression, aj, is used to refer to the English position that is connected to the French position j, and the expression, eoj, is used to refer to the English word in position aj.</Paragraph> <Paragraph position="9"> The variable, m, is used to denote the length of the French sentence and the variable, 1, is used to denote the length of the English sentence.</Paragraph> <Paragraph position="10"> There are quite a number of constraints that could be used to estimate Pr(f, ale ). Model 1 depends primarily on the translation probabilities, t(f\[e), and does not make use of constraints involving the positions within an alignment. These constraints will be exploited in Model 2.</Paragraph> <Paragraph position="11"> Brown e~ al. estimate t(f\[e) on the basis of a training set, a set of English and French sentences that have been aligned at the sentence level. Those values of t(f\[e) that maximize the probability of the training set are called the maximum likelihood estimates. Brown et al. show that the maximum likelihood estimates satisfy</Paragraph> <Paragraph position="13"> where CO.A/'t,e and CO./V'.e denote sets of connections: the set CO.A/'l,e contains all connections in the training data between f and e, and the set CO.N'. e contains all connections between some French word and e. The probability of a connection, con~,~ e, is the sum of the probabilities of all alignments that contain it. Notice that equation 3 satisfies the constraint: ~'~.! t(fle ) = 1, for each English word e.</Paragraph> <Paragraph position="14"> It follows from the definition of Model 1 that the probability of a connection satisfies:</Paragraph> <Paragraph position="16"> Recall that fj refers to the French word in position j of the French sentence f of length rn, and that ei refers to the English word in position i of the English sentence e of length I. Also, remember that position 0 is reserved for the null word.</Paragraph> <Paragraph position="17"> Equations 3 and 4.are used iteratively to estimate t(f\[e). That is, we start with an initial guess for t(fle). We then evaluation the right hand side of equation 4, and compute the probability of the connections in the training set. Then we evaluate equation 3, obtain new estimates for the translation probabilities, and repeat the process, until it converges. This iterative process is known as the EM algorithm and has been shown to converge to a stationary point (Baum, 1972; Dempster et al., 1977). Moreover, Brown et aL show that Model I has a unique maximum, and therefore, in this special case, the EM algorithm is guaranteed to converge to the maximum likelihood solution, and does not depend on the initial guess.</Paragraph> <Paragraph position="18"> Model 2 Model 2 improves upon model 1 by making use of the positions within an alignment. For instance, it is much more likely that the first word of an English sentence will be connected to a word near the beginning of the corresponding French sentence, than to some word near the end of the French sentence. Model 2 enhances Model 1 with the assuml>fe tion that the probability of a connection, conj,'~ , depends also on j and i (the positions in f and e), as well as on m and I (the lengths of the two sentences). This dependence is expressed through the term a(ilj, m,l), which denotes the probability of connecting position j in a French sentence of length m with position i in an English sentence of length I. Since each French position is connected to exactly one English position, the constraint ~&quot;~ti= 0 a(i\[j, m, I) = 1 should hold for all j, m and I. In place of equation 2, we now have:</Paragraph> <Paragraph position="20"> where Of. e is an irrelevant constant.</Paragraph> <Paragraph position="21"> As in Model 1, equation 3 holds for the maximum likelihood estimates of the translation probabilities. The corresponding equation for the max- null imum likelihood estimates of a(iIj, m, l) is:</Paragraph> <Paragraph position="23"> where CO.N'~S denotes the set of connections in the training data between positions j and i in French and English sentences of lengths m and 1, respectively. Similarly, CO.N'~. 'l denotes the set of connections between position j and some English position, in sentences of these lengths.</Paragraph> <Paragraph position="24"> Instead of equation 4, we obtain the following equation for the probability of a connection: f.e, t( fj \[el)&quot; a( ilj, rn, l) ~&quot;~k=0 t(fj \[ek)-a(klj, rn, l) Notice that Model 1 is a special case of Model 2, where a(ilj , m, l) is held fixed at 1+1 &quot; As before, the EM algorithm is used to compute maximum likelihood estimates for t(fle) and a(ilj, m, i) (using first equation 7, and then equations 3 and 6). However, in this case, Model 2 does not have a unique maximum, and therefore the results depend on the initial guesses. Brown et al. therefore use Model 1 to obtain estimates for t(fle ) which do not depend on the initial guesses.</Paragraph> <Paragraph position="25"> These values are then used as the initial guesses of t(fle ) in Model 2.</Paragraph> <Paragraph position="26"> As mentioned in the introduction, we are interested in aligning corpora that are smaller and noisier than the Hansards. This implies severe practical constraints on the word alignment algorithm. As mentioned earlier, we chose to start with the output of char_align because it is more robust than alternative sentence-based methods. This choice, of course, requires certain modifications to the model of Brown et al. to accommodate as input an initial rough alignment (such as produced by char_align) instead of pairs of aligned sentences. It is also useful to reduce the number of parameters that we are trying to estimate, because we have much less data and much more noise. The paragraphs below describe our modifications which are intended to meet these somewhat different requirements. The two major modifications are: (a) replacing the sentence-by-sentence alignment with a single global alignment for the entire corpus, and (b) replacing the set of probabilities a(ilj, m, l) with a small set of offset probabilities.</Paragraph> <Paragraph position="27"> Word_align starts with an initial rough alignment, I, which maps French positions to English positions (if the mapping is partial, we use linear extrapolation to make it complete). Our goal is to find a global alignment, A, which is more accurate than I. To achieve this goal, we first use I to determine which connections will be considered for A. Let conj,i denote a connection between position j in the French corpus and position i in the English corpus (the super-scripts in eon~,~ are omitted, as there is no notion of sentences). We assume that eonj,i is a possible connection only if i falls within a limited window which is centered around I(j), such that: I(j)- w < i < I(j) + w (8) where w is a predetermined parameter specifying the size of the window (we typically set w to 20 words). Connections that fall outside this window are assumed to have a zero probability. This assumption replaces the assumption of Brown et al.</Paragraph> <Paragraph position="28"> that connections which cross boundaries of aligned sentences have a zero probability. In this new framework, equation 3 becomes:</Paragraph> <Paragraph position="30"> where CO.h/'j,e and COA/'.,e are taken from the set of possible connections, as defined by (8).</Paragraph> <Paragraph position="31"> Turning to Model 2, the parameters of the form a(ilj , rn, l) are somewhat more problematic. First, since there are no sentence boundaries, there are no direct equivalents for i, j, m and 1. Secondly, there are too many parameters to be estimated, given the limited size of our corpora Cone parameter for each combination of i,j,m and l). Fortunately, these parameters are highly redundant. For example, it is likely that a(i\[j, m, l) will be very close to a(i + llj+ 1,re, l) and a(itj, rn+ 1,1+ 1).</Paragraph> <Paragraph position="32"> In order to deal with these concerns, we replace probabilities of the form a(ilj, m, 1) with a small set of offset probabilities. We use k to denote the offset between i, an English position which corresponds to the French position j, and the English position which the input alignment I connects to j: k = i- I(j). An offset probability, o(k), is the probability of having an offset k for some arbitrary connection. According to (8), k ranges between -w and w. Thus, instead of equation 6, we have</Paragraph> <Paragraph position="34"> where COAl is the set of all connections and CO.hfk is the set of all connections with offset k. Instead of equation 7, we have</Paragraph> <Paragraph position="36"> The last three equations are used in the EM algorithm in an iterative fashion as before to estimate the translation probabilities and the offset probabilities. Table 1 and Figure 2 show some values that were estimated in this way. The input consisted of a pair of Microsoft Windows manuals in English (125,000 words) and its equivalent in French (143,000 words). Table 1 shows four French words and the four most likely translations, sorted by t(e\]f) 1. Note that the correct translation(s) are usually near the front of the list, though there is a tendency for the program to be confused by collocates such as &quot;information about&quot;. Figure 2 shows the probability estimates for offsets from the initial alignment I. Note that smaller offsets are more likely than larger ones, as we would expect. Moreover, the distribution is reasonably close to normal, as indicated by the dotted line, which was generated by a Gaussian with a mean of 0 and standard deviation of 10 2 .</Paragraph> <Paragraph position="37"> We have found it useful to make use of three filters to deal with robustness issues. Empirically, we found that both high frequency and low frequency words caused difficulties and therefore connections involving these words are filtered out. The thresholds are set to exclude the most frequent function words and punctuations, as well as words with less than 3 occurrences. In addition, following a similar filter by Brown et al., small values of t(f\[e) are set to 0 after each iteration of the EM algorithm because these small values often correspond to inappropriate translations. Finally, connections to null are ignored. Such connections model French words that are often omitted in the English translation.</Paragraph> <Paragraph position="38"> However, because of OCR errors and other sources of noise, it was decided that this phenomenon was too difficult to model.</Paragraph> <Paragraph position="39"> Some words will not be aligned because of these heuristics. It may not be necessary, however, to align all words in order to meet the goal of helping translators (and lexicographers) with difficult terminology.</Paragraph> </Section> <Section position="2" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 2.2 Finding the most probable alignment </SectionTitle> <Paragraph position="0"> The EM algorithm produces two sets of maximum likelihood probability estimates: translation probabilities, t(fle), and offset probabilities, o(k).</Paragraph> <Paragraph position="1"> Brown et al. select their preferred alignment simply by choosing the most probable alignment according to the maximum likelihood probabilities, relative to the given sentence alignment. In the terms of our l ln this example, French is used as the source language a~ad English as the taxget.</Paragraph> <Paragraph position="2"> 2The center of the estimated distribution seems more fiat than in a normal distribution. This might be explained by a higher tendency for local changes of word order within phrases than for order changes among phrases. This is merely a hypothesis, though, which requires further testing.</Paragraph> <Paragraph position="3"> model, it is necessary to select the alignment A that maximizes:</Paragraph> <Paragraph position="5"> Unfortunately, this method does not model the dependence between connections for French words that are near one another. For example, the fact that the French position j was connected to the English position i will not increase the probability that j + 1 will be connected to an English position near i. The absence of such dependence can easily confuse the program, mainly in aligning adjacent occurrences of the same word, which are common in technical texts. Brown et al. introduce such dependence in their Model 4. We have selected a simpler alternative defined in terms of offset probabilities. null 2.2.1 Determining the set of relevant connections The first step in finding the most probable alignment is to determine the relevant connections for each French position. Relevant connections are required to be reasonably likely, that is, their translation probability (t(f\[e)) should exceed some minimal threshold. Moreover, they are required to fall within a window between I(j) - w and I(j) + w in the English corpus, as in the previous step (parameter estimation). We call a French position relevant if it has at least one relevant connection. Each alignment A then consists of exactly one connection for each relevant French position (the irrelevant positions are ignored).</Paragraph> </Section> <Section position="3" start_page="3" end_page="5" type="sub_section"> <SectionTitle> 2.2.2 Determining the most probable alignment </SectionTitle> <Paragraph position="0"> To model the dependency between connections in an alignment, we assume that the offset of a connection is determined relative to the preceding connection in A, instead of relative to the initial alignment, I. For this purpose, we define A' (j) as a linear extrapolation from the preceding connection in A:</Paragraph> <Paragraph position="2"> where Jv,C/~ is the last French position before j which is aligned by A and NE and NF are the lengths of the English and French corpora. A'(j) thus predicts the connection of j, knowing the connection of jp,C/~ and assuming that the two languages have the same word order, instead of (12), the most probable alignment maximizes</Paragraph> <Paragraph position="4"> We approximate the offset probabilities, 0(k), relative to A', using the maximum likelihood estimates which were computed relative to I (as described in Section 2.1.2).</Paragraph> <Paragraph position="5"> We use a dynamic programming algorithm to find the most probable alignment. This enables us to know the value A(jp,e~) when dealing with position j. To avoid connections with very low probability (due to a large offset) we require that t(fj \[el). o(i-- A'(j)) exceeds a pre-specified threshold T s. If the threshold is not exceeded, the connection is dropped from the alignment, and t(fjJei), o(i - A'(j)) for that connection is set to T when computing (14). T can therefore be interpreted as a global setting of the probability that a random position will be connected to the null 3In fact, the threshold on t(f, le,), which is used to determine the relevant connections (described in the previous subsection), is used just as an efficient early application of the threshold T. This early application is possible when t(f~le~)&quot; o(k,,~==) < T, where k,~== is the value of k with maximal o(k).</Paragraph> <Paragraph position="6"> English word 4. A similar dynamic programming approach was used by Gale and Church for word alignment (Gale and Church, 1991a), to handle dependency between connections.</Paragraph> </Section> </Section> class="xml-element"></Paper>