File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/p97-1063_metho.xml
Size: 12,737 bytes
Last Modified: 2025-10-06 14:14:39
<?xml version="1.0" standalone="yes"?> <Paper uid="P97-1063"> <Title>A Word-to-Word Model of Translational Equivalence</Title> <Section position="5" start_page="490" end_page="493" type="metho"> <SectionTitle> 3 The Basic Word-to-Word Model </SectionTitle> <Paragraph position="0"> Our translation model consists of the hidden parameters A + and A-, and likelihood ratios L(u, v). The two hidden parameters are the probabilities of the model generating true and false positives in the data.</Paragraph> <Paragraph position="1"> L(u,v) represents the likelihood that u and v can be mutual translations. For each co-occurring pair of word types u and v, these likelihoods are initially set proportional to their co-occurrence frequency n(u,v) and inversely proportional to their marginal frequencies n(u) and n(v) z, following (Dunning, 1993) 2.</Paragraph> <Paragraph position="2"> When the L(u, v) are re-estimated, the model's hidden parameters come into play.</Paragraph> <Paragraph position="3"> After initialization, the model induction algorithm iterates: 1. Find a set of &quot;links&quot; among word tokens in the bitext, using the likelihood ratios and the competitive linking algorithm.</Paragraph> <Paragraph position="4"> 2. Use the links to re-estimate A +, A-, and the likelihood ratios.</Paragraph> <Paragraph position="5"> 3. Repeat from Step 1 until the model converges to the desired degree.</Paragraph> <Paragraph position="6"> The competitive linking algorithm and its one-to-one assumption are detailed in Section 3.1. Section 3.1 explains how to re-estimate the model parameters.</Paragraph> <Section position="1" start_page="490" end_page="491" type="sub_section"> <SectionTitle> 3.1 Competitive Linking Algorithm </SectionTitle> <Paragraph position="0"> The competitive linking algorithm is designed to overcome the problem of indirect associations, illustrated in Figure 1. The sequences of u's and v's represent corresponding regions of a bitext. If uk and vk co-occur much more often than expected by chance, then any reasonable model will deem them likely to be mutual translations. If uk and Vk are indeed mutual translations, then their tendency to ZThe co-occurrence frequency of a word type pair is simply the number of times the pair co-occurs in the corpus. However, n(u) = ~-~v n(u.v), which is not the same as the frequency of u, because each token of u can co-occur with several differentv's.</Paragraph> <Paragraph position="1"> 2We could just as easily use other symmetric &quot;association&quot; measures, such as C/2 (Gale & Church, 1991) or the Dice coefficient (Smadja, 1992).</Paragraph> <Paragraph position="3"> uk+z. The direct association between uk and vk, and the direct association between uk and Uk+l give rise to an indirect association between v~ and uk+l.</Paragraph> <Paragraph position="4"> co-occur is called a direct association. Now, suppose that uk and Uk+z often co-occur within their language. Then vk and uk+l will also co-occur more often than expected by chance. The arrow connecting vk and u~+l in Figure 1 represents an indirect association, since the association between vk and Uk+z arises only by virtue of the association between each of them and uk. Models of translational equivalence that are ignorant of indirect associations have &quot;a tendency ... to be confused by collocates&quot; (Dagan et al., 1993).</Paragraph> <Paragraph position="5"> Fortunately, indirect associations are usually not difficult to identify, because they tend to be weaker than the direct associations on which they are based (Melamed, 1996c). The majority of indirect associations can be filtered out by a simple competition heuristic: Whenever several word tokens ui in one half of the bitext co-occur with a particular word token v in the other half of the bitext, the word that is most likely to be v's translation is the one for which the likelihood L(u, v) of translational equivalence is highest. The competitive linking algorithm implements this heuristic: 1. Discard all likelihood scores for word types deemed unlikely to be mutual translations, i.e.</Paragraph> <Paragraph position="6"> all L(u,v) < 1. This step significantly reduces the computational burden of the algorithm. It is analogous to the step in other translation model induction algorithms that sets all probabilities below a certain threshold to negligible values (Brown et al., 1990; Dagan et al., 1993; Chen, 1996). To retain word type pairs that are at least twice as likely to be mutual translations than not, the threshold can be raised to 2. Conversely, the threshold can be lowered to buy more coverage at the cost of a larger model that will converge more slowly.</Paragraph> <Paragraph position="7"> 2. Sort all remaining likelihood estimates L(u, v) from highest to lowest.</Paragraph> <Paragraph position="8"> 3. Find u and v such that the likelihood ratio L(u,v) is highest. Token pairs of these types</Paragraph> <Paragraph position="10"> = frequency of co-occurrence between word types u and v = ~&quot;\].(u.,,) n(u.v) = total number of co-occurrences in the bitext = frequency of links between word types u and v</Paragraph> <Paragraph position="12"> would be the winners in any competitions involving u or v.</Paragraph> <Paragraph position="13"> 4. Link all token pairs (u, v) in the bitext. 5. The one-to-one assumption means that linked words cannot be linked again. Therefore, remove all linked word tokens from their respective texts.</Paragraph> <Paragraph position="14"> 6. If there is another co-occurring word token pair (u, v) such that L(u, v) exists, then repeat from Step 3.</Paragraph> <Paragraph position="15"> The competitive linking algorithm is more greedy than algorithms that try to find a set of link types that are jointly most probable over some segment of the bitext. In practice, our linking algorithm can be implemented so that its worst-case running time is O(lm), where l and m are the lengths of the aligned segments.</Paragraph> <Paragraph position="16"> The simplicity of the competitive linking algorithm depends on the one-to-one assumption: Each word translates to at most one other word. Certainly, there are cases where this assumption is false. We prefer not to model those cases, in order to achieve higher accuracy with less effort on the cases where the assumption is true.</Paragraph> </Section> <Section position="2" start_page="491" end_page="493" type="sub_section"> <SectionTitle> 3.2 Parameter Estimation </SectionTitle> <Paragraph position="0"> The purpose of the competitive linking algorithm is to help us re-estimate the model parameters. The variables that we use in our estimation are summarized in Figure 2. The linking algorithm produces a set of links between word tokens in the bitext. We define a link token to be an ordered pair of word tokens, one from each half of the bitext. A link type is an ordered pair of word types. Let n(u.,,) be the co-occurrence frequency of u and v and k(~,,,) be the number of links between tokens of u and v 3. An 3Note that k(u,v) depends on the linking algorithm, but n(u.v) is a constant property of the bitext.</Paragraph> <Paragraph position="1"> important property of the competitive linking algorithm is that the ratio kiu.,,)/n(u,v ) tends to be very high if u and v are mutual translations, and quite low if they are not. The bimodality of this ratio for several values of n(u.,,i is illustrated in Figure 3. This figure was plotted after the model's first iteration over 300000 aligned sentence pairs from the</Paragraph> <Paragraph position="3"> plotted on a log scale -- the bimodality is quite sharp.</Paragraph> <Paragraph position="4"> Canadian Hansard bitext. Note that the frequencies are plotted on a log scale -- the bimodality is quite sharp.</Paragraph> <Paragraph position="5"> The linking algorithm creates all the links of a given type independently of each other, so the num- null ber k(u,v ) of links connecting word types u and v has a binomial distribution with parameters n(u.,,l and P(u.,,)- If u and v are mutual translations, then P(u,,,) tends to a relatively high probability, which we will call A +. If u and v are not mutual translations, then P(u,v) tends to a very low probability, which we will call A-. A + and A- correspond to the two peaks in the frequency distribution of k(u.,,)/niu.v~ in Figure 2. The two parameters can also be interpreted as the percentage of true and false positives. If the translation in the bitext is consistent and the model is accurate, then A + should be near 1 and Ashould be near 0.</Paragraph> <Paragraph position="6"> To find the most probable values of the hidden model parameters A + and A-, we adopt the standard method of maximum likelihood estimation, and find the values that maximize the probability of the link frequency distributions. The one-to-one assumption implies independence between different link types, so that</Paragraph> <Paragraph position="8"> The factors on the right-hand side of Equation 1 can be written explicitly with the help of a mixture coefficient. Let r be the probability that an arbitrary co-occurring pair of word types are mutual translations. Let B(kln,p ) denote the probability that k links are observed out of n co-occurrences, where k has a binomial distribution with parameters n and p.</Paragraph> <Paragraph position="9"> Then the probability that u and v are linked k(u,v) times out of n(u,v) co-occurrences is a mixture of two binomials:</Paragraph> <Paragraph position="11"> One more variable allows us to express r in terms of A + and A- : Let A be the probability that an arbitrary co-occuring pair of word tokens will be linked, regardless of whether they are mutual translations.</Paragraph> <Paragraph position="12"> Since r is constant over all word types, it also represents the probability that an arbitrary co-occurring pair of word tokens are mutual translations. Therefore, null</Paragraph> <Paragraph position="14"> A can also be estimated empirically. Let K be the total number of links in the bitext and let N be the total number of co-occuring word token pairs: K =</Paragraph> <Paragraph position="16"> Equating the right-hand sides of Equations (3) and (4) and rearranging the terms, we get:</Paragraph> <Paragraph position="18"> Since r is now a function of A + and A-, only the latter two variables represent degrees of freedom in the model.</Paragraph> <Paragraph position="19"> The probability function expressed by Equations 1 and 2 has many local maxima. In practice, these imum in the region of interest.</Paragraph> <Paragraph position="20"> local maxima are like pebbles on a mountain, invisible at low resolution. We computed Equation 1 over various combinations of A + and A- after the model's first iteration over 300000 aligned sentence pairs from the Canadian Hansard bitext. Figure 4 shows that the region of interest in the parameter space, where 1 > A + > A > A- > 0, has only one clearly visible global maximum. This global maximum can be found by standard hill-climbing methods, as long as the step size is large enough to avoid getting stuck on the pebbles.</Paragraph> <Paragraph position="21"> Given estimates for A + and A-, we can compute B(ku,,,\[nu,v, A +) and B(ku,v\[nu,v, A-). These are probabilities that k(u,v) links were generated by an algorithm that generates correct links and by an algorithm that generates incorrect links, respectively, out ofn(u,v) co-occurrences. The ratio of these probabilities is the likelihood ratio in favor of u and v being mutual translations, for all u and v: link frequencies generated by the competitive linking algorithm. More accurate models can be induced by taking into account various features of the linked tokens. For example, frequent words are translated less consistently than rare words (Melamed, 1997). To account for this difference, we can estimate separate values of X + and A- for different ranges of n(u,v). Similarly, the hidden parameters can be conditioned on the linked parts of speech. Word order can be taken into account by conditioning the hidden parameters on the relative positions of linked word tokens in their respective sentences. Just as easily, we can model links that coincide with entries in a pre-existing translation lexicon separately from those that do not. This method of incorporating dictionary information seems simpler than the method proposed by Brown et ai. for their models (Brown et al., 1993b). When the hidden parameters are conditioned on different link classes, the estimation method does not change; it is just repeated for each link class.</Paragraph> </Section> </Section> class="xml-element"></Paper>