File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/h91-1025_metho.xml
Size: 19,044 bytes
Last Modified: 2025-10-06 14:12:42
<?xml version="1.0" standalone="yes"?> <Paper uid="H91-1025"> <Title>A Statistical Approach to Sense Disambiguation in Machine Translation</Title> <Section position="3" start_page="0" end_page="146" type="metho"> <SectionTitle> STATISTICAL TRANSLATION </SectionTitle> <Paragraph position="0"> As described by Brown, et al. \[\]\], in the sta.tistica.1 a.l)proa.ch to transla, tion, one chooses for tile tra,nsla,tion of a. French sentence .F, tha.t English sentence E which ha.s the greatest l)robability, Pr(EIF), a.ccordi,g to a, model of th.e tra, ns\]ation process. By Ba.yes' r,,le, Pr(EI ~') = Pr(E) Pr(FIE )/Pr(.F). Since the (lenomina.tor does not del)end on E, the sentence for which Pr(EIF ) is grea, test is also the sentence for which the product Pr(E) Pr(FIE ) is grea~test. The first term in this product is a~ sta, tisticM cha.ra.cterization of the, English \]a.nguage a, nd the second term is a, statistical cha.ra.cteriza,timt of the process by which English sentences are tra.nslated into French. We can compute neither of these probabilities precisely. Rather, in statistical tra.nslat, iou, we employ a. language model P,,od~l(E) which 1)rovide, s a,n estima.te of Pr (E) and a, lrav, slatiov, model which provides a,n estimate of t'r ( Vl/~:).</Paragraph> <Paragraph position="1"> The performance of the system depends on the extent to which these statistical models approximate the actual probabilities. A useful gauge of this is tile cross entropy 1</Paragraph> <Paragraph position="3"> which measures the average uncertainty that the model has about the English translation E of a French sentence F. A better model has less uncertainty and thus a lower cross entropy.</Paragraph> <Paragraph position="4"> A shortcoming of the architecture described above is that it requires the statistical models to deal directly with English and French sentences. Clearly the probability distributions Pr(E) and Pr(FIE ) over sentences are immensely complicated. On the other hand, in practice the statistical models must be relatively simple in order that their parameters can be reliably estimated from a manageable amount of training data. This usually means that they are restricted to the modeling of local linguistic phenonrena. As a.</Paragraph> <Paragraph position="5"> result, the estimates Pmodcz(E) and Pmodd(F I E) will be inaccurate.</Paragraph> <Paragraph position="6"> This difficulty can be addressed by integrating statistical models into the traditional machine translation architecture of analysis-transfer-synthesis. The resulting system employs 1. An analysis component which encodes a French sentence F into an intermediate structure F< 2. A statistical transfer component which translates F t a corresponding intermediate English structure E'. This component incorporates a language model, a translation model, and a decoder as before, but here these components deal with the intermediate structures rather than the sentences directly.</Paragraph> <Paragraph position="7"> 3. A synthesis component which reconstructs an English sentence E from E t.</Paragraph> <Paragraph position="8"> For statistical modeling we require that the synthesis transformation E ~ ~ E be invertible. Typically, analysis and synthesis will involve a sequence of successive transformations in which F p is incrementally tin this equation and in the remainder of the paper, we use bold face letters (e.g. E) for random variables and roman letters (e.g. E) for the values of random variables.</Paragraph> <Paragraph position="9"> constructed from F, or E is incrementally recovered from E I.</Paragraph> <Paragraph position="10"> 'File purpose of analysis and synthesis is to facilitate the task of statistical transfer. This will be the case if the probability distribution Pr (E ~, F ~) is easier to model then the original distribution Pr (E, F). In practice this nleans that E' and F' should encode global linguistic facts about E and F in a local form. The utility of tile analysis and synthesis transformatious can be measured in terms of cross-entropy. Thus transfotma.tions F -+ F' and t~/ ---+ E are useful if we Call construct models ' P ,~od~t( F I E') and P',,,oa+,(E') such that H(E' I r') < H(EIF ).</Paragraph> </Section> <Section position="4" start_page="146" end_page="147" type="metho"> <SectionTitle> SENSE DISAMBIGUATION </SectionTitle> <Paragraph position="0"> In this paper we present a statistical method for automatically constructing analysis and synthesis transformations which perform cross-lingual word-sense labeling. The goal of such transformations is to label the words of a French sentence so as to ehlcidate their English.</Paragraph> <Paragraph position="1"> trauslations, and, conversely, to label the words of an English sentence so as to elucidate their French translations. For exa.mple, in some contexts the French verb prendre translates as to take, but in other contexts it translates as to make. A sense disambiguation transformation, by examining the contexts, might label occurrences of prendre that likely mean to take with one lal)el, and other occurrences of prendre with another label. Then the uncertainty in the translation of prendre given the label would be less than the uncertainty in the translation of prendre without the label. All, hough tile label does not provide any infof mation that is not already present in the context, it encodes this information locally. Thus a local statistical model for the transfer of labeled sentences should be more accurate than one for the transfer of unlal)eled ones.</Paragraph> <Paragraph position="2"> While the translation o:f a word depends on many woMs in its context, we can often obtain information by looking at only a single word. For example, in the sentence .Ic vats prendre ma propre ddeision (I will 'make my own decisiou), tile verb prendre should be translated as make because its object is ddcision. If we replace ddcision by voiture then prendre should be translated as take: Je vais prendre ma propre voiture (l will take my own car). Thus we can reduce the uncertainity in the translation of prendre by asking a question about its object, which is often the first noun to its right, and we might assign a sense to prendre based upon the answer to this question.</Paragraph> <Paragraph position="3"> In It doute que Ins ndtres gagnent (He doubts that we will win), the word il should be translated as he.</Paragraph> <Paragraph position="4"> On the other hand, if we replace doute by faut then il should be translated as it: It faut que les nStres gagnent (It is necessary that we win). Here, we might assign a sense label to il by asking a,bout the identity of the first verb to its right.</Paragraph> <Paragraph position="5"> These examples motivate a. sense-labeling scheme in which the la.bel of a word is determined by a question aJ)out an informant word in its context. In the first example, the informant of prendre is the first noun to the right; in. the second example, the infof mant of ilis the first verb to the right. If we want to assign n senses to a word then we can consider a question with n answers.</Paragraph> <Paragraph position="6"> We can fit this scheme into the fl:amework of the previous section a.s follows: The Intermediate Structures. The intermediate structures E' and F r consist of sequences of words labeled by their senses. Thus F' is a sentence over the expanded vocabulary whose 'words' f' are pairs (f,l) where f is a word in the original French vocabulary and 1 is its sense label.</Paragraph> <Paragraph position="7"> Similarly, E C/ is a sentence over the expanded vocabulary whose words e t are pairs (e, l) where e is a.n English word and l is its sense label.</Paragraph> <Paragraph position="8"> The analysis and synthesis transformations. For each French word and each English word we choose an informant site, such as first noun to the left, and an n-ary question about the va,lue of the informant at that site. The analysis transformation F ~ U and the inverse synthesis transfof marion E ~ E ~ map a sentence to the intermediate structure in which each word is labeled by a sense determined by the question a\])out its informant. The synthesis transformation E ~ ~ E maps a labeled sentence to a sentence in which the labels have been removed.</Paragraph> <Paragraph position="9"> The probability models. We use the translation model that was discussed in \[l\] for both e;~oaet(F'lE') and for P,nodd(FIE). We use a trigram language model. \[1\] for P,,~oa~a(E) and In order to construct these tra.nsformations we need to choose for each English and French word a.n informant and a question. As suggested in the previous section, a criterion for doing this is that of minimizing the (:ross entropy H(E' I F'). In the remainder of the l)aper we present an algorithm for doing this.</Paragraph> </Section> <Section position="5" start_page="147" end_page="148" type="metho"> <SectionTitle> THE TRANSLATION MODEL </SectionTitle> <Paragraph position="0"> We begin by reviewing our statistical model for the translation of a sentence from one language to another \[1\]. In statistical French to English translation system.</Paragraph> <Paragraph position="1"> we need to model transformations from English sentences E to French sentences F, or from intermediate English structures E' to intermediate French structures F t. ltowever, it is clarifying to consider transformations from an arbitrary source language to an arbitrary target language.</Paragraph> <Paragraph position="2"> Review of the Model The l)urpose of a translation model is to compute the prol)al)i\]ity P,,odet(T \[ S) of transforming a source sentence S into a. target sentence T. For our simple mode\], we assume that each word of S independent\]y I)rodnces zero or mote words from the target vocabulary and that these words are then ordered to produce T. We use the term alignment to refer to an association between words in T and words in S. The proba-</Paragraph> <Paragraph position="4"> llere .iA(t) is tile word of ,5' aligned with t in the alignmen t A, a.nd fi.A (s) is the number of words of T aligned with s ill A. Tile distortion model Pdistortlon describes tile ordering of tile words of T. We will not give it explicitly. The parameters in (3) are I. The l)robabilities p(n \] s) that a word s in the source language generates n target words; 2. &quot;File prol)abilities p(t I s) that s generates the word t; 3. The pa.ra,meters of the distortion model.</Paragraph> <Paragraph position="5"> We determine values for these parameters using maximv.m likelihood training. Thus we collect a large bilingual corpus consisting of pairs of sentences (S, T) which are translations of one another, and we seek parameter va.lues that maximize the likelihood of this training data as computed by the model. This is equivalent to minimizing the cross entropy</Paragraph> <Paragraph position="7"> where Ptr~.i,~(S,T) is the empirical distribution obtained by counting the number of times that the pair (S, T) occurs in the training corpus.</Paragraph> <Paragraph position="8"> The Viterbi Approximation The sum over alignments in (2) is too expensive to compute directly since the number of alignments increases exponentially with sentence length. It is useful to approximate this sum by the single term corresponding to the alignment, A(S,T), with greatest probability. We refer to this approximation as the Viterbi approzimation and to A(S,T) as the Viterbi alignment.</Paragraph> <Paragraph position="9"> Let c(s,t) be the expected number of times that s is aligned with t in the Viterbi alignnmnt of a pair of sentences drawn at random from the training data.. Let c(s, n) be the expected number of times that s is aligned with n words. Then</Paragraph> <Paragraph position="11"> where c(.s,t I A) is the number of times that s is aligned with t in the alignment A, and c(s, n I A) is the number of times that s generates n target words in A. It can be shown \[2\] that these counts are also averages with respect to the model</Paragraph> <Paragraph position="13"> By normalizing the counts c(s,t) and c(s,n) we obtain probability distributions p(s, t) and p(s, n) 2</Paragraph> <Paragraph position="15"> ~In these equations and in the remainder of the paper, we The conditional distributions p(t I s) and p(n Is) are the Viterbi approximation estimates \[or the parameters of the model. The marginals satisfy</Paragraph> <Paragraph position="17"> where u(s) and u(t) are the unigram distributions of s and t and Fz(s) = ~ p(n I s)n is the average number of target words aligned with s. These formulae reflect the fact that in any alignment each target word is aligned with exactly one source word.</Paragraph> </Section> <Section position="6" start_page="148" end_page="149" type="metho"> <SectionTitle> CROSS ENTROPY </SectionTitle> <Paragraph position="0"> \]n this section we express the cross entropies H ( S I T ) and \]\[(S ~ I Tt) in terms of the information between source and target words.</Paragraph> <Paragraph position="1"> In the Viterbi approximation the cross entropy H(T IS) is given by H(T I s) : Lr { H(t I s) + H(n t ~) } (9) where LT is the average length of the target sentences in the training data, and lt(t I s) and It(n I s) are the conditional entropies for the probability distributions 1,(s, t) and p(.., ~):</Paragraph> <Paragraph position="3"> We wa.nt a similar exl)ression for the cross entropy I\[(S IT). Since l ,,,oa~,(~, T) P,,,o~.dT I S) P,,~o~z(S), this cross entropy depends on both the translation model, \]',,,oact(T I S), and the language model, P,,.oact(S). We now show that with a suitable additional approxitn ation H(S I T) : Lr { H(n I+) - ~(+,t) } + H(S) (~1) use the generic symbol ~ to denote ~ normalizing fa.ctor that norgn com, er!s counts to probabilities. We let the actua.1 value of .ol I,e implicit from the context. Thus, for example, in the left ha.nd equation of (7), the normalizing factor is norm = ~,,, c(s, t) which equals tile a,verage length of target sentences. In the right hand equation of (7), the normalizing fa.ctor is the average \]engt.h of source sentences.</Paragraph> <Paragraph position="4"> where H(S) is the cross entropy of P,+od+t(S) and I(s, t) is tire mutual information between t and s for the probability distribution p(s, t).</Paragraph> <Paragraph position="5"> The additional approximation that we require is HiT) ,~ LTHit) =- --LT ~p(t)log pi t)</Paragraph> <Paragraph position="7"> where p(t) is the marginal of p(s,t). This amounts to approximating PmodC/l(T) by the unigram distribution that is closest to it in cross entropy. Granting this, formula (11) is a consequence of (9) and of the identities</Paragraph> <Paragraph position="9"/> <Section position="1" start_page="149" end_page="149" type="sub_section"> <SectionTitle> Target Questions </SectionTitle> <Paragraph position="0"> For sensing target sentences, a question about an informant is a f, nction ~ from the target vocabulary into the set of possible senses. If the informant of t is z, then t is assigned the sense 5(z). We want to choose the function fi(z) to minimize the cross entropy It(S IT'). Front formula (34), we see that this is equivale:,t to maximizing the conditional mutual i,formation I(s, t' I t) between s and t' p(s,~(z) I t) (15) ICs, t' I t ) = ~_,pC.s,x \[ t)log pCs 1 t)P(+(.+) t 0 where p(s, t, x) is the probability distribution obtained by counting the number of times in the Viterbi alignments that s is aligned with t and the value of the informa, t of t is x, Next consider H(S' I T'). Let S ~ S' and T T' be sense labeling transformations of the type discussed in Section 2. Assume that these transformations preserve Viterbi alignments; that is, if the words s and t are aligned in the Viterbi alignment for ($, T), then their sensed versions s ~ and t' are aligned in the Viterbi alignment for (SI,T'). It follows that the word translation probabilities obtained from the Viterbi align ntents satisfy p(s,t) = Zt'etP(S,t') = ~,'oP('S',t) where the sums range over tire sensed versions t' of t and the sensed versions s' of ~.</Paragraph> <Paragraph position="1"> By applying (11) to the cross entropies HCS I T), It(S I T'), and H(S'I T), it is not hard. to verify that</Paragraph> <Paragraph position="3"> Here I(s, t' I t) is the conditional mutual information given a target word t between its translations s and its sensed versions t'; I(t, s' \[ s) is the conditional mutual information given a source word s between its translations t a.nd its sensed versions s'; and I(n,s' I s) is the conditional mutual information given .s between n and its sensed versions s'.</Paragraph> <Paragraph position="5"> An exhaustive search for the best ~ requires a computation that is exponential in the number of values of x and is not practical. In previous work \[3\] we found a good ~ usi,g the flip-flop algorithm \[4\], which is only al)l)licable if the number of senses is restricted to two.</Paragraph> <Paragraph position="6"> Since then, we have developed a different Mgorithm that can be used to find 5 for any number of senses.</Paragraph> <Paragraph position="7"> The algorithm uses the technique of alternating minimization, and is similar to the k-means algorithm for determining pattern clusters and to the generalized Lloyd algorithm for designing vector quantitizers. A discussion of alternating minimization, together with refcrences, can be found in Chou \[5\].</Paragraph> <Paragraph position="8"> The algorithm is ba,sed on tile fact that, up to a constant independent of 5, the mutual information l(s,t t I t) can be expressed as an infimum over conditional probal)ility distributions q(s I c),</Paragraph> <Paragraph position="10"/> </Section> </Section> <Section position="7" start_page="149" end_page="150" type="metho"> <SectionTitle> SELECTING QUESTIONS </SectionTitle> <Paragraph position="0"> We now present an algorithm for finding good informants and questions for sensing.</Paragraph> <Paragraph position="1"> where</Paragraph> <Paragraph position="3"> The best value of the information is thus a.n infimiim over both the choice for 2. and the choice for the q.</Paragraph> <Paragraph position="4"> This suggests the following iterative procedure for obtaining a good 2: 1. For given q, find the best E: E(x) = argmin,D(p(s ( x,t) ; g(s ( c)).</Paragraph> <Paragraph position="5"> 2. For this El find the best 3: 3. Iterate steps (1) a.nd (2) ilntil no fnrther increase in I(s, t' I t) results.</Paragraph> <Paragraph position="6"> Source Questions For sensing source sentences, a, question a.bont an informant is a Iunction 2: from the source voca1)iila.ry int'o the set of possible senses. We want to chose 2. to minimize the entropy H(S1 I T). From ( 14) this is equivalent to ~na.ximizing the sum I(t,st I s) + T( n , s' I s ). In analogy to (18), and we can again find a good 2 by alternating minimiza.tion. null</Paragraph> </Section> class="xml-element"></Paper>