File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/h05-1110_metho.xml
Size: 16,477 bytes
Last Modified: 2025-10-06 14:09:34
<?xml version="1.0" standalone="yes"?> <Paper uid="H05-1110"> <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 875-882, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Inducing a multilingual dictionary from a parallel multitext in related languages</Title> <Section position="5" start_page="875" end_page="877" type="metho"> <SectionTitle> 3 Description of the Problem </SectionTitle> <Paragraph position="0"> Let us assume that we have a group of related languages, L1 ...Ln, and a parallel sentence-aligned multitext C, with corresponding portions in each language denoted as C1 ...Cn. Such a multitext exists for virtually all the languages in the form of the Bible. Our goal is to create a multilingual dictionary by learning the joint distribution P(x1 ...xn)xi[?]Li which is simply the expected frequency of the n-tuple of words in a completely word-aligned multitext. We will approach the problem by learning pairwise language models, although leaving some parameters free, and then combine the models and learn the remaining free parameters to produce the joint model.</Paragraph> <Paragraph position="1"> Let us, therefore, assume that we have a set of models {P(x,y|thij)x[?]Li,y[?]Lj}inegationslash=j where thij is a parameter vector for pairwise model for languages Li and Lj. We would like to learn how to combine these models in an optimal way. To solve this problem, let us first consider a simpler and more general setting.</Paragraph> <Section position="1" start_page="875" end_page="876" type="sub_section"> <SectionTitle> 3.1 Combining Models of Hidden Data </SectionTitle> <Paragraph position="0"> Let X be a random variable with distribution Ptrue(x), such that no direct observations of it exist. However, we may have some indirect observations of X and have built several models of X's distribution, {Pi(x|thi)}ni=1, each parameterized by some parameter vector thi. Pi also depends on some other parameters that are fixed. It is important to note that the space of models obtained by varying thi is only a small subspace of the probability space. Our goal is to find a good estimate of Ptrue(x).</Paragraph> <Paragraph position="1"> The main idea is that if some Pi and Pj are close (by some measure) to Ptrue, they have to be close to each other as well. We will therefore make the assumption that if some models of X are close to each other (and we have reason to believe they are fair approximations of the true distribution) they are also close to the true distribution. Moreover, we would like to set the parameters thi in such a way that P(xi|thi) is as close to the other models as possible. This leads us to look for an estimate that is as close to all of our models as possible, under the optimal values of thi's, or more formally:</Paragraph> <Paragraph position="3"> where d measures the distance between ^P and all the Pi under the parameter setting thi. Since we have no reason to prefer any of the Pi, we choose the following symmetric form for d:</Paragraph> <Paragraph position="5"> where D is a reasonable measure of distance between probability distributions. The most appropriate and the most commonly used measure in such cases in the Kullback-Leibler divergence, also known as relative entropy:</Paragraph> <Paragraph position="7"> It turns out that it is possible to find the optimal ^P under these circumstances. Taking a partial derivative and solving, we obtain:</Paragraph> <Paragraph position="9"> Substituting this value into the expression for function d, we obtain the following distance measure between the Pi's:</Paragraph> <Paragraph position="11"> This function is a generalization of the well-known Bhattacharyya distance for two distributions</Paragraph> <Paragraph position="13"> These results suggest the following Algorithm 1 to optimize d (and dprime): * Set all thi randomly * Repeat until change in d is very small: - Compute ^P according to the above formula null - For i from 1 to n [?] Set thi in such a way as to minimize D( ^P(X)||Pi(X|thi)) - Compute d according to the above formula null Each step of the algorithm minimizes d. It is also easy to see that minimizing D( ^P(X)||Pi(X|thi)) is the same as setting the parameters thi in order to maximize producttextx[?]X Pi(x|thi)^P(x), which can be interpreted as maximizing the probability under Pi of a corpus in which word x appears ^P(x) times. In other words, we are now optimizing Pi(X) given an observed corpus of X, which is a much easier problem. In many types of models for Pi the Expectation-Maximization algorithm is able to solve this problem. null</Paragraph> </Section> <Section position="2" start_page="876" end_page="877" type="sub_section"> <SectionTitle> 3.2 Combining Pairwise Models </SectionTitle> <Paragraph position="0"> Following the methods outlined in the previous section, we can find an optimal joint probability P(x1 ...xn)xi[?]Li if we are given several models Pj(x1 ...xn|thj). Instead, we have a number of pair-wise models. Depending on which independence assumptions we make, we can define a joint distribution over all the languages in various ways. For example, for three languages, A, B, and C, and we can use the following set of models:</Paragraph> <Paragraph position="2"> where H(*) is entropy, H(*,*) is cross-entropy, and ^P(A,B) means ^P marginalized to variables A,B.</Paragraph> <Paragraph position="3"> The last three cross-entropy terms involve monolingual models which are not parameterized. The entropy term does not involve any of the pairwise distributions. Therefore, if ^P is fixed, to maximize dprime we need to maximize each of the bilingual cross-entropy terms.</Paragraph> <Paragraph position="4"> This means we can apply the algorithm from the previous section with a small modification (Algorithm 2): * Set all thij (for each language pair i,j) randomly null * Repeat until change in d is very small: - Compute Pi for i = 1...k where k is the number of the joint models we have chosen null - Compute ^P from {Pi} - For i,j such that i negationslash= j [?] Marginalize ^P to (Li,Lj) [?] Set thij in such a way as to minimize D( ^P(Li,Lj)||Pi(Li,Lj|thij)) - Compute d according to the above formula null Most of the th parameters in our models can be set by performing EM, and the rest are discrete with only a few choices and can be maximized over by trying all combinations of them.</Paragraph> </Section> </Section> <Section position="6" start_page="877" end_page="879" type="metho"> <SectionTitle> 4 Building Pairwise Models </SectionTitle> <Paragraph position="0"> We now know how to combine pairwise translation models with some free parameters. Let us now discuss how such models might be built.</Paragraph> <Paragraph position="1"> Our goal at this stage is to take a parallel bitext in related languages A and B and produce a joint probability model P(x,y), where x [?] A,y [?] B.</Paragraph> <Paragraph position="2"> Equivalently, since the models PA(x) and PB(y) are easily estimated by maximum likelihood techniques from the bitext, we can estimate PA-B(y|x) or PB-A(x|y). Without loss of generality, we will build PA-B(y|x).</Paragraph> <Paragraph position="3"> The model we are building will have a number of free parameters. These parameters will be set by the algorithm discussed above. In this section we will assume that the parameters are fixed.</Paragraph> <Paragraph position="4"> Our model is a mixture of several components, each discussed in a separate section below:</Paragraph> <Paragraph position="6"> where all ls sum up to one. The ls are free parameters, although to avoid over-training we tie the ls for x's with similar frequencies. These lambdas form a part of the thij parameter mentioned previously, where Li = A and Lj = B.</Paragraph> <Paragraph position="7"> The components represent various constraints that are likely to hold between related languages.</Paragraph> <Section position="1" start_page="877" end_page="878" type="sub_section"> <SectionTitle> 4.1 GIZA (forward) </SectionTitle> <Paragraph position="0"> This component is in fact GIZA++ software, originally created by John Hopkins University's Summer Workshop in 1999, improved by Och (2000). This software can be used to create word alignments for sentence-aligned parallel corpora as well as to induce a probabilistic dictionary for this language pair. The general approach taken by GIZA is as follows. Let LA and LB be the portions of the parallel text in languages A and B respectively, and</Paragraph> <Paragraph position="2"> The GIZA software does the maximization by building a variety of models, mostly described by Brown et al. (1993). GIZA can be tuned in various ways, most importantly by choosing which models to run and for how many iterations. We treat these parameters as free, to be set along with the rest at a later stage.</Paragraph> <Paragraph position="3"> As a side effect of GIZA's optimization, we obtain the PA-B(y|x) that maximizes the above expression. It is quite reasonable to believe that a model of this sort is also a good model for our purposes.</Paragraph> <Paragraph position="4"> This model is what we refer to as PfwA-B(y|x) in the model overview.</Paragraph> <Paragraph position="5"> GIZA's approach is not, however, perfect. GIZA builds several models, some quite complex, yet it does not use all the information available to it, notably the lexical similarity between the languages. Furthermore, GIZA tries to map words (especially rare ones) into other words if possible, even if the sentence has no direct translation for the word in question.</Paragraph> <Paragraph position="6"> These problems are addressed by using other models, described in the following sections.</Paragraph> </Section> <Section position="2" start_page="878" end_page="878" type="sub_section"> <SectionTitle> 4.2 GIZA (backward) </SectionTitle> <Paragraph position="0"> In the previous section we discussed using GIZA to try to optimize P(LB|LA). It is, however, equally reasonable to try to optimize P(LA|LB) instead. If we do so, we can obtain PfwB-A(x|y) that produces maximal probability for P(LA|LB). We, however need a model of PA-B(y|x). This is easily obtained by using Bayes' rule:</Paragraph> <Paragraph position="2"> which requires us to have PB(y) and PA(x). These models can be estimated directly from LB and LA, by using maximum likelihood estimators:</Paragraph> <Paragraph position="4"> where d(x,y) is the Kronecker's delta function, which is equal to 1 if its arguments are equal, and to 0 otherwise.</Paragraph> </Section> <Section position="3" start_page="878" end_page="879" type="sub_section"> <SectionTitle> 4.3 Character-based model </SectionTitle> <Paragraph position="0"> This and the following models all rely on having a model of PA-B(y|x) to start from. In practice it means that this component is estimated following the previous components and uses the models they provide as a starting point.</Paragraph> <Paragraph position="1"> The basic idea behind this model is that in related languages words are also related. If we have a model Pc of translating characters in language A into characters in language B, we can define the model for translating entire words.</Paragraph> <Paragraph position="2"> Let word x in language A consists of characters x1 through xn, and word y in language B consist of characters y1 through ym.</Paragraph> <Paragraph position="3"> Let us define (the unnormalized) character model:</Paragraph> <Paragraph position="5"> i.e., estimating the length of y first, and y itself afterward. We make an independence assumption that the length of y depends only on length of x, and are able to estimate the second term above easily. The first term is harder to estimate.</Paragraph> <Paragraph position="6"> First, let us consider the case where lengths of x and y are the same (m = n). Then,</Paragraph> <Paragraph position="8"> It is easy to see that this is a valid probability model over all sequences of characters. However, y is not a random sequence of characters, but a word in language B, moreover, it is a word that can serve as a potential translation of word x. So, to define a proper distribution over words y given a word x and a set of possible translations of x, T(x)</Paragraph> <Paragraph position="10"> This is the complete definition of Pchar, except for the fact that we are implicitly relying upon the character-mapping model, Pc, which we need to somehow obtain. To obtain it, we rely upon GIZA again. As we have seen, GIZA can find a good word-mapping model if it has a bitext to work from. If we have a PA-B word-mapping model of some sort, it is equivalent to having a parallel bitext with words y and x treated as a sequence of characters, instead of indivisible tokens. Each (x,y) word pair would occur PA-B(x,y) times in this corpus. GIZA would then provide us with the Pc model we need, by optimizing the probability B language part of the model given the language A part.</Paragraph> </Section> <Section position="4" start_page="879" end_page="879" type="sub_section"> <SectionTitle> 4.4 Prefix Model </SectionTitle> <Paragraph position="0"> This model and the two models that follow are built on the same principle. Let there be a function f : A - CA and a function g : B - CB. These functions group words in A and B into some finite set of classes. If we have some PA-B(y|x) to start with, we can define For the prefix model, we rely upon the following idea: words that have a common prefix often tend to be related. Related words probably should translate as related words in the other language as well. In other words, we are trying to capture word-level semantic information. So we define the following set of f and g functions:</Paragraph> <Paragraph position="2"> where n and m are free parameters, whose values we will determine later. We therefore define PprefA-B as Pfg with f and g specified above.</Paragraph> </Section> <Section position="5" start_page="879" end_page="879" type="sub_section"> <SectionTitle> 4.5 Suffix Model </SectionTitle> <Paragraph position="0"> Similarly to a prefix model mentioned above, it is also useful to have a suffix model. Words that have the same suffixes are likely to be in the same grammatical case or share some morphological feature which may persist across languages. In either case, if a strong relationship exists between the resulting classes, it provides good evidence to give higher likelihood to the word belonging to these classes. It is worth noting that this feature (unlike the previous one) is unlikely to be helpful in a setting where languages are not related.</Paragraph> <Paragraph position="1"> The functions f and g are defined based on a set of suffixes SA and SB which are learned automatically.</Paragraph> <Paragraph position="2"> f(x) is defined as the longest possible suffix of x that is in the set SA, and g is defined similarly, for SB.</Paragraph> <Paragraph position="3"> The sets SA and SB are built as follows. We start with all one-character suffixes. We then consider two-letter suffixes. We add a suffix to the list if it occurs much more often than can be expected based on the frequency of its first letter in the penultimate position, times the frequency of its second letter in the last position. We then proceed in a similar way for three-letter suffixes. The threshold value is a free parameter of this model.</Paragraph> </Section> <Section position="6" start_page="879" end_page="879" type="sub_section"> <SectionTitle> 4.6 Constituency Model </SectionTitle> <Paragraph position="0"> If we had information about constituent boundaries in either language, it would have been useful to make a model favoring alignments that do not cross constituent boundaries. We do not have this information at this point. We can assume, however, that any sequence of three words is a constituent of sorts, and build a model based on that assumption.</Paragraph> <Paragraph position="1"> As before, let LA = (xi)i=1...n and LB = (yi)i=1...m. Let us define as CA(i) a triple of words (xi[?]1,xi,xi+1) and as CB(j) a triple (yj[?]1,yj,yj+1). If we have some model PA-B, we can define</Paragraph> <Paragraph position="3"> where C is the sum over j of the above products, and serves to normalize the distribution.</Paragraph> <Paragraph position="5"/> </Section> </Section> class="xml-element"></Paper>