File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/e99-1010_metho.xml
Size: 8,931 bytes
Last Modified: 2025-10-06 14:15:21
<?xml version="1.0" standalone="yes"?> <Paper uid="E99-1010"> <Title>An Efficient Method for Determining Bilingual Word Classes</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Word classes are often used in language modelling to solve the problem of sparse data. Various clustering techniques have been proposed (Brown et al., 1992; Jardino and Adda, 1993; Martin et al., 1998) which perform automatic word clustering optimizing a maximum-likelihood criterion with iterative clustering algorithms.</Paragraph> <Paragraph position="1"> In the field of statistical machine translation we also face the problem of sparse data. Our aim is to use word classes in statistical machine translation to allow for more robust statistical translation models. A naive approach for doing this would be the use of mono-lingually optimized word classes in source and target language. Unfortunately we can not expect these independently optimized classes to be correspondent. Therefore mono-lingually optimized word classes do not seem to be useful for machine translation (see also (Fhng and Wu, 1995)). We define bilingual word clustering as the process of forming corresponding word classes suitable for machine translation purposes for a pair of languages using a parallel training corpus.</Paragraph> <Paragraph position="2"> The described method to determine bilingual word classes is an extension and improvement of the method mentioned in (Och and Weber, 1998). Our approach is simpler and computationally more efficient than (Wang et al., 1996).</Paragraph> </Section> <Section position="4" start_page="0" end_page="71" type="metho"> <SectionTitle> 2 Monolingual Word Clustering </SectionTitle> <Paragraph position="0"> The task of a statistical language model is to estimate the probability Pr(w N) of a sequence of words w N = wl...wg. A simple approximation of Pr(w N) is to model it as a product of bigram probabilities: Pr(w~) N = I-\[i=1P(WilWi-1) * If we want to estimate the bigram probabilities p(w\[w') using a realistic natural language corpus we are faced with the problem that most of the bigrams are rarely seen. One possibility to solve this problem is to partition the set of all words into equivalence classes. The function C maps words w to their classes C(w). Rewriting the corpus probability using classes we arrive at the following probability model p(wNlC):</Paragraph> <Paragraph position="2"> In this model we have two types of probabilities: the transition probability p(CIC ~) for class C given its predecessor class C' and the membership probability p(wlC ) for word w given class C.</Paragraph> <Paragraph position="3"> To determine the optimal classes C for a given number of classes M we perform a maximum-</Paragraph> <Paragraph position="5"> We estimate the probabilities of Eq. (1) by relative frequencies: p(CIC' ) := n(CIC')/n(C'), p(wlC ) = n(w)/n(C). The function n(-) provides the frequency of a uni- or bigram in the training corpus. If we insert this into Eq. (2) and apply the negative logarithm and change the summation order we arrive at the following optimization</Paragraph> <Paragraph position="7"> The function h(n) is a shortcut for n. log(n).</Paragraph> <Paragraph position="8"> It is necessary to fix the number of classes in C in advance as the optimum is reached if every word is a class of its own. Because of this it is necessary to perform an additional optimization process which determines the number of classes.</Paragraph> <Paragraph position="9"> The use of leaving-one-out in a modified optimization criterion as in (Kneser and Ney, 1993) could in principle solve this problem.</Paragraph> <Paragraph position="10"> An efficient optimization algorithm for LP1 is described in section 4.</Paragraph> </Section> <Section position="5" start_page="71" end_page="72" type="metho"> <SectionTitle> 3 Bilingual Word Clustering </SectionTitle> <Paragraph position="0"> In bilingual word clustering we are interested in classes ~&quot; and C which form partitions of the vocabulary of two languages. To perform bilingual word clustering we use a maximum-likelihood approach as in the monolingnal case. We maximize the joint probability of a bilingual training corpus (el, f J):</Paragraph> <Paragraph position="2"> To perform the maximization of Eq. (6) we have to model the monolingual a priori probability p(e I IE) and the translation probability p(fJte~; E, .T). For the first we use the class-based bigram probability from Eq. (1).</Paragraph> <Paragraph position="3"> To model p(fJle~;8,.T) we assume the existence of an alignment a J. We assume that every word fj is produced by the word e~j at position aj in the training corpus with the probability</Paragraph> <Paragraph position="5"> The word alignment a J is trained automatically using statistical translation models as described in (Brown et al., 1993; Vogel et al., 1996). The idea is to introduce the unknown alignment al J as hidden variable into a statistical model of the translation probability p(fJle~). By applying the EM-algorithm we obtain the model parameters. The alignment a J that we use is the Viterbi-Alignment of an HMM alignment model similar to (Vogel et al., 1996).</Paragraph> <Paragraph position="6"> By rewriting the translation probability using word classes, we obtain (corresponding to Eq. (1)):</Paragraph> <Paragraph position="8"> The variables F and E denote special classes in 9 v and ~'. We use relative frequencies to estimate p(FIE) and p(flF): p(F\[E) = nt(FIE)/ (~F hi(FIE)) The function nt(FIE) counts how often the words in class F are aligned to words in class E. If we insert these relative frequencies into Eq. (8) and apply the same transformations as in the monolingual case we obtain a similar optimization criterion for the translation probability part of Eq. (6). Thus the full optimization criterion for bilingual word classes is:</Paragraph> <Paragraph position="10"> The two count functions n(EIE' ) and nt(FIE ) can be combined into one count function ng(X\[Y ) := n(XIY)+nt(X\[Y ) as for all words f and all words e and e' holds n(fle ) = 0 and nt(ele' ) = O. Using the function ng we arrive at the following optimization criterion:</Paragraph> <Paragraph position="12"> over the classes in E and Y. In the optimization process it cannot be allowed that words of different languages occur in one class. It can be seen that Eq. (3) is a special case of Eq. (9) with g,1 ----- rig,2.</Paragraph> <Paragraph position="13"> Another possibility to perform bilingual word clustering is to apply a two-step approach. In a first step we determine classes PS optimizing only the monolingual part of Eq. (6) and secondly we determine classes 5~ optimizing the bilingual part (without changing C):</Paragraph> <Paragraph position="15"> By using these two optimization processes we enforce that the classes E are mono-lingually 'good' classes and that the classes fi- correspond to ~.</Paragraph> <Paragraph position="16"> Interestingly enough this results in a higher translation quality (see section 5).</Paragraph> </Section> <Section position="6" start_page="72" end_page="187" type="metho"> <SectionTitle> 4 Implementation </SectionTitle> <Paragraph position="0"> An efficient optimization algorithm for LPz is the exchange algorithm (Martin et al., 1998). For the optimization of LP2 we can use the same algorithm with small modifications. Our starting point is a random partition of the training corpus vocabulary. This initial partition is improved iteratively by moving a single word from one class to another. The algorithm to determine bilingual classes is depicted in Figure 1.</Paragraph> <Paragraph position="1"> If only one word w is moved between the partitions C and C' the change LP(C, ng) - LP(C', ng) can be computed efficiently looking only at classes C for which ng(w, C) > 0 or ng(C, w) > 0. We define M0 to be the average number of seen predecessor and successor word classes. With the notation I for the number of iterations needed for convergence, B for the number of word bigrams, M for the number of classes and V for the vocabulary</Paragraph> <Paragraph position="3"> size the computational complexity of this algorithm is roughly I. (B. log 2 (B/V) + V. M. Mo).</Paragraph> <Paragraph position="4"> A detailed analysis of the complexity can be found in (Martin et al., 1998).</Paragraph> <Paragraph position="5"> The algorithm described above provides only a local optimum. The quality of the resulting local optima can be improved if we accept a short-term degradation of the optimization criterion during the optimization process. We do this in our implementation by applying the optimization method threshold accepting (Dueck and Scheuer, 1990) which is an efficient simplification of simulated annealing. null</Paragraph> </Section> class="xml-element"></Paper>