File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0905_metho.xml
Size: 13,389 bytes
Last Modified: 2025-10-06 14:15:33
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0905"> <Title>Resolving Translation Ambiguity using Non-parallel Bilingual Corpora</Title> <Section position="4" start_page="31" end_page="32" type="metho"> <SectionTitle> 3 Unsupervised Word Sense </SectionTitle> <Paragraph position="0"> Disambiguation We adopted the unsupervised word-sense disambiguation (WSD) algorithm based on distributional clustering (Schuetze, 1997). The underlying idea is that the sense of a word is determined by its co-occurring words. For example, the word &quot;suit&quot; co-occurring with &quot;jacket&quot; and &quot;pants&quot; tends to mean a set of clothes, whereas the same word co-occurring with &quot;file&quot; and &quot;court&quot; means &quot;lawsuit&quot;. As stated in Section 2, the WSD algorithm comprises two parts: distributional clustering and categorization. null The former learns the relation between sense and co-occurring words in the following steps: 1. Collecting contexts of the word in the corpus, then 2. Clustering them into small coherent groups (clusters).</Paragraph> <Paragraph position="1"> Table 1 shows sample contexts surrounding &quot;suit&quot; as extracted by the first step from actual news articles. These contexts are expected to be clustered into two sets: (1,4) and (2,3) by the second step. Since each cluster corresponds to a particular sense (or usage) of the word, it is called a &quot;sense profile&quot;.</Paragraph> <Paragraph position="2"> The latter part of the WSD algorithm is responsible for choosing the cluster &quot;closest&quot; or relevant to a new context of the same word (in this case, &quot;suit&quot;). The selected cluster is the &quot;sense&quot; in the new context. null</Paragraph> <Section position="1" start_page="31" end_page="32" type="sub_section"> <SectionTitle> 3.1 Distributional Clustering </SectionTitle> <Paragraph position="0"> The above idea is implemented using the multi-dimensional vector space derived from word co-occurrence statistics in the source language corpus.</Paragraph> <Paragraph position="1"> We first map each word, w, in the corpus onto a vector, g(w), referred to as the &quot;word vector&quot;. A . Make a co-occurrence matrix whose (i, j) element corresponds to the occurrence count of the j-th content bearing word in the context of every occurrence of word-i 1 in the corpus.</Paragraph> <Paragraph position="2"> For simplicity we employ the sliding window approach where neighboring n words are judged to be the context.</Paragraph> <Paragraph position="3"> 3. Apply singular value decomposition (SVD) to the matrix to reduce its dimensionality.</Paragraph> <Paragraph position="4"> 4. The vector representation of word-/is the i-th row vector of the reduced matrix.</Paragraph> <Paragraph position="5"> Second, the context of each occurrence of the word is also mapped to a multi-dimensional vector, called the &quot;context vector&quot;. The context vector is the sum of every word vector within the context (again, neighboring n words) weighted by its idf score. Formally, the context vector cxt of word set W is defined as follows:</Paragraph> <Paragraph position="7"> documents in the collection N~ = the number of documents (4) containing w. 2 Finally, derived context vectors are clustered by applying a clustering algorithm. We used the groupaverage agglomerative clustering algorithm called Buckshot(Cutting et al., 1992). In this algorithm, the proximity, proz, between two context vectors, ~, b is measured by cosine of the angle between these two vectors: pro,( , g) = b')/(I II b'l) (5) IEvery word (type) is assigned a sequential id number. ~The document unit may be a paragraph, a text, an article etc.</Paragraph> <Paragraph position="8"> Since we hypothesize that each translation alternative corresponds to at least one usage of the source word, the number of clusters is determined to be the number of the translation alternatives plus some fixed number (e.g., 3).</Paragraph> </Section> <Section position="2" start_page="32" end_page="32" type="sub_section"> <SectionTitle> 3.2 Categorization </SectionTitle> <Paragraph position="0"> The task of this step is to determine which semantic profile (i.e., context cluster) is &quot;closest&quot; to the word in a new context. In this step, the &quot;closeness&quot; between a semantic profile and the context is measured by the proximity, defined by (5), between the former vector and the representative vector of the latter called the sense vector. The sense vector of a sense profile is the centroid of all the context vectors in the cluster.</Paragraph> <Paragraph position="1"> Unlike the original algorithm, we used only a portion (e.g., 70%) of the context vectors closest to the centroid for computing the sense vector since these central vectors contain less noise in terms of representing the cluster(Cutting et al., 1992).</Paragraph> </Section> </Section> <Section position="5" start_page="32" end_page="33" type="metho"> <SectionTitle> 4 Linking Sense to Translation </SectionTitle> <Paragraph position="0"> The WSD algorithm introduced in the previous section represents the sense of a given word, w, as a cluster of contexts (i.e., co-occurring words) in the source language. If each cluster is associated with one translation, then the result of the WSD can directly be maped to the translation.</Paragraph> <Paragraph position="1"> Our method for associating each cluster with a translation consists of the following two major steps: 1. Extracting characteristic words from the cluster, then 2. Applying the termlist translation (disambigua null tion) algorithm(Kikui, 1998) to the list of words consisting of these characteristic words and the given word, w.</Paragraph> <Paragraph position="2"> The termlist translation algorithm employed in the second step chooses the translation, from possible translation alternatives, that is most relevant to the context formed by the entire input (i.e., word</Paragraph> <Paragraph position="4"> list). Thus, the second step is expected to translate the given source word w into the target word relevant to the sense represented by the cluster.</Paragraph> <Section position="1" start_page="33" end_page="33" type="sub_section"> <SectionTitle> 4.1 Extracting Characteristics Words </SectionTitle> <Paragraph position="0"> We applied IR 3 techniques to extract characteristic words as follows.</Paragraph> <Paragraph position="1"> 1. Let S be a sense profile of a source word w . 2. Extract central elements (i.e., contexts or context vectors) of S in the same way as described in Section 3.2.</Paragraph> <Paragraph position="2"> 3. Calculate the tf-idf score for each word in the extracted contexts, where tf-idf score of a word w is the frequency of w (term frequency) multiplied by the idf value defined by (3) in Section 3.1.</Paragraph> <Paragraph position="3"> 4. Choose the topmost m words (m is typically 1.0 to 20).</Paragraph> <Paragraph position="4"> Table 2 shows the output of the above procedure applied to two sense profiles for &quot;suit&quot; using the training data shown in Section 5.1.</Paragraph> <Paragraph position="5"> The extracted words for each cluster are combined with the source word to form a term-list. The terralist of length 8 for the first sense in Table 2 is : (suit, wearing, blue, designer, white, dark, shoes, hat).</Paragraph> <Paragraph position="6"> Each term-list is then sent to the term-list translation module. The resulting translation of the source word is associated with the cluster and stored in the sense-translation table, shown in Figure 1.</Paragraph> </Section> <Section position="2" start_page="33" end_page="33" type="sub_section"> <SectionTitle> 4.2 Term-list Translation using Target Language Corpus </SectionTitle> <Paragraph position="0"> The termlist translation algorithm(Kikui, 1998) aims at translating a list of words that characterize a consistent text or a concept. It is an unsupervised algorithm in the sense that it relies only on a mono-lingual corpus free from manual tagging.</Paragraph> </Section> </Section> <Section position="6" start_page="33" end_page="33" type="metho"> <SectionTitle> 3 IR = Information Retrieval </SectionTitle> <Paragraph position="0"> The algorithm first retrieves all translation alternatives of each word from a bilingual dictionary (Dictionary Lookup), then tries to find the most coherent (or semantically relevant) combination of the translation alternatives in the target language corpus (Disambiguation), detailed as follows:</Paragraph> </Section> <Section position="7" start_page="33" end_page="34" type="metho"> <SectionTitle> 1. Dictionary Lookup: </SectionTitle> <Paragraph position="0"> For each word in the given term-list, all the alternative translations are retrieved from a bilingual dictionary. A combination of one translation for each input word is calied a translation candidate. For example, if the input is (book, library), then a translation candidate in French is (livre, biblioth~que).</Paragraph> <Paragraph position="1"> 2. Disambiguation: In this step, all possible translation candidates are ranked by the 'similarity' score of a candidate. The top ranked candidate is the output of the entire algorithm.</Paragraph> <Paragraph position="2"> The similarity score of a translation candidate (i.e., a set of target words) is defined again by using the multi-dimensional vector space introduced in Section 3.1. Each target word in the translation candidates is, first, mapped to a word vector derived from the target language corpus. The similarity score sim of a set of (target) words W, is the average distance of word vectors from the centroid g of them as shown below.</Paragraph> <Paragraph position="4"/> <Section position="1" start_page="33" end_page="33" type="sub_section"> <SectionTitle> Evaluation and Discussion </SectionTitle> <Paragraph position="0"> We conducted English-to-Japanese translation experiments using newspaper articles. The results of the proposed algorithm were compared against those of the previous algorithm which relies solely on target language corpora(Kikui, 1998).</Paragraph> </Section> <Section position="2" start_page="33" end_page="34" type="sub_section"> <SectionTitle> 5.1 Experimental Data </SectionTitle> <Paragraph position="0"> The bilingual dictionary, from English to Japanese, was an inversion of the EDICT(Breen, 1995), a free Japanese-to-English dictionary.</Paragraph> <Paragraph position="1"> The co-occurrence statistics were extracted from the 1994 New York Times (420MB) for English and 1994 Mainichi Shinbun (Japanese newspaper) (90MB) for Japanese. Note that 100 articles were randomly separated from the former corpus as the test set described below.</Paragraph> <Paragraph position="2"> Although these two kinds of newspaper articles were both written in 1994, topics and contents greatly differ. This is because each newspaper publishing company edited its paper primarily for domestic readers. Note that the domains of these texts range from business to sports.</Paragraph> <Paragraph position="3"> The initial size of each co-occurrence matrix was 50000-by-1000, where rows and columns correspond to the 50,000 and 1000 most frequent words in the corpus 4. Each initial matrix was then reduced by using SVD into a matrix of 50000-by-100 using SVD-PACKC(Berry et al., 1993).</Paragraph> <Paragraph position="4"> Test data, a set of word-lists, were automatically generated from the 120 articles from the New York Times separated from the training set. A word-list was extracted from an article by choosing the top-most n words ranked by their tf-idfscores, given in Section 5. In the following experiments, we set n to 6 since it gave the &quot;best&quot; result for this corpus.</Paragraph> </Section> <Section position="3" start_page="34" end_page="34" type="sub_section"> <SectionTitle> 5.2 Results </SectionTitle> <Paragraph position="0"> In order to calculate success rates, translation outputs were compared against the &quot;correct data&quot; which were manually created by removing incorrect alternatives from all possible alternatives. If all the translation alternatives in the bilingual dictionary were judged to be correct, we then excluded them in calculating the success-rate.</Paragraph> <Paragraph position="1"> The success rates of the proposed method and the previous algorithm are shown in Table 3.</Paragraph> </Section> <Section position="4" start_page="34" end_page="34" type="sub_section"> <SectionTitle> 5.3 Discussion </SectionTitle> <Paragraph position="0"> Although our method produced higher accuracy than the previous method, we cannot tell whether or not the difference is quantitatively significant. Further experiments with more data might be required.</Paragraph> <Paragraph position="1"> From a qualitative view point, the proposed method successfully learned useful knowledge for choosing the correct target word. An example is shown in Table 4.</Paragraph> <Paragraph position="2"> One advantage of the proposed method is that it is applicable to interactive disambiguation. The acquired disambiguation knowledge gives clues for</Paragraph> </Section> </Section> class="xml-element"></Paper>