File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-1015_metho.xml
Size: 23,438 bytes
Last Modified: 2025-10-06 14:08:07
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-1015"> <Title>Word Sense Acquisition from Bilingual Comparable Corpora</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Basic Idea </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Clustering of translation equivalents </SectionTitle> <Paragraph position="0"> Most work on automatic extraction of synonyms from text corpora rests on the idea that synonyms have similar distribution patterns (Hindle, 1990; Peraira, et al., 1993; Grefenstette, 1994). This idea is also useful for our task, i.e., extracting sets of synonymous translation equivalents, and we adopt the approach to distributional word clustering.</Paragraph> <Paragraph position="1"> We need to mention that the singularity of our task makes the problem easier. First, we do not have to cluster all words of a language, but we only have to cluster a small number of translation equivalents for each target word, whose senses are to be extracted, separately. As a result, the problem of computational efficiency becomes less serious. Second, even if a translation equivalent itself is polysemous, it is not necessary to consider senses that are irrelevant to the target word. A translation equivalent usually represents one and only one sense of the target word, at least in case the language-pair is those with different origins like English and Japanese. Therefore, a non-overlapping clustering algorithm, which is far simpler than overlapping clustering algorithms, is sufficient. null</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Translingual distributional word clustering </SectionTitle> <Paragraph position="0"> In conventional distributional word clustering, a word is characterized by a vector or weighted set consisting of words in the same language as that of the word itself. In contrast, we propose a translingual distributional word clustering method, whereby a word is characterized by a vector or weighted set consisting of words in another language. It is based on the sense-vs.-clue correlation matrix calculation method we originally developed for unsupervised WSD (Kaji and Morimoto, 2002). That method presupposes that each sense of a target word x is defined with a synonym set consisting of the target word itself and one or more translation equivalents which represent the sense.</Paragraph> <Paragraph position="1"> It calculates correlations between the senses of x and the words statistically related to x, which act as clues for determining the sense of x, on the basis of translingual alignment of pairs of related words. Rows of the resultant correlation matrix are regarded as translingual distribution patterns characterizing translation equivalents.</Paragraph> <Paragraph position="2"> Sense-vs.-clue correlation matrix calculation A description of the wild-card pair of related words, which plays an essential role in recovering alignment failure, has been omitted for simplicity.</Paragraph> <Paragraph position="3"> Let X(x) be the set of clues for determining the sense of a first-language target word x. That is,</Paragraph> <Paragraph position="5"> denotes the collection of pairs of related words extracted from a corpus of the first language.</Paragraph> <Paragraph position="6"> Henceforth, the j-th clue for determining the sense of x will be denoted as x(j). Furthermore, let Y(x, x(j)) be the set consisting of all second-language counterparts of a first-language pair of related words x and x(j).</Paragraph> <Paragraph position="8"> denotes the collection of pairs of related words extracted from a corpus of the second language, and D denotes a bilingual dictionary, i.e., a collection of pairs consisting of a first-language word and a second-language word that are translations of one another. null Then, for each alignment, i.e., pair of (x, x(j)) and (y, y) ([?]Y(x, x(j))), a weighted set of common related words Z((x, x(j)), (y, y )) is constructed as follows: null</Paragraph> <Paragraph position="10"> The weight of x, denoted as w(x), is determined as follows:</Paragraph> <Paragraph position="12"> This is where MI(y, y) is the mutual information of y and y. The coefficient a was set to 5 in the experiment described in Section 4.</Paragraph> <Paragraph position="13"> 2) Calculation of correlation between senses and clues The correlation between the i-th sense S(x, i) and the j-th clue x(j) is defined as:</Paragraph> <Paragraph position="15"> This is where MI(x, x(j)) is the mutual information of x and x(j), and A((x, x(j)), (y, y), S(x,i)), the plausibility of alignment of (x, x(j)) with (y, y) suggesting S(x, i), is defined as the weighted sum of the correlations between the sense and the common related words,</Paragraph> <Paragraph position="17"> The correlations between senses and clues are calculated iteratively with the following initial values:</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="2" type="metho"> <SectionTitle> C </SectionTitle> <Paragraph position="0"> (S(x, i), x(j))=MI(x, x(j)). The number of iterations was set to 6 in the experiment. Figure 1 shows how the correlation values converge.</Paragraph> <Paragraph position="1"> Advantages of using translingually aligned distribution patterns Translingual distributional word clustering has advantages over conventional monolingual distributional word clustering, when they are used to cluster translation equivalents of a target word. First, it avoids clusters being degraded by polysemous translation equivalents. Let race be the target word. One of its translation equivalents, resu<REESU>, is a polysemous word representing lace as well as race.</Paragraph> <Paragraph position="2"> According to monolingual distributional word clustering, resu<REESU> is characterized by a mixture of the distribution pattern for resu<REESU> representing race and that for resu<REESU> representing lace, which often results in degraded clusters. In contrast, according to translingual distributional word clustering, resu<REESU> is characterized by the distribution pattern for the sense of race that means competition.</Paragraph> <Paragraph position="3"> Second, translingual distributional word clustering can exclude from the clusters translation equivalents irrelevant to the corpus. For example, a bilingual dictionary renders Te Zheng <TOKUCHOU> (feature) as a translation of race, but that sense of race is used infrequently. If it is the case in a given domain, Te Zheng <TOKUCHOU> has low correlation with most words related to race, and can therefore be excluded from any clusters.</Paragraph> <Paragraph position="4"> We should also mention the data-sparseness problem that hampers distributional word clustering. Generally speaking, the problem becomes more difficult in translingual distributional word clustering, since the sparseness of data in two languages is multiplied.</Paragraph> <Paragraph position="5"> However, the sense-vs.-clue correlation matrix calculation method overcomes this difficulty; it calculates the correlations between senses and clues iteratively to smooth out the sparse data.</Paragraph> <Paragraph position="6"> Translingual distributional word clustering can also be implemented on the basis of word-for-word alignment of a parallel corpus. However, availability of large parallel corpora is extremely limited. In contrast, the sense-vs.-clue correlation calculation method accepts comparable corpora which are available in many domains.</Paragraph> <Section position="1" start_page="0" end_page="2" type="sub_section"> <SectionTitle> 2.3 Similarity based on subordinate distribu- </SectionTitle> <Paragraph position="0"> tion pattern Naive translingual distributional word clustering based on the sense-vs.-clue correlation matrix calculation method is outlined in the following steps: 1) Define the sense of a target word by using each translation equivalent.</Paragraph> <Paragraph position="1"> 2) Calculate the sense-vs.-clue correlation matrix for the set of senses resulting from step 1). 3) Calculate similarities between senses on the basis of distribution patterns shown by the sense-vs.-clue correlation matrix.</Paragraph> <Paragraph position="2"> 4) Cluster senses by using a hierarchical agglomera null tive clustering method, e.g., the group-average method.</Paragraph> <Paragraph position="3"> However, this naive method is not effective because some senses usually have duplicated definitions in step 1) despite the fact that the sense-vs.-clue correlation matrix calculation algorithm presupposes a set of senses without duplicated definitions. The algorithm is based on the one sense per collocation hypothesis, and it results in each clue having a high correlation with one and only one sense. A clue can never have high correlations with two or more senses, even when they are actually the same sense. Consequently, synonymous translation equivalents do not necessarily have high similarity.</Paragraph> <Paragraph position="4"> Figure 2(a) shows parts of distribution patterns for senses and clues.</Paragraph> <Paragraph position="5"> {promotion, Xuan Chuan <SENDEN>}, {promotion, puromo siyon<PUROMOUSHON>}, and {promotion, Mai riIp mi <URIKOMI>} all of which define the sales activity sense of promotion. We see that most clues for selecting that sense have higher correlation with {promotion, Xuan Chuan <SENDEN>} than with {promotion, puro mosiyon<PUROMOUSHON>} and {promotion, Mai ri Ip mi<URIKOMI>}. This is because Xuan Chuan <SENDEN> is the most dominant translation equivalent of promotion in the corpus.</Paragraph> <Paragraph position="6"> To resolve the above problem, we calculated the sense-vs.-clue correlation matrix not only for the full set of senses but also for the set of senses excluding one of these senses. Excluding a definition of the sense, which includes the most dominant translation equivalent, allows most clues for selecting the sense to have the highest correlations with another definition of the same sense, which includes the second most dominant translation equivalent. Figure 2(b) shows parts of distribution patterns for {promotion, puromosiyon <PUROMOUSHON>} and {promotion, Mai riIp mi <URIKOMI>} shown by the sense-vs.-clue correlation matrix for the set of senses excluding {promotion, Xuan Chuan <SENDEN>}. We see that most clues for selecting the sales activity sense have higher correlations with {promotion, puromosiyon<PUROMOUSHON>} than with {promotion, Mai riIp mi<URIKOMI>}. This is because puromosiyon<PUROMOUSHON> is the second most dominant translation equivalent in the corpus. We also see that the distribution pattern for {promotion, puromosiyon<PUROMOUSHON>} in Fig. 2(b) is more similar to that for {promotion, Xuan Chuan <SENDEN>} in Fig. 2(a) than that for {promotion, puromosiyon <PUROMOUSHON>} in Fig. 2(a).</Paragraph> <Paragraph position="7"> We call the distribution pattern for sense S , resulting from the sense-vs.-clue correlation matrix for the set of senses excluding sense S , while we call the distribution pattern for sense S , resulting from the sense-vs.-clue correlation matrix for the full set of senses, simply the distribution pattern for S Calculating the sense-vs.-clue correlation matrix for a set of senses excluding one sense is of course insufficient since three or more translation equivalents may represent the same sense of the target word. We should calculate the sense-vs.-clue correlation matrices both for the full set of senses and for the set of senses excluding one of these senses again, after merging similar senses into one. Repeating these procedures enables corpus-relevant but less dominant translation equivalents to be drawn up, while corpus-irrelevant ones are never drawn up. Thus, a hierarchy of corpus-relevant senses or clusters of corpus-relevant translation equivalents is produced.</Paragraph> </Section> </Section> <Section position="5" start_page="2" end_page="2" type="metho"> <SectionTitle> 3 Proposed Method 3.1 Outline </SectionTitle> <Paragraph position="0"> As shown in Fig. 3, our method repeats the following three steps: 1) Calculate sense-vs.-clue correlation matrices both for the full set of senses and for a set of senses excluding each of these senses.</Paragraph> <Paragraph position="1"> 2) Calculate similarities between senses on the basis of distribution patterns and subordinate distribution patterns.</Paragraph> <Paragraph position="2"> 3) Merge each pair of senses with high similarity into one.</Paragraph> <Paragraph position="3"> The initial set of senses is given as S(x)={{x, y</Paragraph> <Paragraph position="5"> are translation equivalents of x in the second-language. Translation equivalents that occur less frequently in the second-language corpus can be excluded from the initial set to shorten the processing time. The details of the steps are described in the following sections.</Paragraph> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.2 Calculation of sense-vs.-clue correlation matrices </SectionTitle> <Paragraph position="0"> First, a sense-vs.-clue correlation matrix is calculated for the full set of senses. The resulting correlation matrix is denoted as C. That is, C(i, j) is the correlation between the i-th sense S(x,i) of a target word x and its j-th clue x(j).</Paragraph> <Paragraph position="1"> Then a set of active senses, S A (x), is determined. A sense is regarded active if and only if the ratio of clues with which it has the highest correlation exceeds a predetermined threshold th (In the experiment in Section 4, th was set to 0.05). That is, are relevant to the corpus.</Paragraph> <Paragraph position="2"> Finally, a sense-vs.-clue correlation matrix is calculated for the set of senses excluding each of the active senses. The correlation matrix calculated for the set of senses excluding the k-th sense is denoted as C</Paragraph> <Paragraph position="4"> i-th sense and the j-th clue that is calculated excluding the k-th sense. C -k (k, j) (j=1, 2, ...) are set to zero. This redundant k-th row is included to maintain the same correspondence between rows and senses as in C.</Paragraph> </Section> <Section position="2" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.3 Calculation of sense similarity matrix </SectionTitle> <Paragraph position="0"> Similarity of the i-th sense S(x, i) to the j-th sense S(x, j), Sim(S(x, i), S(x, j)), is defined as the similarity of the distribution pattern for S(x, i) subordinate to S(x, j) to the distribution pattern of S(x, j). Note that this similarity is asymmetric and reflects which sense is more dominant in the corpus. It is probable that Sim(S(x, i), S(x, j)) is large but Sim(S(x, j), S(x, i)) is not when S(x, j) is more dominant than S(x, i).</Paragraph> <Paragraph position="1"> According to the sense-vs.-clue correlation matrix, each sense is characterized by a weighted set of clues.</Paragraph> <Paragraph position="2"> Therefore, we used the weighted Jaccard coefficient as the similarity measure. That is,</Paragraph> <Paragraph position="4"> It should be noted that a sense is characterized by different weighted sets of clues depending on which sense the similarity is calculated. Note also that inactive senses are neglected because they are not reliable.</Paragraph> </Section> <Section position="3" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.4 Merging similar senses </SectionTitle> <Paragraph position="0"> The set of senses is updated by merging every pair of mutually most-similar senses into one. That is,</Paragraph> <Paragraph position="2"> The s is a predetermined threshold for similarity, which is introduced to avoid noisy pairs of senses being merged. In the experiment in Section 4, s was set to 0.25.</Paragraph> <Paragraph position="3"> If at least one pair of senses are merged, the whole procedure, i.e., the calculation of sense-vs.-clue matrices through the merger of similar senses, is repeated for the updated set of senses. Otherwise, the clustering procedure terminates.</Paragraph> <Paragraph position="4"> Agglomerative clustering methods usually suffer from the problem of when to terminate merging. In our method described above, the similarity of senses that are merged into one does not necessarily decrease monotonically, which makes the problem more difficult. At present, we are forced to output a dendrogram that represents the history of mergers and leave the final decision to humans. The dendrogram consists of translation equivalents that are included in active senses in the final cycle. Other translation equivalents are rejected as they are irrelevant to the corpus.</Paragraph> </Section> </Section> <Section position="6" start_page="2" end_page="2" type="metho"> <SectionTitle> 4 Experimental Evaluation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 4.1 Experimental settings </SectionTitle> <Paragraph position="0"> Our method was evaluated through an experiment using a Wall Street Journal corpus (189 Mbytes) and a Nihon Keizai Shimbun corpus (275 Mbytes).</Paragraph> <Paragraph position="1"> First, collected pairs of related words, which we restricted to nouns and unknown words, were obtained from each corpus by extracting pairs of words co-occurring in a window, calculating mutual information of each pair of words, and selecting pairs with mutual information larger than the threshold. The size of the window was 25 words excluding function words, and the threshold for mutual information was set to zero. Second, a bilingual dictionary was prepared by collecting pairs of nouns that were translations of one another from the Japan Electronic Dictionary Research Institute (EDR) English-to-Japanese and Japanese-to-English dictionaries. The resulting dictionary includes 633,000 pairs of 269,000 English nouns and 276,000 Japanese nouns.</Paragraph> <Paragraph position="2"> Evaluating the performance of word sense acquisition methods is not a trivial task. First, we do not have a gold-standard sense inventory. Even if we have one, we have difficulty mapping acquired senses onto those in it. Second, there is no way to establish the complete set of senses appearing in a large corpus. Therefore, we evaluated our method on a limited number of target words as follows.</Paragraph> <Paragraph position="3"> We prepared a standard sense inventory by selecting 60 English target words and defining an average of 3.4 senses per target word manually. The senses were rather coarse-grained; i.e., they nearly corresponded to groups of translation equivalents within the entries of everyday English-Japanese dictionaries. We then sampled 100 instances per target word from the Wall Street Journal corpus, and we sense-tagged them manually. Thus, we estimated the ratios of the senses in the training corpus for each target word.</Paragraph> <Paragraph position="4"> We defined two evaluative measures, recall of senses and accuracy of sense definitions. The recall of senses is the proportion of senses with ratios not less than a threshold that are successfully extracted, and it varies with change of the threshold. We judged that a sense was extracted, when it shared at least one translation equivalent with some active sense in the final cycle.</Paragraph> <Paragraph position="5"> To evaluate the accuracy of sense definitions while avoiding mapping acquired senses onto those in the standard sense inventory, we regard a set of senses as a set of pairs of synonymous translation equivalents. Let</Paragraph> <Paragraph position="7"> be a set consisting of pairs of translation equivalents belonging to the same sense in the standard sense inventory. Likewise, let T(k) be a set consisting of pairs of translation equivalents belonging to the same active sense in the k-th cycle. Further, let U be a set of pairs of translation equivalents that are included in active senses in the final cycle. Recall and precision of pairs of synonymous translation equivalents in the k-th cycle are defined as: The F-measure indicates how well the set of active senses coincides with the set of sense definitions in the standard senses inventory. Although the current method cannot determine the optimum cycle, humans can identify the set of appropriate senses from a hierarchy of senses at a glance. Therefore, we define the accuracy of sense definitions as the maximum F-measure in all cycles.</Paragraph> </Section> <Section position="2" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 4.2 Experimental results </SectionTitle> <Paragraph position="0"> To simplify the evaluation procedure, we clustered translation equivalents that were used to define the senses of each target word in the standard sense inventory, rather than clustering translation equivalents rendered by the EDR bilingual dictionary. The recall of senses for totally 201 senses of the 60 target words was: 96% for senses with ratios not less than 25%, 87% for senses with ratios not less than 5%, and 78% for senses with ratios not less than 1%.</Paragraph> <Paragraph position="1"> The accuracy of sense definitions, averaged over the 60 target words, was 77%.</Paragraph> <Paragraph position="2"> The computational efficiency of our method proved to be acceptable. It took 13 minutes per target word on a HP9000 C200 workstation (CPU clock: 200 MHz, memory: 32 MB) to produce a hierarchy of clusters of translation equivalents.</Paragraph> <Paragraph position="3"> Some clustering results are shown in Fig. 4. These demonstrate that our proposed method shows a great deal of promise. At the same time, evaluating the results revealed its deficiencies. The first of these lies in the crucial role of the bilingual dictionary. It is obvious that a sense is never extracted if the translation equivalents representing it are not included in it. An exhaustive bilingual dictionary is therefore required. From this point of view, the EDR bilingual dictionary is fairly good. The second deficiency lies in the fact that it performs badly for low-frequency or non-topical senses. For example, the sense of bar as the legal profession was clearly extracted, but its sense as a piece of solid material was not extracted.</Paragraph> <Paragraph position="4"> We also compared our method with two alternatives: monolingual distributional clustering mentioned in Section 2.2 and naive translingual clustering mentioned in Section 2.3. Figures 5(a), (b), and (c) show respective examples of clustering obtained by our method, the monolingual method, and the naive translingual method. Comparing (a) with (b) reveals the superiority of the translingual approach to the monolingual approach, and comparing (a) with (c) reveals the effectiveness of the subordinate distribution pattern introduced in Section 2.3. Note that deleting the corpus-irrelevant translation equivalents from the dendrograms in both (b) and (c) would not result in appropriate ones.</Paragraph> </Section> </Section> class="xml-element"></Paper>