File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1192_intro.xml
Size: 18,523 bytes
Last Modified: 2025-10-06 14:02:09
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1192"> <Title>Fine-Grained Word Sense Disambiguation Based on Parallel Corpora, Word Alignment, Word Clustering and Aligned Wordnets</Title> <Section position="2" start_page="0" end_page="2" type="intro"> <SectionTitle> 1 Introduction Word Sense Disambiguation (WSD) is well- </SectionTitle> <Paragraph position="0"> known as one of the more difficult problems in the field of natural language processing, as noted in (Gale et al, 1992; Kilgarriff, 1997; Ide and Veronis, 1998), and others. The difficulties stem from several sources, including the lack of means to formalize the properties of context that characterize the use of an ambiguous word in a given sense, lack of a standard (and possibly exhaustive) sense inventory, and the subjectivity of the human evaluation of such algorithms. To address the last problem, (Gale et al, 1992) argue for upper and lower bounds of precision when comparing automatically assigned sense labels with those assigned by human judges. The lower bound should not drop below the baseline usage of the algorithm (in which every word that is disambiguated is assigned the most frequent sense) whereas the upper bound should not be too restrictive&quot; when the word in question is hard to disambiguate even for human judges (a measure of this difficulty is the computation of the agreement rates between human annotators).</Paragraph> <Paragraph position="1"> Identification and formalization of the determining contextual parameters for a word used in a given sense is the focus of WSD work that treats texts in a monolingual setting--that is, a setting where translations of the texts in other languages either do not exist or are not considered. This focus is based on the assumption that for a given word w and two of its . A formalized definition of context for a given sense would then enable a WSD system to accurately assign sense labels to occurrences of w in unseen texts. Attempts to characterize context for a given sense of a word have addressed a variety of factors: * Context length: what is the size of the window of text that should be considered to determine context? Should it consist of only a few words, or include much larger portions of text? * Context content: should all context words be considered, or only selected words (e.g., only words in a certain part of speech or a certain grammatical relations to the target word)? Should they be weighted based on distance from the target or treated as a &quot;bag of words&quot;? * Context formalization: how can context information be represented to enable definitions of an inter-context equivalence function? Is there a single representation appropriate for all words, or does it vary according to, for example, the word's part of speech? The use of multi-lingual parallel texts provides a very different approach to the problem of context identification and characterization.</Paragraph> <Paragraph position="2"> &quot;Context&quot; now becomes the word(s) by which the target word (i.e., the word to be disambiguated) is translated in one or more other languages. The assumption here is that different senses of a word are likely to be lexicalized differently in different languages; therefore, the translation can be used to identify the correct sense of a word. Effectively, the translation captures the context as the translator conceived it. The use of parallel translations for sense disambiguation brings up a different set of issues, primarily because the assumption that different senses are lexicalized differently in different languages is true only to an extent. For instance, it is well known that many ambiguities are preserved across languages (e.g. the French interet and the English interest), especially languages that are relatively closely related. This raises new questions: how many languages, and of which types (e.g., closely related languages, languages from different language families), provide adequate information for this purpose? How do we measure the degree to which different lexicalizations provide evidence for a distinct sense? We have addressed these questions in experiments involving sense clustering based on translation equivalents extracted from parallel corpora (Ide, 199; Ide et al., 2002). Tufis and Ion (2003) build on this work and further describe a method to accomplish a &quot;neutral&quot; labelling for the sense clusters in Romanian and English that is not bound to any particular sense inventory.</Paragraph> <Paragraph position="3"> Our experiments confirm that the accuracy of word sense clustering based on translation equivalents is heavily dependent on the number and diversity of the languages in the parallel corpus and the language register of the parallel text. For example, using six source languages from three language families (Romance, Slavic and Finno-Ugric), sense clustering of English words was approximately 74% accurate; when fewer languages and/or languages from less diverse families are used accuracy drops dramatically. This drop is obviously a result of the decreased chances that two or more senses of an ambiguous word in one language will be lexicalized differently in another when fewer languages, and languages that are more closely related, are considered.</Paragraph> <Paragraph position="4"> To enhance our results, we have explored the use of additional resources, in particular, the aligned wordnets in BalkaNet (Tufis et al.</Paragraph> <Paragraph position="5"> 2004a). BalkaNet is a European project that is developing monolingual wordnets for five Balkan languages (Bulgarian, Greek, Romanian Serbian, and Turkish) and improving the Czech wordnet developed in the EuroWordNet project. The wordnets are aligned to the Princeton Wordnet (PWN2.0), taken as an interlingual index, following the principles established by the EuroWordNet consortium. The underlying hypothesis in this experiment exploits the common intuition that reciprocal translations in parallel texts should have the same (or closely related) interlingual meanings (in terms of BalkaNet, interlingual index (ILI) codes).</Paragraph> <Paragraph position="6"> However, this hypothesis is reasonable if the monolingual wordnets are reliable and correctly linked to the interlingual index (ILI). Quality assurance of the wordnets is a primary concern in the BalkaNet project, and to this end, the consortium developed several methods and tools for validation, described in various papers authored by BalkaNet consortium members (see Proceedings of the Global WordNet Conference, Brno, 2004).</Paragraph> <Paragraph position="7"> We previously implemented a language-independent disambiguation program, called WSDtool, which has been extended to serve as a multilingual wordnet checker and specialized editor for error-correction. In (Tufis, et al., 2004) it was demonstrated that the tool detected several interlingual alignment errors that had escaped human analysis. In this paper, we describe a disambiguation experiment that exploits the ILI information in the corrected wordnets for which there are aligned wordnets, extract all pairs of lexical items that are reciprocal</Paragraph> <Paragraph position="9"> the ILI codes for the synsets that contain W</Paragraph> <Paragraph position="11"> respectively to yield two lists of ILI codes, are the most similar ILI codes (defined below) among the candidate pairs</Paragraph> <Paragraph position="13"> The accuracy of step 1 is essential for the success of the validation method. A recent shared task evaluation) of different word aligners (www.cs.unt.edu/~rada/wpt, organized on the occasion of the Conference of the NAACL showed that step 1 may be solved quite reliably.</Paragraph> <Paragraph position="14"> Our system (Tufis et al. 2003) produced lexicons relevant for wordnets evaluation, with an aggregated F-measure as high as 84.26%.</Paragraph> <Paragraph position="15"> Meanwhile, the word-aligner was further improved so that current performance on the same data is about 1% better on all scores in word alignment and about 2% better in wordnetrelevant dictionaries. The word alignment problem includes cases of null alignment, where words in one part of the bitext are not translated in the other part; and cases of expression alignment, where multiple words in one part of the bitext are translated as one or more words in the other part. Word alignment algorithms typically do not take into account the part of speech (POS) of the words comprising a translation equivalence pair, since cross-POS translations are rather frequent. However, for the aligned wordnet-based word sense disambiguation we discard both translation pairs which do not preserve the POS and null alignments. Multiword expressions included in a wordnet are dealt with by the underlying tokenizer. Therefore, we consider only one-toone, POS-preserving alignments.</Paragraph> <Paragraph position="16"> Once the translation equivalents were extracted, then, for any translation equivalence</Paragraph> <Paragraph position="18"> > and two aligned wordnets, the steps 2 and 3 above should ideally identify one ILI concept lexicalized by W in language L1 and by W in language L2. However, due to various reasons, the wordnets alignment might reveal not the same ILI concept, but two concepts which are semantically close enough to license the translation equivalence of W</Paragraph> <Paragraph position="20"> can be easily generalized to more than two languages. Our measure of interlingual concepts semantic similarity is based on PWN2.0 structure. We compute semantic-similarity score by formula:</Paragraph> <Paragraph position="22"> where k is the number of links from ILI to the nearest common ancestor. The semantic similarity score is 1 when the two concepts are identical, 0.33 for two sister concepts, and 0.5 for mother/daughter, whole/part, or concepts related by a single link. Based on empirical studies, we decided to set the significance threshold of the semantic similarity score to 0.33. Other approaches to similarity measures are described in (Budanitsky and Hirst 2001).</Paragraph> <Paragraph position="23"> In order to describe the algorithm for WSD based on aligned wordnets let us assume we have a parallel corpus containing texts in k+1 are the source languages and monolingual wordnets for each of the k+1 languages interlinked via an ILI-like structure. For each source language and for all occurrences of a specific word in the target language T, we build a matrix of translation equivalents as shown in Table 1 (eq ij represents the translation equivalent in the i th source language of the j</Paragraph> <Paragraph position="25"> is represented by the null string.</Paragraph> <Paragraph position="26"> The second step transforms the matrix in Table 1 to a VSA (Validation and Sense Assignment) matrix with the same dimensions VSA(i,j) is undefined; otherwise, it is a set containing 0, 1, or more ILI codes. For undefined VSAs, the algorithm cannot determine the sense number for the corresponding occurrence of the target word. However, it is very unlikely that an entire column in Table 2 is undefined, i.e., that there is no translation equivalent for an occurrence of the target word in any of the source languages.</Paragraph> <Paragraph position="27"> When VSA(i,j) contains a single ILI code, the target occurrence and its translation equivalent are assigned the same sense.</Paragraph> <Paragraph position="28"> When VSA(i,j) is empty--i.e., when none of the senses of the target word corresponds to an ILI code to which a sense of the translation equivalent was linked--the algorithm selects the</Paragraph> <Paragraph position="30"> ) has a the semantic similarity score above the significance threshold, neither the occurrence of the target word nor its translation equivalent can be semantically disambiguated; but once again, it is extremely rare that there is no translation equivalent for an occurrence of the target word in any of the source languages.</Paragraph> <Paragraph position="31"> In case of ties, the pair corresponding to the most frequent sense of the target word in the current bitext pair is selected. If this heuristic in turn fails, the choice is made in favor of the pair corresponding to the lowest PWN2.0 sense number for the target word, since PWN senses are ordered by frequency.</Paragraph> <Paragraph position="32"> When the VSA cell contains two or more ILIcodes, we have the case of cross-lingual ambiguity, i.e., two or more senses are common to the target word and the corresponding translation equivalent in the i th language.</Paragraph> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 2.1 Agglomerative clustering </SectionTitle> <Paragraph position="0"> As noted before, when VSA(i,j) is undefined, we may get the information from a VSA corresponding to the same occurrence of the target word in a different language. However, this demands that aligned wordnets are available for all languages in the parallel corpus, and that the quality of the inter-lingual linking is high for all languages concerned. In cases where we cannot fulfill these requirements, we rely on a &quot;backoff&quot; method involving sense clustering based on translation equivalents, as discussed in (Ide, et al., 2002). We apply the clustering method after the wordnet-based method has been applied, and therefore each cluster containing an undisambiguated occurrence of the target word will also typically contain several occurrences that have already been assigned a sense. We can therefore assign the most frequent sense assignment in the cluster to previously unlabeled occurrences within the same cluster. The combined approach has two main advantages: * it eliminates reliance only on high-quality, k-1 aligned wordnets. Indeed, having k+1 languages in our corpus, we need only apply the WSD method to the aligned wordnets for the target language (English in our case) and one source</Paragraph> <Paragraph position="2"> , and alignment lexicons from the target language to every other language in the corpus. The WSD procedure in the bilingual setting would ensure the sense assignment for most of the non-null translation equivalence pairs and the clustering algorithm would classify the target words which were not translated (or for which the word alignment algorithm didn't find a</Paragraph> <Paragraph position="4"> based on their equivalents in the other k-1 source languages. * it can reinforce or modify the sense assignment decided by the tie heuristics in case of cross-lingual ambiguity.</Paragraph> <Paragraph position="5"> To perform the clustering, we derive a set of m binary vectors VECT(L</Paragraph> <Paragraph position="7"> ) for each source language L p and each target word i occurring m times in the corpus. To compute the vectors, we first construct a Dictionary Entry List > is a translation equivalence pair}, comprising the ordered list of all the translation equivalents in the source language p L of the target word TW i . In this part of the experiment, the translation equivalents are automatically extracted from the parallel corpus using a hypothesis testing algorithm described in We use a Hierarchical Clustering Algorithm based on Stolcke's Cluster2.9 to classify similar vectors into sense classes. Stolcke's algorithm generates a clustering tree, the root of which corresponds to a baseline clustering (all the occurrences are clustered in one sense class) and the leaves are single element classes, corresponding to each occurrence vector of the target word. An interior cut in the clustering tree will produce a specific number (say X) of subtrees, the roots of which stand for X classes each containing the vectors of their leaves. We call an interior cut a pertinent cut if X is equal to the number of senses TW i has been used throughout the entire corpus. One should note that in a clustering tree many pertinent cuts could be possible. The pertinent cut which corresponds to the correct sense clustering of the m occurrences of TW i is called a perfect cut. However, if TW i has Y possible senses, it is possible that only a subset of the Y senses will be used in an arbitrary text. Therefore, a perfect cut in a clustering tree cannot be deterministically computed. Instead of deriving the clustering tree and guessing at a perfect cut, we stop the clustering algorithm when Z clusters have been created, where Z is the number of senses in which the occurrences of</Paragraph> <Paragraph position="9"> have been used in the text in question.</Paragraph> <Paragraph position="10"> However, the value of Z is specific to each word and depends on the type and size of the text; it cannot therefore be computed a priori. In our previous work (Tufis and Ion, 2003), to approximate Z we imposed an exit condition for the clustering algorithm based on distance heuristics. In particular, the algorithm stops when the minimal distance between the existing classes increases beyond a given threshold level: where dist(k) is the minimal distance between two clusters at the k-th iteration step and a is an empirical numerical threshold. Experimentation revealed that reasonable results are achieved with a value for a is 0.12. However, although the threshold is a parameter for the clustering algorithm irrespective of the target words, the number of classes the clustering algorithm generates (Z) is still dependent on the particular target word and the corpus in which it appears.</Paragraph> <Paragraph position="11"> By using sense information produced by the ILI-similarity approach, the algorithm and its exit condition have been modified as described below: - the sense label of a cluster is given by the majority sense of its members as assigned by the wordnet-based sense labelling; a cluster containing only non-disambiguated occurrences has an wild-card sense label; - two joinable clusters (that is the clusters with the minimal distance and the exit condition (1) not satisfied) are joint only when their sense labels is the same or one of them has an wild-card sense label; in this case the wild-card sense label is turned into the sense label of the senseassigned cluster. Otherwise the next distant clusters are tried.</Paragraph> <Paragraph position="12"> - the algorithm stops when no clusters can be further joined.</Paragraph> </Section> </Section> class="xml-element"></Paper>