File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0856_metho.xml
Size: 17,276 bytes
Last Modified: 2025-10-06 14:09:10
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-0856"> <Title>Pattern Abstraction and Term Similarity for Word Sense Disambiguation: IRST at Senseval-3</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Semantic Domains and LSA </SectionTitle> <Paragraph position="0"> Domains are common areas of human discussion, such as economics, politics, law, science etc., which are at the basis of lexical coherence. A substantial part of the lexicon is composed by domain words , that refer to concepts belonging to speci c domains.</Paragraph> <Paragraph position="1"> In (Magnini et al., 2002) it has been claimed that domain information provides generalized features at the paradigmatic level that are useful to discriminate among word senses.</Paragraph> <Paragraph position="2"> The WORDNET DOMAINS1 lexical resource is an extension of WORDNET which provides such domain labels for all synsets (Magnini and Cavagli a, 2000). About 200 domain labels were selected from a number of dictionaries and then structured in a taxonomy according to the Dewey Decimal Classi cation (DDC). The annotation methodology was mainly manual and took about 2 person years.</Paragraph> <Paragraph position="3"> WORDNET DOMAINS has been proven a useful resource for WSD. However some aspects induced us to explore further developments. These issues are: (i) it is dif cult to nd an objective a-priori model for domains; (ii) the annotation procedure followed to develop WORDNET DOMAINS is very expensive, making hard the replicability of the lexical resource for other languages or domain speci c sub-languages; (iii) the domain distinctions are rigid in WORDNET DOMAINS, while a more fuzzy association between domains and concepts is often more appropriate to describe term similarity.</Paragraph> <Paragraph position="4"> In order to generalize the domain approach and to overcome these issues, we explored the direction of unsupervised learning on a large-scale corpus (we used the BNC corpus for all the experiments described in this paper).</Paragraph> <Paragraph position="5"> In particular, we followed the LSA approach (Deerwester et al., 1990). In LSA, term co-occurrences in the documents of the corpus are captured by means of a dimensionality reduction operated on the term-by-document matrix. The resulting LSA vectors can be exploited to estimate both term and document similarity. Regarding document similarity, Latent Semantic Indexing (LSI) is a technique that allows one to represent a document by a LSA vector. In particular, we used a variation of the pseudo-document methodology described in (Berry, 1992). Each document can be represented in the LSA space by summing up the normalized LSA vectors of all the terms contained in it.</Paragraph> <Paragraph position="6"> By exploiting LSA vectors for terms, it is possible to estimate domain vectors for the synsets of WORDNET, in order to obtain similarity values between concepts that can be used for synset clustering and WSD. Thus, term and document vectors can be used instead of WORDNET DOMAINS for WSD and other applications in which term similarity and domain relevance estimation is required.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 All-Words systems: DDD and DDD-LSA </SectionTitle> <Paragraph position="0"> DDD with DRE. DDD assignes the right sense of a word in its context comparing the domain of the context to the domain of each sense of the word.</Paragraph> <Paragraph position="1"> This methodology exploits WORDNET DOMAINS information to estimate both the domain of the textual context and the domain of the senses of the word to disambiguate.</Paragraph> <Paragraph position="2"> The basic idea to estimate domain relevance for texts is to exploit lexical coherence inside texts. A simple heuristic approach to this problem, used in Senseval-2, is counting the occurrences of domain words for every domain inside the text: the higher the percentage of domain words for a certain domain, the more relevant the domain will be for the text.</Paragraph> <Paragraph position="3"> Unfortunately, the simple local frequency count is not a good domain relevance measure for several reasons. Indeed irrelevant senses of ambiguous words contribute to augment the nal score of irrelevant domains, introducing noise. Moreover, the level of noise is different for different domains because of their different sizes and possible differences in the ambiguity level of their vocabularies. We re ned the original Senseval-2 DDD system with the Domain Relevance Estimation (DRE) technique. Given a certain domain, DRE distinguishes between relevant and non-relevant texts by means of a Gaussian Mixture model that describes the frequency distribution of domain words inside a large-scale corpus (in particular we used the BNC corpus also in this case). Then, an Expectation Maximization algorithm computes the parameters that maximize the likelihood of the model on the empirical data (Gliozzo et al., 2004).</Paragraph> <Paragraph position="4"> In order to represent domain information we introduced the notion of Domain Vectors (DV), which are data structures that collect domain information.</Paragraph> <Paragraph position="5"> These vectors are de ned in a multidimensional space, in which each domain represents a dimension of the space. We distinguish between two kinds of DVs: (i) synset vectors, which represent the relevance of a synset with respect to each considered domain and (ii) text vectors, which represent the relevance of a portion of text with respect to each domain in the considered set. The core of the DDD algorithm is based on scoring the comparison of these kinds of vectors. The synset vectors are built considering WORDNET DOMAINS, while in the calculation of scoring the system takes into account synset probabilities on SemCor. The system makes use of a threshold th-cut, ranging in the interval [0,1], that allows us to tune the tradeoff between precision and recall.</Paragraph> <Paragraph position="6"> Section 2, it is possible to implement a DDD version that does not use WORDNET DOMAINS and instead it exploits LSA term and document vectors for estimating synset vectors and text vectors, leaving the core of DDD algorithm unchanged. As for text vectors, we used the psedo-document technique also for building synset vectors: in this case we consider the synonymous terms contained in the synset itself.</Paragraph> <Paragraph position="7"> The system presented at Senseval-3 does not make use of any statistics on SemCor, and consequently it can be considered fully unsupervised. Results are reported in table 3 and do not differ much from the results obtained by DDD in the same task.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Lexical Sample Systems: Pattern </SectionTitle> <Paragraph position="0"> abstraction and Kernel Methods One of the most discriminative features for lexical disambiguation is the lexical/syntactic pattern in which the word appears. A well known issue in the WSD area is the one sense per collocation claim (Yarowsky, 1993) stating that the word meanings are strongly associated with the particular collocation in which the word is located. Collocations are sequences of words in the context of the word to disambiguate, and can be associated to word senses performing supervised learning.</Paragraph> <Paragraph position="1"> Another important knowledge source for WSD is the shallow-syntactic pattern in which a word appears. Syntactic patterns, like lexical patterns, can be obtained by exploiting pattern abstraction techniques on POS sequences. In the WSD literature both lexical and syntactic patterns have been used as features in a supervised learning schema by representing each instance using bigrams and trigrams in the surrounding context of the word to be analyzed2. null 2More recently deep-syntactic features have been also considered by several systems, as for example modi ers of nouns and verbs, object and subject of the sentence, etc. In order to Representing each instance by a bag of features presents several disadvantages from the point of view of both machine learning and computational linguistics: (1) Sparseness in the learning data: most of the collocations found in the learning data occur just once, reducing the generalization power of the learning algorithm. In addition most of the collocations found in the test data are often unseen in the training data. (2) Low exibility for pattern abstraction purposes: bigram and trigram extraction schemata are xed in advance. (3) Knowledge acquisition bottleneck: the size of the training data is not large enough to cover each possible collocation in the language.</Paragraph> <Paragraph position="2"> To overcome problems 1 and 2 we investigated some pattern abstraction techniques from the area of Information Extraction (IE) and we adapted them to WSD. To overcome problem 3 we developed Latent Semantic Kernels, which allow us to integrate external knowledge provided by unsupervised term similarity estimation.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 TIES </SectionTitle> <Paragraph position="0"> Our rst experiments have been performed exploiting TIES, an environment developed at IRST for IE that induces patterns from the marked entities in the training phase, and then applies those patterns in the test phase in order to assign a category if the pattern is satis ed. For our experiments, we used the Boosted Wrapper Induction (BWI) algorithm (Freitag and Kushmerick, 2000) that is implemented in TIES.</Paragraph> <Paragraph position="1"> For Senseval-3 we used very few features (lemma and POS). We proposed the system in this con guration as a baseline system for pattern abstraction. Our preliminary experiments with BWI have shown that pattern abstraction is very attractive for WSD, allowing us to achieve a very high precision for a restricted number of words, in which the syntagmatic information is suf cient for disambiguation. However, we still had some restrictions. In particular, the integration with different knowledge sources for classi cation is not trivial.</Paragraph> <Paragraph position="2"> obtain such features parsing of the data is required. However, we decided to do not use such information, while we plan to introduce it in the next future.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Kernel-WSD </SectionTitle> <Paragraph position="0"> Our choice of exploiting kernel methods for WSD has been motivated by the observation that pattern-based approaches for disambiguation are complementary to the domain based ones: they require different knowledge sources and different techniques for classi cation and feature description. Both approaches have to be simultaneously taken into account in order to perform accurate disambiguation.</Paragraph> <Paragraph position="1"> Our aim was to combine them into a common framework.</Paragraph> <Paragraph position="2"> Kernel methods, e.g. Support Vector Machines (SVMs), are state-of-the-art learning algorithms, and they are successfully adopted in many NLP tasks.</Paragraph> <Paragraph position="3"> The idea of SVM (Cristianini and Shawe-Taylor, 2000) is to map the set of training data into a higherdimensional feature space F via a mapping function : @ ! F, and construct a separating hyperplane with maximum margin (distance between planes and closest points) in the new space. Generally, this yields a nonlinear decision boundary in the input space. Since the feature space is high dimensional, performing the transformation has often a high computational cost. Rather than use the explicit mapping , we can use a kernel function K : @ @!< , that corresponds to the inner product in a feature space which is, in general, different from the input space.</Paragraph> <Paragraph position="4"> Therefore, a kernel function provides a way to compute (ef ciently) the separating hyperplane without explicitly carrying out the map into the feature space - this is called the kernel trick. In this way the kernel acts as an interface between the data and the learning algorithm by de ning an implicit mapping into the feature space. Intuitively, we can see the kernel as a function that measures the similarity between pairs of objects. The learning algorithm, which compares all pairs of data items, exploits the information encoded in the kernel. An important characteristic of kernels is that they are not limited to vector objects but are applicable to virtually any kind of object representation.</Paragraph> <Paragraph position="5"> In this work we use kernel methods to combine heterogeneous sources of information that we found relevant for WSD. For each of these aspects it is possible to de ne kernels independently. Then they are combined by exploiting the property that the sum of two kernels is still a kernel (i.e. k(x;y) = k1(x;y) + k2(x;y)), taking advantage of each single contribution in an intuitive way3.</Paragraph> <Paragraph position="6"> 3In order to keep the kernel values comparable for different values and to be independent from the length of the examples, we considered the normalized version ^K(x; y) = The Word Sense Disambiguation Kernel is dened in this way:</Paragraph> <Paragraph position="8"> where KS is the Syntagmatic Kernel and KP is the Paradigmatic Kernel.</Paragraph> <Paragraph position="9"> The Syntagmatic Kernel. The syntagmatic kernel generalizes the word-sequence kernels de ned by (Cancedda et al., 2003) to sequences of lemmata and POSs. Word sequence kernels are based on the following idea: two sequences are similar if they have in common many sequences of words in a given order. The similarity between two examples is assessed by the number (possibly noncontiguous) of the word sequences matching. Non-contiguous occurrences are penalized according to the number of gaps they contain. For example the sequence of words I go very quickly to school is less similar to I go to school than I go quickly to school . Different than the bag-of-word approach, word sequence kernels capture the word order and allow gaps between words. The word sequence kernels are parametric with respect to the length of the (sparse) sequences they want to capture.</Paragraph> <Paragraph position="10"> We have de ned the syntagmatic kernel as the sum of n distinct word-sequence kernels for lemmata (i.e. Collocation Kernel - KC) and sequences of POSs (i.e. POS Kernel - KPOS), according to the formula (for our experiments we set n to 2):</Paragraph> <Paragraph position="12"> In the above de nition of syntagmatic kernel, only exact lemma/POS matches contribute to the similarity. One shortcoming of this approach is that (near-)synonyms will never be considered similar. We address this problem by considering soft-matching of words employing a term similarity</Paragraph> <Paragraph position="14"> measure based on LSA4. In particular we considered equivalent two words having the same POS and a similarity value higher than an empirical threshold. For example, if we consider as equivalent the terms Ronaldo and football player the sequence The football player scored the rst goal can be considered equivalent to the sentence Ronaldo scored the rst goal. The properties of the kernel methods offer a exible way to plug additional information, in this case unsupervised (we could also take this information from a semantic network such as WORDNET). null The Paradigmatic Kernel. The paradigmatic kernel takes into account the paradigmatic aspect of sense distinction (i.e. domain aspects) (Gliozzo et al., 2004). For example the word virus can be disambiguated by recognizing the domain of the context in which it is placed (e.g. computer science vs. biology). Usually such an aspect is captured by bag-of-words , in analogy to the Vector Space Model, widely used in Text Categorization and Information Retrieval. The main limitation of this model for WSD is the knowledge acquisition bottleneck (i.e. the lack of sense tagged data). Bag of words are very sparse data that require a large scale corpus to be learned. To overcome such a limitation, Latent Semantic Indexing (LSI) can provide a solution.</Paragraph> <Paragraph position="15"> Thus we de ned a paradigmatic kernel composed by the sum of a traditional bag of words kernel and an LSI kernel (Cristianini et al., 2002) as dened by formula 3:</Paragraph> <Paragraph position="17"> where KBoW computes the inner product between the vector space model representations and KLSI computes the cosine between the LSI vectors representing the texts.</Paragraph> <Paragraph position="18"> 4For languages other than English, we did not exploit this soft-matching and the KLSI kernel described below. See the rst column in the table 5.</Paragraph> <Paragraph position="19"> Table 5 displays the performance of Kernel-WSD. As a comparison, we also report the gures on the English task without using LSA. The last column reports the recall of the most-frequent baseline.</Paragraph> </Section> </Section> class="xml-element"></Paper>