File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1050_metho.xml
Size: 13,686 bytes
Last Modified: 2025-10-06 14:09:47
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1050"> <Title>Domain Kernels for Word Sense Disambiguation</Title> <Section position="3" start_page="403" end_page="404" type="metho"> <SectionTitle> 2 Domain Models </SectionTitle> <Paragraph position="0"> The simplest methodology to estimate the similarity among the topics of two texts is to represent them by means of vectors in the Vector Space Model (VSM), and to exploit the cosine similarity. More formally, let C = ft1,t2,... ,tng be a corpus, let V = fw1,w2,... ,wkg be its vocabulary, let T be the k n term-by-document matrix representing C, such that ti,j is the frequency of word wi into the text tj. The VSM is a k-dimensional space Rk, in which the text tj 2 C is represented by means of the vector vectortj such that the ith component of vectortj is ti,j. The similarity among two texts in the VSM is estimated by computing the cosine among them.</Paragraph> <Paragraph position="1"> However this approach does not deal well with lexical variability and ambiguity. For example the two sentences he is affected by AIDS and HIV is a virus do not have any words in common. In the VSM their similarity is zero because they have orthogonal vectors, even if the concepts they express are very closely related. On the other hand, the similarity between the two sentences the laptop has been infected by a virus and HIV is a virus would turn out very high, due to the ambiguity of the word virus.</Paragraph> <Paragraph position="2"> To overcome this problem we introduce the notion of Domain Model (DM), and we show how to use it in order to de ne a domain VSM in which texts and terms are represented in a uniform way.</Paragraph> <Paragraph position="3"> A DM is composed by soft clusters of terms. Each cluster represents a semantic domain, i.e. a set of terms that often co-occur in texts having similar topics. A DM is represented by a k kprime rectangular matrix D, containing the degree of association among terms and domains, as illustrated in Table 1.</Paragraph> </Section> <Section position="4" start_page="404" end_page="405" type="metho"> <SectionTitle> MEDICINE COMPUTER SCIENCE </SectionTitle> <Paragraph position="0"> DMs can be used to describe lexical ambiguity and variability. Lexical ambiguity is represented by associating one term to more than one domain, while variability is represented by associating different terms to the same domain. For example the term virus is associated to both the domain COM-PUTER SCIENCE and the domain MEDICINE (ambiguity) while the domain MEDICINE is associated to both the terms AIDS and HIV (variability).</Paragraph> <Paragraph position="1"> More formally, let D = fD1,D2,...,Dkprimeg be a set of domains, such that kprime k. A DM is fully de ned by a k kprime domain matrix D representing in each cell di,z the domain relevance of term wi with respect to the domain Dz. The domain matrix D is used to de ne a function D : Rk ! Rkprime, that maps the vectors vectortj expressed into the classical VSM, into the vectors vectortprimej in the domain VSM. D is de ned by1</Paragraph> <Paragraph position="3"> a particular instance.</Paragraph> <Paragraph position="4"> where IIDF is a k k diagonal matrix such that iIDFi,i = IDF(wi), vectortj is represented as a row vector, and IDF(wi) is the Inverse Document Frequency of wi.</Paragraph> <Paragraph position="5"> Vectors in the domain VSM are called Domain Vectors (DVs). DVs for texts are estimated by exploiting the formula 1, while the DV vectorwprimei, corresponding to the word wi 2 V is the ith row of the domain matrix D. To be a valid domain matrix such vectors should be normalized (i,e. hvectorwprimei, vectorwprimeii = 1). In the Domain VSM the similarity among DVs is estimated by taking into account second order relations among terms. For example the similarity of the two sentences He is affected by AIDS and HIV is a virus is very high, because the terms AIDS, HIVand virusare highly associated to the domain MEDICINE.</Paragraph> <Paragraph position="6"> A DM can be estimated from hand made lexical resources such as WORDNET DOMAINS (Magnini and Cavagli a, 2000), or by performing a term clustering process on a large corpus. We think that the second methodology is more attractive, because it allows us to automatically acquire DMs for different languages.</Paragraph> <Paragraph position="7"> In this work we propose the use of Latent Semantic Analysis (LSA) to induce DMs from corpora. LSA is an unsupervised technique for estimating the similarity among texts and terms in a corpus. LSA is performed by means of a Singular Value Decomposition (SVD) of the term-by-document matrix T describing the corpus. The SVD algorithm can be exploited to acquire a domain matrix D from a large corpus C in a totally unsupervised way. SVD decomposes the term-by-document matrix T into three matrixes T ' VSkprimeUT where Skprime is the diagonal k k matrix containing the highest kprime k eigenvalues of T, and all the remaining elements set to 0. The parameter kprime is the dimensionality of the Domain VSM and can be xed in advance2. Under this setting we de ne the domain matrix DLSA as</Paragraph> <Paragraph position="9"> where IN is a diagonal matrix such that iNi,i =</Paragraph> </Section> <Section position="5" start_page="405" end_page="406" type="metho"> <SectionTitle> 3 Kernel Methods for WSD </SectionTitle> <Paragraph position="0"> In the introduction we discussed two promising directions for improving the performance of a supervised disambiguation system. In this section we show how these requirements can be ef ciently implemented in a natural and elegant way by using kernel methods.</Paragraph> <Paragraph position="1"> The basic idea behind kernel methods is to embed the data into a suitable feature space F via a mapping function ph : X ! F, and then use a linear algorithm for discovering nonlinear patterns. Instead of using the explicit mapping ph, we can use a kernel function K : X X ! R, that corresponds to the inner product in a feature space which is, in general, different from the input space.</Paragraph> <Paragraph position="2"> Kernel methods allow us to build a modular system, as the kernel function acts as an interface between the data and the learning algorithm. Thus the kernel function becomes the only domain speci c module of the system, while the learning algorithm is a general purpose component. Potentially any kernel function can work with any kernel-based algorithm. In our system we use Support Vector Machines (Cristianini and Shawe-Taylor, 2000).</Paragraph> <Paragraph position="3"> Exploiting the properties of the kernel functions, it is possible to de ne the kernel combination schema as</Paragraph> <Paragraph position="5"> Our WSD system is then de ned as combination of n basic kernels. Each kernel adds some additional dimensions to the feature space. In particular, we have de ned two families of kernels: Domain and Syntagmatic kernels. The former is composed by both the Domain Kernel (KD) and the Bag-of-Words kernel (KBoW ), that captures domain aspects (see Section 3.1). The latter captures the syntagmatic aspects of sense distinction and it is composed by two kernels: the collocation kernel (KColl) and is equivalent to a Latent Semantic Space (Deerwester et al., 1990). The only difference in our formulation is that the vectors representing the terms in the Domain VSM are normalized by the matrix IN, and then rescaled, according to their IDF value, by matrix IIDF. Note the analogy with the tf idf term weighting schema (Salton and McGill, 1983), widely adopted in Information Retrieval.</Paragraph> <Paragraph position="6"> the Part of Speech kernel (KPoS) (see Section 3.2).</Paragraph> <Paragraph position="7"> The WSD kernels (KprimeWSD and KWSD) are then dened by combining them (see Section 3.3).</Paragraph> <Section position="1" start_page="405" end_page="405" type="sub_section"> <SectionTitle> 3.1 Domain Kernels </SectionTitle> <Paragraph position="0"> In (Magnini et al., 2002), it has been claimed that knowing the domain of the text in which the word is located is a crucial information for WSD. For example the (domain) polysemy among the COM-PUTER SCIENCE and the MEDICINE senses of the word virus can be solved by simply considering the domain of the context in which it is located.</Paragraph> <Paragraph position="1"> This assumption can be modeled by de ning a kernel that estimates the domain similarity among the contexts of the words to be disambiguated, namely the Domain Kernel. The Domain Kernel estimates the similarity among the topics (domains) of two texts, so to capture domain aspects of sense distinction. It is a variation of the Latent Semantic Kernel (Shawe-Taylor and Cristianini, 2004), in which a DM (see Section 2) is exploited to de ne an explicit mapping D : Rk !Rkprime from the classical VSM into the Domain VSM. The Domain Kernel is de ned by</Paragraph> <Paragraph position="3"> where D is the Domain Mapping de ned in equation 1. Thus the Domain Kernel requires a Domain Matrix D. For our experiments we acquire the matrix DLSA, described in equation 2, from a generic collection of unlabeled documents, as explained in Section 2.</Paragraph> <Paragraph position="4"> A more traditional approach to detect topic (domain) similarity is to extract Bag-of-Words (BoW) features from a large window of text around the word to be disambiguated. The BoW kernel, denoted by KBoW , is a particular case of the Domain Kernel, in which D = I, and I is the identity matrix. The BoW kernel does not require a DM, then it can be applied to the strictly supervised settings, in which an external knowledge source is not provided. null</Paragraph> </Section> <Section position="2" start_page="405" end_page="406" type="sub_section"> <SectionTitle> 3.2 Syntagmatic kernels </SectionTitle> <Paragraph position="0"> Kernel functions are not restricted to operate on vectorial objects vectorx 2 Rk. In principle kernels can be de ned for any kind of object representation, as for example sequences and trees. As stated in Section 1, syntagmatic relations hold among words collocated in a particular temporal order, thus they can be modeled by analyzing sequences of words.</Paragraph> <Paragraph position="1"> We identi ed the string kernel (or word sequence kernel) (Shawe-Taylor and Cristianini, 2004) as a valid instrument to model our assumptions.</Paragraph> <Paragraph position="2"> The string kernel counts how many times a (noncontiguous) subsequence of symbols u of length n occurs in the input string s, and penalizes non-contiguous occurrences according to the number of gaps they contain (gap-weighted subsequence kernel). null Formally, let V be the vocabulary, the feature space associated with the gap-weighted subsequence kernel of length n is indexed by a set I of subsequences over V of length n. The (explicit) mapping function is de ned by</Paragraph> <Paragraph position="4"> where u = s(i) is a subsequence of s in the positions given by the tuple i, l(i) is the length spanned by u, and l 2]0, 1] is the decay factor used to penalize non-contiguous subsequences.</Paragraph> <Paragraph position="5"> The associate gap-weighted subsequence kernel is de ned by</Paragraph> <Paragraph position="7"> We modi ed the generic de nition of the string kernel in order to make it able to recognize collocations in a local window of the word to be disambiguated. In particular we de ned two Syntagmatic kernels: the n-gram Collocation Kernel and the n-gram PoS Kernel. The n-gram Collocation kernel KnColl is de ned as a gap-weighted subsequence kernel applied to sequences of lemmata around the word l0 to be disambiguated (i.e. l[?]3, l[?]2, l[?]1, l0, l+1, l+2, l+3). This formulation allows us to estimate the number of common (sparse) subsequences of lemmata (i.e. collocations) between two examples, in order to capture syntagmatic similarity. In analogy we de ned the PoS kernel KnPoS, by setting s to the sequence of PoSs p[?]3, p[?]2, p[?]1, p0, p+1, p+2, p+3, where p0 is the PoS of the word to be disambiguated. null The de nition of the gap-weighted subsequence kernel, provided by equation 6, depends on the parameter n, that represents the length of the sub-sequences analyzed when estimating the similarity among sequences. For example, K2Coll allows us to represent the bigrams around the word to be disambiguated in a more exible way (i.e. bigrams can be sparse). In WSD, typical features are bigrams and trigrams of lemmata and PoSs around the word to be disambiguated, then we de ned the Collocation Kernel and the PoS Kernel respectively by equations</Paragraph> </Section> </Section> <Section position="6" start_page="406" end_page="406" type="metho"> <SectionTitle> 7 and 84. </SectionTitle> <Paragraph position="0"/> <Paragraph position="2"/> <Section position="1" start_page="406" end_page="406" type="sub_section"> <SectionTitle> 3.3 WSD kernels </SectionTitle> <Paragraph position="0"> In order to show the impact of using Domain Models in the supervised learning process, we de ned two WSD kernels, by applying the kernel combination schema described by equation 3. Thus the following WSD kernels are fully speci ed by the list of the kernels that compose them.</Paragraph> <Paragraph position="1"> Kwsd composed by KColl, KPoS and KBoW Kprimewsd composed by KColl, KPoS, KBoW and KD The only difference between the two systems is that Kprimewsd uses Domain Kernel KD. Kprimewsd exploits external knowledge, in contrast to Kwsd, whose only available information is the labeled training data.</Paragraph> </Section> </Section> class="xml-element"></Paper>