File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1081_metho.xml

Size: 10,715 bytes

Last Modified: 2025-10-06 14:09:01

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1081">
  <Title>A Kernel PCA Method for Superior Word Sense Disambiguation Dekai WU1 Weifeng SU Marine CARPUAT dekai@cs.ust.hk weifeng@cs.ust.hk marine@cs.ust.hk</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Nonlinear principal components and
WSD
</SectionTitle>
    <Paragraph position="0"> The Kernel Principal Component Analysis technique, or KPCA, is a nonlinear kernel method for extraction of nonlinear principal components from vector sets in which, conceptually, the n-dimensional input vectors are nonlinearly mapped from their original space Rn to a high-dimensional feature space F where linear PCA is performed, yielding a transform by which the input vectors can be mapped nonlinearly to a new set of vectors (Sch&amp;quot;olkopf et al., 1998).</Paragraph>
    <Paragraph position="1"> A major advantage of KPCA is that, unlike other common analysis techniques, as with other kernel methods it inherently takes combinations of predictive features into account when optimizing dimensionality reduction. For natural language problems in general, of course, it is widely recognized that significant accuracy gains can often be achieved by generalizing over relevant feature combinations (e.g., Kudo and Matsumoto (2003)). Another advantage of KPCA for the WSD task is that the dimensionality of the input data is generally very  large, a condition where kernel methods excel.</Paragraph>
    <Paragraph position="2"> Nonlinear principal components (Diamantaras and Kung, 1996) may be defined as follows. Suppose we are given a training set of M pairs (xt;ct) where the observed vectors xt 2 Rn in an n-dimensional input space X represent the context of the target word being disambiguated, and the correct class ct represents the sense of the word, for t = 1;::;M. Suppose is a nonlinear mapping from the input space Rn to the feature space F.</Paragraph>
    <Paragraph position="3"> Without loss of generality we assume the M vectors are centered vectors in the feature space, i.e.,P</Paragraph>
    <Paragraph position="5"> 1998). We wish to diagonalize the covariance matrix in F:</Paragraph>
    <Paragraph position="7"> To do this requires solving the equation v = Cv for eigenvalues 0 and eigenvectors v 2 F. Be-</Paragraph>
    <Paragraph position="9"> and let ^ 1 ^ 2 ::: ^ M denote the eigenvalues of ^K and ^ 1 ,..., ^ M denote the corresponding complete set of normalized eigenvectors, such that ^ t(^ t ^ t) = 1 when ^ t &gt; 0. Then the lth nonlinear principal component of any test vector xt is defined as</Paragraph>
    <Paragraph position="11"> where ^ li is the lth element of ^ l .</Paragraph>
    <Paragraph position="12"> To illustrate the potential of nonlinear principal components for WSD, consider a simplified disambiguation example for the ambiguous target word &amp;quot;art&amp;quot;, with the two senses shown in Table 1. Assume a training corpus of the eight sentences as shown in Table 2, adapted from Senseval-2 English lexical sample corpus. For each sentence, we show the feature set associated with that occurrence of &amp;quot;art&amp;quot; and the correct sense class. These eight occurrences of &amp;quot;art&amp;quot; can be transformed to a binary vector representation containing one dimension for each feature, as shown in Table 3.</Paragraph>
    <Paragraph position="13"> Extracting nonlinear principal components for the vectors in this simple corpus results in nonlinear generalization, reflecting an implicit consideration of combinations of features. Table 3 shows the first three dimensions of the principal component vectors obtained by transforming each of the eight training vectors xt into (a) principal component vectors zt using the linear transform obtained via PCA, and (b) nonlinear principal component vectors yt using the nonlinear transform obtained via KPCA as described below.</Paragraph>
    <Paragraph position="14"> Similarly, for the test vector x9, Table 4 shows the first three dimensions of the principal component vectors obtained by transforming it into (a) a principal component vector z9 using the linear PCA transform obtained from training, and (b) a nonlinear principal component vector y9 using the nonlinear KPCA transform obtained obtained from training.</Paragraph>
    <Paragraph position="15"> The vector similarities in the KPCA-transformed space can be quite different from those in the PCA-transformed space. This causes the KPCA-based model to be able to make the correct class prediction, whereas the PCA-based model makes the  (Kilgarriff 2001), together with a tiny example set of features. The training and testing examples can be represented as a set of binary vectors: each row shows the correct class c for an observed vector x of five dimensions.</Paragraph>
    <Paragraph position="16"> TRAINING design/N media/N the/DT entertainment/N world/N Class x1 He studies art in London. 1 x2 Punch's weekly guide to the world of the arts, entertainment, media and more.</Paragraph>
    <Paragraph position="17">  sign arts particularly, this led to appointments made for political rather than academic reasons.</Paragraph>
    <Paragraph position="18"> 1 1 1 1 wrong class prediction.</Paragraph>
    <Paragraph position="19"> What permits KPCA to apply stronger generalization biases is its implicit consideration of combinations of feature information in the data distribution from the high-dimensional training vectors. In this simplified illustrative example, there are just five input dimensions; the effect is stronger in more realistic high dimensional vector spaces. Since the KPCA transform is computed from unsupervised training vector data, and extracts generalizations that are subsequently utilized during supervised classification, it is quite possible to combine large amounts of unsupervised data with reasonable smaller amounts of supervised data.</Paragraph>
    <Paragraph position="20"> It can be instructive to attempt to interpret this example graphically, as follows, even though the interpretation in three dimensions is severely limiting. Figure 1(a) depicts the eight original observed training vectors xt in the first three of the five dimensions; note that among these eight vectors, there happen to be only four unique points when restricting our view to these three dimensions. Ordinary linear PCA can be straightforwardly seen as projecting the original points onto the principal axis,  principal components as transformed via PCA and KPCA. Observed vectors PCA-transformed vectors KPCA-transformed vectors Class</Paragraph>
    <Paragraph position="22"> as transformed via the trained PCA and KPCA parameters. The PCA-based and KPCA-based sense class predictions disagree.</Paragraph>
    <Paragraph position="24"> as can be seen for the case of the first principal axis in Figure 1(b). Note that in this space, the sense 2 instances are surrounded by sense 1 instances. We can traverse each of the projections onto the principal axis in linear order, simply by visiting each of the first principal components z1t along the principle axis in order of their values, i.e., such that</Paragraph>
    <Paragraph position="26"> It is significantly more difficult to visualize the nonlinear principal components case, however.</Paragraph>
    <Paragraph position="27"> Note that in general, there may not exist any principal axis in X, since an inverse mapping from F may not exist. If we attempt to follow the same procedure to traverse each of the projections onto the first principal axis as in the case of linear PCA, by considering each of the first principal components y1t in order of their value, i.e., such that</Paragraph>
    <Paragraph position="29"> then we must arbitrarily select a &amp;quot;quasi-projection&amp;quot; direction for each y1t since there is no actual principal axis toward which to project. This results in a &amp;quot;quasi-axis&amp;quot; roughly as shown in Figure 1(c) which, though not precisely accurate, provides some idea as to how the nonlinear generalization capability allows the data points to be grouped by principal components reflecting nonlinear patterns in the data distribution, in ways that linear PCA cannot do. Note that in this space, the sense 1 instances are already better separated from sense 2 data points. Moreover, unlike linear PCA, there may be up to M of the &amp;quot;quasi-axes&amp;quot;, which may number far more than five. Such effects can become pronounced in the high dimensional spaces are actually used for real word sense disambiguation tasks.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 A KPCA-based WSD model
</SectionTitle>
    <Paragraph position="0"> To extract nonlinear principal components efficiently, note that in both Equations (5) and (6) the explicit form of (xi) is required only in the form of ( (xi) (xj)), i.e., the dot product of vectors in F. This means that we can calculate the nonlinear principal components by substituting a kernel function k(xi;xj) for ( (xi) (xj )) in Equations (5) and (6) without knowing the mapping explicitly; instead, the mapping is implicitly defined by the kernel function. It is always possible to construct a mapping into a space where k acts as a dot product so long as k is a continuous kernel of a positive  : training example with sense class 1 : training example with sense class 2 : test example with unknown sense class : test example with predicted sense first principal &amp;quot; quasi-axis &amp;quot; class 2 (correct sense class=1) : test example with predicted sense  Thus we train the KPCA model using the follow- null ing algorithm: 1. Compute an M M matrix ^K such that</Paragraph>
    <Paragraph position="2"> 2. Compute the eigenvalues and eigenvectors of matrix ^K and normalize the eigenvectors. Let ^ 1 ^ 2 ::: ^ M denote the eigenvalues  and ^ 1,..., ^ M denote the corresponding complete set of normalized eigenvectors. To obtain the sense predictions for test instances, we need only transform the corresponding vectors using the trained KPCA model and classify the resultant vectors using nearest neighbors. For a given test instance vector x, its lth nonlinear principal</Paragraph>
    <Paragraph position="4"> where ^ li is the ith element of ^ l.</Paragraph>
    <Paragraph position="5"> For our disambiguation experiments we employ a polynomial kernel function of the form k(xi;xj) = (xi xj)d, although other kernel functions such as gaussians could be used as well. Note that the degenerate case of d = 1 yields the dot product kernel k(xi;xj) = (xi xj) which covers linear PCA as a special case, which may explain why KPCA always outperforms PCA.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML