XML Viewer - w04-0822

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0822_metho.xml
Size: 9,975 bytes
Last Modified: 2025-10-06 14:09:12
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0822">
  <Title>Augmenting Ensemble Classification for Word Sense Disambiguation with a Kernel PCA Model</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Experimental setup
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Tasks evaluated
</SectionTitle>
      <Paragraph position="0"> We performed experiments on the following lexical sample tasks from Senseval-3: English (fine). The English lexical sample task includes 57 target words (32 verbs, 20 nouns and 5 adjectives). For each word, training and test instances tagged with WordNet senses are provided. There are an average of 8.5 senses per target word type, ranging from 3 to 23. On average, 138 training instances per target word are available.</Paragraph>
      <Paragraph position="1"> English (coarse). This modified evaluation of the preceding task employs a sense map that groups fine-grained sense distinctions into the same coarse-grained sense.</Paragraph>
      <Paragraph position="2"> Chinese. The Chinese lexical sample task includes 21 target words. For each word, several</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Association for Computational Linguistics
</SectionTitle>
      <Paragraph position="0"> for the Semantic Analysis of Text, Barcelona, Spain, July 2004 SENSEVAL-3: Third International Workshop on the Evaluation of Systems senses are defined using the HowNet knowledge base. There are an average of 3.95 senses per target word type, ranging from 2 to 8. Only about 37 training instances per target word are available.</Paragraph>
      <Paragraph position="1"> Multilingual (t). The Multilingual (t) task is defined similarly to the English lexical sample task, except that the word senses are the translations into Hindi, rather than WordNet senses. The Multilingual (t) task requires finding the Hindi sense for 31 English target word types. There are an average of 7.54 senses per target word type, ranging from 3 to 16. A relatively large training set is provided (more than 260 training instances per word on average).</Paragraph>
      <Paragraph position="2"> Multilingual (ts). The Multilingual (ts) task uses a different data set of 10 target words and provides the correct English sense of the target word for both training and testing. There are an average of 6.2 senses per target word type, ranging from 3 to 11.</Paragraph>
      <Paragraph position="3"> The training set for this subtask was smaller, with about 150 training instances per target word.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Ensemble classification
</SectionTitle>
      <Paragraph position="0"> The WSD models presented here consist of ensembles utilizing various combinations of four voting models, as follows. Some of these component models were also evaluated on other Senseval-3 tasks: the Basque, Catalan, Italian, and Romanian Lexical Sample tasks (Wicentowski et al., 2004), as well as Semantic Role Labeling (Ngai et al., 2004).</Paragraph>
      <Paragraph position="1"> The first voting model, a na&amp;quot;ive Bayes model, was built as Yarowsky and Florian (2002) found this model to be the most accurate classifier in a comparative study on a subset of Senseval-2 English lexical sample data.</Paragraph>
      <Paragraph position="2"> The second voting model, a maximum entropy model (Jaynes, 1978), was built as Klein and Manning (2002) found that it yielded higher accuracy than na&amp;quot;ive Bayes in a subsequent comparison of WSD performance. However, note that a different subset of either Senseval-1 or Senseval-2 English lexical sample data was used.</Paragraph>
      <Paragraph position="3"> The third voting model, a boosting model (Freund and Schapire, 1997), was built as boosting has consistently turned in very competitive scores on related tasks such as named entity classification (Carreras et al., 2002)(Wu et al., 2002). Specifically, we employed an AdaBoost.MH model (Schapire and Singer, 2000), which is a multi-class generalization of the original boosting algorithm, with boosting on top of decision stump classifiers (decision trees of depth one).</Paragraph>
      <Paragraph position="4"> The fourth voting model, the KPCA-based model, is described below.</Paragraph>
      <Paragraph position="5"> All classifier models were selected for their ability to able to handle large numbers of sparse features, many of which may be irrelevant. Moreover, the maximum entropy and boosting models are known to be well suited to handling features that are highly interdependent.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Controlled feature set
</SectionTitle>
      <Paragraph position="0"> In order to facilitate a controlled comparison across the individual voting models, the same feature set was employed for all classifiers. The features are as described by Yarowsky and Florian (2002) in their &amp;quot;feature-enhanced na&amp;quot;ive Bayes model&amp;quot;, with position-sensitive, syntactic, and local collocational features.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.4 The KPCA-based WSD model
</SectionTitle>
      <Paragraph position="0"> We briefly summarize the KPCA-based model here; for full details including illustrative examples and graphical interpretation, please refer to Wu et al.</Paragraph>
      <Paragraph position="1"> (2004).</Paragraph>
      <Paragraph position="2"> Kernel PCA Kernel Principal Component Analysis is a nonlinear kernel method for extracting non-linear principal components from vector sets where, conceptually, the n-dimensional input vectors are nonlinearly mapped from their original space Rn to a high-dimensional feature space F where linear PCA is performed, yielding a transform by which the input vectors can be mapped nonlinearly to a new set of vectors (Sch&amp;quot;olkopf et al., 1998).</Paragraph>
      <Paragraph position="3"> As with other kernel methods, a major advantage of KPCA over other common analysis techniques is that it can inherently take combinations of predictive features into account when optimizing dimensionality reduction. For WSD and indeed many natural language tasks, significant accuracy gains can often be achieved by generalizing over relevant feature combinations (see, e.g., Kudo and Matsumoto (2003)). A further advantage of KPCA in the context of the WSD problem is that the dimensionality of the input data is generally very large, a condition where kernel methods excel.</Paragraph>
      <Paragraph position="4"> Nonlinear principal components (Diamantaras and Kung, 1996) are defined as follows. Suppose we are given a training set of M pairs (xt;ct) where the observed vectors xt 2 Rn in an n-dimensional input space X represent the context of the target word being disambiguated, and the correct class ct represents the sense of the word, for t = 1;::;M.</Paragraph>
      <Paragraph position="5"> Suppose is a nonlinear mapping from the input space Rn to the feature space F. Without loss of generality we assume the M vectors are centered vectors in the feature space, i.e., PMt=1 (xt) = 0; uncentered vectors can easily be converted to centered vectors (Sch&amp;quot;olkopf et al., 1998). We wish to diagonalize the covariance matrix in F:</Paragraph>
      <Paragraph position="7"> To do this requires solving the equation v =</Paragraph>
      <Paragraph position="9"> and let ^ 1 ^ 2 ::: ^ M denote the eigenvalues of ^K and ^ 1 ,..., ^ M denote the corresponding complete set of normalized eigenvectors, such that ^ t(^ t ^ t) = 1 when ^ t &gt; 0. Then the lth nonlinear principal component of any test vector xt is defined as</Paragraph>
      <Paragraph position="11"> where ^ li is the lth element of ^ l .</Paragraph>
      <Paragraph position="12"> See Wu et al. (2004) for a possible geometric interpretation of the power of the nonlinearity.</Paragraph>
      <Paragraph position="13"> WSD using KPCA In order to extract nonlinear principal components efficiently, first note that in both Equations (5) and (6) the explicit form of (xi) is required only in the form of ( (xi) (xj)), i.e., the dot product of vectors in F. This means that we can calculate the nonlinear principal components by substituting a kernel function k(xi;xj) for ( (xi) (xj )) in Equations (5) and (6) without knowing the mapping explicitly; instead, the mapping is implicitly defined by the kernel function. It is always possible to construct a mapping into a space where k acts as a dot product so long as k is a continuous kernel of a positive integral operator (Sch&amp;quot;olkopf et al., 1998).</Paragraph>
      <Paragraph position="14"> Thus we train the KPCA model using the follow- null ing algorithm: 1. Compute an M M matrix ^K such that</Paragraph>
      <Paragraph position="16"> 2. Compute the eigenvalues and eigenvectors of matrix ^K and normalize the eigenvectors. Let ^ 1 ^ 2 ::: ^ M denote the eigenvalues  and ^ 1,..., ^ M denote the corresponding complete set of normalized eigenvectors. To obtain the sense predictions for test instances, we need only transform the corresponding vectors using the trained KPCA model and classify the resultant vectors using nearest neighbors. For a given test instance vector x, its lth nonlinear principal</Paragraph>
      <Paragraph position="18"> where ^ li is the ith element of ^ l.</Paragraph>
      <Paragraph position="19"> For our disambiguation experiments we employ a polynomial kernel function of the form k(xi;xj) = (xi xj)d, although other kernel functions such as gaussians could be used as well. Note that the degenerate case of d = 1 yields the dot product kernel k(xi;xj) = (xi xj) which covers linear PCA as a special case, which may explain why KPCA always outperforms PCA.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML