File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1006_metho.xml

Size: 18,500 bytes

Last Modified: 2025-10-06 14:08:02

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1006">
  <Title>An Empirical Evaluation of Knowledge Sources and Learning Algorithms for Word Sense Disambiguation</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Knowledge Sources
</SectionTitle>
    <Paragraph position="0"> To disambiguate a word occurrence a2 , we consider four knowledge sources listed below. Each training (or test) context of a2 generates one training (or test) feature vector.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Part-of-Speech (POS) of Neighboring
Words
</SectionTitle>
      <Paragraph position="0"> We use 7 features to encode this knowledge source: a3a5a4a7a6a9a8a10a3a11a4a7a12a9a8a10a3a5a4a14a13a15a8a10a3a17a16a9a8a10a3a5a13a15a8a10a3a18a12a9a8a10a3a17a6 , wherea3a19a4a21a20 (a3a22a20 ) is the POS of the a23 th token to the left (right) of a2 , and a3 a16 is the POS of a2 . A token can be a word or a punctuation symbol, and each of these neighboring tokens must be in the same sentence as a2 . We use a sentence segmentation program (Reynar and Ratnaparkhi, 1997) and a POS tagger (Ratnaparkhi, 1996) to segment the tokens surrounding a2 into sentences and assign POS tags to these tokens.</Paragraph>
      <Paragraph position="1"> For example, to disambiguate the word bars in the POS-tagged sentence &amp;quot;Reid/NNP</Paragraph>
      <Paragraph position="3"> the POS tag of a null token.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Single Words in the Surrounding Context
</SectionTitle>
      <Paragraph position="0"> For this knowledge source, we consider all single words (unigrams) in the surrounding context of a2 , and these words can be in a different sentence from a2 . For each training or test example, the SENSEVAL data sets provide up to a few sentences as the surrounding context. In the results reported in this paper, we consider all words in the provided context.</Paragraph>
      <Paragraph position="1"> Specifically, all tokens in the surrounding context of a2 are converted to lower case and replaced by their morphological root forms. Tokens present in a list of stop words or tokens that do not contain at least an alphabet character (such as numbers and punctuation symbols) are removed. All remaining tokens from all training contexts provided for a2 are gathered. Each remaining token a43 contributes one feature. In a training (or test) example, the feature corresponding to a43 is set to 1 iff the context of a2 in that training (or test) example contains a43 .</Paragraph>
      <Paragraph position="2"> We attempted a simple feature selection method to investigate if a learning algorithm performs better with or without feature selection. The feature selection method employed has one parameter: a44</Paragraph>
      <Paragraph position="4"> feature a43 is selected if a43 occurs in some sense of a2  a12 or more times in the training data. This parameter is also used by (Ng and Lee, 1996). We have tried a44 a12a32a45a47a46 and a44 a12a32a45a49a48 (i.e., no feature selection) in the results reported in this paper.</Paragraph>
      <Paragraph position="5"> For example, if a2 is the word bars and the set of selected unigrams is a50 chocolate, iron, beera51 , the feature vector for the sentence &amp;quot;Reid saw me looking at the iron bars .&amp;quot; is a24 0, 1, 0a41 .</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Local Collocations
</SectionTitle>
      <Paragraph position="0"> A local collocation a52 a20a54a53a55 refers to the ordered sequence of tokens in the local, narrow context of a2 .</Paragraph>
      <Paragraph position="1"> Offsets a23 and a56 denote the starting and ending position (relative to a2 ) of the sequence, where a negative (positive) offset refers to a token to its left (right). For example, let a2 be the word bars in the sentence &amp;quot;Reid saw me looking at the iron bars .&amp;quot; Then a52 a4a7a12a57a53a58a4a14a13 is the iron and a52 a4a14a13a59a53a12 is iron . a39 , where a39 denotes a null token. Like POS, a collocation does not cross sentence boundary. To represent this knowledge source of local collocations, we extracted 11 features corresponding to the following collocations: a52 a4a14a13a59a53a58a4a14a13 , a52 a13a59a53a58a13 , a52 a4a7a12a57a53a58a4a7a12 , a52 a12a57a53a12 , a52 a4a7a12a57a53a58a4a14a13 ,</Paragraph>
      <Paragraph position="3"> set of 11 features is the union of the collocation features used in Ng and Lee (1996) and Ng (1997).</Paragraph>
      <Paragraph position="4"> To extract the feature values of the collocation feature a52 a20a54a53a55 , we first collect all possible collocation strings (converted into lower case) corresponding to</Paragraph>
      <Paragraph position="6"> surrounding words, we do not remove stop words, numbers, or punctuation symbols. Each collocation string is a possible feature value. Feature value selection using a44 a12 , analogous to that used to select surrounding words, can be optionally applied. If a training (or test) context of a2 has collocation a60 , and a60 is a selected feature value, then the a52 a20a61a53a55 feature of a2 has value a60 . Otherwise, it has the value a62 , denoting the null string.</Paragraph>
      <Paragraph position="7"> Note that each collocation a52 a20a54a53a55 is represented by one feature that can have many possible feature values (the local collocation strings), whereas each distinct surrounding word is represented by one feature that takes binary values (indicating presence or absence of that word). For example, if a2 is the word bars and suppose the set of selected collocations for</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Syntactic Relations
</SectionTitle>
      <Paragraph position="0"> We first parse the sentence containing a2 with a statistical parser (Charniak, 2000). The constituent tree structure generated by Charniak's parser is then converted into a dependency tree in which every word points to a parent headword. For example, in the sentence &amp;quot;Reid saw me looking at the iron bars .&amp;quot;, the word Reid points to the parent headword saw.</Paragraph>
      <Paragraph position="1"> Similarly, the word me also points to the parent headword saw.</Paragraph>
      <Paragraph position="2"> We use different types of syntactic relations, depending on the POS of a2 . If a2 is a noun, we use four features: its parent headword a63 , the POS of a63 , the voice of a63 (active, passive, ora62 ifa63 is not a verb), and the relative position of a63 from a2 (whether a63 is to the left or right of a2 ). If a2 is a verb, we use six features: the nearest word a64 to the left of a2 such that a2 is the parent headword of a64 , the nearest word a65 to the right of a2 such that a2 is the parent headword of a65 , the POS of a64 , the POS of a65 , the POS of a2 , and the voice of a2 . If a2 is an adjective, we use two features: its parent headword a63 and the POS of a63 . We also investigated the effect of feature selection on syntactic-relation features that are words (i.e., POS, voice, and relative position are excluded).</Paragraph>
      <Paragraph position="3"> Some examples are shown in Table 1. Each POS noun, verb, or adjective is illustrated by one example. For each example, (a) shows a2 and its POS; (b) shows the sentence where a2 occurs; and (c) shows the feature vector corresponding to syntactic relations. null</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Learning Algorithms
</SectionTitle>
    <Paragraph position="0"> We evaluated four supervised learning algorithms: Support Vector Machines (SVM), AdaBoost with decision stumps (AdB), Naive Bayes (NB), and decision trees (DT). All the experimental results reported in this paper are obtained using the implementation of these algorithms in WEKA (Witten and Frank, 2000). All learning parameters use the default values in WEKA unless otherwise stated.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Support Vector Machines
</SectionTitle>
      <Paragraph position="0"> The SVM (Vapnik, 1995) performs optimization to find a hyperplane with the largest margin that separates training examples into two classes. A test example is classified depending on the side of the hyperplane it lies in. Input features can be mapped into high dimensional space before performing the optimization and classification. A kernel function (linear by default) can be used to reduce the computational cost of training and testing in high dimensional space. If the training examples are nonseparable, a regularization parameter a52 (a45 a66 by default) can be used to control the trade-off between achieving a large margin and a low training error.</Paragraph>
      <Paragraph position="1"> In WEKA's implementation of SVM, each nominal feature with a67 possible values is converted into a67 binary (0 or 1) features. If a nominal feature takes the a23 th feature value, then the a23 th binary feature is set to 1 and all the other binary features are set to 0.</Paragraph>
      <Paragraph position="2"> We tried higher order polynomial kernels, but they gave poorer results. Our reported results in this paper used the linear kernel.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 AdaBoost
</SectionTitle>
      <Paragraph position="0"> AdaBoost (Freund and Schapire, 1996) is a method of training an ensemble of weak learners such that the performance of the whole ensemble is higher than its constituents. The basic idea of boosting is to give more weights to misclassified training examples, forcing the new classifier to concentrate on these hard-to-classify examples. A test example is classified by a weighted vote of all trained classifiers. We use the decision stump (decision tree with only the root node) as the weak learner in AdaBoost.</Paragraph>
      <Paragraph position="1"> WEKA implements AdaBoost.M1. We used 100 iterations in AdaBoost as it gives higher accuracy than the default number of iterations in WEKA (10).</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Naive Bayes
</SectionTitle>
      <Paragraph position="0"> The Naive Bayes classifier (Duda and Hart, 1973) assumes the features are independent given the class.</Paragraph>
      <Paragraph position="1"> During classification, it chooses the class with the highest posterior probability. The default setting uses Laplace (&amp;quot;add one&amp;quot;) smoothing.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4 Decision Trees
</SectionTitle>
      <Paragraph position="0"> The decision tree algorithm (Quinlan, 1993) partitions the training examples using the feature with the highest information gain. It repeats this process recursively for each partition until all examples in each partition belong to one class. A test example is classified by traversing the learned decision tree. WEKA implements Quinlan's C4.5 decision tree algorithm, with pruning by default.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Evaluation Data Sets
</SectionTitle>
    <Paragraph position="0"> In the SENSEVAL-2 English lexical sample task, participating systems are required to disambiguate 73 words that have their POS predetermined. There are 8,611 training instances and 4,328 test instances tagged with WORDNET senses. Our evaluation is based on all the official training and test data of SENSEVAL-2.</Paragraph>
    <Paragraph position="1"> For SENSEVAL-1, we used the 36 trainable words for our evaluation. There are 13,845 training instances1 for these trainable words, and 7,446 test instances. For SENSEVAL-1, 4 trainable words belong to the indeterminate category, i.e., the POS is not provided. For these words, we first used a POS tagger (Ratnaparkhi, 1996) to determine the correct POS.</Paragraph>
    <Paragraph position="2"> For a word a2 that may occur in phrasal word form (eg, the verb &amp;quot;turn&amp;quot; and the phrasal form &amp;quot;turn down&amp;quot;), we train a separate classifier for each phrasal word form. During testing, if a2 appears in a phrasal word form, the classifier for that phrasal word form is used. Otherwise, the classifier for a2 is used.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Empirical Results
</SectionTitle>
    <Paragraph position="0"> We ran the different learning algorithms using various knowledge sources. Table 2 (Table 3) shows  each algorithm evaluated and official scores of the top 3 participating systems of SENSEVAL-2 and SENSEVAL-1 the accuracy figures for the different combinations of knowledge sources and learning algorithms for the SENSEVAL-2 (SENSEVAL-1) data set. The nine columns correspond to: (i) using only POS of neighboring words (ii) using only single words in the surrounding context with feature selection (a44 a12a72a45a73a46 ) (iii) same as (ii) but without feature selection (a44 a12 a45a74a48 ) (iv) using only local collocations with feature selection (a44 a12a75a45a76a46 ) (v) same as (iv) but without feature selection (a44 a12a31a45a77a48 ) (vi) using only syntactic relations with feature selection on words (a44 a12a70a45a78a46 ) (vii) same as (vi) but without feature selection (a44 a12a33a45a79a48 ) (viii) combining all four knowledge sources with feature selection (ix) combining all four knowledge sources without feature selection. null SVM is only capable of handling binary class problems. The usual practice to deal with multi-class problems is to build one binary classifier per output class (denoted &amp;quot;1-per-class&amp;quot;). The original AdaBoost, Naive Bayes, and decision tree algo- null algorithm is significantly better.</Paragraph>
    <Paragraph position="1"> rithms can already handle multi-class problems, and we denote runs using the original AdB, NB, and DT algorithms as &amp;quot;normal&amp;quot; in Table 2 and Table 3. Accuracy for each word task a2 can be measured by recall (r) or precision (p), defined by: r a45 no. of test instances correctly labeledno. of test instances in word task  p a45 no. of test instances correctly labeledno. of test instances output in word task  Recall is very close (but not always identical) to precision for the top SENSEVAL participating systems. In this paper, our reported results are based on the official fine-grained scoring method.</Paragraph>
    <Paragraph position="2"> To compute an average recall figure over a set of words, we can either adopt micro-averaging (mi) or macro-averaging (ma), defined by: mi a45 total no. of test instances correctly labeledtotal no. of test instances in all word tasks</Paragraph>
    <Paragraph position="4"> a20a54a103 word tasks recall for word task a23 That is, micro-averaging treats each test instance equally, so that a word task with many test instances will dominate the micro-averaged recall. On the other hand, macro-averaging treats each word task equally.</Paragraph>
    <Paragraph position="5"> As shown in Table 2 and Table 3, the best micro-averaged recall for SENSEVAL-2 (SENSEVAL-1) is 65.4% (79.2%), obtained by combining all knowledge sources (without feature selection) and using SVM as the learning algorithm.</Paragraph>
    <Paragraph position="6"> In Table 4, we tabulate the best micro-averaged recall for each learning algorithm, broken down according to nouns, verbs, adjectives, indeterminates (for SENSEVAL-1), and all words. We also tabulate analogous figures for the top three participating systems for both SENSEVALs. The top three systems for SENSEVAL-2 are: JHU (S1) (Yarowsky et al., 2001), SMUls (S2) (Mihalcea and Moldovan, 2001), and KUNLP (S3) (Seo et al., 2001). The top three systems for SENSEVAL-1 are: hopkins (s1) (Yarowsky, 2000), ets-pu (s2) (Chodorow et al., 2000), and tilburg (s3) (Veenstra et al., 2000). As shown in Table 4, SVM with all four knowledge sources achieves accuracy higher than the best official scores of both SENSEVALs.</Paragraph>
    <Paragraph position="7"> We also conducted paired t test to see if one system is significantly better than another. The t statistic of the difference between each pair of recall figures (between each test instance pair for micro-averaging and between each word task pair for macro-averaging) is computed, giving rise to a p value. A large p value indicates that the two systems are not significantly different from each other. The comparison between our learning algorithms and the top three participating systems is given in Table 5. Note that we can only compare macro-averaged recall for SENSEVAL-1 systems, since the sense of each individual test instance output by the SENSEVAL-1 participating systems is not available. The comparison indicates that our SVM system is better than the best official SENSEVAL-2 and SENSEVAL-1 systems at the level of significance 0.05.</Paragraph>
    <Paragraph position="8"> Note that we are able to obtain state-of-the-art results using a single learning algorithm (SVM), without resorting to combining multiple learning algorithms. Several top SENSEVAL-2 participating systems have attempted the combination of classifiers using different learning algorithms.</Paragraph>
    <Paragraph position="9"> In SENSEVAL-2, JHU used a combination of various learning algorithms (decision lists, cosine-based vector models, and Bayesian models) with various knowledge sources such as surrounding words, local collocations, syntactic relations, and morphological information. SMUls used a k-nearest neighbor algorithm with features such as keywords, collocations, POS, and name entities. KUNLP used Classification Information Model, an entropy-based learning algorithm, with local, topical, and bigram contexts and their POS.</Paragraph>
    <Paragraph position="10"> In SENSEVAL-1, hopkins used hierarchical decision lists with features similar to those used by JHU in SENSEVAL-2. ets-pu used a Naive Bayes classifier with topical and local words and their POS. tilburg used a k-nearest neighbor algorithm with features similar to those used by (Ng and Lee, 1996). tilburg also used dictionary examples as additional training data.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML