XML Viewer - w97-0323

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0323_metho.xml
Size: 12,196 bytes
Last Modified: 2025-10-06 14:14:45
<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0323">
  <Title>Exemplar-Based Word Sense Disambiguation: Some Recent Improvements</Title>
  <Section position="4" start_page="208" end_page="209" type="metho">
    <SectionTitle>
3 Improvements to Exemplar-Based
WSD
</SectionTitle>
    <Paragraph position="0"> PEBLS contains a number of parameters that must be set before running the algorithm. These parameters include k (the number of nearest neighbors to  use for determining the class of a test example), exemplar weights, feature weights, etc. Each of these parameters has a default value in PEBLS, eg., k = 1, no exemplar weighting, no feature weighting, etc. We have used the default values for all parameter settings in our previous work on exemplar-based WSD reported in (Ng and Lee, 1996). However, our preliminary investigation indicates that, among the various learning parameters of PEBLS, the number k of nearest neighbors used has a considerable impact on the accuracy of the induced exemplar-based classifier.</Paragraph>
    <Paragraph position="1"> Cross validation is a well-known technique that can be used for estimating the expected error rate of a classifier which has been trained on a particular data set. For instance, the C4.5 program (Quinlan, 1993) contains an option for running cross validation to estimate the expected error rate of an induced rule set. Cross validation has been proposed as a general technique to automatically determine the parameter settings of a given learning algorithm using a particular data set as training data (Kohavi and John, 1995).</Paragraph>
    <Paragraph position="2"> In m-fold cross validation, a training data set is partitioned into m (approximately) equal-sized blocks, and the learning algorithm is run m times.</Paragraph>
    <Paragraph position="3"> In each run, one of the m blocks of training data is set aside as test data (the holdout set) and the remaining m- 1 blocks are used as training data. The average error rate of the m runs is a good estimate of the error rate of the induced classifier.</Paragraph>
    <Paragraph position="4"> For a particular parameter setting, we can run m-fold cross validation to determine the expected error rate of that particular parameter setting. We can then choose an optimal parameter setting that minimizes the expected error rate. Kohavi and John (1995) reported the effectiveness of such a technique in obtaining optimal sets of parameter settings over a large number of machine learning problems.</Paragraph>
    <Paragraph position="5"> In our present study, we used 10-fold cross validation to automatically determine the best k (number of nearest neighbors) to use from the training data. To determine the best k for disambiguating a word on a particular training set, we run 10-fold cross validation using PEBLS 21 times, each time with k = 1,5, 10, 15,..., 85, 90, 95,100. We compute the error rate for each k, and choose the value of k with the minimum error rate. Note that the automatic determination of the best k through 10-fold cross validation makes use of only the training set, without looking at the test set at all.</Paragraph>
  </Section>
  <Section position="5" start_page="209" end_page="210" type="metho">
    <SectionTitle>
4 Experimental Results
</SectionTitle>
    <Paragraph position="0"> Mooney (1996) has reported that the Naive-Bayes algorithm gives the best performance on disambiguating six senses of the word &amp;quot;line&amp;quot;, among seven state-of-the-art learning algorithms tested. However, his comparative study is done on only one word using a data set of 2,094 examples. In our present study, we evaluated PEBLS and Naive-Bayes on a much larger corpus containing sense-tagged occurrences of 121 nouns and 70 verbs. This corpus was first reported in (Ng and Lee, 1996), and it contains about 192,800 sense-tagged word occurrences of 191 most frequently occurring and ambiguous words of English. 1 These 191 words have been tagged with senses from WOI:tDNET (Miller, 1990), an on-line, electronic dictionary available publicly. For this set of 191 words, the average number of senses per noun is 7.8, while the average number of senses per verb is 12.0. The sentences in this corpus were drawn from the combined corpus of the i million word Brown corpus and the 2.5 million word Wall Street Journal (WSJ) corpus.</Paragraph>
    <Paragraph position="1"> We tested both algorithms on two test sets from this corpus. The first test set, named BC50, consists of 7,119 occurrences of the 191 words appearing in 50 text files of the Brown corpus. The second test set, named WSJ6, consists of 14,139 occurrences of the 191 words appearing in 6 text files of the WSJ corpus. Both test sets are identical to the ones reported in (Ng and Lee, 1996).</Paragraph>
    <Paragraph position="2"> Since the primary aim of our present study is the comparative evaluation of learning algorithms, not feature representation, we have chosen, for simplicity, to use local collocations as the only features in the example representation. Local collocations have been found to be the single most informative set of features for WSD (Ng and Lee, 1996). That local collocation knowledge provides important clues to WSD has also been pointed out previously by Yarowsky (1993).</Paragraph>
    <Paragraph position="3"> Let w be the word to be disambiguated, and let 12 ll w rl r2 be the sentence fragment containing w. In the present study, we used seven features in the representation of an example, which are the local collocations of the surrounding 4 words. These seven features are: 12-11, ll-rl, rl-r2, ll, rl, 12, and r2. The first three features are concatenation of two words. 2 The experimental results obtained are tabulated in Table 1. The first three rows of accuracy fig- null ures are those of (Ng and Lee, 1996). The default strategy of picking the most frequent sense has been advocgted as the baseline performance for evaluating WSD programs (Gale et al., 1992b; Miller et al., 1994). There are two instantiations of this strategy in our current evaluation. Since WORDNET orders its senses such that sense 1 is the most frequent sense, one possibility is to always pick sense 1 as the best sense assignment. This assignment method does not even need to look at the training examples. We call this method &amp;quot;Sense 1&amp;quot; in Table 1. Another assignment method is to determine the most frequently occurring sense in the training examples, and to assign this sense to all test examples. We call this method &amp;quot;Most Frequent&amp;quot; in Table 1.</Paragraph>
    <Paragraph position="4"> The accuracy figures of LEXAS as reported in (Ng and Lee, 1996) are reproduced in the third row of Table 1. These figures were obtained using all features including part of speech and morphological form, surrounding words, local collocations, and verb-object syntactic relation. However, the feature value pruning method of (Ng and Lee, 1996) only selects surrounding words and local collocations as feature values if they are indicative of some sense class as measured by conditional probability (See (Ng and Lee, 1996) for details).</Paragraph>
    <Paragraph position="5"> The next three rows show the accuracy figures of PEBLS using the parameter setting of k = 1, k = 20, and 10-fold cross validation for finding the best k, respectively. The last row shows the accuracy figures of the Naive-Bayes algorithm. Accuracy figures of the last four rows are all based on only seven collocation features as described earlier in this section. However, all possible feature values (collocated words) are used, without employing the feature value pruning method used in (Ng and Lee, 1996).</Paragraph>
    <Paragraph position="6"> Note that the accuracy figures of PEBLS with k = 1 are 1.0% and 1.6% higher than the accuracy figures of (Ng and Lee, 1996) in the third row, also with k = 1. The feature value pruning method of (Ng and Lee, 1996) is intended to keep only feature values deemed important for classification. It seems that the pruning method has filtered out some usefifl collocation values that improve classification accuracy, such that this unfavorable effect outweighs the additional set of features (part of speech and morphological form, surrounding words, and verb-object syntactic relation) used.</Paragraph>
    <Paragraph position="7"> Our results indicate that although Naive-Bayes performs better than PEBLS with k = 1, PEBLS with k = 20 achieves comparable performance. Furthermore, PEBLS with 10-fold cross validation to select the best k yields results slightly better than the Naive-Bayes algorithm.</Paragraph>
  </Section>
  <Section position="6" start_page="210" end_page="211" type="metho">
    <SectionTitle>
5 Discussion
</SectionTitle>
    <Paragraph position="0"> To understand why larger values of k are needed, we examined the performance of PEBLS when tested on the WSJ6 test set. During 10-fold cross validation runs on the training set, for each of the 191 words, we compared two error rates: the minimum expected error rate of PEBLS using the best k, and the expected error rate of the most frequent classifter. We found that for 13 words out of the 191 words, the minimum expected error rate of PEBLS using the best k is still higher than the expected error rate of the most frequent classifier. That is, for these 13 words, PEBLS will produce, on average, lower accuracy than the most frequent classifier.</Paragraph>
    <Paragraph position="1"> Importantly, for 11 of these 13 words, the best k found by PEBLS are at least 85 and above. This indicates that for a training data set when PEBLS has trouble even outperforming the most frequent classifter, it will tend to use a large value for k. This is explainable since for a large value of k, PEBLS will tend towards the performance of the most frequent classifier, as it will find the k closest matching training examples and select the majority class among this large number of k examples. Note that in the extreme case when k equals the size of the training set, PEBLS will behave exactly like the most frequent classifier.</Paragraph>
    <Paragraph position="2"> Our results indicate that although PEBLS with k = 1 gives lower accuracy compared with Naive-Bayes, PEBLS with k = 20 performs as well as Naive-Bayes. Furthermore, PEBLS with automatically selected k using 10-fold cross validation gives slightly higher performance compared with Naive-Bayes. We believe that this result is significant, in light of the fact that Naive-Bayes has been found to give the best performance for WSD among seven state-of-the-art machine learning algorithms (Mooney, 1996). It demonstrates that an exemplar-based learning approach is suitable for the WSD task, achieving high disambiguation accuracy.</Paragraph>
    <Paragraph position="3"> One potential drawback of an exemplar-based  learning approach is the testing time required, since each test example must be compared with every training example, and hence the required testing time grows linearly with the size of the training set. However, more sophisticated indexing methods such as that reported in (Friedman et al., 1977) can reduce this to logarithmic expected time, which will significantly reduce testing time.</Paragraph>
    <Paragraph position="4"> In the present study, we have focused on the comparison of learning algorithms, but not on feature representation of examples. Our past work (Ng and Lee, 1996) suggests that multiple sources of knowledge are indeed useful for WSD. Future work will explore the addition of these other features to further improve disambiguation accuracy.</Paragraph>
    <Paragraph position="5"> Besides the parameter k, PEBLS also contains other learning parameters such as exemplar weights and feature weights. Exemplar weighting has been found to improve classification performance (Cost and Saizberg, 1993). Also, given the relative importance of the various knowledge sources as reported in (Ng and Lee, 1996), it may be possible to improve disambignation performance by introducing feature weighting. Future work can explore the effect of exemplar weighting and feature weighting on disambiguation accuracy.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML