File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-0812_metho.xml
Size: 18,128 bytes
Last Modified: 2025-10-06 14:08:03
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-0812"> <Title>Evaluating the Effectiveness of Ensembles of Decision Trees in Disambiguating Senseval Lexical Samples</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Experimental Data </SectionTitle> <Paragraph position="0"> The English lexical sample for SENSEVAL-1 is made up of 35 words, six of which are used in multiple parts of speech. The training examples have been manually annotated based on the HECTOR sense inventory. There are 12,465 training examples, and 7,448 test instances. This corresponds to what is known as the trainable lexical sample in the SENSEVAL-1 official results.</Paragraph> <Paragraph position="1"> The English lexical sample for SENSEVAL-2 consists of 73 word types, each of which is associated with a single part of speech. There are 8,611 sense-tagged examples provided for training, where each instance has been manually assigned a Word-Net sense. The evaluation data for the English lexical sample consists of 4,328 held out test instances. The Spanish lexical sample for SENSEVAL-2 consists of 39 word types. There are 4,480 training examples that have been manually tagged with senses from Euro-WordNet. The evaluation data consists of 2,225 test instances.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 System Results </SectionTitle> <Paragraph position="0"> This section (and Table 1) summarizes the performance of the top two participating systems in SENSEVAL-1 and SENSEVAL-2, as well as the Duluth3 and Duluth8 systems. Also included are base-line results for a decision stump and a majority classifier. A decision stump is simply a one node decision tree based on a co-occurrence feature, while the majority classifier assigns the most frequent sense in the training data to every occurrence of that word in the test data.</Paragraph> <Paragraph position="1"> Results are expressed using accuracy, which is computed by dividing the total number of correctly disambiguated test instances by the total number of test instances. Official results from SENSEVAL are reported using precision and recall, so these are converted to accuracy to provide a consistent point of comparison. We utilize fine grained scoring, where a word is considered correctly disambiguated only if it is assigned exactly the sense indicated in the manually created gold standard.</Paragraph> <Paragraph position="2"> In the English lexical sample task of SENSEVAL-1 the two most accurate systems overall were hopkins-revised (77.1%) and ets-pu-revised (75.6%). The Duluth systems did not participate in this exercise, but have been evaluated using the same data after the fact. The Duluth3 system reaches accuracy of 70.3%. The simple majority classifier attains accuracy of 56.4%.</Paragraph> <Paragraph position="3"> In the English lexical sample task of SENSEVAL-2 the two most accurate systems were JHU(R) (64.2%) and SMUls (63.8%). Duluth3 attains an accuracy of 57.3%, while a simple majority classifier attains accuracy of 47.4%.</Paragraph> <Paragraph position="4"> In the Spanish lexical sample task of SENSEVAL-2 the two most accurate systems were JHU(R) (68.1%) and stanford-cs224n (66.9%). Duluth8 has accuracy of 61.2%, while a simple majority classifier attains accuracy of 47.4%.</Paragraph> <Paragraph position="5"> The top two systems from the first and second SENSEVAL exercises represent a wide range of strategies that we can only hint at here. The SMUls English lexical sample system is perhaps the most distinctive in that it incorporates information from WordNet, the source of the sense distinctions in SENSEVAL-2. The hopkins-revised, JHU(R), and stanford-cs224n systems use supervised algorithms that learn classifiers from a rich combination of syntactic and lexical features. The ets-pu-revised system may be the closest in spirit to our own, since it creates an ensemble of two Naive Bayesian classifiers, where one is based on topical context and the other on local context.</Paragraph> <Paragraph position="6"> More detailed description of the SENSEVAL-1 and SENSEVAL-2 systems and lexical samples can be found in (Kilgarriff and Palmer, 2000) and (Edmonds and Cotton, 2001), respectively.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Decomposition of Ensembles </SectionTitle> <Paragraph position="0"> The three bagged decision trees that make up Duluth38 are evaluated both individually and as pair-wise ensembles. In Table 1 and subsequent discussion, we refer to the individual bagged decision trees based on unigrams, bigrams and co-occurrences as U, B, and C, respectively. We designate ensembles that consist of two or three bagged decision trees by using the relevant combinations of letters. For example, UBC refers to a three member ensemble consisting of unigram (U), bigram (B), and co-occurrence (C) decision trees, while BC refers to a two member ensemble of bigram (B) and co-occurrence (C) decision trees. Note of course that UBC is synonymous with Duluth38.</Paragraph> <Paragraph position="1"> Table 1 shows that Duluth38 (UBC) achieves accuracy significantly better than the lower bounds represented by the majority classifier and the decision stump, and comes within seven percentage points of the most accurate systems in each of the three lexical sample tasks. However, UBC does not significantly improve upon all of its member classifiers, suggesting that the ensemble is made up of redundant rather than complementary classifiers.</Paragraph> <Paragraph position="2"> In general the accuracies of the bigram (B) and co-occurrence (C) decision trees are never significantly different than the accuracy attained by the ensembles of which they are members (UB, BC, UC, and UBC), nor are they significantly different from each other. This is an intriguing result, since the co-occurrences represent a much smaller feature set than bigrams, which are in turn much smaller than the unigram feature set. Thus, the smallest of our feature sets is the most effective. This may be due to the fact that small feature sets are least likely to suffer from fragmentation during decision tree learning. Of the three individual bagged decision trees U, B, and C, the unigram tree (U) is significantly less accurate for all three lexical samples. It is only slightly more accurate than the decision stump for both English lexical samples, and is less accurate than the decision stump in the Spanish task.</Paragraph> <Paragraph position="3"> The relatively poor performance of unigrams can be accounted for by the large number of possible features. Unigram features consist of all words not in the stop-list that occur five or more times in the training examples for a word. The decision tree learner must search through a very large feature space, and under such circumstances may fall victim to fragmentation.</Paragraph> <Paragraph position="4"> Despite these results, we are not prepared to dismiss the use of ensembles or unigram decision trees. An ensemble of unigram and co-occurrence decision trees (UC) results in greater accuracy than any other lexical decision tree for the English SENSEVAL-1 lexical sample, and is essentially tied with the most accurate of these approaches (UBC) in the English SENSEVAL-2 lexical sample. In principle unigrams and co-occurrence features are complementary, since unigrams represent topical context, and co-occurrences represent local context.</Paragraph> <Paragraph position="5"> This follows the line of reasoning developed by (Leacock et al., 1998) in formulating their ensemble of Naive Bayesian classifiers for word sense disambiguation. null Adding the bigram decision tree (B) to the ensemble of the unigram and co-occurrence decision trees (UC) to create UBC does not result in significant improvements in accuracy for the any of the lexical samples. This reflects the fact that the bigram and co-occurrence feature sets can be redundant. Bi-grams are two word sequences that occur anywhere within the context of the ambiguous word, while co-occurrences are bigrams that include the target word and a word one or two positions away. Thus, any consecutive two word sequence that includes the word to be disambiguated and has a log-likelihood ratio greater than the specified threshold will be considered both a bigram and a co-occurrence.</Paragraph> <Paragraph position="6"> Despite the partial overlap between bigrams and co-occurrences, we believe that retaining them as separate feature sets is a reasonable idea. We have observed that an ensemble of multiple decision trees where each is learned from a representation of the training examples that has a small number of features is more accurate than a single decision tree that is learned from one large representation of the training examples. For example, we mixed the bi-gram and co-occurrence features into a single feature set, and then learned a single bagged decision tree from this representation of the training examples. We observed drops in accuracy in both the Spanish and English SENSEVAL-2 lexical sample tasks. For Spanish it falls from 59.4% to 58.2%, and for English it drops from 57.2% to 54.9%. Interestingly enough, this mixed feature set of bigrams and co-occurrences results in a slight increase over an ensemble of the two in the SENSEVAL-1 data, rising from 71.3% to 71.5%.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Agreement Among Systems </SectionTitle> <Paragraph position="0"> The results in Table 1 show that UBC and its member classifiers perform at levels of accuracy significantly higher than the majority classifier and decision stumps, and approach the level of some of the more accurate systems. This poses an intriguing possibility. If UBC is making complementary errors to those other systems, then it might be possible to combine these systems to achieve an even higher level of accuracy. The alternative is that the decision trees based on lexical features are largely redundant with these other systems, and that there is a hard core of test instances that are resistant to disambiguation by any of these systems.</Paragraph> <Paragraph position="1"> We performed a series of pairwise comparisons to establish the degree to which these systems agree.</Paragraph> <Paragraph position="2"> We included the two most accurate participating systems from each of the three lexical sample tasks, along with UBC, a decision stump, and a majority classifier.</Paragraph> <Paragraph position="3"> In Table 2 the column labeled &quot;both&quot; shows the percentage and count of test instances where both systems are correct, the column labeled &quot;one&quot; shows the percentage and count where only one of the two systems is correct, and the column labeled &quot;none&quot; shows how many test instances were not correctly disambiguated by either system. We note that in the pairwise comparisons there is a high level of agreement for the instances that both systems were able to disambiguate, regardless of the systems involved. For example, in the SENSEVAL-1 results the three pairwise comparisons among UBC, hopkinsrevised, and ets-pu-revised all show that approximately 65% of the test instances are correctly disambiguated by both systems. The same is true for the English and Spanish lexical sample tasks in SENSEVAL-2, where each pairwise comparison results in agreement in approximately half the test instances. null Next we extend this study of agreement to a three-way comparison between UBC, hopkins-revised, and ets-pu-revised for the SENSEVAL-1 lexical sample. There are 4,507 test instances where all three systems agree (60.5%), and 973 test instances (13.1%) that none of the three is able to get correct.</Paragraph> <Paragraph position="4"> These are remarkably similar values to the pair-wise comparisons, suggesting that there is a fairly consistent number of test instances that all three systems handle in the same way. When making a five-way comparison that includes these three systems and the decision stump and the majority classifier, the num- null ber of test instances that no system can disambiguate correctly drops to 888, or 11.93%. This is interesting in that it shows there are nearly 100 test instances that are only disambiguated correctly by the decision stump or the majority classifier, and not by any of the other three systems. This suggests that very simple classifiers are able to resolve some test instances that more complex techniques miss.</Paragraph> <Paragraph position="5"> The agreement when making a three way comparison between UBC, JHU(R), and SMUls in the English SENSEVAL-2 lexical sample drops somewhat from the pair-wise levels. There are 1,791 test instances that all three systems disambiguate correctly (41.4%) and 828 instances that none of these systems get correct (19.1%). When making a five way comparison between these three systems, the decision stump and the majority classifier, there are 755 test instances (17.4%) that no system can resolve.</Paragraph> <Paragraph position="6"> This shows that these three systems are performing somewhat differently, and do not agree as much as the SENSEVAL-1 systems.</Paragraph> <Paragraph position="7"> The agreement when making a three way comparison between UBC, JHU(R), and cs224n in the Spanish lexical sample task of SENSEVAL-2 remains fairly consistent with the pairwise comparisons. There are 960 test instances that all three systems get correct (43.2%), and 308 test instances where all three systems failed (13.8%). When making a five way comparison between these three systems and the decision stump and the majority classifier, there were 237 test instances (10.7%) where no systems was able to resolve the sense. Here again we see three systems that are handling quite a few test instances in the same way.</Paragraph> <Paragraph position="8"> Finally, the number of cases where neither the decision stump nor the majority classifier is correct varies from 33% to 43% across the three lexical samples. This suggests that the optimal combination of a majority classifier and decision stump could attain overall accuracy between 57% and 66%, which is comparable with some of the better results for these lexical samples. Of course, how to achieve such an optimal combination is an open question. This is still an interesting point, since it suggests that there is a relatively large number of test instances that require fairly minimal information to disambiguate successfully.</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 7 Duluth38 Background </SectionTitle> <Paragraph position="0"> The origins of Duluth38 can be found in an ensemble approach based on multiple Naive Bayesian classifiers that perform disambiguation via a majority vote (Pedersen, 2000). Each member of the ensemble is based on unigram features that occur in varying sized windows of context to the left and right of the ambiguous word. The sizes of these windows are 0, 1, 2, 3, 4, 5, 10, 25, and 50 words to the left and to the right, essentially forming bags of words to the left and right. The accuracy of this ensemble disambiguating the nouns interest (89%) and line (88%) is as high as any previously published results. However, each ensemble consists of 81 Naive Bayesian classifiers, making it difficult to determine which features and classifiers were contributing most significantly to disambiguation.</Paragraph> <Paragraph position="1"> The frustration with models that lack an intuitive interpretation led to the development of decision trees based on bigram features (Pedersen, 2001a).</Paragraph> <Paragraph position="2"> This is quite similar to the bagged decision trees of bigrams (B) presented here, except that the earlier work learns a single decision tree where training examples are represented by the top 100 ranked bigrams, according to the log-likelihood ratio. This earlier approach was evaluated on the SENSEVAL-1 data and achieved an overall accuracy of 64%, whereas the bagged decision tree presented here achieves an accuracy of 68% on that data.</Paragraph> <Paragraph position="3"> Our interest in co-occurrence features is inspired by (Choueka and Lusignan, 1985), who showed that humans determine the meaning of ambiguous words largely based on words that occur within one or two positions to the left and right. Co-occurrence features, generically defined as bigrams where one of the words is the target word and the other occurs within a few positions, have been widely used in computational approaches to word sense disambiguation. When the impact of mixed feature sets on disambiguation is analyzed, co-occurrences usually prove to contribute significantly to overall accuracy. This is certainly our experience, where the co-occurrence decision tree (C) is the most accurate of the individual lexical decision trees. Likewise, (Ng and Lee, 1996) report overall accuracy for the noun interest of 87%, and find that that when their feature set only consists of co-occurrence features the accuracy only drops to 80%.</Paragraph> <Paragraph position="4"> Our interest in bigrams was indirectly motivated by (Leacock et al., 1998), who describe an ensemble approach made up of local context and topical context. They suggest that topical context can be represented by words that occur anywhere in a window of context, while local contextual features are words that occur within close proximity to the target word. They show that in disambiguating the adjective hard and the verb serve that the local context is most important, while for the noun line the topical context is most important. We believe that statistically significant bigrams that occur anywhere in the window of context can serve the same role, in that such a two word sequence is likely to carry heavy semantic (topical) or syntactic (local) weight.</Paragraph> </Section> class="xml-element"></Paper>