File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-0811_metho.xml
Size: 23,648 bytes
Last Modified: 2025-10-06 14:08:01
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-0811"> <Title>Combining Heterogeneous Classifiers for Word-Sense Disambiguation</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 The System </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Training Procedure </SectionTitle> <Paragraph position="0"> Figure 1 shows the high-level organization of our system. Individual first-order classifiers each map lists of context word tokens to word-sense predictions, and are self-contained WSD systems. The first-order classifiers are combined in a variety of ways with second-order classifiers. Second-order classifiers are selectors, taking a list of first-order out-July 2002, pp. 74-80. Association for Computational Linguistics. Disambiguation: Recent Successes and Future Directions, Philadelphia, Proceedings of the SIGLEX/SENSEVAL Workshop on Word Sense 6 Choose the ensemble Ew;k to be the top k classifiers 7 For each voting method m 8 Train the (k, m) second-order classifier with Ew;k 9 Rank the second-order classifier types (k, m) globally. 10 Rank the second-order classifier instances locally.</Paragraph> <Paragraph position="1"> 11 Choose the top-ranked second-order classifier for each word. 12 Retrain chosen per-word classifiers on entire training data. 13 Run these classifiers on test data, and evaluate results. puts and choosing from among them. An outline of the classifier construction process is given in table 1. First, the training data was split into training and held-out sets for each word. This was done using 5 random bootstrap splits. Each split allocated 75% of the examples to training and 25% to held-out testing.2 Held-out data was used both to select the subsets of first-order classifiers to be combined, and to select the combination methods.</Paragraph> <Paragraph position="2"> For each word and each training split, the 23 first-order classifiers were (independently) trained and tested on held-out data. For each word, the first-order classifiers were ranked by their average performance on the held-out data, with the most accurate classifiers at the top of the rankings. Ties were broken by the classifiers' (weighted) average perfomance across all words.</Paragraph> <Paragraph position="3"> For each word, we then constructed a set of can2Bootstrap splits were used rather than standard n-fold cross-validation for two reasons. First, it allowed us to generate an arbitrary number of training/held-out pairs while still leaving substantial held-out data set sizes. Second, this approach is commonly used in the literature on ensembles. Its well-foundedness and theoretical properties are discussed in Breiman (1996). In retrospect, since we did not take proper advantage of the ability to generate numerous splits, it might have been just as well to use cross-validation.</Paragraph> <Paragraph position="4"> didate second-order classifiers. Second-order classifier types were identified by an ensemble size k and a combination method m. One instance of each second-order type was constructed for each word.</Paragraph> <Paragraph position="5"> We originally considered ensemble sizes k in the range f1; 3; 5; 7; 9; 11; 13; 15g. For a second-order classifier with ensemble size k, the ensemble members were the top k first-order classifiers according to the local rank described above.</Paragraph> <Paragraph position="6"> We combined first-order ensembles using one of three methods m: Majority voting: The sense output by the most first-order classifiers in the ensemble was chosen.</Paragraph> <Paragraph position="7"> Ties were broken by sense frequency, in favor of more frequent senses.</Paragraph> <Paragraph position="8"> Weighted voting: Each first-order classifier was assigned a voting weight (see below). The sense receiving the greatest total weighted vote was chosen.</Paragraph> <Paragraph position="9"> Maximum entropy: A maximum entropy classifier was trained (see below) and run on the outputs of the first-order classifiers.</Paragraph> <Paragraph position="10"> We considered all pairs of k and m, and so for each word there were 24 possible second-order classifiers, though for k D 1 all three values of m are equivalent and were merged. The k D 1 ensemble, as well as the larger ensembles (k 2 f9; 11; 13; 15g), did not help performance once we had good first-order classifier rankings (see section 3.4).</Paragraph> <Paragraph position="11"> For m = Majority, there are no parameters to set.</Paragraph> <Paragraph position="12"> For the other two methods, we set the parameters of the (k, m) second-order classifier for a word w using the bootstrap splits of the training data for w.</Paragraph> <Paragraph position="13"> In the same manner as for the first-order classifiers, we then ranked the second-order classifiers. For each word, there was the local ranking of the second-order classifiers, given by their (average) accuracy on held-out data. Ties in these rankings were broken by the average performance of the classifier type across all words. The top second-order classifier for each word was selected from these tie-broken rankings.</Paragraph> <Paragraph position="14"> At this point, all first-order ensemble members and chosen second-order combination methods were retrained on the unsplit training data and run on the final test data.</Paragraph> <Paragraph position="15"> It is important to stress that each target word was considered an entirely separate task, and different first- and second-order choices could be, and were, made for each word (see the discussion of table 2 below). Aggregate performance across words was only used for tie-breaking.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Combination Methods </SectionTitle> <Paragraph position="0"> Our second-order classifiers take training instances of the form Ns D .s; s1;::: ; sk/ where s is the correct sense and each si is the sense chosen by classifier i.</Paragraph> <Paragraph position="1"> All three of the combination schemes which we used can be seen as weighted voting, with different ways of estimating the voting weights i of the first-order voters. In the simplest case, majority voting, we skip any attempt at statistical estimation and simply set each i to be 1=k.</Paragraph> <Paragraph position="2"> For the method we actually call &quot;weighted voting,&quot; we view the combination output as a mixture model in which each first-order system is a mixture component:</Paragraph> <Paragraph position="4"> The conditional probabilties P.sjsi/ assign mass one to the sense si chosen by classifier i. The mixture weights i were estimated using EM to maximize the likelihood of the second-order training instances. In testing, the sense with the highest weighted vote, and hence highest posterior likelihood, is the selected sense.</Paragraph> <Paragraph position="5"> For the maximum entropy classifier, we have a different model for the chosen sense s. In this case, it is an exponential model of the form:</Paragraph> <Paragraph position="7"> The features fx are functions which are true over some subset of vectors Ns. The original intent was to design features to recognize and exploit &quot;sense expertise&quot; in the individual classifiers. For example, one classifier might be trustworthy when reporting a certain sense but less so for other senses. However, there was not enough data to accurately estimate parameters for such models.3 In fact, we no3The number of features was not large: only one for each (classifier, chosen sense, correct sense) triple. However, most senses are rarely chosen and rarely correct, so most features had zero or singleton support.</Paragraph> <Paragraph position="8"> ticed that, for certain words, simple majority voting performed better than the maximum entropy model.</Paragraph> <Paragraph position="9"> It also turned out that the most complex features we could get value from were features of the form: fi.s; s1;::: ; sk/ D 1 () s D si That is, for each first-order classifier, there is a single feature which is true exactly when that classifier is correct. With only these features, the maximum entropy approach also reduces to a weighted vote; the s which maximizes the posterior probability P.sjs1;:::; sk/ also maximizes the vote: v.s/ D Pi i .si D s/ The indicators are true for exactly one sense, and correspond to the simple fi defined above.4 The sense with the largest vote v.s/ will be the sense with the highest posterior probability P.sjs1;::: sk/ and will be chosen.</Paragraph> <Paragraph position="10"> For the maximum entropy classifier, we estimate the weights by maximizing the likelihood of a held-out set, using the standard IIS algorithm (Berger et al., 1996). For both weighted schemes, we found that stopping the iterative procedures before convergence gave better results. IIS was halted after 50 rounds, while EM was halted after a single round.</Paragraph> <Paragraph position="11"> Both methods were initialized to uniform starting weights.</Paragraph> <Paragraph position="12"> More importantly than changing the exact weight estimates, moving from method to method triggers broad qualitative changes in what kind of weights are allowed. With majority voting, classifiers all have equal, positive weights. With weighted voting, the weights are no longer required to be equal, but are still non-negative. With maximum entropy weighting, this non-negativity constraint is also relaxed, allowing classifiers' votes to actually reduce the score for the sense that classifier has chosen.</Paragraph> <Paragraph position="13"> Negative weights are in fact assigned quite frequently, and often seem to have the effect of using poor classifiers as &quot;error masks&quot; to cancel out common errors.</Paragraph> <Paragraph position="14"> As we move from majority voting to weighted voting to maximum entropy, the estimation becomes 4If the ith classifier returns the correct sense s, then .si D s/ is 1, otherwise it is zero.</Paragraph> <Paragraph position="15"> more sophisticated, but also more prone to overfitting. Since solving the overfitting problem is hard, while choosing between classifiers based on held-out data is relatively easy, this spectrum gives us a way to gracefully handle the range of sparsities in the training corpora for different words.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Individual Classifiers </SectionTitle> <Paragraph position="0"> While our first-order classifiers implemented a variety of classification algorithms, the differences in their individual accuracies did not primarily stem from the algorithm chosen. Rather, implementation details led to the largest differences. For example, naive-Bayes classifiers which chose sensible window sizes, or dynamically chose between window sizes, tended to outperform those which chose poor sizes. Generally, the optimal windows were either of size one (for words with strong local syntactic or collocational cues) or of very large size (which detected more topical cues). Programs with hard-wired window sizes of, say, 5, performed poorly. Ironically, such middle-size windows were commonly chosen by students, but rarely useful; either extreme was a better design.5 Another implementation choice dramatically affecting performance of naive-Bayes systems was the amount and type of smoothing. Heavy smoothing and smoothing which backed off conditional distributions P.wjjsi/ to the relevant marginal P.w j/ gave good results, while insufficient smoothing or backing off to uniform marginals gave substantially degraded results.6 There is one significant way in which our first-order classifiers were likely different from other teams' systems. In the original class project, students were guaranteed that the ambiguous word would appear only in a single orthographic form, and many of the systems depended on the input satisfying this guarantee. Since this was not true of the SENSEVAL-2 data, we mapped the ambiguous where, when one smooths far too little, the chosen sense is the one which has occurred with the most words in the context window. For small training sets of skewed-prior data like the SENSEVAL-2 sets, this is invariably the common sense, regardless of the context words.</Paragraph> <Paragraph position="1"> words (but not context words) to a citation form.</Paragraph> <Paragraph position="2"> We suspect that this lost quite a bit of information and negatively affected the system's overall performance, since there is considerable correlation between form and sense, especially for verbs. Nevertheless, we have made no attempt to re-engineer the student systems, and have not thoroughly investigated how big a difference this stemming made.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="58" type="metho"> <SectionTitle> 3 Results and Discussion 3.1 Results </SectionTitle> <Paragraph position="0"> Table 2 shows the results per word, and table 3 shows results by part-of-speech and overall, for the SENSEVAL-2 English lexical sample task. It also shows what second-order classifiers were selected for each word. 54.2% of the time, we made an optimal second-order classifier choice. When we chose wrong, we usually made a mistake in either ensemble size or method, rarely both. A wide range of second-order classifier types were chosen. As an overview of the benefit of combination, the globally best single classifier scored 61.2%, the locally best single classifier (best on test data) scored 62.2%, the globally best second order classifier (ME-7, best on test data) scored 63.2%, and our dynamic selection method scored 63.9%. Section 3.3 examines combination effectiveness more closely.</Paragraph> <Section position="1" start_page="0" end_page="58" type="sub_section"> <SectionTitle> 3.2 Changes from SENSEVAL-2 </SectionTitle> <Paragraph position="0"> The system we originally submitted to the SENSEVAL-2 competition had an overall accuracy of 61.7%, putting it in 4th place in the revised rankings (among 21 supervised and 28 total systems). Assuming that our first-order classifiers were fixed black-boxes, we wanted an idea of how good our combination and selection methods were.</Paragraph> <Paragraph position="1"> To isolate the effectiveness of our second-order classifier choices, we compared our system to an oracle method (OR-BEST) which chose a word's second-order classifier based on test data (rather than held-out data). The overall accuracy of this oracle method was 65.4% at the time, a jump of 3.7%.7 This gap was larger than the gap between the various top-scoring teams' systems. Therefore, while the test-set performance of the second-order cal sample task. Lower bound (LB): ALL is how often all of the first-orders chose correctly. Baselines (BL): MFS is the most-frequent-sense baseline, SNG is the best single first-order classifier as chosen on held-out data for that word. Fixed combinations: majority vote (MJ), weighted vote (WT), maximum entropy (ME). Oracle bound (OR): BEST is the best second-order classifier as measured on the test data. Upper bound (UB): SOME is how often at least one first-order classifier produced the correct answer. Methods which are ensemble-size dependent are shown for k D 7. System choices: ACC is the accuracy of the selection the system makes based on held-out data. CL is the 2nd-order classifier selected.</Paragraph> <Paragraph position="2"> that a more sophisticated or better-tuned method of selecting combination models could lead to significant improvement. In fact, changing only ranking methods, which are discussed further in the next section, resulted in an increase in final accuracy for our system to the current score of 63.9%, which would have placed it 1st in the SENSEVAL-2 preliminary results or 2nd in the revised results. Our the ensemble size varies. The three combination methods are shown. In addition, the globally best single classifier is the single first-order classifier with the highest overall accuracy on the test data. Chosen combination is our final system's score. These two are both independent of k in this graph.</Paragraph> <Paragraph position="3"> final accuracy is thus higher than the first draft of the system, and, in particular, the classifier selection gap between actual performance and the OR-BEST oracle has been substantially decreased.</Paragraph> <Paragraph position="4"> In addition, since the top first-order classifiers were more reliably identified, larger ensembles were no longer beneficial in the revised system, for an interesting reason. When the first-order rankings were poorly estimated, large ensembles and weighted methods were important for achieving good accuracy, because the weighting scheme could &quot;rescue&quot; good classifiers which had been incorrectly ranked low. In our current system, however, first-order classifiers were ranked reliably enough that we could restrict our ensemble sizes to k 2 f1; 3; 5; 7g. Furthermore, since k D 1 was only chosen a few times, usually among ties, we removed that option as well.</Paragraph> </Section> <Section position="2" start_page="58" end_page="58" type="sub_section"> <SectionTitle> 3.3 Combination Methods and Ensemble Size </SectionTitle> <Paragraph position="0"> Our system differs from the typical ensemble of classifiers in that the first-order classifiers are not merely perturbations of each other, but are highly varied in both quality and character. This scenario has been investigated before, e.g. (Zhang et al., 1992), but is not the common case. With such heterogeneity, having more classifiers is not always better. Figure 2 shows how the three combination methods' average scores varied with the number of component classifiers used. Initially, accuracy increases as added classifiers bring value to the ensemble.</Paragraph> <Paragraph position="1"> However, as lower-quality classifiers are added in, the better classifiers are steadily drowned out. The weighted vote and maximum entropy combinations are much less affected by low-quality classifiers than the majority vote, being able to suppress them with low weights. Still, majority vote over small ensembles was effective for some words where weights could not be usefully set by the other methods.</Paragraph> </Section> <Section position="3" start_page="58" end_page="58" type="sub_section"> <SectionTitle> 3.4 Ranking Methods </SectionTitle> <Paragraph position="0"> Because of the effects described above, it was necessary to identify which classifiers were worth including for a given word. A global ranking of first-order classifiers, averaged over all words, was not effective because the strengths of the classifiers were so different. In fact, every single first-order classifier was a top-5 performer on at least one word.</Paragraph> <Paragraph position="1"> On the other hand, SENSEVAL-2 training sets were often very small, and very skewed towards a frequent most-frequent-sense. As a result, accuracy estimates based on single words' held-out data produced frequent ties. The average size of the per-word largest set of tied first-order classifiers was 3.6 (with a maximum of 23 on the word collaborate where all tied). The second-order local rankings also produced many ties. For the top position (the most important for second-order ranks) 43.1% of the words had local ties.</Paragraph> <Paragraph position="2"> In our submitted entry, all ties were broken unintelligently (in an arbitrary manner based on the order in which systems were listed in a file). The approach of local ranking with global tie-breaking presented in this paper was much more successful according to two distinct measures. First, it predicted the true ranks more accurately, (measured by the Spearman rank correlation: 0.08 for global ranks, 0.63 for globally-broken local ranks) and gave better final accuracy scores (63.5% with global, 63.9% with globally-broken local - significant only at p=0.1 by a sign test at the word type level).</Paragraph> <Paragraph position="3"> The other ranking that our system attempts to estimate is the per-word ranking of the second-order classifiers. In this case, however, we are only ever concerned with which classifier ends up being ranked first, as only that classifier is chosen.</Paragraph> <Paragraph position="4"> Again, globally-broken local ranks were the most effective, choosing a second-order classifier which was actually top-performing on test data for 54% of the words, as opposed to 50% for global selection (and increasing the overall accuracy from 62.8% to 63.9% - significant at p=0.01, sign test).</Paragraph> <Paragraph position="5"> These results stress that ranking, and effective tiebreaking, are important for a system such as ours where the classifiers are so divergent in behavior.</Paragraph> </Section> <Section position="4" start_page="58" end_page="58" type="sub_section"> <SectionTitle> 3.5 Combination </SectionTitle> <Paragraph position="0"> When combining classifiers, one would like to know when and how the combination will outperform the individuals. One factor (Tumer and Ghosh, 1996) is how complementary the mistakes of the individual classifiers are. If all make the same mistakes, combination can do no good. We can measure this complementarity by averaging, over all pairs of first-order classifiers, the fraction of errors that pair has in common. This gives average pairwise error independence. Another factor is the difficulty of the word being disambiguated. A high most-frequent-sense baseline (BL-MFS) means that there is little room for improvement by combining classifiers. Figure 3 shows, for the global top 7 first-order classifiers, the absolute gain between their average accuracy (BL-AVG-7) and the accuracy of their majority combination (MJ-7). The quantity on the x-axis is the difference between the pairwise independence and the baseline accuracy. The pattern is loose, but clear.</Paragraph> <Paragraph position="1"> When either independence increases or the word's difficulty (as indicated by the BL-MFS baseline) increases, the combination tends to win by a greater amount.</Paragraph> <Paragraph position="2"> Figure 4 shows how the average pairwise independent error fraction (api) varies as we add classifiers. Here classifiers are added in an order based on their accuracy on the entire test set. For each k, the average is over all pairs of classifiers in the top k and all samples of all words. This graph should be compared to figure 2. After the third classifier, adding classifiers reduces the api, and the performance of the majority vote begins to drop at exactly this point.</Paragraph> <Paragraph position="3"> However, the weighted methods continue to gain in accuracy since they have the capacity to downweight classifiers which hurt held-out accuracy.</Paragraph> <Paragraph position="4"> The drop in api reflects that the newly added systems are no longer bringing many new correct answers to the collection. However, they can still add as their number is increased.</Paragraph> <Paragraph position="5"> deciding votes in areas where the ensemble had the right answer, but did not choose it. The final gradual rise in api reflects the somewhat patternless new errors that substantially lower-performing systems unfortunately bring to the ensemble.</Paragraph> </Section> </Section> class="xml-element"></Paper>