File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0706_metho.xml
Size: 15,752 bytes
Last Modified: 2025-10-06 14:07:20
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0706"> <Title>A Comparison between Supervised Learning Algorithms for Word Sense Disambiguation*</Title> <Section position="3" start_page="0" end_page="32" type="metho"> <SectionTitle> 2 Learning Algorithms Tested </SectionTitle> <Paragraph position="0"> Naive-Bayes (NB). Naive Bayes is intended as a simple representative of statistical learning methods. It has been used in its most classi- null cal setting (Duda and Hart, 1973). That is, assuming the independence of features, it classifies a new example by assigning the class that maximizes the conditional probability of the class given the observed sequence of features of that example. Model probabilities are estimated during the training process using relative frequencies. To avoid the effect of zero counts, a very simple smoothing technique has been used, which was proposed in (Ng, 1997).</Paragraph> <Paragraph position="1"> Despite its simplicity, Naive Bayes is claimed to obtain state-of-the-art accuracy on supervised WSD in many papers (Mooney, 1996; Ng, 1997; Leacock et al., 1998).</Paragraph> <Paragraph position="2"> Exemplar-based Classifier (EB). In exemplar, instance, or memory-based learning (Aha et al., 1991) no generalization of training examples is performed. Instead, the examples are simply stored in memory and the classification of new examples is based on the most similar stored exemplars. In our implementation, all examples are kept in memory and the classification is based on a k-NN (Nearest-Neighbours) algorithm using Hamming distance to measure closeness. For k's greater than 1, the resulting sense is the weighted majority sense of the k nearest neighbours --where each example votes its sense with a strength proportional to its closeness to the test example.</Paragraph> <Paragraph position="3"> Exemplar-based learning is said to be the best option for WSD (Ng, 1997). Other authors (Daelemans et al., 1999) point out that exemplar-based methods tend to be superior in language learning problems because they do not forget exceptions.</Paragraph> <Paragraph position="4"> The SNoW Architecture (SN). SNoWis a Sparse Network of linear separators which utilizes the Winnow learning algorithm 1. In the SNo W architecture there is a winnow node for each class, which learns to separate that class from all the rest. During training, which is performed in an on-line fashion, each example is considered a positive example for the winnow node associated to its class and a negative example for all the others. A key point that allows a fast learning is that the winnow nodes are not connected to all features but only to those that of a linear threshold algorithm with multiplicative weight updating for 2-class problems.</Paragraph> <Paragraph position="5"> are &quot;relevant&quot; for their class. When classifying a new example, SNo W is similar to a neural network which takes the input features and outputs the class with the highest activation. Our implementation of SNo W for WSD is explained in (Escudero et al., 2000c).</Paragraph> <Paragraph position="6"> SNoW is proven to perform very well in high dimensional NLP problems, where both the training examples and the target function reside very sparsely in the feature space (Roth, 1998), e.g: context-sensitive spelling correction, POS tagging, PP-attachment disambiguation, etc.</Paragraph> <Paragraph position="7"> Decision Lists (DL). In this setting, a Decision List is a list of features extracted from the training examples and sorted by a log-likelihood measure. This measure estimates how strong a particular feature is as an indicator of a specific sense (Yarowsky, 1994). When testing, the decision list is checked in order and the feature with the highest weight that matches the test example is used to select the winning word sense. Thus, only the single most reliable piece of evidence is used to perform disambiguation. Regarding the details of implementation (smoothing, pruning of the decision list, etc.) we have followed (Agirre and Martinez, 2000).</Paragraph> <Paragraph position="8"> Decision Lists were one of the most successful systems on the 1st Senseval competition for WSD (Kilgarriff and Rosenzweig, 2000).</Paragraph> <Paragraph position="9"> LazyBoosting (LB). The main idea of boosting algorithms is to combine many simple and moderately accurate hypotheses (weak classifiers) into a single, highly accurate classifier. The weak classifiers are trained sequentially and, conceptually, each of them is trained on the examples which were most difficult to classify by the preceding weak classifiers. These weak hypotheses are then linearly combined into a single rule called the combined hypothesis.</Paragraph> <Paragraph position="10"> Schapire and Singer's real AdaBoost.MH algorithm for multiclass multi-label classification (Schapire and Singer, 1999) has been used. It constructs a combination of very simple weak hypotheses that test the value of a single boolean predicate and make a real-valued prediction based on that value. LazyBoosting (Escudero et al., 2000a) is a simple modification of the AdaBoost.MH algorithm, which consists in reducing the feature space that is explored when learning each weak classifier. This modification significantly increases the efficiency of the learning process with no loss in accuracy.</Paragraph> </Section> <Section position="4" start_page="32" end_page="32" type="metho"> <SectionTitle> 3 Setting </SectionTitle> <Paragraph position="0"> A number of comparative experiments has been carried out on a subset of 21 highly ambiguous words of the DSO corpus, which is a semantically annotated English corpus collected by Ng and colleagues (Ng and Lee, 1996). Each word is treated as a different classification problem.</Paragraph> <Paragraph position="1"> The 21 words comprise 13 nouns (age, art, body, car, child, cost, head, interest, line, point, state, thing, work) and 8 verbs (become, fall, grow, lose, set, speak, strike, tell), which frequently appear in the WSD literature. The average number of senses per word is close to 10 and the number of training examples is around 1,000.</Paragraph> <Paragraph position="2"> The DSO corpus contains sentences from two different corpora, namely Wall Street Journal (WSJ) and Brown Corpus (BC). Therefore, it is easy to perform experiments about the portability of systems by training them on the WSJ part (A part, hereinafter) and testing them on the BC part (B part, hereinafter), or vice-versa.</Paragraph> <Paragraph position="3"> Two kinds of information are used to train classifiers: local and topical context. Let ... &quot; be ~ W-3 W--2 W--1 W W-i_ 1 W+2 W+3...</Paragraph> <Paragraph position="4"> the context of consecutive words around the word w to be disambiguated, and P+-i (-3 < i < 3) be the part-of-speech tag of word w+-i. Attributes referring to local context are the following 15: P-3, P-2, P-l, P+I,</Paragraph> <Paragraph position="6"> the last seven correspond to collocations of two and three consecutive words. The topical context is formed by cl,..., Cm, which stand for the unordered set of open class words appearing in the sentence 2. Details about how the different algorithms translate this information into features can be found in (Escudero et al., 2000c).</Paragraph> </Section> <Section position="5" start_page="32" end_page="32" type="metho"> <SectionTitle> 4 Comparing the five approaches </SectionTitle> <Paragraph position="0"> The five algorithms, jointly with a naive Most-</Paragraph> </Section> <Section position="6" start_page="32" end_page="34" type="metho"> <SectionTitle> A+B-A+B, A-I-B-A, A+B-B, A-A, B-B, A-B, and B-A, </SectionTitle> <Paragraph position="0"> figures, micro-averaged over the 21 words and over the ten folds, are reported in table 1. The comparison leads to the following conclusions: As expected, the five algorithms significantly outperform the baseline M FC classifier. Among them, three groups can be observed: Ni3, DL, and SN perform similarly; LB outperforms all the other algorithms in all experiments; and EB is somewhere in between. The difference between \[B and the rest is statistically significant in all cases except when comparing \[B to the EB approach in the case marked with an asterisk 4.</Paragraph> <Paragraph position="1"> Extremely poor results are observed when testing the portability of the systems. Restricting to LB results, it can be observed that the accuracy obtained in A-B is 47.1%, while the accuracy in B-B (which can be considered an upper bound for LB in B corpus) is 59.0%, that is, that there is a difference of 12 points. Furthermore, 47.1% is only slightly better than the most frequent sense in corpus B, 45.5%.</Paragraph> <Paragraph position="2"> Apart from accuracy figures, the comparison between the predictions made by the five methods on the test sets provides interesting information about the relative behaviour of the algorithms. Table 2 shows the agreement rates and the Kappa statistics 5 between all pairs of methods in the A+B-A+B experiment. Note that 'DSO' stands for the annotation of DSO corpus, which is taken as the correct one.</Paragraph> <Paragraph position="3"> It can be observed that N B obtains the most similar results with regard to M FC in agreement and Kappa values. The agreement ratio is 74%, that is, almost 3 out of 4 times it predicts the most frequent sense. On the other extreme, LB obtains the most similar results with regard to DSO in agreement and Kappa values, and it has the least similar with regard to M FC, suggesting respectively. In this notation, the training set is placed on the left hand side of symbol &quot;-&quot;, while the test set is on the right hand side. For instance, A-B means that the training set is corpus A and the test set is corpus B. The symbol &quot;+&quot; stands for set union.</Paragraph> <Paragraph position="4"> sure of inter-annotator agreement which reduces the effect of chance agreement. It has been used for measuring inter-annotator agreement during the construction of semantic annotated corpora (V~ronis, 1998; Ng et al., 1999). A Kappa value of 1 indicates perfect agreement, while 0.8 is considered as indicating good agreement. % of agreement (above diagonal) between all methods in the A+B-A+B experiment that LB is the algorithm that better learns the behaviour of the DSO examples.</Paragraph> <Paragraph position="5"> In absolute terms, the Kappa values are very low. But, as it is suggested in (Vdronis, 1998), evaluation measures should be computed relative to the agreement between the human annotators of the corpus and not to a theoretical 100%. It seems pointless to expect more agreement between the system and the reference corpus than between the annotators themselves. Contrary to the intuition that the agreement between human annotators should be very high in the WSD task, some papers report surprisingly low figures. For instance, (Ng et al., 1999) reports an accuracy rate of 56.7% and a Kappa value of 0.317 when comparing the annotation of a subset of the DSO corpus performed by two independent research groups. From this perspective, the Kappa value of 0.44 achieved by LB in A+B-A+B could be considered an excellent result. Unfortunately, the subset of the \[:)SO corpus studied by (Ng et al., 1999) and that used in this report are not the same and, thus, a direct comparison is not possible.</Paragraph> <Section position="1" start_page="33" end_page="34" type="sub_section"> <SectionTitle> 4.1 About the tuning to new domains </SectionTitle> <Paragraph position="0"> This experiment explores the effect of a simple tuning process consisting in adding to the original training set A a relatively small sample of manually sense-tagged examples of the new domain B. The size of this supervised portion varies from 10% to 50% of the available corpus in steps of 10% (the remaining 50% is kept for testing) 6. This experiment will be referred to as A+%B-B T. In order to determine to which extent the original training set contributes to accurately disambiguating in the new domain, we also calculate the results for %B-B, that is, using only the tuning corpus for training.</Paragraph> <Paragraph position="1"> Figure 1 graphically presents the results obtained by all methods. Each plot contains the A+%B-B and %B-B curves, and the straight lines corresponding to the lower bound MFC, and to the upper bounds B-B and A+B-B.</Paragraph> <Paragraph position="2"> As expected, the accuracy of all methods grows (towards the upper bound) as more tuning corpus is added to the training set. However, the relation between A+%B-B and %B-B reveals some interesting facts. In plots (c) and (d), the contribution of the original training corpus is null, while in plots (a) and (b), a degradation onthe accuracy is observed. Summarizing, these results suggest that for NB, DL, SN, and EB methods it is not worth keeping the original training examples. Instead, a better (but disappointing) strategy would be simply using the tuning corpus. However, this is not the situation of LB --plot (d)-- for which a moderate, but consistent, improvement of accuracy is observed when retaining the original training set.</Paragraph> <Paragraph position="3"> 6Tuning examples can be weighted more highly than the training examples to force the learning algorithm to adapt more quickly to the new corpus. Some experiments in this direction revealed that slightly better results can be obtained, though the improvement was not statistically significant.</Paragraph> <Paragraph position="4"> We observed that part of the poor results obtained is explained by: 1) corpus A and B have a very different distribution of senses, and, therefore, different a-priori biases; furthermore, 2) examples of corpus A and B contain different information and, therefore, the learning algorithms acquire different (and non interchangeable) classification clues from both corpora. The study of the rules acquired by LazyBoosting from WSJ and BC helped understanding the differences between corpora. On the one hand, the type of features used in the rules was significantly different between corpora and, additionally, there were very few rules that applied to both sets. On the other hand, the sign of the prediction of many of these common rules was somewhat contradictory between corpora. See (Escudero et al., 2000c) for details.</Paragraph> </Section> <Section position="2" start_page="34" end_page="34" type="sub_section"> <SectionTitle> 4.2 About the training data quality </SectionTitle> <Paragraph position="0"> The observation of the rules acquired by LazyBoosting could also help improving data quality in a semi-supervised fashion. It is known that mislabelled examples resulting from annotation errors tend to be hard examples to classify correctly and, therefore, tend to have large weights in the final distribution. This observation allows both to identify the noisy examples and use LazyBoosting as a way to improve the training corpus.</Paragraph> <Paragraph position="1"> A preliminary experiment has been carried out in this direction by studying the rules acquired by LazyBoosting from the training examples of the word state. The manual revision, by four different people, of the 50 highest scored rules, allowed us to identify 28 noisy training examples. 11 of them were clear tagging errors, and the remaining 17 were not coherently tagged and very difficult to judge, since the four annotators achieved systematic disagreement (probably due to the extremely fine grained sense definitions involved in these examples). null</Paragraph> </Section> </Section> class="xml-element"></Paper>