File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1322_metho.xml
Size: 20,078 bytes
Last Modified: 2025-10-06 14:07:26
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-1322"> <Title>An Empirical Study of the Domain Dependence of Supervised Word Sense Disambiguation Systems*</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> Keywords: Cross-corpus evaluation of Ni_P </SectionTitle> <Paragraph position="0"/> </Section> <Section position="4" start_page="0" end_page="172" type="metho"> <SectionTitle> vised Machine Learning 1 Introduction </SectionTitle> <Paragraph position="0"> Word Sense Disambiguation (WSD) is the problem of assigning the appropriate meaning (sense) to a given word in a text or discourse.</Paragraph> <Paragraph position="1"> Resolving the ambiguity of words is a central problem for large scale language understanding applications and their associate tasks (Ide and V4ronis, 1998), e.g., machine translation, information retrieval, reference resolution, parsing, etc.</Paragraph> <Paragraph position="2"> WSD is one of the most important open problems in NLP. Despite the wide range of approaches investigated and the large effort devoted to tackle this problem, to date, no large-scale broad-coverage and highly accurate WSD system has been built --see the main conclusions of the first edition of SensEval (Kilgarriff and Rosenzweig, 2000).</Paragraph> <Paragraph position="3"> One of the most successful current lines of research is the corpus-based approach in &quot; This research has been partially funded by the Spanish Research Department (CICYT's project TIC980423-C06). by the EU Commission (NAMIC IST1999-12392), and by the Catalan Research Department (CIRIT's consolidated research group 1999SGR150 and CIRIT's grant 1999FI 00773).</Paragraph> <Paragraph position="4"> which statistical or Machine Learning (ML) algorithms are applied to learn statistical models or classifiers from corpora in order to perform WSD. Generally, supervised approaches 1 have obtained better results than unsupervised methods on small sets of selected ambiguous words, or artificial pseudo-words.</Paragraph> <Paragraph position="5"> Many standard M L algorithms for supervised learning have been applied, such as: Decision Lists (Y=arowsky, 1994; Agirre and Martinez, 2000), Neural Networks (Towell and Voorhees, 1998), Bayesian learning (Bruce and Wiebe, 1999), Exemplar-Based learning (Ng, 1997a; Fujii et al., 1998), Boosting (Escudero et al., 2000a), etc. Unfortunately, there have been very few direct comparisons between alternative methods for WSD.</Paragraph> <Paragraph position="6"> In general, supervised learning presumes that the training examples are somehow reflective of the task that will be performed by the trainee on other data. Consequently, the performance of such systems is commonly estimated by testing the algorithm on a separate part of the set of training examples (say 1020% of them), or by N-fold cross-validation, in which the set of examples is partitioned into N disjoint sets (or folds), and the training-test procedure is repeated N times using all combinations of N-1 folds for training and 1 fold for testing. In both cases, test examples are different from those used for training, but they belong to the same corpus, and, therefore, they are expected to be quite similar. Although this methodology could be valid for certain NLP problems, such as English Part-of-Speech tagging, we think that there exists reasonable evidence to say that, in WSD, accuracy results cannot be simply extrapolated to other domains (contrary to the opinion of other authors (Ng, 1997b)): On the aSupervised approaches, also known as data-driven or corpus-dmven, are those that learn from a previously semantically annotated corpus.</Paragraph> <Paragraph position="7"> one hand, WSD is very dependant to the domain of application (Gale et al., 1992b) --see also (Ng and Lee, 1996; Ng, 1997a), in which quite different accuracy figures are obtained when testing an exemplar-based WSD classifier on two different corpora. Oi1 the other hand, it does not seem reasonable to think that the training material is large and representative enough to cover &quot;all&quot; potential types of examples.</Paragraph> <Paragraph position="8"> To date, a thorough study of the domain dependence of WSD --in the style of other studies devoted to parsing (Sekine, 1997)-has not been carried out. We think that such an study is needed to assess the validity of the supervised approach, and to determine to which extent a tuning process is necessary to make real WSD systems portable. In order to corroborate the previous hypotheses, this paper explores the portability and tuning of four different ML algorithms (previously applied to WSD) by training and testing them on different corpora.</Paragraph> <Paragraph position="9"> Additionally, supervised methods suffer from the &quot;knowledge acquisition bottleneck&quot; (Gale et al., 1992a). (Ng, 1997b) estimates that the manual annotation effort necessary to build a broad coverage semantically annotated English corpus is about 16 personyears. This overhead for supervision could be much greater if a costly tuning procedure is required before applying any existing system to each new domain.</Paragraph> <Paragraph position="10"> Due to this fact, recent works have focused on reducing the acquisition cost as well as the need for supervision in corpus-based methods.</Paragraph> <Paragraph position="11"> It is our belief that the research by (Leacock et al., 1998; Mihalcea and Moldovan, 1999) 2 provide enough evidence towards the &quot;opening&quot; of the bottleneck in the near future. For that reason, it is worth further investigating the robustness and portability of existing supervised ML methods to better resolve the WSD problem.</Paragraph> <Paragraph position="12"> It is important to note that the focus of this work will be on the empirical cross-corpus evaluation of several M L supervised algorithms. Other important issues, such as: selecting the best attribute set, discussing an appropriate definition of senses for the task, etc., are not addressed in this paper.</Paragraph> <Paragraph position="13"> eIn the line of using lexical resources and search engunes to automatically collect training examples from large text collections or Internet.</Paragraph> <Paragraph position="14"> This paper is organized as follows: Section 2 presents the four ML algorithms compared.</Paragraph> <Paragraph position="15"> In section 3 the setting is presented in detail, including the corpora and the experimental methodology used. Section 4 reports the experiments carried out and the results obtained. Finally, section 5 concludes and outlines some lines for further research.</Paragraph> </Section> <Section position="5" start_page="172" end_page="173" type="metho"> <SectionTitle> 2 Learning Algorithms Tested 2.1 Naive-Bayes (NB) </SectionTitle> <Paragraph position="0"> Naive Bayes is intended as a simple representative of statistical learning methods. It has been used in its most classical setting (Duda and Hart, 1973). That is, assuming independence of features, it classifies a new example by assigning the class that maximizes the conditional probability of the class given the observed sequence of features of that example.</Paragraph> <Paragraph position="1"> Model probabilities are estimated during training process using relative frequencies. To avoid the effect of zero counts when estimating probabilities, a very simple smoothing technique has been used, which was proposed in (Ng, 1997a). Despite its simplicity, Naive Bayes is claimed to obtain state-of-the-art accuracy on supervised WSD in many papers (Mooney, 1996; Ng, 1997a; Leacock et al., 1998).</Paragraph> <Section position="1" start_page="172" end_page="173" type="sub_section"> <SectionTitle> 2.2 Exemplar-based Classifier (EB) </SectionTitle> <Paragraph position="0"> In Exemplar-based learning (Aha et al., 1991) no generalization of training examples is performed. Instead, the examples are stored in memory and the classification of new examples is based on the classes of the most similar stored examples. In our implementation, all examples are kept in memory and the classification of a new example is based on a k-NN (Nearest-Neighbours) algorithm using Hamming distance 3 to measure closeness (in doing so, all examples are examined). For k's greater than 1, the resulting sense is the weighted majority sense of the k nearest neighbours --where each example votes its sense with a strength proportional to its closeness to the test example.</Paragraph> <Paragraph position="1"> In the experiments explained in section 4, the EB algorithm is run several times using different number of nearest neighbours (1, 3, SAlthough the use of MVDM metric (Cost and Salzberg, 1993) could lead to better results, current implementations have prohivitive computational overheads(Escudero et al., 2000b) 5, 7, 10, 15, 20 and 25) and the results corresponding to the best choice are reported 4. Exemplar-based learning is said to be the best option for VSD (Ng, 1997a). Other authors (Daelemans et al., 1999) point out that exemplar-based methods tend to be superior in language learning problems because they do not forget exceptions.</Paragraph> </Section> <Section position="2" start_page="173" end_page="173" type="sub_section"> <SectionTitle> 2.3 Snow: A Winnow-based Classifier </SectionTitle> <Paragraph position="0"> Snow stands for Sparse Network Of Winnows, and it is intended as a representative of on-line learning algorithms.</Paragraph> <Paragraph position="1"> The basic component is the Winnow algorithm (Littlestone, 1988). It consists of a linear threshold algorithm with multiplicative weight updating for 2-class problems, which learns very fast in the presence of many binary input features.</Paragraph> <Paragraph position="2"> In the Snow architecture there is a winnow node for each class, which learns to separate that class from all the rest. During training, each example is considered a positive example for winnow node associated to its class and a negative example for all the rest. A key point that allows a fast learning is that the winnow nodes are not connected to all features but only to those that are &quot;relevant&quot; for their class. When classifying a new example, Snow is similar to a neural network which takes the input features and outputs the class with the highest activation.</Paragraph> <Paragraph position="3"> Snow is proven to perform very well in high dimensional domains, where both, the training examples and the target function reside very sparsely in the feature space (Roth, 1998), e.g: text categorization, context-sensitive spelling correction, WSD, etc. In this paper, our approach to WSD using Snow follows that of (Escudero et al., 2000c).</Paragraph> </Section> <Section position="3" start_page="173" end_page="173" type="sub_section"> <SectionTitle> 2.4 LazyBoosting (LB) </SectionTitle> <Paragraph position="0"> The main idea of boosting algorithms is to combine many simple and moderately accurate hypotheses (called weak classifiers) into a single, highly accurate classifier. The weak classifiers are trained sequentially and, conceptually, each of them is trained on the examples which were most difficult to classify by the preceding weak classifiers. These weak 4In order to construct a real EB-based system for WSD, the k parameter should be estimated by cross-validation using only the training set (Ng, 1997a), however, in our case, this cross-validation inside the cross-validation involved in the testing process would generate a prohibitive overhead.</Paragraph> <Paragraph position="1"> hypotheses are then linearly combined into a single rule called the combined hypothesis.</Paragraph> <Paragraph position="2"> More particularly, the Schapire and Singer's real AdaBoost.MH algorithm for multi-class multi-label classification (Schapire and Singer, to appear) has been used. As in that paper, very simple weak hypotheses are used.</Paragraph> <Paragraph position="3"> They test the value of a boolean predicate and make a real-valued prediction based on that value. The predicates used, which are the binarization of the attributes described in section 3.2, are of the form &quot;f = v&quot;, where f is a feature and v is a value (e.g: &quot;-r v&quot; p e mus_word = hospital&quot;). Each weak rule uses a single feature, and, therefore, they can be seen as simple decision trees with one internal node (testing the value of a binary feature) and two leaves corresponding to the yes/no answers to that test.</Paragraph> <Paragraph position="4"> LazyBoosting (Escudero et al., 2000a), is a simple modification of the AdaBoost.MH algorithm, which consists of reducing the feature space that is explored when learning each weak classifier. More specifically, a small proportion p of attributes are randomly selected and the best weak rule is selected only among them. The idea behind this method is that if the proportion p is not too small, probably a sufficiently good rule can be found at each iteration. Besides, the chance for a good rule to appear in the whole learning process is very high. Another important characteristic is that no attribute needs to be discarded and, thus, the risk of eliminating relevant attributes is avoided. The method seems to work quite well since no important degradation is observed in performance for values of p greater or equal to 5% (this may indicate that there are many irrelevant or highly dependant attributes in the WSD domain). Therefore, this modification significantly increases the efficiency of the learning process (empirically, up to 7 times faster) with no loss in accuracy.</Paragraph> </Section> </Section> <Section position="6" start_page="173" end_page="175" type="metho"> <SectionTitle> 3 Setting </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="173" end_page="174" type="sub_section"> <SectionTitle> 3.1 The DSO Corpus </SectionTitle> <Paragraph position="0"> The DSO corpus is a semantically annotated corpus containing 192,800 occurrences of 121 nouns and 70 verbs, corresponding to the most frequent and ambiguous English words. This corpus was collected by Ng and colleagues (Ng and Lee, 1996) and it is available from the The D50 corpus contains sentences from two different corpora, namely Wall Street Journal (WSJ) and Brown Corpus (BC).</Paragraph> <Paragraph position="1"> Therefore, it is easy to perform experiments about the portability of alternative systems by training them on the WSJ part and testing them on the BE part, or vice-versa. Hereinafter, the WSJ part of DSO will be referred to as corpus A, and the BC part to as corpus B. At a word level, we force the number of examples of corpus A and B be the same 6 in order to have symmetry and allow the comparison in both directions.</Paragraph> <Paragraph position="2"> From these corpora, a group of 21 words which frequently appear in the WSD literature has been selected to perform the comparative experiments (each word is treated as a different classification problem). These words are 13 nouns (age, art, body, car, child, cost, head, interest, line, point, state, thing, work) and 8 verbs (become, fall, grow, lose, set, speak, strike, tell). Table 1 contains information about the number of examples, the number of senses, and the percentage of the most frequent sense (MF5) of these reference words, grouped by nouns, verbs, and all 21 words.</Paragraph> </Section> <Section position="2" start_page="174" end_page="174" type="sub_section"> <SectionTitle> 3.2 Attributes </SectionTitle> <Paragraph position="0"> Two kinds of information are used to perform disambiguation: local and topical context.</Paragraph> <Paragraph position="1"> Let &quot;... w-3 w-2 w-1 w W+l w+2 w+3...&quot; be the context of consecutive words around the word w to be disambiguated, and p+-, (-3 < i _< 3)be the part-of-speech tag of word w+-~. Attributes referring to local context are the following 15: P-3, P-2, P-l, P+i, P+2, P+3, w-l, W+l, (W-2,W-1), (w-i.w+i), (w+l,w+2), (w-2, W-l, w+l), (w-i, w+l, w+2), and (w+l,w+2, w+3), where the last seven correspond to collocations of two and three consecutive words.</Paragraph> <Paragraph position="2"> The topical context is formed by Cl,..., Cm, which stand for the unordered set of open class words appearing in the sentence 7.</Paragraph> <Paragraph position="3"> The four methods tested translate this information into features in different ways.</Paragraph> <Paragraph position="4"> Snow and LB algorithms require binary fea- null those attributes used in (Ng and Lee, 1996), with the exception of the morphology of the target word and the verb-object syntactic relation.</Paragraph> <Paragraph position="5"> tures. Therefore, local context attributes have to be binarized in a preprocess, while the topical context attributes remain as binary tests about the presence/absence of a concrete word in the sentence. As a result the number of attributes is expanded to several thousands (from 1,764 to 9,900 depending on the particular word).</Paragraph> <Paragraph position="6"> The binary representation of attributes is not appropriate for NB and EB algorithms.</Paragraph> <Paragraph position="7"> Therefore, the 15 local-context attributes are taken straightforwardly. Regarding the binary topical-context attributes, we have used the variants described in (Escudero et al., 2000b). For EB, the topical information is codified as a single set-valued attribute (containing all words appearing in the sentence) and the calculation of closeness is modified so as to handle this type of attribute. For NB, the topical context is conserved as binary features, but when classifying new examples only the information of words appearing in the example (positive information) is taken into account. In that paper, these variants are called positive Exemplar-based (PEB) and positive Naive Bayes (PNB), respectively. PNB and PEB algorithms are empirically proven to perform much better in terms of accuracy and efficiency in the WSD task.</Paragraph> </Section> <Section position="3" start_page="174" end_page="175" type="sub_section"> <SectionTitle> 3.3 Experimental Methodology </SectionTitle> <Paragraph position="0"> The comparison of algorithms has been performed in series of controlled experiments using exactly the same training and test sets.</Paragraph> <Paragraph position="1"> There are 7 combinations of training-test sets called: A+B-A+B, A+B-A, A+B-B, A-A, B-B, A-B, and B-A, respectively. In this notation, the training set is placed at the left hand side of symbol &quot;-&quot;, while the test set is at the right hand side. For instance, A-B means that the training set is corpus A and the test set is corpus B. The symbol &quot;+&quot; stands for set union, therefore A+B-B means that the training set is A union B and the test set is B.</Paragraph> <Paragraph position="2"> When comparing the performance of two algorithms, two different statistical tests of significance have been apphed depending on the case. A-B and B-A combinations represent a single training-test experiment. In this cases, the McNemar's test of significance is used (with a confidence value of: X1,0.952 = 3.842), which is proven to be more robust than a simple test for the difference of tw0_proportions.</Paragraph> <Paragraph position="3"> In the other combinations, a 10-fold cross-validation was performed in order to prevent testing on the same material used for training.</Paragraph> <Paragraph position="4"> In these cases, accuracy/error rate figures reported in section 4 are averaged over the results of the 10 folds. The associated statistical tests of significance is a paired Student's t-test with a confidence value of: t9,0.975 = 2.262.</Paragraph> <Paragraph position="5"> Information about both statistical tests can be found at (Dietterich, 1998).</Paragraph> </Section> </Section> class="xml-element"></Paper>