File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/00/w00-1322_evalu.xml
Size: 7,222 bytes
Last Modified: 2025-10-06 13:58:39
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-1322"> <Title>An Empirical Study of the Domain Dependence of Supervised Word Sense Disambiguation Systems*</Title> <Section position="7" start_page="175" end_page="177" type="evalu"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="175" end_page="175" type="sub_section"> <SectionTitle> 4.1 First Experiment </SectionTitle> <Paragraph position="0"> Table 2 shows the accuracy figures of the four methods in all combinations of training and test sets s. Standard deviation numbers are supplied in all cases involving cross validation. M FC stands for a Most-Frequent-sense Classifier, that is, a naive classifier that learns the most frequent sense of the training set and uses it to classify all examples of the test set. Averaged results are presented for nouns.</Paragraph> <Paragraph position="1"> verbs, and overall, and the best results for each case are printed in boldface.</Paragraph> <Paragraph position="2"> The following conclusions can be drawn: * LB outperforms all other methods in all cases. Additionally, this superiority is statistically significant, except when comparing LB to the PEB approach in the cases marked with an asterisk.</Paragraph> <Paragraph position="3"> * Surprisingly, LB in A+B-A (or A+B-B) does not achieve substantial improvement to the results of A-A (or B-S) win fact, the first variation is not statistically significant and the second is only slightly significant. That is, the addition of extra examples from another domain does not necessarily contribute to improve the results on the original corpus. This effect is also observed in the other methods, specially in some cases (e.g. Snow in A+B-A vs. A-A) in which the joining of both training corpora is even counterproductive. null SThe second and third column correspond to the train and test sets used by (Ng and Lee, 1996; Ng, 1997a) * Regarding the portability of the systems, very disappointing results are obtained.</Paragraph> <Paragraph position="4"> Restricting to \[B results, we observe that the accuracy obtained in A-B is 47.1% while the accuracy in B-B (which can be considered an upper bound for LB in B corpus) is 59.0%, that is, a drop of 12 points. Furthermore, 47.1% is only slightly better than the most frequent sense in corpus B, 45.5%. The comparison in the reverse direction is even worse: a drop from 71.3% (A-A) to 52.0% (B-A), which is lower than the most frequent sense of corpus A, 55.9%.</Paragraph> </Section> <Section position="2" start_page="175" end_page="176" type="sub_section"> <SectionTitle> 4.2 Second Experiment </SectionTitle> <Paragraph position="0"> The previous experiment shows that classitiers trained on the A corpus do not work well on the B corpus, and vice-versa. Therefore, it seems that some kind of tuning process is necessary to adapt supervised systems to each new domain.</Paragraph> <Paragraph position="1"> This experiment explores the effect of a simple tuning process consisting of adding to the original training set a relatively small sarnple of manually sense tagged examples of the new domain. The size of this supervised portion varies from 10% to 50% of the available corpus in steps of 10% (the remaining 50% is kept for testing). This set of experiments will be referred to as A+%B-B, or conversely, to B+%A-A.</Paragraph> <Paragraph position="2"> In order to determine to which extent the original training set contributes to accurately disambiguate in the new domain, we also calculate the results for %A-A (and %B-B), that is, using only the tuning corpus for training.</Paragraph> <Paragraph position="3"> Figure 1 graphically presents the results obtained by all methods. Each plot contains the X+%Y-Y and %Y-Y curves, and the straight lines corresponding to the lower bound MFC, and to the upper bounds Y-Y and X+Y-Y.</Paragraph> <Paragraph position="4"> As expected, the accuracy of all methods grows (towards the upper bound) as more tuning corpus is added to the training set. However, the relation between X+%Y-Y and %Y-Y reveals some interesting facts. In plots 2a, 3a, and lb the contribution of the original training corpus is null. Furthermore, in plots la, 2b, and 3b a degradation on the accuracy performance is observed. Summarizing, these six plots show that for Naive Bayes, Exemplar Based, and Snow methods it is not worth keeping the original training examples. Instead, a better (but disappointing) strategy would be simply using the tuning corpus.</Paragraph> <Paragraph position="5"> However, this is not the situation of LazyBoosting (plots 4a and 4b), for which a moderate (but consistent) improvement of accuracy is observed when retaining the original training set. Therefore, Lazy\[3oosting shows again a better behaviour than their competitors when moving from one domain to another. null</Paragraph> </Section> <Section position="3" start_page="176" end_page="177" type="sub_section"> <SectionTitle> 4.3 Third Experiment </SectionTitle> <Paragraph position="0"> The bad results about portability could be explained by, at least, two reasons: 1) Corpus A and \[3 have a very different distribution of senses, and, therefore, different a-priori biases; 2) Examples of corpus A and \[3 contain different information, and, therefore, the learning algorithms acquire different (and non interchangeable) classification cues from both corpora,.</Paragraph> <Paragraph position="1"> The first hypothesis is confirmed by observing the bar plots of figure 2, which contain the distribution of the four most frequent senses of some sample words in the corpora A and B. respectively. In order to check the second hypothesis, two new sense-balanced corpora have been generated from the DSO corpus, by equilibrating the number of examples of each sense between A and B parts. In this way, the first difficulty is artificially overrided and the algorithms should be portable if examples of both parts are quite similar.</Paragraph> <Paragraph position="2"> Table 3 shows the results obtained by LazyBoosting on these new corpora.</Paragraph> <Paragraph position="3"> Regarding portability, we observe a significant accuracy decrease of 7 and 5 points from A-A to B-A, and from B-B to A-B, respectively 9. That is, even when the sazne distribution of senses is conserved between training and test examples, the portability of the supervised WSD systems is not guaranteed.</Paragraph> <Paragraph position="4"> These results imply that examples have to be largely different from one corpus to another. By studying the weak rules generated by kazyBoosting in both cases, we could corroborate this fact. On the one hand, the type of features used in the rules were significantly different between corpora, and, additionally, there were very few rules that apply to both sets; On the other hand, the sign of the prediction of many of these common rules was somewhat contradictory between corpora.</Paragraph> <Paragraph position="5"> 9This loss in accuracy is not as important as m the first experiment, due to the simplification provided by the balancing of sense distributions.</Paragraph> </Section> </Section> class="xml-element"></Paper>