XML Viewer - w97-0322

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/97/w97-0322_evalu.xml
Size: 8,338 bytes
Last Modified: 2025-10-06 14:00:27
<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0322">
  <Title>Distinguishing Word Senses in Untagged Text</Title>
  <Section position="8" start_page="201" end_page="203" type="evalu">
    <SectionTitle>
6 Experimental Results
</SectionTitle>
    <Paragraph position="0"> Figure 5 shows the average accuracy and standard deviation of disambiguation over 25 random trials for each combination of word, feature set and learning algorithm. Those cases where the average accuracy of one algorithm for a particular feature set is significantly higher than another algorithm, as judged by the t-test (p=.01), are shown in bold face.</Paragraph>
    <Paragraph position="1"> For each word, the most accurate overall experiment (i.e., algorithm/feature set combination), and those that are not significantly less accurate are underlined. Also included in Figure 5 is the percentage of each sample that is composed of the majority sense.</Paragraph>
    <Paragraph position="2"> This is the accuracy that can be obtained by a majority classifier; a simple classifier that assigns each ambiguous word to the most frequent sense in a sample. However, bear in mind that in unsupervised experiments the distribution of senses is not generally known.</Paragraph>
    <Paragraph position="3"> Perhaps the most striking aspect of these results is that, across all experiments, only the nouns are disambiguated with accuracy greater than that of the majority classifier. This is at least partially explained by the fact that, as a class, the nouns have the most uniform distribution of senses. This point will be elaborated on in Section 6.1. While the choice of feature set impacts accuracy, overall it is only to a small degree. We return to this point in Section 6.2. The final result, to be discussed in Section 6.3, is that the differences in the accuracy of these three algorithms are statistically significant both on average and for individual words.</Paragraph>
    <Section position="1" start_page="201" end_page="201" type="sub_section">
      <SectionTitle>
6.1 Distribution of Classes
</SectionTitle>
      <Paragraph position="0"> Extremely skewed distributions pose a challenging learning problem since the sample contains precious little information regarding minority classes. This makes it difficult to learn their distributions without prior knowledge. For unsupervised approaches, this problem is exacerbated by the difficultly in distinguishing the characteristics of the minority classes from noise.</Paragraph>
      <Paragraph position="1"> In this study, the accuracy of the unsupervised algorithms was less than that of the majority classifier in every case where the percentage of the majority sense exceeded 68%. However, in the cases where the performance of these algorithms was less than that of the majority classifier, they were often still providing high accuracy disambiguation (e.g., 91% accuracy for last). Clearly, the distribution of classes is not the only factor affecting disambiguation accuracy; compare the performance of these algorithms on bill and public which have roughly the same class distributions.</Paragraph>
      <Paragraph position="2"> It is difficult to quantify the effect of the distribution of classes on a learning algorithm particularly when using naturally occurring data. In previous unsupervised experiments with interest, using a modified version of Feature Set A, we were able to achieve an increase of 36 percentage points over the accuracy of the majority classifier when the 3 classes were evenly distributed in the sample (Pedersen and Bruce, 1997b). Here, our best performance using a larger sample with a natural distribution of senses is only an increase of 20 percentage points over the accuracy of the majority classifier.</Paragraph>
      <Paragraph position="3"> Because skewed distributions are common in lexical work (Zipf, 1935), they are an important consideration in formulating disambiguation experiments. In future work, we will investigate procedures for feature selection that are more sensitive to minority classes. Reliance on frequency based features, as used in this work, means that the more skewed the sample is, the more likely it is that the features will be indicative of only the majority class.</Paragraph>
    </Section>
    <Section position="2" start_page="201" end_page="202" type="sub_section">
      <SectionTitle>
6.2 Feature Set
</SectionTitle>
      <Paragraph position="0"> Despite varying the feature sets, the relative accuracy of the three algorithms remains rather consistent. For 6 of the 13 words there was a single algorithm that was always significantly more accurate than the other two across all features.</Paragraph>
      <Paragraph position="1"> The EM algorithm was most accurate for last and line with all three feature sets. McQuitty's method was significantly more accurate for chief, common, public, and help regardless of the feature set.</Paragraph>
      <Paragraph position="2">  Despite this consistency, there were some observable trends associated with changes in feature set. For example, McQuitty's method was significantly more accurate overall in combination with feature set C while the EM algorithm was more accurate with Feature Set A, and the accuracy of Ward's method was the least favorable with Feature Set B.</Paragraph>
      <Paragraph position="3"> For the nouns, there was no significant difference between Feature Sets A and B when using the EM algorithm. For the verbs there was no significant difference between the three feature sets when using McQuitty's method. The adjectives were significantly more accurate when using McQuitty's method and Feature Set C.</Paragraph>
      <Paragraph position="4"> One possible explanation for the consistency of results as feature sets varied is that perhaps the features most indicative of word senses are included in all the sets due to the selection methods and the commonality of feature types. These common features may be sufficient for the level of disambiguation achieved here. This explanation seems more plausible for the EM algorithm, where features are weighted, but less so for McQuitty's and Ward's which use a representation that does not allow feature weighting.</Paragraph>
    </Section>
    <Section position="3" start_page="202" end_page="203" type="sub_section">
      <SectionTitle>
6.3 Disambiguation Algorithm
</SectionTitle>
      <Paragraph position="0"> Based on the average accuracy over part-of-speech categories, the EM algorithm performs with the highest accuracy for nouns while McQuitty's method performs most accurately for verbs and adjectives.</Paragraph>
      <Paragraph position="1"> This is true regardless of the feature set employed.</Paragraph>
      <Paragraph position="2"> The standard deviations give an indication of the effect of ties on the clustering algorithms and the effect of the random initialization on the the EM algorithm. In few cases is the standard deviation very small. For the clustering algorithms, a high standard deviation indicates that ties are having some effect on the cluster analysis. This is undesirable and may point to a need to expand the feature set in order to reduce ties: For the EM algorithm, a high standard deviation means that the algorithm is not settling on any particular maxima. Results may become more consistent if the number of parameters that must be estimated was reduced.</Paragraph>
      <Paragraph position="3"> Figures 6, 7 and 8 show the confusion matrices associated with the disambiguation of concern, interest, and help, using Feature Sets A, B, and C, respectively. A confusion matrix shows the number of cases where the sense discovered by the algorithm agrees with the manually assigned sense along the main diagonal; disagreements are shown in the rest of the matrix.</Paragraph>
      <Paragraph position="4"> In general, these matrices reveal that both the EM algorithm and Ward's method are more biased toward balanced distributions of senses than is McQuitty's method. This may explain the better performance of McQuitty's method in disambiguating those words with the most skewed sense distributions, the adjectives and adverbs. It is possible to adjust the EM algorithm away from this tendency towards discovering balanced distributions by providing prior knowledge of the expected sense distribution. This will be explored in future work.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML