File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/w02-0813_evalu.xml

Size: 12,371 bytes

Last Modified: 2025-10-06 13:58:50

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0813">
  <Title>Combining Contextual Features for Word Sense Disambiguation</Title>
  <Section position="5" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"> In this section we describe the system performance on the verbs from SENSEVAL-1 and SENSEVAL-2.</Paragraph>
    <Paragraph position="1"> The system was built after SENSEVAL-1 but before</Paragraph>
    <Paragraph position="3"> evaluation format where the participants were provided with hand-annotated training data and test data. The lexical inventory used was the Hector lexicon, developed jointly by DEC and Oxford University Press (Kilgarriff and Rosenzweig, 2000). By allowing for discussion and revision of confusing lexical entries during tagging, before the final test data was tagged, inter-annotator agreement of over 90% was eventually achieved. However, the Hector lexicon was very small and under proprietary constraints, making it an unsuitable candidate for applications requiring a large-scale, publicly-available dictionary.</Paragraph>
    <Paragraph position="4"> SENSEVAL-2 The subsequent SENSEVAL-2 exercise used a pre-release version of WordNet1.7 which is much larger than Hector and is more widely used in NLP applications. The average training set size for verbs was only about half of that provided in SENSEVAL-1, while the average polysemy of each verb was higher3. Smaller training sets and the use of a large-scale, publicly available dictionary arguably make SENSEVAL-2 a more indicative evaluation of WSD systems in the current NLP environment than SENSEVAL-1. The role of sense groups was also explored as a way to address the popular criticism that WordNet senses are too vague and fine-grained. During the data preparation for SENSEVAL-2, previous WordNet groupings of the verbs were carefully re-examined, and specific semantic criteria were manually associated with each group. This occasionally resulted in minor revisions of the original groupings (Fellbaum et al., 2001).</Paragraph>
    <Paragraph position="5"> This manual method of creating a more coarse-grained sense inventory from WordNet contrasts with automatic methods that rely on existing se- null was 11.6 using the Hector dictionary in SENSEVAL-1, and 15.6 using WordNet1.7 in SENSEVAL-2.</Paragraph>
    <Paragraph position="6"> mantic links in WordNet (Mihalcea and Moldovan, 2001), which can produce divergent dictionaries.</Paragraph>
    <Paragraph position="7"> Our system performs competitively with the best performing systems in SENSEVAL-1 and SENSEVAL-2. Measuring accuracy as the recall score (which is equal to precision in our case because the system assigns a tag to every instance), we compare the system's coarse-grained scores using the revised groupings versus random groupings, and demonstrate the coherence and utility of the groupings in reconciling apparent tagging disagreements.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 SENSEVAL-1 Results
</SectionTitle>
      <Paragraph position="0"> The maximum entropy WSD system's performance on the verbs from the evaluation data for SENSEVAL-1 (Kilgarriff and Rosenzweig, 2000) rivaled that of the best-performing systems. Table 1 shows the performance of variants of the system using different subsets of possible features. In addition to experimenting with different combinations of local/topical features, we attempted to undo passivization transformations to recover underlying subjects and objects. This was expected to increase the accuracy with which verb arguments could be identified, helping in cases where selectional restrictions on arguments played an important role in differentiating between senses.</Paragraph>
      <Paragraph position="1"> The best overall variant of the system for verbs did not use WordNet class features, but included topical keywords and passivization transformation, giving an average verb accuracy of 72.3%. This falls between Chodorow, Leacock, and Miller's accuracy of 71.0%, and Yarowsky's 73.4% (74.3% post-workshop). If only the best combination of feature sets for each verb is used, then the maximum entropy models achieve 73.7% accuracy. Even though our system used only the training data provided and none of the information from the dictionary itself, it was still competitive with the top performing systems which also made use of the dictionary to identify multi-word constructions. As we show later, using this additional piece of information improves performance substantially.</Paragraph>
      <Paragraph position="2"> In addition to the SENSEVAL-1 verbs, we ran the system on the SENSEVAL-1 data for shake, which contains both nouns and verbs. The system simply excluded verb complement features whenever the part-of-speech tagger indicated that the word task lex lex+topic lex+trans+topic wn wn+topic wn+trans+topic  formation was used, unless indicated by &amp;quot;+topic,&amp;quot; in which case the topical keyword features were included in the model; &amp;quot;wn&amp;quot; indicates that WordNet class features were used, while &amp;quot;lex&amp;quot; indicates only lexical and named entity tag features were used for the noun complements; &amp;quot;+trans&amp;quot; indicates that an attempt was made to undo passivization transformations.</Paragraph>
      <Paragraph position="3"> to be sense-tagged was not a verb. Even on this mix of nouns and verbs, the system performed well compared with the best system for shake from SENSEVAL-1, which had an accuracy of 76.5% on the same task.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 SENSEVAL-2 Results
</SectionTitle>
      <Paragraph position="0"> We also tested the WSD system on the verbs from the English lexical sample task for SENSEVAL-2.</Paragraph>
      <Paragraph position="1"> In contrast to SENSEVAL-1, senses involving multi-word constructions could be directly identified from the sense tags themselves (through the WordNet sense keys that were used as sense tags), and the head word and satellites of multi-word constructions were explicitly marked in the training and test data. This additional annotation made it much easier for our system to incorporate information about the satellites, without having to look at the dictionary (whose format may vary from one task to another). The best-performing systems on the English verb lexical sample task (including our own) filtered out possible senses based on the marked satellites, and this improved performance.</Paragraph>
      <Paragraph position="2"> Table 2 shows the performance of the system using different subsets of features. While we found little improvement from transforming passivized sentences into a more canonical form to recover underlying arguments, there is a clear improvement in performance as richer linguistic information is incorporated in the model. Adding topical keywords also helped.</Paragraph>
      <Paragraph position="3"> Incorporating topical keywords as well as collocational, syntactic, and semantic local features, our system achieved 59.6% and 69.0% accuracy using fine-grained and coarse-grained scoring, respectively. This is in comparison to the next best-performing system, which had fine- and coarse-grained scores of 57.6% and 67.2% (Palmer et al., 2001). Here we see the benefit from including a filter that only considered phrasal senses whenever there were satellites of multi-word constructions marked in the test data; had we not included this filter, our fine- and coarse-grained scores would have been only 56.9% and 66.1%.</Paragraph>
      <Paragraph position="4"> Table 3 shows a breakdown of the number of senses and groups for each verb, the fine-grained accuracy of the top three official SENSEVAL-2 systems, fine- and coarse-grained accuracy of our maxi- null accuracy of top three competitors (JHU, SMULS, KUNLP) in SENSEVAL-2 English verbs lexical sample task; fine-grained (MX) and coarse-grained accuracy (MX-c) of maximum entropy system; inter-tagger agreement for fine-grained senses (ITA) and sense groups (ITA-c). *No inter-tagger agreement figures were available for &amp;quot;play&amp;quot; and &amp;quot;work&amp;quot;.</Paragraph>
      <Paragraph position="5"> mum entropy system, and human inter-tagger agreement on fine-grained and coarse-grained senses.</Paragraph>
      <Paragraph position="6"> Overall, coarse-grained evaluation using the groups improved the system's score by about 10%. This is consistent with the improvement we found in inter-tagger agreement for groups over fine-grained senses (82% instead of 71%). As a base-line, to ensure that the improvement did not come simply from the lower number of tag choices for each verb, we created random groups. Each verb had the same number of groups, but with the senses distributed randomly. We found that these random groups provided almost no benefit to the inter-annotator agreement figures (74% instead of 71%), confirming the greater coherence of the manual groupings.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Analysis of errors
</SectionTitle>
      <Paragraph position="0"> We found that the grouped senses for call substantially improved performance over evaluating with respect to fine-grained senses; the system achieved 63.6% accuracy with coarse-grained scoring using the groups, as compared to 47.0% accuracy with fine-grained scoring. When evaluated against the fine-grained senses, the system got 35 instances wrong, but 11 of the &amp;quot;incorrect&amp;quot; instances were tagged with senses that were actually in the same group as the correct sense. This group of senses differs from others in the ability to take a small clause as a complement, which is modeled as a feature in our system. Here we see that the system benefits from using syntactic features that are linguistically richer than the features that have been used in the past.</Paragraph>
      <Paragraph position="1"> 29% of errors made by the tagger on develop were due to confusing Sense 1 and Sense 2, which are in the same group. The two senses describe transitive verbs that create new entities, characterized as either &amp;quot;products, or mental or artistic creations: CREATE (Sense 1)&amp;quot; or &amp;quot;a new theory of evolution: CREATE BY MENTAL ACT (Sense 2).&amp;quot; Instances of Sense 1 that were tagged as Sense 2 by the system included: Researchers said they have developed a genetic engineering technique for creating hybrid plants for a number of key crops; William Gates and Paul Allen developed an early language-housekeeper system for PCs. Conversely, the following instances of Sense 2 were tagged as Sense 1 by the tagger: A Purdue University team hopes to develop ways to magnetically induce cardiac muscle contractions; Kobe Steel Ltd. adopted Soviet casting technology used it until it developed its own system. Based on the direct object of develop, the automatic tagger was hardpressed to differentiate between developing a technique/system (Sense 1) and developing a way/system (Sense 2).</Paragraph>
      <Paragraph position="2"> Analysis of inter-annotator disagreement between two human annotators doing double-blind tagging revealed similar confusion between these two senses of develop; 25% of the human annotator disagreements on develop involved determining which of these two senses should be applied to phrases like develop a better way to introduce crystallography techniques. These instances that were difficult for the automatic WSD system, were also difficult for human annotators to differentiate consistently.</Paragraph>
      <Paragraph position="3"> These different senses are clearly related, but the relation is not reflected in their hypernyms, which emphasize the differences in what is being highlighted by each sense, rather than the similarities. Methods of evaluation that automatically back off from synset to hypernyms (Lin, 1997) would fail to credit the system for &amp;quot;mistagging&amp;quot; an instance with a closely related sense. Manually created sense groups, on the other hand, can capture broader, more underspecified senses which are not explicitly listed and which do not participate in any of the WordNet semantic relations.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML