XML Viewer - p05-1005

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/p05-1005_intro.xml
Size: 9,000 bytes
Last Modified: 2025-10-06 14:03:00
<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-1005">
  <Title>Learning Semantic Classes for Word Sense Disambiguation</Title>
  <Section position="2" start_page="0" end_page="35" type="intro">
    <SectionTitle>
1 Introduction
Word Sense Disambiguation (WSD) is the task of
</SectionTitle>
    <Paragraph position="0"> determining the meaning of a word in a given context. This task has a long history in natural language processing, and is considered to be an intermediate task, success of which is considered to be important for other tasks such as Machine Translation, Language Understanding, and Information Retrieval.</Paragraph>
    <Paragraph position="1"> Despite a long history of attempts to solve WSD problem by empirical means, there is not any clear consensus on what it takes to build a high performance implementation of WSD. Algorithms based on Supervised Learning, in general, show better performance compared to unsupervised systems. But they suffer from a serious drawback: the difficulty of acquiring considerable amounts of training data, also known as knowledge acquisition bottleneck. In the typical setting, supervised learning needs training data created for each and every polysemous word; Ng (1997) estimates an effort of 16 person-years for acquiring training data for 3,200 significant words in English. Mihalcea and Chklovski (2003) provide a similar estimate of an 80 person-year effort for creating manually labelled training data for about 20,000 words in a common English dictionary.</Paragraph>
    <Paragraph position="2"> Two basic approaches have been tried as solutions to the lack of training data, namely unsupervised systems and semi-supervised bootstrapping techniques. Unsupervised systems mostly work on knowledge-based techniques, exploiting sense knowledge encoded in machine-readable dictionary entries, taxonomical hierarchies such as WORD-NET (Fellbaum, 1998), and so on. Most of the bootstrapping techniques start from a few 'seed' labelled examples, classify some unlabelled instances using this knowledge, and iteratively expand their knowledge using information available within newly labelled data. Some others employ hierarchical relatives such as hypernyms and hyponyms.</Paragraph>
    <Paragraph position="3"> In this work, we present another practical alternative: we reduce the WSD problem to a one of finding generic semantic class of a given word instance. We show that learning such classes can help relieve the problem of knowledge acquisition bottleneck.</Paragraph>
    <Section position="1" start_page="0" end_page="35" type="sub_section">
      <SectionTitle>
1.1 Learning senses as concepts
</SectionTitle>
      <Paragraph position="0"> As the semantic classes we propose learning, we use WORDNET lexicographer file identifiers corre- null sponding to each fine-grained sense. By learning these generic classes, we show that we can reuse training data, without having to rely on specific training data for each word. This can be done because the semantic classes are common to words unlike senses; for learning the properties of a given class, we can use the data from various words. For instance, the noun crane falls into two semantic classes ANIMAL and ARTEFACT. We can expect the words such as pelican and eagle (in the bird sense) to have similar usage patterns to those of ANIMAL sense of crane, and to provide common training examples for that particular class.</Paragraph>
      <Paragraph position="1"> For learning these classes, we can make use of any training example labelled with WORDNET senses for supervised WSD, as we describe in section 3.1.</Paragraph>
      <Paragraph position="2"> Once the classification is done for an instance, the resulting semantic classes can be transformed into finer grained senses using some heuristical mapping, as we show in the next sub section. This would not guarantee a perfect conversion because such a mapping can miss some finer senses, but as we show in what follows, this problem in itself does not prevent us from attaining good performance in a practical WSD setting.</Paragraph>
      <Paragraph position="3"> 1.2 Information loss in coarse grained senses As an empirical verification of the hypothesis that we can still build effective fine-grained sense disambiguators despite the loss of information, we analyzed the performance of a hypothetical coarse grained classifier that can perform at 100% accuracy. As the general set of classes, we used WORD-NET unique beginners, of which there are 25 for nouns, and 15 for verbs.</Paragraph>
      <Paragraph position="4"> To simulate this classifier on SENSEVAL English all-words tasks' data (Edmonds and Cotton, 2001; Snyder and Palmer, 2004), we mapped the fine-grained senses from official answer keys to their respective beginners. There is an information loss in this mapping, because each unique beginner can typically include more than one sense. To see how this 'classifier' fares in a fine-grained task, we can map the 'answers' back to WORDNET fine-grained senses by picking up the sense with the lowest sense number that falls within each unique beginner. In principal, this is the most likely sense within the class, because WORDNET senses are said to be  grained classifier, output mapped to fine-grained senses, on SENSEVAL English all-words tasks.</Paragraph>
      <Paragraph position="5"> ordered in descending order of frequency. Since this sense is not necessarily the same as the original sense of the instance, the accuracy of the fine-grained answers will be below 100%.</Paragraph>
      <Paragraph position="6"> Figure 1 shows the performance of this transformed fine-grained classifier (CG) for nouns and verbs with SENSEVAL-2 and 3 English all words task data (marked as S2 and S3 respectively), along with the baseline WORDNET first sense (BL), and the best-performer classifiers at each SENSEVAL excercise (CL), SMUaw (Mihalcea, 2002) and GAMBL-AW (Decadt et al., 2004) respectively.</Paragraph>
      <Paragraph position="7"> There is a considerable difference in terms of improvement over baseline, between the state-of-the-art systems and the hypothetical optimal coarse-grained system. This shows us that there is an improvement in performance that we can attain over the state-of-the-art, if we can create a classifier for even a very coarse level of senses, with sufficiently high accuracy. We believe that the chances for such a high accuracy in a coarse-grained sense classifier is better, for several reasons: * previously reported good performance for coarse grained systems (Yarowsky, 1992) * better availability of data, due to the possibility of reusing data created for different words. For instance, labelled data for the noun 'crane' is not found in SEMCOR corpus at all, but there are more than 1000 sample instances for the concept ANIMAL, and more than 9000 for ARTEFACT.</Paragraph>
      <Paragraph position="8">  * higher inter-annotator agreement levels and lower corpus/genre dependencies in training/testing data due to coarser senses.</Paragraph>
    </Section>
    <Section position="2" start_page="35" end_page="35" type="sub_section">
      <SectionTitle>
1.3 Overall approach
</SectionTitle>
      <Paragraph position="0"> Basically, we assume that we can learn the 'concepts', in terms of WORDNET unique beginners, using a set of data labelled with these concepts, regardless of the actual word that is labelled. Hence, we can use a generic data set that is large enough, where various words provide training examples for these concepts, instead of relying upon data from the examples of the same word that is being classified.</Paragraph>
      <Paragraph position="1"> Unfortunately, simply labelling each instance with its semantic class and then using standard supervised learning algorithms did not work well. This is probably because the effectiveness of the feature patterns often depend on the actual word being disambiguated and not just its semantic class. For example, the phrase 'run the newspaper' effectively indicates that 'newspaper' belongs to the semantic class GROUP. But 'run the tape' indicates that 'tape' belongs to the semantic class ARTEFACT. The collocation 'run the' is effective for indicating the GROUP sense only for 'newspaper' and closely related words such as 'department' or 'school'.</Paragraph>
      <Paragraph position="2"> In this experiment, we use a k-nearest neighbor classifier. In order to allow training examples of different words from the same semantic class to effectively provide information for each other, we modify the distance between instances in a way that makes the distance between instances of similar words smaller. This is described in Section 3.</Paragraph>
      <Paragraph position="3"> The rest of the paper is organized as follows: In section 2, we discuss several related work. We proceed on to a detailed description of our system in section 3, and discuss the empirical results in section 4, showing that our representation can yield state of the art performance.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML