File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-0858_intro.xml

Size: 2,437 bytes

Last Modified: 2025-10-06 14:02:35

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0858">
  <Title>Word Sense Disambiguation by Web Mining for Word Co-occurrence Probabilities</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> The Senseval-3 English Lexical Sample (ELS) task requires disambiguating 57 words, with an average of roughly 140 training examples and 70 testing examples of each word. Each example is about a paragraph of text, in which the word that is to be disambiguated is marked as the head word. The average head word has around six senses. The training examples are manually classified according to the intended sense of the head word, inferred from the surrounding context. The task is to use the training data and any other relevant information to automatically assign classes to the testing examples.</Paragraph>
    <Paragraph position="1"> This paper presents the National Research Council (NRC) Word Sense Disambiguation (WSD) system, which generated our four entries for the Senseval-3 ELS task (NRC-Fine, NRC-Fine2, NRC-Coarse, and NRC-Coarse2). Our approach to the ELS task is to treat it as a classical supervised machine learning problem. Each example is represented as a feature vector with several hundred features. Each of the 57 ambiguous words is represented with a different set of features. Typically, around half of the features are syntactic and the other half are semantic. After the raw examples are converted to feature vectors, the Weka machine learning software is used to induce a model of the training data and predict the classes of the testing examples (Witten and Frank, 1999).</Paragraph>
    <Paragraph position="2"> The syntactic features are based on part-of-speech tags, assigned by a rule-based tagger (Brill, 1994). The main innovation of the NRC WSD system is the method for generating the semantic features, which are derived from word co-occurrence probabilities. We estimated these probabilities using the Waterloo MultiText System with a corpus of about one terabyte of unlabeled text, collected by a web crawler (Clarke et al., 1995; Clarke and Cormack, 2000; Terra and Clarke, 2003).</Paragraph>
    <Paragraph position="3"> In Section 2, we describe the NRC WSD system.</Paragraph>
    <Paragraph position="4"> Our experimental results are presented in Section 3 and we conclude in Section 4.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML