XML Viewer - p05-1005

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1005_metho.xml
Size: 11,720 bytes
Last Modified: 2025-10-06 14:09:42
<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-1005">
  <Title>Learning Semantic Classes for Word Sense Disambiguation</Title>
  <Section position="3" start_page="35" end_page="35" type="metho">
    <SectionTitle>
2 Related Work
</SectionTitle>
    <Paragraph position="0"> Using generic classes as word senses has been done several times in WSD, in various contexts.</Paragraph>
    <Paragraph position="1"> Resnik (1997) described a method to acquire a set of conceptual classes for word senses, employing selectional preferences, based on the idea that certain linguistic predicates constraint the semantic interpretation of underlying words into certain classes. The method he proposed could acquire these constraints from a raw corpus automatically.</Paragraph>
    <Paragraph position="2"> Classification proposed by Levin (1993) for English verbs remains a matter of interest. Although these classes are based on syntactic properties unlike those in WORDNET, it has been shown that they can be used in automatic classifications (Stevenson and Merlo, 2000). Korhonen (2002) proposed a method for mapping WORDNET entries into Levin classes.</Paragraph>
    <Paragraph position="3"> WSD System presented by Crestan et al. (2001) in SENSEVAL-2 classified words into WORD-NET unique beginners. However, their approach did not use the fact that the primes are common for words, and training data can hence be reused.</Paragraph>
    <Paragraph position="4"> Yarowsky (1992) used Roget's Thesaurus categories as classes for word senses. These classes differ from those mentioned above, by the fact that they are based on topical context rather than syntax or grammar.</Paragraph>
  </Section>
  <Section position="4" start_page="35" end_page="38" type="metho">
    <SectionTitle>
3 Basic Design of the System
</SectionTitle>
    <Paragraph position="0"> The system consists of three classifiers, built using local context, part of speech and syntax-based relationships respectively, and combined with the most-frequent sense classifier by using weighted majority voting. Our experiments (section 4.3) show that building separate classifiers from different subsets of features and combining them works better than building one classifier by concatenating the features together.</Paragraph>
    <Paragraph position="1"> For training and testing, we used publicly available data sets, namely SEMCOR corpus (Miller et al., 1993) and SENSEVAL English all-words task data. In order to evaluate the systems performance in vivo, we mapped the outputs of our classifier to the answers given in the key. Although we face a penalty here due to the loss of granularity, this approach allows a direct comparison of actual usability of our system.</Paragraph>
    <Section position="1" start_page="35" end_page="36" type="sub_section">
      <SectionTitle>
3.1 Data
</SectionTitle>
      <Paragraph position="0"> As training corpus, we used Brown-1 and Brown2 parts of SEMCOR corpus; these parts have all of their open-class words tagged with corresponding WORDNET senses. A part of the training corpus was set aside as the development corpus. This part was selected by randomly selecting a portion of multi- null class words (600 instances for each part of speech) from the training data set. As labels, the semantic class (lexicographic file number) was extracted from the sense key of each instance. Testing data sets from SENSEVAL-2 and SENSEVAL-3 English all-words tasks were used as testing corpora.</Paragraph>
    </Section>
    <Section position="2" start_page="36" end_page="37" type="sub_section">
      <SectionTitle>
3.2 Features
</SectionTitle>
      <Paragraph position="0"> The feature set we selected was fairly simple; As we understood from our initial experiments, widewindow context features and topical context were not of much use for learning semantic classes from a multi-word training data set. Instead of generalizing, wider context windows add to noise, as seen from validation experiments with held-out data.</Paragraph>
      <Paragraph position="1"> Following are the features we used:  This is a window of n words to the left, and n words to the right, where n [?] {1,2,3} is a parameter we selected via cross validation.1 Punctuation marks were removed and all words were converted into lower case. The feature vector was calculated the same way for both nouns and verbs. The window did not exceed the boundaries of a sentence; when there were not enough words to either side of the word within the window, the value NULL was used to fill the remaining positions.</Paragraph>
      <Paragraph position="2"> For instance, for the noun 'companion' in sentence (given with POS tags)</Paragraph>
      <Paragraph position="4"> the local context feature vector is [at, his, drinking, through, bleary, tear-filled], for window size n = 3. Notice that we did not consider the hyphenated words as two words, when the data files had them annotated as a single token.</Paragraph>
      <Paragraph position="5">  This consists of parts of speech for a window of n words to both sides of word (excluding the word 1Validation results showed that a window of two words to both sides yields the best performance for both local context and POS features. n = 2 is the size we used in actual evaluation.  target word is shown inside [brackets] itself), with quotation signs and punctuation marks ignored. For SEMCOR files, existing parts of speech were used; for SENSEVAL data files, parts of speech from the accompanying Penn-Treebank parsed data files were aligned with the XML data files. The value vector is calculated the same way as the local context, with the same constraint on sentence boundaries, replacing vacancies with NULL.</Paragraph>
      <Paragraph position="6"> As an example, for the sentence we used in the previous example, the part-of-speech vector with context size n = 3 for the verb peered is [NULL, NULL, NNP, RB, IN, PRP$].</Paragraph>
      <Paragraph position="7">  The words that hold several kinds of syntactic relations with the word under consideration were selected. We used Link Grammar parser due to Sleator and Temperley (1991) because of the information-rich parse results produced by it.</Paragraph>
      <Paragraph position="8"> Sentences in SEMCOR corpus files and the SENSEVAL files were parsed with Link parser, and words were aligned with links. A given instance of a word can have more than one syntactic features present. Each of these features was considered as a binary feature, and a vector of binary values was constructed, of which each element denoted a unique feature found in the test set of the word.</Paragraph>
      <Paragraph position="9"> Each syntactic pattern feature falls into either of two types collocation or relation: Collocation features Collocation features are such features that connect the word under consideration to another word, with a preposition or an infinitive in between -- for instance, the phrase 'art of change-ringing' for the word art. For these features, the feature value consists of two words, which are connected to the given word either from left or  from right, in a given order. For the above example, the feature value is [[?].of.change-ringing], where [?] denotes the placeholder for word under consideration.</Paragraph>
      <Paragraph position="10"> Relational features Relational features represent more direct grammatical relationships, such as subject-verb or noun-adjective, the word under consideration has with surrounding words. When encoding the feature value, we specified the relation type and the value of the feature in the given instance. For instance, in the phrase 'Henry peered doubtfully', the adverbial modifier feature for the verb 'peered' is encoded as [adverb-mod doubtfully].</Paragraph>
      <Paragraph position="11"> A description of the relations for each part of speech is given in the table 1.</Paragraph>
    </Section>
    <Section position="3" start_page="37" end_page="37" type="sub_section">
      <SectionTitle>
3.3 Classifier and instance weighting
</SectionTitle>
      <Paragraph position="0"> The classifier we used was TiMBL, a memory based learner due to Daelemans et al. (2003). One reason for this choice was that memory based learning has shown to perform well in previous word sense disambiguation tasks, including some best performers in SENSEVAL, such as (Hoste et al., 2001; Decadt et al., 2004; Mihalcea and Faruque, 2004). Another reason is that TiMBL supported exemplar weights, a necessary feature for our system for the reasons we describe in the next section.</Paragraph>
      <Paragraph position="1"> One of the salient features of our system is that it does not consider every example to be equally important. Due to the fact that training instances from different instances can provide confusing examples, as shown in section 1.3, such an approach cannot be trusted to give good performance; we verified this by our own findings through empirical evaluations as shown in section 4.2.</Paragraph>
      <Paragraph position="2"> 3.3.1 Weighting instances with similarity We use a similarity based measure to assign weights to training examples. In the method we use, these weights are used to adjust the distances between the test instance and the example instances.</Paragraph>
      <Paragraph position="3"> The distances are adjusted according to the formula</Paragraph>
      <Paragraph position="5"> where [?]E(X,Y) is the adjusted distance between instance Y and example X, [?](X,Y) is the original distance, ewX is the exemplar weight of instance X.</Paragraph>
      <Paragraph position="6"> The small constant epsilon1 is added to avoid division by zero.</Paragraph>
      <Paragraph position="7"> There are various schemes used to measure inter-sense similarity. Our experiments showed that the measure defined by Jiang and Conrath (1997) (JCn) yields best results. Results for various weighting schemes are discussed in section 4.2.</Paragraph>
      <Paragraph position="8">  The exemplar weights were derived from the fol- null lowing method: 1. pick a labelled example e, and extract its sense se and semantic class ce.</Paragraph>
      <Paragraph position="9"> 2. if the class ce is a candidate for the current test word w, i.e. w has any senses that fall into ce, find out the most frequent sense of w, scew , within ce. We define the most frequent sense within a class as the sense that has the lowest WORDNET sense number within that class. If none of the senses of w fall into ce, we ignore that example.</Paragraph>
      <Paragraph position="10"> 3. calculate the relatedness measure between se and scew , using whatever the similarity metric  being considered. This is the exemplar weight for example e.</Paragraph>
      <Paragraph position="11"> In the implementation, we used freely available WordNet::Similarity package (Pedersen et al., 2004). 2</Paragraph>
    </Section>
    <Section position="4" start_page="37" end_page="38" type="sub_section">
      <SectionTitle>
3.4 Classifier optimization
</SectionTitle>
      <Paragraph position="0"> A part of SEMCOR corpus was used as a validation set (see section 3.1). The rest was used as training data in validation phase. In the preliminary experiments, it was seen that the generally recommended classifier options yield good enough performance, although variations of switches could improve performance slightly in certain cases. Classifier options were selected by a search over the available option space for only three basic classifier parameters, namely, number of nearest neighbors, distance metric and feature weighting scheme.</Paragraph>
      <Paragraph position="1"> 2WordNet::Similarity is a perl package available freely under GNU General Public Licence. http://wnsimilarity.sourceforge.net. null  bined classifiers: recall measures for nouns and verbs combined.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML