XML Viewer - w04-0839

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0839_metho.xml
Size: 4,869 bytes
Last Modified: 2025-10-06 14:09:13
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0839">
  <Title>Complementarity of Lexical and Simple Syntactic Features: The SyntaLex Approach to SENSEVAL-3</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Feature Space
</SectionTitle>
    <Paragraph position="0"> Simple lexical and syntactic features are used to represent the context. The lexical features used are word bigrams. The Part of Speech (PoS) of the target word and its neighbors make up the the syntactic features. Bigrams are readily captured from the text while Part of Speech taggers are widely available for a variety of languages.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Bigrams
</SectionTitle>
      <Paragraph position="0"> A bigram is a pair of words that occur close to each other in text and in a particular order. Consider:</Paragraph>
      <Paragraph position="2"> It has the following bigrams: the interest, interest rate, rate is, is lower, lower in, in state and state banks. Note that the bigram interest rate suggests that bank has been used in the financial institution sense and not the river bank sense.</Paragraph>
      <Paragraph position="3"> All features are binary valued. Thus, the bi-gram feature interest rate has value 1 if it occurs in the context of the target word, and 0 if it does not. The learning algorithm considers only those bigrams that occur at least twice in the training data and have a word association ratio greater than a certain predecided threshold. Bigrams that tend to be very common are ignored via a stop list. The Ngram Statistics Package1 is used to identify statistically significant bigrams in the training corpus, for a particular word.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Part of Speech Features
</SectionTitle>
      <Paragraph position="0"> The Part of Speech (PoS) of the target word and its surrounding words can be useful indicators of its intended sense. Consider the following sentences where turn is used in changing sides/parties and changing course/direction senses, respectively:</Paragraph>
      <Paragraph position="2"/>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Association for Computational Linguistics
</SectionTitle>
      <Paragraph position="0"> for the Semantic Analysis of Text, Barcelona, Spain, July 2004 SENSEVAL-3: Third International Workshop on the Evaluation of Systems Notice that the Part of Speech of words following turn in the two sentences are significantly different.</Paragraph>
      <Paragraph position="1"> We believe that words used in different senses may be surrounded by words with different PoS. Therefore, PoS of words at particular positions relative to the target word are used as features to identify the intended sense. The PoS of the target word is denoted by Pa0 . The Part of Speech of words following it are represented by Pa1 , Pa2 and so on, while that of words to the left of the target word are Pa3a4a1 , Pa3a5a2 , etc. Like bigrams, the Part of Speech features are binary. For example, the feature (Pa1 = JJ) has value  1 if the target word is followed by an adjective (JJ), and 0 otherwise.</Paragraph>
      <Paragraph position="2"> 3 Data and its Pre-processing  The English lexical sample of SENSEVAL-3 has 7,860 sense-tagged training instances and 3,944 test instances. The training data has six pairs of instances with identical context (different instance ID's). These duplicates are removed so as not to unfairly bias the classifier to such instances. The test data has one pair of with the same context but no instances were removed from the test data in order to facilitate comparison with other systems. The data also has certain instances with multiple occurrences of a word marked as the target word. We remove all such markings except for the first occurrence of the target word in an instance. Thus, our systems identify the intended sense based solely on how the target word is used in the first occurrence.</Paragraph>
      <Paragraph position="3"> The sense-tagged training and test data are Part of Speech tagged using the posSenseval2 package.</Paragraph>
      <Paragraph position="4"> posSenseval PoS tags any data in SENSEVAL-2 data format (same as SENSEVAL-3 format) using the Brill Tagger. It represents the PoS tags in appropriate xml tags and outputs data back in SENSEVAL-2 data format. A simple sentence boundary identifier is used to place one sentence per line, which is a requirement of the Brill Tagger. The mechanism of Guaranteed Pre-tagging (Mohammad and Pedersen, 2003) is used to further enhance the quality of tagging around the target words. The experiments performed on this pre-processed data are described next.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML