XML Viewer - c02-1097

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1097_metho.xml
Size: 16,316 bytes
Last Modified: 2025-10-06 14:07:51
<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1097">
  <Title>Word Sense Disambiguation using Static and Dynamic Sense Vectors</Title>
  <Section position="3" start_page="4" end_page="6" type="metho">
    <SectionTitle>
2 Word Sense Disambiguation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
Method
2.1 Overall System Description
</SectionTitle>
      <Paragraph position="0"> Figure 1 shows the overall system description.</Paragraph>
      <Paragraph position="1"> The system is composed of a training phase and a test phase. In the training phase, words in the limited context window of training samples, which contains a target word and its sense, are extracted and the words are weighted with local density concept (section 2.2). Then, context vectors, which represent each training sample, are indexed and static sense vectors for each a target word , its sense and its context. But the sense of contexual words is not annotated in the training samples (SENSEVAL-2, 2001) sense are constructed. A static sense vector is the centroid of context vectors of training samples where a target word is used as a certain sense (section 2.3). For example, two sense vectors of 'bank' can be constructed using context vectors of training samples where 'bank' is used as 'business establishment' and those where 'bank' is used as 'artificial embankment'. Each context vector is indexed for 'automatic selective sampling'.</Paragraph>
      <Paragraph position="2">  In the test phase, contextual words are extracted with the same manner as in the training phase (section 2.5). Then, the 'automatic selective sampling' module retrieves top-N training samples. Cosine similarity between indexed context vectors of training samples, and the context vector of a given test sample provides relevant training samples. Then we can make another sense vectors for each sense using the retrieved context vectors. Since, the sense vectors produced by the automatic selective sampling method are changed according to test samples and their context, we call them dynamic sense vectors in this paper (section 2.4) (Note that, the sense vectors produced in the training phase are not changed according to test samples. Thus, we call them static sense vectors.) The similarities between dynamic sense vectors, and a context vector of a test sample, and those between static sense vectors and the context vector of the test sample are estimated by cosine measure. The sense with the highest similarity is selected as the relevant word sense.</Paragraph>
      <Paragraph position="3"> Our proposed method can be summarized as follows  G1G2Training Phase 1) Constructing context vectors using contextual words in training sense tagged data.</Paragraph>
      <Paragraph position="4"> 2) Local density to weight terms in context vectors.</Paragraph>
      <Paragraph position="5"> 3) Creating static sense vectors, which are the centroid of the context vectors.</Paragraph>
      <Paragraph position="6"> G1G2Test Phase 1) Constructing context vectors using contextual words in test data.</Paragraph>
      <Paragraph position="7"> 2) Automatic selective sampling of training vectors in each test case to reduce noise.</Paragraph>
      <Paragraph position="8"> 3) Creating dynamic sense vectors,  which are the centroid of the training vectors for each sense.</Paragraph>
      <Paragraph position="9"> 4) Estimating word senses using static and dynamic sense vectors.</Paragraph>
    </Section>
    <Section position="2" start_page="4" end_page="6" type="sub_section">
      <SectionTitle>
2.2 Representing Training Samples as a
</SectionTitle>
      <Paragraph position="0"> Context Vector with Local Density In WSD, context must reflect various contextual characteristics  . If the window size of context is too large, the context cannot contain relevant information consistently (Kilgarriff et al., 2000). Words in this context window  can be classified into nouns, verbs, and adjectives. The classified words within the context window are assumed to show the co-occurring behaviour with the target word. They provide a supporting vector for a certain sense. Contextual words nearby a target word give more relevant information to decide its sense than those far from it. Distance from a target word is used for this purpose and it is calculated by the assumption that the target words in the context window have the same sense (Yarowsky, 1995).</Paragraph>
      <Paragraph position="1"> Each word in the training samples can be weighted by formula (1). Let W</Paragraph>
      <Paragraph position="3"> ) represent a weighting function for a term t k , which appears in the j th training sample for the i th sense, tf ijk 5 POS, collocations, semantic word associations, subcategorization information, semantic roles, selectional preferences and frequency of senses are useful for WSD (Agirre et al., 2001). 6 Since, the length of context window was considered when SENSEVAL-2 lexical sample data were constructed, we use a training sample itself as context window.</Paragraph>
      <Paragraph position="4"> represent the frequency of a term t k in the j th training sample for the i  from the target word in the j th training sample for the i</Paragraph>
      <Paragraph position="6"> In formula (1), Z is a normalization factor, which forces all values of W</Paragraph>
      <Paragraph position="8"> ) to fall into between 0 and 1, inclusive (Salton et al., 1983). Formula (1) is a variation of tf-idf. We regard each training sample as indexed documents, which we want to retrieve and a test sample as a query in information retrieval system. Because we know a target word in training samples and test samples, we can restrict search space into training samples, which contain the target word when we find relevant samples. We also take into account distance from the target word.</Paragraph>
      <Paragraph position="10"> in formula (1) support a local density concept. In this paper, 'local density' of a target word 'Wt' is defined by the density among contextual words of 'Wt' in terms of their in-between distance and relative frequency. First, the distance factor is one of the important clues because contextual words surrounding a target word frequently support a certain sense: for example, 'money' in 'money in a bank'.</Paragraph>
      <Paragraph position="11"> Second, if contextual words frequently co-occur with a target word of a certain sense, they may be a strong evidence to decide what word sense is correct. Therefore, contextual words, which more frequently appear near a target word and appear with a certain sense of a target word, have a higher local density.</Paragraph>
      <Paragraph position="12"> With the local density concept, context of training samples can be represented by a vector with context words and their weight, such that</Paragraph>
      <Paragraph position="14"> is strong evidence for the i</Paragraph>
      <Paragraph position="16"> are much larger than others.)</Paragraph>
    </Section>
    <Section position="3" start_page="6" end_page="6" type="sub_section">
      <SectionTitle>
2.3 Constructing Static Sense Vectors
</SectionTitle>
      <Paragraph position="0"> Now, we can represent each training sample as context vectors using contextual words such that</Paragraph>
      <Paragraph position="2"> represents a context vector of the j th training sample for the i</Paragraph>
      <Paragraph position="4"> Throughout clustering the context vectors, each sense can be represented as sense vectors. Let N</Paragraph>
      <Paragraph position="6"> represent the context vector of the j th training sample for the i th sense. The static sense vector for the i</Paragraph>
      <Paragraph position="8"> is the centroid of context vectors of training samples for the i th sense as shown in figure 2. In figure 2, there are n senses and context vectors, which represent each training sample. We can categorize each context vector according to a sense of a target word. Then, each sense vectors are acquired using formula (2). Because the sense vectors are not changed according to test samples, we call them a static sense vector in this paper (note that sense vectors, which we will describe in section 2.4, are changed depending on the context of test samples).</Paragraph>
    </Section>
    <Section position="4" start_page="6" end_page="6" type="sub_section">
      <SectionTitle>
2.4 Automatic selective sampling: Dynamic
Sense Vectors
</SectionTitle>
      <Paragraph position="0"> It is important to capture effective patterns and features from the training sense tagged data in WSD. However, if there is noise in the training sense tagged data, it makes difficult to disambiguate word senses effectively. To reduce its negative effects, we use a automatic selective sampling method using cosine similarity. Figure 3 shows the process of a automatic selective sampling method. The upper side shows retrieval process and the lower side shows a graphical representation of dynamic sense vectors.</Paragraph>
      <Paragraph position="1"> Sense 1 Sense 2 Sense n  For example, let 'bank' have two senses ('business establishment', 'artificial embankment'). Now, there are indexed training samples for the two senses. Then top-N training samples can be acquired for a given test sample containing a target word 'bank'. The retrieved training samples can be clustered as Dynamic Sense Vectors according to a sense of their target word. Since, the sense vectors produced by a automatic selective sampling method are changed according to the context vector of a test sample, we call them dynamic sense vectors in this paper.</Paragraph>
      <Paragraph position="3"> represent the number of training samples for the i th sense in the retrieved top-N, and v ij represent a context vector of the j th training sample for the i th sense in the top-N. The dynamic sense vector for the i th sense of a target word, DSV</Paragraph>
      <Paragraph position="5"> means the centroid of the retrieved context vectors of training samples for the i th sense as shown in the lower side of</Paragraph>
    </Section>
    <Section position="5" start_page="6" end_page="6" type="sub_section">
      <SectionTitle>
2.5 Context Vectors of a Test Sample
</SectionTitle>
      <Paragraph position="0"> Contextual words in a test sample are extracted as the same manner as in the training phase. The classified words in the limited window size nouns, verbs, and adjectives - offer components of context vectors. When a term t k appears in the test sample, the value of t k in a context vector of the test sample will be 1, in contrary, when t k does not appear in the test sample, the value of t k in a context vector of the test sample will be 0. Let contextual words of a test sample be 'bank', 'river' and 'water', and dimension of context vector be ('bank', 'commercial', 'money', 'river', 'water'). Then we can acquire a context vector, CV =(1,0,0,1,1), from the test sample. Henceforth we will denote CV i as a context vector for the i th test sample.</Paragraph>
    </Section>
    <Section position="6" start_page="6" end_page="6" type="sub_section">
      <SectionTitle>
2.6 Estimating a Word Sense: Comparing
Similarity
</SectionTitle>
      <Paragraph position="0"> We described the method for constructing static sense vectors, dynamic sense vectors and context vectors of a test sample. Next, we will describe the method for estimating a word sense using them. The similarity in information retrieval area is the measure of how alike two documents are, or how alike a document and a query are. In a vector space model, this is usually interpreted as how close their corresponding vector representations are to each other. A popular method is to compute the cosine of the angle between the vectors (Salton et al., 1983). Since our method is based on a vector space model, the cosine measure (formula (4)) will be used as the similarity measure.</Paragraph>
      <Paragraph position="1"> Throughout comparing similarity between SV</Paragraph>
      <Paragraph position="3"> for the i th sense and the j th test sample, we can estimate the relevant word sense for the given context vector of the test sample. Formula (5) shows a combining method of sim(SV  where, N represents the dimension of the vector space, v and w represent vectors.  where l is a weighting parameter. Because the value of cosine similarity falls into between 0 and 1, that of Score(s</Paragraph>
      <Paragraph position="5"> between 0 and 1. When similarity value is 1 it means perfect consensus, in contrary, when similarity value is 0 it means there is no part of agreement at all. After all, the sense having maximum similarity by formula (5) is decided as the answer.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="6" end_page="6" type="metho">
    <SectionTitle>
3. Experiment
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="6" end_page="6" type="sub_section">
      <SectionTitle>
3.1 Experimental Setup
</SectionTitle>
      <Paragraph position="0"> In this paper, we compared six systems as follows.</Paragraph>
      <Paragraph position="1"> G1G2The system that assigns a word sense which appears most frequently in the training samples (Baseline) G1G2The system by the Naive Bayesian method (A) (Gale, et al., 1992) G1G2The system that is trained by co-occurrence information directly without changing. (only with term  the other method. System B, C, D, and E will show the performance of each component in our proposed method. To evaluate performance in the condition of 'without local density (system B and D)', we weight each word with its frequency in the context of training samples.</Paragraph>
      <Paragraph position="2"> The test suit used is the English lexical samples released for SENSEVAL-2 in 2001. This test suit supplies training sense tagged data and test data for noun, verb and adjective (SENSEVAL-2, 2001).</Paragraph>
      <Paragraph position="3"> Cross-validation on training sense tagged data is used to determine the parameters - l in formula (5) and top-N in constructing dynamic sense vectors. We divide training sense tagged data into ten folds with the equal size, and determine each parameter, which makes the best result in average from ten-fold validation. The values, we used, are 2.0=l , and 50 =N .</Paragraph>
      <Paragraph position="4"> The results were evaluated by precision rates (Salton, et al., 1983). The precision rate is defined as the proportion of the correct answers to the generated results.</Paragraph>
    </Section>
    <Section position="2" start_page="6" end_page="6" type="sub_section">
      <SectionTitle>
3.2 Experimental Results
</SectionTitle>
      <Paragraph position="0"> all systems and baseline show higher performance on noun and adjective than verb.</Paragraph>
      <Paragraph position="1"> This indicates that the disambiguation of verb is more difficult than others in this test suit. In analysing errors, we found that we did not consider important information for disambiguating verb senses such as adverbs, which can be used as idioms with the verbs. For example, 'carry out', 'pull out' and so on. It is necessary to handle them for more effective WSD.</Paragraph>
      <Paragraph position="2"> System B, C, D, and E show how effective local density and dynamic vectors are in WSD. The performance increase was shown about 70% with local density (system C) and about 150% with dynamic vectors (system D), when they are compared with system B - without local density and dynamic vectors. This shows that local density is more effective than term frequency. This also shows that automatic selective sampling of training samples in each test sample is very important.</Paragraph>
      <Paragraph position="3"> Combining local density and dynamic vectors (system E), we acquire about 62% performance.</Paragraph>
      <Paragraph position="4"> Our method also shows higher performance than baseline and system A (the Naive Bayesian method) - about 30% for baseline and about 58% for system A.</Paragraph>
      <Paragraph position="5"> As a result of this experiment, we proved that co-occurrence information throughout the local density and the automatic selective sampling is more suitable and discriminative in WSD. This techniques lead up to 70% ~ 150% performance improvement in the experimentation comparing the system without local density and automatic selective sampling.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML