File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0101_metho.xml

Size: 10,554 bytes

Last Modified: 2025-10-06 14:10:36

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-0101">
  <Title>Improving Context Vector Models by Feature Clustering for Automatic Thesaurus Construction</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Conventional approaches for auto-
</SectionTitle>
    <Paragraph position="0"> matic thesaurus construction The conventional approaches for automatic thesaurus construction include three steps: (1) Acquire contextual behaviors of words from corpora. (2) Calculate the similarity between words. (3) Finding similar words and then organizing into a thesaurus structure.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Acquire word sense knowledge
</SectionTitle>
      <Paragraph position="0"> One can model word meanings by their co-occurrence context. The common ways to extract co-occurrence contextual words include simple window based and syntactic dependent based (You, 2004). Obviously, syntactic dependent relations carry more accurate information than window based. Also, it can bring additional information, such as POS (part of speech) and semantic roles etc. To extract the syntactic de-</Paragraph>
      <Paragraph position="2"> pended relation, a raw text has to be segmented, POS tagged, and parsed. Then the relation extractor identifies the head-modifier relations and/or head-argument relations. Each relation could be defined as a triple (w, r, c), where w is the thesaurus term, c is the co-occurred context word and r is the relation between w and c.</Paragraph>
      <Paragraph position="3"> Then context vector of a word is represented differently by different models, such as: tf, weight-tf, Latent Semantic Indexing (LSI) (Deerwester, S.,et al., 1990) and Probabilistic LSI (Hofmann, 1999). The context vectors of word x can be express by: a) tf model: word x = }tf...,,2tf,1{tf xnxx ,where xitf is the term frequency of the ith context word when given word x.</Paragraph>
      <Paragraph position="4"> b) weight-tf model: assume there are n contextual words and m target words. word x= ,where weighti, we used here, is defined as [logm-entropy(wordi)]/logm )(wordikp is the co-occurrence probability of wordk when given wordi.</Paragraph>
      <Paragraph position="5"> c) LSI or PLSI models: using tf or weighted-tf co-occurrence matrix and by adopting LSI or PLSI to reduce the dimension of the matrix.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Similarity between words
</SectionTitle>
      <Paragraph position="0"> The common similarity functions include a) Adopting simple frequency feature, such as cosine, which computes the angle between two context vectors; b) Represent words by the probabilistic distribution among contexts, such as Kull-Leiber divergence (Cover and Thomas, 1991).</Paragraph>
      <Paragraph position="1"> The first step is to convert the co-occurrence matrix into a probabilistic matrix by simple formula. null</Paragraph>
      <Paragraph position="3"> Then calculate the distance between probabilistic vectors by sums up the all probabilistic difference among each context word so called cross entropy.</Paragraph>
      <Paragraph position="4"> Due to the original KL distance is asymmetric and is not defined when zero frequency occurs.</Paragraph>
      <Paragraph position="5"> Some enhanced KL models were developed to prevent these problems such as Jensen-Shannon (Jianhua, 1991), which introducing a probabilistic variable m, or a -Skew Divergence (Lee, 1999), by adopting adjustable variable a. Research shows that Skew Divergence achieves better performance than other measures. (Lee,</Paragraph>
      <Paragraph position="7"> To convert distance to similarity value, we adopt the formula inspired by Mochihashi, and Matsumoto 2002.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Organize similar words into thesaurus
</SectionTitle>
      <Paragraph position="0"> There are several clustering methods can be used to cluster similar words. For example, by selecting N target words as the entries of a thesaurus, then extract top-n similar words for each entry; adopting HAC(Hierarchical agglomerative clustering, E.M. Voorhees,1986) method to cluster the most similar word pairs in each clustering loop. Eventually, these similar words will be formed into synonyms sets.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="2" type="metho">
    <SectionTitle>
3 Difficulties and Solutions
</SectionTitle>
    <Paragraph position="0"> There are two difficulties of using context vector models. One is the enormous dimensions of con-</Paragraph>
    <Paragraph position="2"> textual words, and the other is data sparseness problem. Conventionally LSI or PLSI methods are used to reduce feature dimensions by mapping literal words into latent semantic classes.</Paragraph>
    <Paragraph position="3"> The researches show that it's a promising method (April Kontostathis, 2003). However the latent semantic classes also smooth the information content of feature vectors. Here we proposed a different approach to cope with the feature reduction and data sparseness problems.</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.1 Feature Clustering
</SectionTitle>
      <Paragraph position="0"> Reduced feature dimensions and data sparseness cause the problem of inaccurate contextual information. In general, one has to reduce the feature dimensions for computational feasibility and also to extend the contextual word information to overcome the problem of insufficient context information.</Paragraph>
      <Paragraph position="1"> In our experiments, we took the clustered-feature approaches instead of LSI to cope with these two problems and showed better performances. The idea of clustered-feature approaches is by adopting the classes of clustering result of the frequent words as the new set of features which has less feature dimensions and context words are naturally extend to their class members. We followed the steps described in section 2 to develop the synonyms sets. First, the syntactic dependent relations were extracted to create the context vectors for each word. We adopted the skew divergence as the similarity function, which is reported to be the suitable similarity function (Masato, 2005), to measure the distance between words.</Paragraph>
      <Paragraph position="2"> We used HAC algorithm to develop the synonyms classes, which is a greedy method, simply to cluster the most similar word pairs at each clustering iteration.</Paragraph>
      <Paragraph position="3"> The HAC clustering process: While the similarity of the most similar word pair (wordx, wordy) is greater than a threshold e then cluster wordx, wordy together and replace it with the centroid between wordx and wordy Recalculate the similarity between other words and the centroid</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.2 Clustered-Feature Vectors
</SectionTitle>
      <Paragraph position="0"> We obtain the synonyms sets S from above HAC method. Let the extracted synonyms sets S = { S1, S2,...SR} which contains R synonym classes;</Paragraph>
      <Paragraph position="2"> jS stands for the jth element of the ith synonym class; the ith synonym class Si contains Qi elements. null The feature extension processing transforms the coordination from literal words to synonyms sets. Assume there are N contextual words {C1,C2,...CN}, and the first step is to transform the context vector of of Ci to the distribution vector among S. Then the new feature vector is the summation of the distribution vectors among S of its all contextual words.</Paragraph>
      <Paragraph position="3"> The new feature vector of wordj = [?]= xNi 1 jitf Distribution_Vector_among_S( iC ) ,where jitf is the term frequency of the context word Ci occurs with wordj.</Paragraph>
      <Paragraph position="4">  Due to the transformed coordination no longer stands for either frequency or probability, we use simple cosine function to measure the similarity between these transformed clustered-feature vectors. null</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="2" end_page="3" type="metho">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"> To evaluate the performance of the feature clustering method, we had prepared two sets of testing data with high and low frequency words respectively. We want to see the effects of feature reduction and feature extension for both frequent and infrequent words.</Paragraph>
    <Section position="1" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
4.1 Discrimination Rates
</SectionTitle>
      <Paragraph position="0"> The discrimination rate is used to examine the capability of distinguishing the correlation between words. Given a word pair (wordi,wordj), one has to decide whether the word pair is similar or not. Therefore, we will arrange two different word pair sets, related and unrelated, to estimate the discrimination. By given the formula below ,where Na and Nb are respectively the numbers of synonym word pairs and unrelated word pairs.</Paragraph>
      <Paragraph position="1"> As well as, na and nb are the numbers of correct labeled pairs in synonyms and unrelated words.</Paragraph>
    </Section>
    <Section position="2" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
4.2 Nonlinear interpolated precision
</SectionTitle>
      <Paragraph position="0"> The Nap evaluation is used to measure the performance of restoring words to taxonomy, a similar task of restoring words in WordNet (Dominic Widdows, 2003).</Paragraph>
      <Paragraph position="1"> The way we adopted Nap evaluation is to reconstruct a partial Chinese synonym set, and measure the structure resemblance between original synonyms and the reconstructed one. By doing so, one has to prepare certain number of synonyms sets from Chinese taxonomy, and try to reclassify these words.</Paragraph>
      <Paragraph position="2"> Assume there are n testing words distributed in R synonyms sets. Let i1R stands for the represented word of the ith synonyms set. Then we will compute the similarity ranking between each represented word and the rest n-1 testing words.</Paragraph>
      <Paragraph position="4"> The NAP value means how many percent synonyms can be identified. The maximum value of NAP is 1, means the extracted similar words are exactly match to the synonyms.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML