File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-0101_evalu.xml

Size: 3,974 bytes

Last Modified: 2025-10-06 13:59:45

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-0101">
  <Title>Improving Context Vector Models by Feature Clustering for Automatic Thesaurus Construction</Title>
  <Section position="7" start_page="3" end_page="4" type="evalu">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"> The context vectors were derived from a 10 year news corpus from The Central News Agency. It contains nearly 33 million sentences, 234 million word tokens, and we extracted 186 million syntactic relations from this corpus. Due to the low reliability of infrequent data, only the relation triples (w, r, c), which occurs more than 3 times and POS of w and c must be noun or verb, are used. It results that nearly 30,000 high frequent nouns and verbs are used as the contextual features. And with feature clustering2, the contextual dimensions were reduced from 30,988 literal words to 12,032 semantic classes.</Paragraph>
    <Paragraph position="1"> In selecting testing data, we consider the words that occur more than 200 times as high frequent words and the frequencies range from 40 to 200 as low frequent words.</Paragraph>
    <Paragraph position="2"> Discrimination For the discrimination experiments, we randomly extract high frequent word pairs which include 500 synonym pairs and 500 unrelated word pairs from Cilin (Mei et. al, 1983). At the mean time, we also prepare equivalent low frequency data.</Paragraph>
    <Paragraph position="3"> We use a mathematical technique Singular Value Decomposition (SVD) to derive principal components and to implement LSI models with respect to different feature dimensions from 100 to 1000. We compare the performances of different models. The results are shown in the following figures.</Paragraph>
    <Paragraph position="4"> Figure1. Discrimination for high frequent words The result shows that for the high frequent data, although the feature clustering method did not achieve the best performance, it performances better at related data and a balanced performance at unrelated data. The tradeoffs be- null tween related recalls and unrelated recalls are clearly shown. Another observation is that no matter of using LSI or literal word features (tf or weight_tf), the performances are comparable.</Paragraph>
    <Paragraph position="5"> Therefore, we could simply use any method to handle the high frequent words.</Paragraph>
    <Paragraph position="6"> Figure2 Discrimination for low frequent word For the infrequent words experiments, neither LSI nor weighted-tf performs well due to insufficient contextual information. But by introducing feature clustering method, one can gain more 6% accuracy for the related data. It shows feature clustering method could help gather more information for the infrequent words.</Paragraph>
    <Paragraph position="7"> Nonlinear interpolated precision For the Nap evaluation, we prepared two testing data from Cilin and Hownet. In the high frequent words experiments, we extract 1311 words within 352 synonyms sets from Cilin and 2981 words within 570 synonyms sets from Hownet.</Paragraph>
    <Paragraph position="8"> Figure 3. Nap performance for high frequent words In high frequent experiments, the results show that the models retaining literal form perform better than dimension reduction methods. It means in the task of measuring similarity of high frequent words using literal contextual feature vectors is more precise than using dimension reduction feature vectors.</Paragraph>
    <Paragraph position="9"> In the infrequent words experiments, we can only extract 202 words distributed in 62 synonyms sets from Cilin and 1089 words within 222 synonyms sets. Due to fewer testing words, LSI was not applied in this experiment.</Paragraph>
    <Paragraph position="10"> Figure 4. Nap performance for low frequent words It shows with insufficient contextual information, the feature clustering method could not help in recalling synonyms because of dimensional reduction.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML