File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1664_metho.xml

Size: 7,872 bytes

Last Modified: 2025-10-06 14:10:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1664">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Graph-based Word Clustering using a Web Search Engine</Title>
  <Section position="5" start_page="545" end_page="547" type="metho">
    <SectionTitle>
4 Experimental Results
</SectionTitle>
    <Paragraph position="0"> This section addresses evaluation. Two sets of word groups are used for the evaluation: one is derived from documents on a web directory; another is from WordNet. We first evaluate the co- null a19 a16 1. Input A set of words is given. The number of words is denoted as n.</Paragraph>
    <Paragraph position="1"> 2. Obtain frequencies Put a query for each pair of words to a search engine, and obtain a co-occurrence matrix. Then calculate the chi-square matrix (alternatively a PMI matrix, or a Jaccard matrix.) 3. Make a graph Set a node for each word, and an edge to a pair of nodes whose kh2 value is above a threshold. The threshold is determined so that the network density (the number of edges divided by nC2) is dthre.</Paragraph>
    <Paragraph position="2"> 4. Apply Newman clustering Initially set each node  as a cluster. Then merge two clusters repeatedly so that Q is maximized. Terminate if Q does not increase anymore, or when a given number of clusters nc is obtained. (Alternatively, apply</Paragraph>
    <Section position="1" start_page="546" end_page="546" type="sub_section">
      <SectionTitle>
4.1 Word Groups from an Open Directory
</SectionTitle>
      <Paragraph position="0"> We collected documents from the Japanese Open Directory (dmoz.org/World/Japanese). The dmoz japanese category contains about 130,000 documents and more than 10,000 classes. We chose 9 categories out of the top 12 categories: art, sports, computer, game, society, family, science, and health. We crawled 1000 documents for each category, i.e., 9000 documents in all.</Paragraph>
      <Paragraph position="1"> For each category, a word group is obtained through the procedure in Fig. 5. We consider that the specific words to a category are relevant to some extent, and that they can therefore be regarded as a word group. Examples are shown in Table 5. In all, 90 word sets are obtained and merged. We call the word set DMOZ-J data.</Paragraph>
      <Paragraph position="2"> Our task is, given 90 words, to cluster the words into the correct nine groups. Here we investigate whether the correct nine words are selected for each word using the co-occurrence measure. We compare pointwise mutual information (PMI), the Jaccard coefficient (Jaccard), and chi-square (kh2). We chose these methods for comparison because PMI performs best in (Terra and Clarke, 2003).</Paragraph>
      <Paragraph position="3"> The Jaccard coefficient is often used in social network mining from the web. Table 7 shows the precision of each method. Experiments are repeated five times. We keep each method that outputs the a19 a16  1. For each category, crawl 1000 documents randomlya null 2. Apply the Japanese morphological analysis system ChaSen (Matsumoto et al., 2000) to the documents. Calculate the score of each word w in category c similarly to TF-IDF:</Paragraph>
      <Paragraph position="5"> where fc denotes the document frequency of word w in category c, Nall denotes the number of all documents, and fall(w) denotes the frequency of word w in all documents.</Paragraph>
      <Paragraph position="6"> 3. For each category, the top 10 words are selected as the word group.</Paragraph>
      <Paragraph position="7"> aWe first get all urls, sort them, and select a sample randomly.</Paragraph>
      <Paragraph position="9"> highest nine words for each word, groups of ten words. Therefore, recall is the same as the precision. From the table, the chi-square performs best.</Paragraph>
      <Paragraph position="10"> PMI is slightly better than the Jaccard coefficient.</Paragraph>
    </Section>
    <Section position="2" start_page="546" end_page="546" type="sub_section">
      <SectionTitle>
4.2 Word Groups from WordNet
</SectionTitle>
      <Paragraph position="0"> Next, we make a comparison using WordNet 7. By extracting 10 words that have the same hypernym (i.e. coordinates), we produce a word group. Examples are shown in Table 6. Nine word groups are merged into one, as with DMOZ-J. The experiments are repeated 10 times. Table 8 shows the result. Again, the chi-square performs best among the methods that were compared.</Paragraph>
      <Paragraph position="1"> Detailed analyses of the results revealed that word groups such as bacteria and diseases are clustered correctly. However, word groups such as computers (in which homepage, server and client are included) are not well clustered: these words tend to be polysemic, which causes difficulty.</Paragraph>
    </Section>
    <Section position="3" start_page="546" end_page="547" type="sub_section">
      <SectionTitle>
4.3 Evaluation of Clustering
</SectionTitle>
      <Paragraph position="0"> We compare two clustering methods: Newman clustering and average-link agglomerative cluster- null ing, which is often used in word clustering. A word co-occurrence graph is created using PMI, Jaccard, and chi-square measures. The threshold is determined so that the network density dthre is 0.3. Then, we apply clustering to obtain nine clusters; nc = 9. Finally, we compare the resultant clusters with the correct categories. Clustering results for DMOZ-J sets are shown in Table 9. Newman clustering produces higher precision and recall. Especially, the combination of chi-square and Newman is the best in our experiments. null</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="547" end_page="548" type="metho">
    <SectionTitle>
5 Discussion
</SectionTitle>
    <Paragraph position="0"> In this paper, the scope of co-occurrence is document-wide. One reason is that major commercial search engines do not support a type of query w1 NEAR w2. Another reason is in (Terra and Clarke, 2003) document-wide co-occurrences perform comparable to other Windows-based cooccurrences. null Many types of co-occurrence exist other than noun-noun. We limit our scope to noun-noun co-occurrences in this paper. Other types of co-occurrence such as verb-noun can be investigated in future studies. Also, co-occurrence for the second-order similarity can be sought. Because web documents are sometimes difficult to analyze, we keep our algorithm as simple as possible. Analyzing semantic relations and applying distributional clustering is another goal for future work. A salient weak point of our algorithm is the number of necessary queries allowed to a search engine. For obtaining a graph of n words, O(n2) queries are required, which discourages us from undertaking large experiments. However some devices are possible: if we analyze the texts of the top retrieved pages by query w, we can guess what words are likely to co-occur with w. This preprocessing seems promising at least in social network extraction: we can eliminate 85% of queries in the 500 nodes case while retaining more than 90% precision (Asada et al., 2005).</Paragraph>
    <Paragraph position="1"> In our evaluation, the chi-square measure performed well. One reason is that the PMI performs worse when a word group contains rare or frequent words, as is generally known for mutual information measure (Manning and Sch&amp;quot;utze, 2002). Another reason is that if we put one word and two words to a search engine, the result might be inconsistent. In an extreme case, the web count of w1 is below the web count of w1ANDw2. This  phenomenon depends on how a search engine processes AND operator, and results in unstable values for the PMI. On the other hand, our method by the chi-square uses a co-occurrence matrix as a contingency table. For that reason, it suffers less from the problem. Other statistical measures such as the likelihood ratio are also applicable.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML