File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-1664_intro.xml
Size: 11,743 bytes
Last Modified: 2025-10-06 14:03:58
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1664"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Graph-based Word Clustering using a Web Search Engine</Title> <Section position="4" start_page="542" end_page="545" type="intro"> <SectionTitle> 2 Related Works </SectionTitle> <Paragraph position="0"> A number of studies have explained the use of the web for NLP tasks e.g., creating multilingual translation lexicons (Cheng et al., 2004), text classification (Huang et al., 2004), and word sense disambiguation (Turney, 2004). M. Baroni and M.</Paragraph> <Paragraph position="1"> Ueyama summarize three approaches to use the web as a corpus (Baroni and Ueyama, 2005): using web counts as frequency estimates, building corpora through search engine queries, and crawling the web for linguistic purposes. Commercial search engines are optimized for ordinary users.</Paragraph> <Paragraph position="2"> Therefore, it is desirable to crawl the web and to develop specific search engines for NLP applications (Cafarella and Etzioni, 2005). However, considering that great efforts are taken in commercial search engines to maintain quality of crawling and indexing, especially against spammers, it is still important to pursue the possibility of using the current search engines for NLP applications.</Paragraph> <Paragraph position="3"> P. Turney (Turney, 2001) presents an unsupervised learning algorithm for recognizing synonyms by querying a web search engine. The task of recognizing synonyms is, given a target word and a set of alternative words, to choose the word that is most similar in meaning to the target word. The algorithm uses pointwise mutual information (PMI-IR) to measure the similarity of pairs of words. It is evaluated using 80 synonym test questions from the Test of English as a Foreign Language (TOEFL) and 50 from the English as a Second Language test (ESL). The algorithm obtains a score of 74%, contrasted to that of 64% by Latent Semantic Analysis (LSA). Terra and Clarke (Terra and Clarke, 2003) provide a comparative investigation of co-occurrence frequency estimation on the performance of synonym tests. They report that PMI (with a certain window size) performs best on average. Also, PMI-IR is useful for calculating semantic orientation and rating reviews (Turney, 2002).</Paragraph> <Paragraph position="4"> As described, PMI is one of many measures to calculate the strength of word similarity or word association (Manning and Sch&quot;utze, 2002). An important assumption is that similarity between words is a consequence of word co-occurrence, or that the proximity of words in text is indicative of relationship between them, such as synonymy or antonymy. A commonly used technique to obtain word groups is distributional clustering (Baker and McCallum, 1998). Distributional clustering of words was first proposed by Pereira Tishby & Lee in (Pereira et al., 1993): They cluster nouns according to their conditional verb distributions.</Paragraph> <Paragraph position="5"> Graphic representations for word similarity have also been advanced by several researchers.</Paragraph> <Paragraph position="6"> Kageura et al. (2000) propose automatic thesaurus generation based on a graphic representation. By applying a minimum edge cut, the corresponding English terms and Japanese terms are identified as a cluster. Widdows and Dorow (2002) use a graph model for unsupervised lexical acquisition.</Paragraph> <Paragraph position="7"> A graph is produced by linking pairs of words which participate in particular syntactic relationships. An incremental cluster-building algorithm achieves 82% accuracy at a lexical acquisition task, evaluated against WordNet classes. Another study builds a co-occurrence graph of terms and decomposes it to identify relevant terms by duplicating nodes and edges (Tanaka-Ishii and Iwasaki, 1996). It focuses on transitivity: if transitivity does not hold between three nodes (e.g., if edge a-b and b-c exist but edge a-c does not), the nodes should be in separate clusters.</Paragraph> <Paragraph position="8"> A network of words (or named entities) on the web is investigated also in the context of the Semantic Web (Cimiano et al., 2004; Bekkerman and McCallum, 2005). Especially, a social network of persons is mined from the web using a search engine (Kautz et al., 1997; Mika, 2005; Matsuo et al., 2006). In these studies, the Jaccard coefficient is often used to measure the co-occurrence of entities. We compare Jaccard coefficients in our evaluations. null In the research field on complex networks, structures of various networks are investigated in detail. For example, Motter (2002) targeted a conceptual network from a thesaurus and demonstrated its small-world structure. Recently, numerous works have identified communities (or densely-connected subgraphs) from large networks (Newman, 2004; Girvan and Newman, 2002; Palla et al., 2005) as explained in the next section.</Paragraph> <Paragraph position="9"> 3 Word Clustering using Web Counts</Paragraph> <Section position="1" start_page="543" end_page="544" type="sub_section"> <SectionTitle> 3.1 Co-occurrence by a Search Engine </SectionTitle> <Paragraph position="0"> A typical word clustering task is described as follows: given a set of words (nouns), cluster words into groups so that the similar words are in the same cluster 1. Let us take an example. Assume a set of words is given:(printer), n(print),(InterLaser), (ink), TV (TV), Aquos (Aquos), and Sharp (Sharp). Apparently, the first four words are related to a printer, and the last three words are related to a TV 2. In this case, we would like to have two word groups: the first four and the last three. We query a search engine3 to obtain word counts. Table 1 shows web counts for each word.</Paragraph> <Paragraph position="1"> Table 2 shows the web counts for pairs of words.</Paragraph> <Paragraph position="2"> For example, we submit a query printer AND InterLaser to a search engine, and are directed to 179 documents. Thereby, nC2 queries are necessary to obtain the matrix if we have n words. We call Table 2 a co-occurrence matrix.</Paragraph> <Paragraph position="3"> We can calculate the pointwise mutual informa-</Paragraph> <Paragraph position="5"> fw1 represents the web count of w1 and N represents the number of documents on the web. Probability of co-occurrence p(w1,w2) is estimated by fw1,w2/N where fw1,w2 represents the web count of w1 AND w2.</Paragraph> <Paragraph position="6"> The PMI values are shown in Table 3. We set N = 1010 according to the number of indexed pages on Google. Some values are inconsistent with our intuition: Aquos is inferred to have high PMI to TV and Sharp, but also to printer. None of the words has high PMI with TV. These are because the range of the word count is broad. Generally, mutual information tends to provide a large value if either word is much rarer than the other. Various statistical measures based on co-occurrence analysis have been proposed for estimating term association: the DICE coefficient, Jaccard coefficient, chi-square test, and the log-likelihood ratio (Manning and Sch&quot;utze, 2002). In our algorithm, we use the chi-square (kh2) value instead of PMI. The chi-square value is calculated as follows: We denote the number of pages containing both w1 and w2 as a. We also denote b, c, d as follows4.</Paragraph> <Paragraph position="8"> Thereby, the expected frequency of (w1, w2) is (a+c)(a+b)/N. Eventually, chi-square is calculated as follows (Manning and Sch&quot;utze, 2002).</Paragraph> <Paragraph position="10"> However, N is a huge number on the web and sometimes it is difficult to know exactly. Therefore we regard the co-occurrence matrix as a con-</Paragraph> <Paragraph position="12"> fw,wprime, where W represents a given set of words. Then chi-square (within the word list W) is defined as</Paragraph> <Paragraph position="14"> We should note that kh2W depends on a word set W. It calculates the relative strength of cooccurrences. Table 4 shows the kh2W values. Aquos has high values only with TV and Sharp as expected. null</Paragraph> </Section> <Section position="2" start_page="544" end_page="545" type="sub_section"> <SectionTitle> 3.2 Clustering on Co-occurrence Graph </SectionTitle> <Paragraph position="0"> Recently, a series of effective graph clustering methods has been advanced. Pioneering work that specifically emphasizes edge betweenness was done by Girvan and Newman (2002): we call the method as GN algorithm. Betweenness of an edge is the number of shortest paths between pairs of nodes that run along it. Figure 1 (i) shows that two &quot;communities&quot; (in Girvan's term), i.e. {a,b,c} and {d,e,f,g}, which are connected by edge c-d.</Paragraph> <Paragraph position="1"> Edge c-d has high betweenness because numerous shortest paths (e.g., from a to d, from b to e, ...) traverse the edge. The graph is likely to be separated into densely connected subgraphs if we cut the high betweenness edge.</Paragraph> <Paragraph position="2"> The GN algorithm is different from the minimum edge cut. For (i), the results are identical: By cutting edge c-d, which is a minimum edge cut, we can obtain two clusters. However in case of (ii), there are two candidates for the minimum edge cut, whereas the highest betweenness edge is still only edge c-d. Girvan et al. (2002) shows that this clustering works well to various networks from biological to social networks. Numerous studies have been inspired by that work. One prominent effort is a faster variant of GN algorithm (Newman, 2004), which we call Newman clustering in In Newman clustering, instead of explicitly calculating high-betweenness edges (which is computationally demanding), an objective function is defined as follows:</Paragraph> <Paragraph position="4"> We assume that we have separate clusters, and that eij is the fraction5 of edges in the network that connect nodes in cluster i to those in cluster j.</Paragraph> <Paragraph position="5"> The term eii denotes the fraction of edges within the clusters. The term [?]j eij represents the expected fraction of edges within the cluster. If a par- null ticular division gives no more within-community edges than would be expected by random chance, then we would obtain Q = 0. In practice, values greater than about 0.3 appear to indicate significant group structure (Newman, 2004).</Paragraph> <Paragraph position="6"> Newman clustering is agglomerative (although we can intuitively understand that a graph without high betweenness edges is ultimately obtained). We repeatedly join clusters together in pairs, choosing at each step the joint that provides the greatest increase in Q. Currently, Newman clustering is one of the most efficient methods for graph-based clustering.</Paragraph> <Paragraph position="7"> The illustration of our algorithm is shown in Fig. 2. First, we obtain web counts among a given set of words using a search engine. Then PMI or the chi-square values are calculated. If the value is above a certain threshold6, we invent an edge between the two nodes. Then, we apply graph clustering and finally identify groups of words. This illustration shows that the chi-square measure yields the correct clusters.</Paragraph> <Paragraph position="8"> The algorithm is described in Fig. 4. The parameters are few: a threshold dthre for a graph and, optionally, the number of clusters nc. This enables easy implementation of the algorithm. Figure 3 is a small network of 88 Japanese words obtained through 3828 search queries. We can see that some parts in the graph are densely connected.</Paragraph> </Section> </Section> class="xml-element"></Paper>