File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1046_metho.xml
Size: 19,371 bytes
Last Modified: 2025-10-06 14:10:17
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1046"> <Title>Scaling Distributional Similarity to Large Corpora</Title> <Section position="5" start_page="361" end_page="361" type="metho"> <SectionTitle> 3 Dimensionality Reduction </SectionTitle> <Paragraph position="0"> Using a cut-off to remove low frequency terms can significantly reduce the value of n. Unfortunately, reducing m by eliminating low frequency contexts has a significant impact on the quality of the results. There are many techniques to reduce dimensionality while avoiding this problem. The simplest methods use feature selection techniques, such as information gain, to remove the attributes that are less informative. Other techniques smooth the data while reducing dimensionality.</Paragraph> <Paragraph position="1"> Latent Semantic Analysis (LSA, Landauer and Dumais, 1997) is a smoothing and dimensionality reduction technique based on the intuition that the true dimensionality of data is latent in the surface dimensionality. Landauer and Dumais admit that, from a pragmatic perspective, the same effect as LSA can be generated by using large volumes of data with very long attribute vectors. Experiments with LSA typically use attribute vectors of a dimensionality of around 1000. Our experiments have a dimensionality of 500,000 to 1,500,000.</Paragraph> <Paragraph position="2"> Decompositions on data this size are computationally difficult. Dimensionality reduction is often used before using LSA to improve its scalability.</Paragraph> <Section position="1" start_page="361" end_page="361" type="sub_section"> <SectionTitle> 3.1 Heuristics </SectionTitle> <Paragraph position="0"> Another technique is to use an initial heuristic comparison to reduce the number of full O(m) vector comparisons that are performed. If the heuristic comparison is sufficiently fast and a sufficient number of full comparisons are avoided, the cost of an additional check will be easily absorbed by the savings made.</Paragraph> <Paragraph position="1"> Curran and Moens (2002) introduces a vector of canonical attributes (of bounded length k m), selected from the full vector, to represent the term. These attributes are the most strongly weighted verb attributes, chosen because they constrain the semantics of the term more and partake in fewer idiomatic collocations. If a pair of terms share at least one canonical attribute then a full similarity comparison is performed, otherwise the terms are not compared. They show an 89% reduction in search time, with only a 3.9% loss in accuracy.</Paragraph> <Paragraph position="2"> There is a significant improvement in the computational complexity. If a maximum of p positive results are returned, our complexity becomes O(n2k + npm). When p n, the system will be faster as many fewer full comparisons will be made, but at the cost of accuracy as more possibly near results will be discarded out of hand.</Paragraph> </Section> </Section> <Section position="6" start_page="361" end_page="362" type="metho"> <SectionTitle> 4 Randomised Techniques </SectionTitle> <Paragraph position="0"> Conventional dimensionality reduction techniques can be computationally expensive: a more scal- null able solution is required to handle the volumes of data we propose to use. Randomised techniques provide a possible solution to this.</Paragraph> <Paragraph position="1"> We present two techniques that have been used recently for distributional similarity: Random Indexing (Kanerva et al., 2000) and Locality Sensitive Hashing (LSH, Broder, 1997).</Paragraph> <Section position="1" start_page="362" end_page="362" type="sub_section"> <SectionTitle> 4.1 Random Indexing </SectionTitle> <Paragraph position="0"> Random Indexing (RI) is a hashing technique based on Sparse Distributed Memory (Kanerva, 1993). Karlgren and Sahlgren (2001) showed RI produces results similar to LSA using the Test of English as a Foreign Language (TOEFL) evaluation. Sahlgren and Karlgren (2005) showed the technique to be successful in generating bilingual lexicons from parallel corpora.</Paragraph> <Paragraph position="1"> In RI, we first allocate a d length index vector to each unique attribute. The vectors consist of a large number of 0s and small number (epsilon1) number of randomly distributed 1s. Context vectors, identifying terms, are generated by summing the index vectors of the attributes for each non-unique context in which a term appears. The context vector for a term t appearing in contexts c1 = [1,0,0, 1] and c2 = [0,1,0, 1] would be [1,1,0, 2]. The distance between these context vectors is then measured using the cosine measure: cos(th(u,v)) = vectoru vectorvjvectoruj jvectorvj (3) This technique allows for incremental sampling, where the index vector for an attribute is only generated when the attribute is encountered. Construction complexity is O(nmd) and search complexity is O(n2d).</Paragraph> </Section> <Section position="2" start_page="362" end_page="362" type="sub_section"> <SectionTitle> 4.2 Locality Sensitive Hashing </SectionTitle> <Paragraph position="0"> LSH is a probabilistic technique that allows the approximation of a similarity function. Broder (1997) proposed an approximation of the Jaccard similarity function using min-wise independent functions. Charikar (2002) proposed an approximation of the cosine measure using random hyperplanes Ravichandran et al. (2005) used this cosine variant and showed it to produce over 70% accuracy in extracting synonyms when compared against Pantel and Lin (2002).</Paragraph> <Paragraph position="1"> Given we have n terms in an mprime dimensional space, we create d mprime unit random vectors also of mprime dimensions, labelled fvectorr1, vectorr2,..., vectorrdg. Each vector is created by sampling a Gaussian function mprime times, with a mean of 0 and a variance of 1.</Paragraph> <Paragraph position="2"> For each term w we construct its bit signature using the function</Paragraph> <Paragraph position="4"> where vectorr is a spherically symmetric random vector of length d. The signature, -w, is the d length bit vector: -w = fhvectorr1(vectorw),hvectorr2(vectorw),... ,hvectorrd(vectorw)g The cost to build all n signatures is O(nmprimed).</Paragraph> <Paragraph position="5"> For terms u and v, Goemans and Williamson (1995) approximate the angular similarity by</Paragraph> <Paragraph position="7"> where th(vectoru,vectoru) is the angle between vectoru and vectoru. The angular similarity gives the cosine by</Paragraph> <Paragraph position="9"> The probability can be derived from the Hamming distance:</Paragraph> <Paragraph position="11"> By combining equations 5 and 6 we get the following approximation of the cosine distance:</Paragraph> <Paragraph position="13"> That is, the cosine of two context vectors is approximated by the cosine of the Hamming distance between their two signatures normalised by the size of the signatures. Search is performed using</Paragraph> </Section> </Section> <Section position="7" start_page="362" end_page="364" type="metho"> <SectionTitle> 5 Data Structures </SectionTitle> <Paragraph position="0"> The methods presented above fail to address the n2 component of the search complexity. Many data structures have been proposed that can be used to address this problem in similarity searching. We present three data structures: the vantage point tree (VPT, Yianilos, 1993), which indexes points in a metric space, Point Location in Equal Balls (PLEB, Indyk and Motwani, 1998), a probabilistic structure that uses the bit signatures generated by LSH, and the Spatial Approximation Sample Hierarchy (SASH, Houle and Sakuma, 2005), which approximates a k-NN search.</Paragraph> <Paragraph position="1"> Another option inspired by IR is attribute indexing (INDEX). In this technique, in addition to each term having a reference to its attributes, each attribute has a reference to the terms referencing it. Each term is then only compared with the terms with which it shares attributes. We will give a theoretically comparison against other techniques.</Paragraph> <Section position="1" start_page="363" end_page="363" type="sub_section"> <SectionTitle> 5.1 Vantage Point Tree </SectionTitle> <Paragraph position="0"> Metric space data structures provide a solution to near-neighbour searches in very high dimensions.</Paragraph> <Paragraph position="1"> These rely solely on the existence of a comparison function that satisfies the conditions of metricality: non-negativity, equality, symmetry and the triangle inequality.</Paragraph> <Paragraph position="2"> VPT is typical of these structures and has been used successfully in many applications. The VPT is a binary tree designed for range searches. These are searches limited to some distance from the target term but can be modified for k-NN search.</Paragraph> <Paragraph position="3"> VPT is constructed recursively. Beginning with a set of U terms, we take any term to be our vantage point p. This becomes our root. We now find the median distance mp of all other terms to p: mp = medianfdist(p,u)ju 2 Ug. Those terms u such that dist(p,u) mp are inserted into the left sub-tree, and the remainder into the right subtree. Each sub-tree is then constructed as a new VPT, choosing a new vantage point from within its terms, until all terms are exhausted.</Paragraph> <Paragraph position="4"> Searching a VPT is also recursive. Given a term q and radius r, we begin by measuring the distance to the root term p. If dist(q,p) r we enter p into our list of near terms. If dist(q,p) r mp we enter the left sub-tree and if dist(q,p) + r > mp we enter the right sub-tree. Both sub-trees may be entered. The process is repeated for each entered subtree, taking the vantage point of the sub-tree to be the new root term.</Paragraph> <Paragraph position="5"> To perform a k-NN search we use a back-tracking decreasing radius search (Burkhard and Keller, 1973). The search begins with r = 1, and terms are added to a list of the closest k terms. When the kth closest term is found, the radius is set to the distance between this term and the target. Each time a new, closer element is added to the list, the radius is updated to the distance from the target to the new kth closest term.</Paragraph> <Paragraph position="6"> Construction complexity is O(nlog n). Search complexity is claimed to be O(log n) for small radius searches. This does not hold for our decreasing radius search, whose worst case complexity is O(n).</Paragraph> </Section> <Section position="2" start_page="363" end_page="363" type="sub_section"> <SectionTitle> 5.2 Point Location in Equal Balls </SectionTitle> <Paragraph position="0"> PLEB is a randomised structure that uses the bit signatures generated by LSH. It was used by Ravichandran et al. (2005) to improve the efficiency of distributional similarity calculations. Having generated our d length bit signatures for each of our n terms, we take these signatures and randomly permute the bits. Each vector has the same permutation applied. This is equivalent to a column reordering in a matrix where the rows are the terms and the columns the bits. After applying the permutation, the list of terms is sorted lexicographically based on the bit signatures. The list is scanned sequentially, and each term is compared to its B nearest neighbours in the list. The choice of B will effect the accuracy/efficiency trade-off, and need not be related to the choice of k. This is performed q times, using a different random permutation function each time. After each iteration, the current closest k terms are stored.</Paragraph> <Paragraph position="1"> For a fixed d, the complexity for the permutation step is O(qn), the sorting O(qnlog n) and the search O(qBn).</Paragraph> </Section> <Section position="3" start_page="363" end_page="364" type="sub_section"> <SectionTitle> 5.3 Spatial Approximation Sample Hierarchy </SectionTitle> <Paragraph position="0"> SASH approximates a k-NN search by precomputing some near neighbours for each node (terms in our case). This produces multiple paths between terms, allowing SASH to shape itself to the data set (Houle, 2003). The following description is adapted from Houle and Sakuma (2005).</Paragraph> <Paragraph position="1"> The SASH is a directed, edge-weighted graph with the following properties (see Figure 1): Each term corresponds to a unique node.</Paragraph> <Paragraph position="2"> The nodes are arranged into a hierarchy of levels, with the bottom level containing n2 nodes and the top containing a single root node. Each level, except the top, will contain half as many nodes as the level below.</Paragraph> <Paragraph position="3"> Edges between nodes are linked to consecutive levels. Each node will have at most p parent nodes in the level above, and c child nodes in the level below.</Paragraph> <Paragraph position="4"> Every node must have at least one parent so that all nodes are reachable from the root.</Paragraph> <Paragraph position="5"> Construction begins with the nodes being randomly distributed between the levels. SASH is then constructed iteratively by each node finding its closest p parents in the level above. The parent will keep the closest c of these children, forming edges in the graph, and reject the rest. Any nodes without parents after being rejected are then assigned as children of the nearest node in the previous level with fewer than c children.</Paragraph> <Paragraph position="6"> Searching is performed by finding the k nearest nodes at each level, which are added to a set of near nodes. To limit the search, only those nodes whose parents were found to be nearest at the previous level are searched. The k closest nodes from the set of near nodes are then returned. The search complexity is O(ck log n).</Paragraph> <Paragraph position="7"> In Figure 1, the filled nodes demonstrate a search for the near-neighbours of some node q, using k = 2. Our search begins with the root node A. As we are using k = 2, we must find the two nearest children of A using our similarity measure. In this case, C and D are closer than B. We now find the closest two children of C and D. E is not checked as it is only a child of B. All other nodes are checked, including F and G, which are shared as children by B and C. From this level we chose G and H. The final levels are considered similarly. At this point we now have the list of near nodes A, C, D, G, H, I, J, K and L. From this we chose the two nodes nearest q, H and I marked in black, which are then returned.</Paragraph> <Paragraph position="8"> k can be varied at each level to force a larger number of elements to be tested at the base of the SASH using, for instance, the equation:</Paragraph> <Paragraph position="10"> This changes our search complexity to:</Paragraph> <Paragraph position="12"> We use this geometric function in our experiments.</Paragraph> <Paragraph position="13"> Gorman and Curran (2005a; 2005b) found the performance of SASH for distributional similarity could be improved by replacing the initial random ordering with a frequency based ordering. In accordance with Zipf's law, the majority of terms have low frequencies. Comparisons made with these low frequency terms are unreliable (Curran and Moens, 2002). Creating SASH with high frequency terms near the root produces more reliable initial paths, but comparisons against these terms are more expensive.</Paragraph> <Paragraph position="14"> The best accuracy/efficiency trade-off was found when using more reliable initial paths rather than the most reliable. This is done by folding the data around some mean number of relations. For each term, if its number of relations mi is greater than some chosen number of relations M, it is given a new ranking based on the score M2mi . Otherwise its ranking based on its number of relations. This has the effect of pushing very high and very low frequency terms away from the root.</Paragraph> </Section> </Section> <Section position="8" start_page="364" end_page="365" type="metho"> <SectionTitle> 6 Evaluation Measures </SectionTitle> <Paragraph position="0"> The simplest method for evaluation is the direct comparison of extracted synonyms with a manually created gold standard (Grefenstette, 1994). To reduce the problem of limited coverage, our evaluation combines three electronic thesauri: the Macquarie, Roget's and Moby thesauri.</Paragraph> <Paragraph position="1"> We follow Curran (2004) and use two performance measures: direct matches (DIRECT) and inverse rank (INVR). DIRECT is the percentage of returned synonyms found in the gold standard.</Paragraph> <Paragraph position="2"> INVR is the sum of the inverse rank of each matching synonym, e.g. matches at ranks 3, 5 and 28 give an inverse rank score of 13 + 15 + 128. With at most 100 matching synonyms, the maximum INVR is 5.187. This more fine grained as it incorporates the both the number of matches and their ranking. The same 300 single word nouns were used for evaluation as used by Curran (2004) for his large scale evaluation. These were chosen randomly from WordNet such that they covered a range over the following properties: frequency, number of senses, specificity and concreteness.</Paragraph> <Paragraph position="3"> For each of these terms, the closest 100 terms and their similarity scores were extracted.</Paragraph> </Section> <Section position="9" start_page="365" end_page="365" type="metho"> <SectionTitle> 7 Experiments </SectionTitle> <Paragraph position="0"> We use two corpora in our experiments: the smaller is the non-speech portion of the British National Corpus (BNC), 90 million words covering a wide range of domains and formats; the larger consists of the BNC, the Reuters Corpus Volume 1 and most of the English news holdings of the LDC in 2003, representing over 2 billion words of text (LARGE, Curran, 2004).</Paragraph> <Paragraph position="1"> The semantic similarity system implemented by Curran (2004) provides our baseline. This performs a brute-force k-NN search (NAIVE). We present results for the canonical attribute heuristic (HEURISTIC), RI, LSH, PLEB, VPT and SASH.</Paragraph> <Paragraph position="2"> We take the optimal canonical attribute vector length of 30 for HEURISTIC from Curran (2004).</Paragraph> <Paragraph position="3"> For SASH we take optimal values of p = 4 and c = 16 and use the folded ordering taking M = 1000 from Gorman and Curran (2005b).</Paragraph> <Paragraph position="4"> For RI, LSH and PLEB we found optimal values experimentally using the BNC. For LSH we chose d = 3,000 (LSH3,000) and 10,000 (LSH10,000), showing the effect of changing the dimensionality. The frequency statistics were weighted using mutual information, as in Ravichandran et al. (2005): The initial experiments on RI produced quite poor results. The intuition was that this was caused by the lack of smoothing in the algorithm. Experiments were performed using the weights given in Curran (2004). Of these, mutual information (10), evaluated with an extra log2(f(w,r,wprime) + 1) factor and limited to positive values, produced the best results (RIMI). The values d = 1000 and epsilon1 = 5 were found to produce the best results.</Paragraph> <Paragraph position="5"> All experiments were performed on 3.2GHz Xeon P4 machines with 4GB of RAM.</Paragraph> </Section> class="xml-element"></Paper>