File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/p05-1077_intro.xml

Size: 4,979 bytes

Last Modified: 2025-10-06 14:03:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-1077">
  <Title>Randomized Algorithms and NLP: Using Locality Sensitive Hash Function for High Speed Noun Clustering</Title>
  <Section position="4" start_page="622" end_page="623" type="intro">
    <SectionTitle>
2 Theory
</SectionTitle>
    <Paragraph position="0"> The core theory behind the implementation of fast cosine similarity calculation can be divided into two parts: 1. Developing LSH functions to create signatures; 2. Using fast search algorithm to find nearest neighbors. We describe these two components in greater detail in the next subsections.</Paragraph>
    <Section position="1" start_page="622" end_page="623" type="sub_section">
      <SectionTitle>
2.1 LSH Function Preserving Cosine Similarity
</SectionTitle>
      <Paragraph position="0"> We first begin with the formal definition of cosine similarity.</Paragraph>
      <Paragraph position="1"> Definition: Let u and v be two vectors in a k dimensional hyperplane. Cosine similarity is defined as the cosine of the angle between them: cos(th(u,v)). We can calculate cos(th(u,v)) by the following formula:</Paragraph>
      <Paragraph position="3"> Here th(u,v) is the angle between the vectors u and v measured in radians. |u.v |is the scalar (dot) product of u and v, and |u |and |v |represent the length of vectors u and v respectively.</Paragraph>
      <Paragraph position="4"> The LSH function for cosine similarity as proposed by Charikar (2002) is given by the following theorem: Theorem: Suppose we are given a collection of vectors in a k dimensional vector space (as written as Rk). Choose a family of hash functions as follows: Generate a spherically symmetric random vector r of unit length from this k dimensional space. We define a hash function, hr, as:</Paragraph>
      <Paragraph position="6"> Proof of the above theorem is given by Goemans and Williamson (1995). We rewrite the proof here for clarity. The above theorem states that the probability that a random hyperplane separates two vectors is directly proportional to the angle between the two vectors (i,e., th(u,v)). By symmetry, we have</Paragraph>
      <Paragraph position="8"> corresponds to the intersection of two half spaces, the dihedral angle between which is th. Thus, we have Pr[u.r [?] 0,v.r &lt; 0] = th(u,v)/2pi. Proceeding we have Pr[hr(u) negationslash= hr(v)] = th(u,v)/pi and</Paragraph>
      <Paragraph position="10"> pletes the proof.</Paragraph>
      <Paragraph position="11"> Hence from equation 3 we have,</Paragraph>
      <Paragraph position="13"> This equation gives us an alternate method for finding cosine similarity. Note that the above equation is probabilistic in nature. Hence, we generate a large (d) number of random vectors to achieve the process. Having calculated hr(u) with d random vectors for each of the vectors u, we apply equation 4 to find the cosine distance between two vectors.</Paragraph>
      <Paragraph position="14"> As we generate more number of random vectors, we can estimate the cosine similarity between two vectors more accurately. However, in practice, the number (d) of random vectors required is highly domain dependent, i.e., it depends on the value of the total number of vectors (n), features (k) and the way the vectors are distributed. Using d random vectors, we  can represent each vector by a bit stream of length d.</Paragraph>
      <Paragraph position="15"> Carefully looking at equation 4, we can observe that Pr[hr(u) = hr(v)] = 1 [?] (hamming distance)/d1 . Thus, the above theorem, converts the problem of finding cosine distance between two vectors to the problem of finding hamming distance between their bit streams (as given by equation 4). Finding hamming distance between two bit streams is faster and highly memory efficient.</Paragraph>
      <Paragraph position="16"> Also worth noting is that this step could be considered as dimensionality reduction wherein we reduce a vector in k dimensions to that of d bits while still preserving the cosine distance between them.</Paragraph>
    </Section>
    <Section position="2" start_page="623" end_page="623" type="sub_section">
      <SectionTitle>
2.2 Fast Search Algorithm
</SectionTitle>
      <Paragraph position="0"> To calculate the fast hamming distance, we use the search algorithm PLEB (Point Location in Equal Balls) first proposed by Indyk and Motwani (1998).</Paragraph>
      <Paragraph position="1"> This algorithm was further improved by Charikar (2002). This algorithm involves random permutations of the bit streams and their sorting to find the vector with the closest hamming distance. The algorithm given in Charikar (2002) is described to find the nearest neighbor for a given vector. We modify it so that we are able to find the top B closest neighbor for each vector. We omit the math of this algorithm but we sketch its procedural details in the next section. Interested readers are further encouraged to read Theorem 2 from Charikar (2002) and Section 3 from Indyk and Motwani (1998).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML