XML Viewer - w05-1011

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-1011_metho.xml
Size: 15,518 bytes
Last Modified: 2025-10-06 14:10:00
<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-1011">
  <Title>Approximate Searching for Distributional Similarity</Title>
  <Section position="4" start_page="97" end_page="97" type="metho">
    <SectionTitle>
3 Nearest-neighbour search
</SectionTitle>
    <Paragraph position="0"> The simplest algorithm for finding synonyms is nearest-neighbour search, which involves pairwise vector comparison of the target term with every term in the vocabulary. Given an n term vocabulary and up to m attributes for each term, the asymptotic time complexity of nearest-neighbour search is O(n2m).</Paragraph>
    <Paragraph position="1"> This is very expensive with even a moderate vocabulary and small attribute vectors making the use of huge datasets infeasible.</Paragraph>
    <Section position="1" start_page="97" end_page="97" type="sub_section">
      <SectionTitle>
3.1 Heuristic
</SectionTitle>
      <Paragraph position="0"> Using cutoff to remove low frequency terms can significantly reduce the value of n. In these experiments, we used a cutoff of 5. However, a solution is still needed to reduce the factor m. Unfortunately, reducing m by eliminating low frequency contexts has a significant impact on the quality of the results.</Paragraph>
      <Paragraph position="1"> Curran and Moens (2002a) propose an initial heuristic comparison to reduce the number of full O(m) vector comparisons. They introduce a bounded vector (length k) of canonical attributes, selected from the full vector, to represent the term. The selected attributes are the most strongly weighted verb attributes: Curran and Moens chose these relations as they generally constrain the semantics of the term more and partake in fewer idiomatic collocations.</Paragraph>
      <Paragraph position="2"> If a pair of terms share at least one canonical attribute then a full similarity comparison is performed, otherwise the terms are not considered similar. If a maximum of p positive results are returned, our complexity becomes O(n2k+npm), which, since k is constant, is O(n2 + npm).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="97" end_page="99" type="metho">
    <SectionTitle>
4 The SASH
</SectionTitle>
    <Paragraph position="0"> The SASH approximates a nearest-neighbour search by pre-computing some of the near-neighbours of each node (terms in our case). It is arranged as a multi-leveled pyramid, where each node is linked to its (approximate) near-neighbours on the levels above and below. This produces multiple paths between nodes, allowing the SASH to shape itself to the data set (Houle, 2003a). This graph is searched by finding the near-neighbours of the target node at each level. The following description is adapted from Houle (2003b).</Paragraph>
    <Section position="1" start_page="98" end_page="98" type="sub_section">
      <SectionTitle>
4.1 Metric Spaces
</SectionTitle>
      <Paragraph position="0"> The SASH organises nodes that can be measured in metric space. Although it is not necessary for the SASH to work, only in this space can performance be guaranteed. Our meaures produce a metric-like space for the terms derived from large datasets.</Paragraph>
      <Paragraph position="1"> A domain D is a metric space if there exists a  function dist : D x D - R[?]0 such that: 1. dist(p, q) [?] 0 [?] p, q [?] D (non-negativity) 2. dist(p, q) = 0 iff p = q [?] p, q [?] D (equality) 3. dist(p, q) = dist(q, p) [?] p, q [?] D (symmetry) 4. dist(p, q) + dist(q, r) [?] dist(p, r) [?] p, q, r [?] D (triangle inequality)  We invert the similarity measure to produce a distance, resulting in condition 2 not being satisfied since dist(p, p) = x, x &gt; 0. For most measures x is constant, so dist(p, q) &gt; dist(p, p) if p nequal q and p and q do not occur in exactly the same contexts. For some measures, e.g. DICE, dist(p, p) &gt; dist(p, q), that is, p is closer to q than it is to itself. These do not preserve metric space in any way, so cannot be used with the SASH.</Paragraph>
      <Paragraph position="2"> Ch'avez et al. (2001) divides condition 2 into:</Paragraph>
      <Paragraph position="4"> If strict positiveness is not satisfied the space is called pseudometric. In theory, our measures do not satisfy this condition, however in practice most large datasets will satisfy this condition.</Paragraph>
    </Section>
    <Section position="2" start_page="98" end_page="98" type="sub_section">
      <SectionTitle>
4.2 Structure
</SectionTitle>
      <Paragraph position="0"> The SASH is a directed, edge-weighted graph with the following properties: * Each term corresponds to a unique node.</Paragraph>
      <Paragraph position="1"> * The nodes are arranged into a hierarchy of levels, with the bottom level containing n2 nodes and the top containing a single root node. Each level, except the top, will contain half as many nodes as the level below. These are numbered from 1 (top) to h.</Paragraph>
      <Paragraph position="2"> * Edges between nodes are linked from consecutive levels. Each node will have at most p parent nodes in the level above, and c child nodes in the level below.</Paragraph>
      <Paragraph position="3"> * Every node must have at least one parent so that all nodes are reachable from the root.</Paragraph>
      <Paragraph position="4"> Figure 1 shows a SASH which will be used below.</Paragraph>
    </Section>
    <Section position="3" start_page="98" end_page="99" type="sub_section">
      <SectionTitle>
4.3 Construction
</SectionTitle>
      <Paragraph position="0"> The SASH is constructed iteratively by finding the nearest parents in the level above. The nodes are first randomly distributed to reduce any clustering effects. They are then split into the levels described above, with level h having n2 nodes, level 2 at most c nodes and level 1 having a single root node.</Paragraph>
      <Paragraph position="1"> The root node has all nodes at level 2 as children and each node at level 2 has the root as its sole parent. Then for each node in each level i from 3 to h, we find the set of p nearest parent nodes in level (i [?] 1). The node then asks that parent if it can be a child. As only the closest c nodes can be children of a node, it may be the case that a requested parent rejects a child.</Paragraph>
      <Paragraph position="2">  If a child is left without any parents it is said to be orphaned. Any orphaned nodes must now find the closest node in the above level that has fewer than c children. Once all nodes have at least one parent, we move to the next level. This proceeds iteratively through the levels.</Paragraph>
    </Section>
    <Section position="4" start_page="99" end_page="99" type="sub_section">
      <SectionTitle>
4.4 Search
</SectionTitle>
      <Paragraph position="0"> Searching the SASH is also performed iteratively. To find the k nearest neighbours of a node q, we first find the k nearest neighbours at each level. At level 1 we take the single root node to be nearest. Then, for each level after, we find the k nearest unique children of the nodes found in the level above. When the last level has been searched, we return the closest k nodes from all the sets of near neighbours returned.</Paragraph>
      <Paragraph position="1"> In Figure 1, the filled nodes demonstrate a search for the near-neighbours of some node q, using k = 2.</Paragraph>
      <Paragraph position="2"> Our search begins with the root node A. As we are using k = 2, we must find the two nearest children of A using our similarity measure. In this case, C and D are closer than B. We now find the closest two children of C and D. E is not checked as it is only a child of B. All other nodes are checked, including F and G, which are shared as children by B and C.</Paragraph>
      <Paragraph position="3"> From this level we chose G and H. We then consider the fourth and fifth levels similarly.</Paragraph>
      <Paragraph position="4"> At this point we now have the list of near nodes A, C, D, G, H, I, J, K and L. From this we chose the two nodes closest to q: H and I marked in black.</Paragraph>
      <Paragraph position="5"> These are returned as the near-neighbours of q.</Paragraph>
      <Paragraph position="6"> k can also be varied at each level to force a larger number of elements to be tested at the base of the SASH using, for instance, the equation:</Paragraph>
      <Paragraph position="8"> We use this geometric function in our experiments.</Paragraph>
    </Section>
    <Section position="5" start_page="99" end_page="99" type="sub_section">
      <SectionTitle>
4.5 Complexity
</SectionTitle>
      <Paragraph position="0"> When measuring the time complexity, we consider the number of distance measurements as these dominate the computation. If we do not consider the problem of assigning parents to orphans, for n nodes, p parents per child, at most c children per parent and a search returning k elements, the loose upper bounds are:</Paragraph>
      <Paragraph position="2"> Since the average number of children per node is approximately 2p, practical complexities can be derived using c = 2p.</Paragraph>
      <Paragraph position="3"> In Houle's experiments, typically less than 5% of computation time was spent assigning parents to orphans, even for relatively small c. In some of our experiments we found that low values of c produced significantly worse load times that for higher values, but this was highly dependant on the distribution of nodes. Table 1 shows this with respect to several distributions and values of c.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="99" end_page="100" type="metho">
    <SectionTitle>
5 Evaluation
</SectionTitle>
    <Paragraph position="0"> The simplest method of evaluation is direct comparison of the extracted synonyms with a manually-created gold standard (Grefenstette, 1994). However, on small corpora, rare direct matches provide limited information for evaluation, and thesaurus coverage is a problem. Our evaluation uses a combination of three electronic thesauri: the Macquarie (Bernard, 1990), Roget's (Roget, 1911) and Moby (Ward, 1996) thesauri.</Paragraph>
    <Paragraph position="1">  With this gold standard in place, it is possible to use precision and recall measures to evaluate the quality of the extracted thesaurus. To help overcome the problems of direct comparisons we use several measures of system performance: direct matches (DIRECT), inverse rank (INVR), and precision of the top n synonyms (P(n)), for n = 1, 5 and 10.</Paragraph>
    <Paragraph position="2"> INVR is the sum of the inverse rank of each matching synonym, e.g. matching synonyms at ranks 3, 5 and 28 give an inverse rank score of  28, and with at most 100 synonyms, the max-imum I NVR score is 5.187. Precision of the top n is the percentage of matching synonyms in the top n extracted synonyms.</Paragraph>
    <Paragraph position="3"> The same 70 single-word nouns were used for the evaluation as in Curran and Moens (2002a). These were chosen randomly from WordNet such that they covered a range over the following properties: frequency Penn Treebank and BNC frequencies; number of senses WordNet and Macquarie senses; specificity depth in the WordNet hierarchy; concreteness distribution across WordNet subtrees. For each of these terms, the closest 100 terms and their similarity score were extracted.</Paragraph>
  </Section>
  <Section position="7" start_page="100" end_page="101" type="metho">
    <SectionTitle>
6 Experiments
</SectionTitle>
    <Paragraph position="0"> The contexts were extracted from the non-speech portion of the British National Corpus (Burnard, 1995). All experiments used the JACCARD measure function, the TTEST weight function and a cutoff frequency of 5. The SASH was constructed using the geometric equation for ki described in Section 4.4.</Paragraph>
    <Paragraph position="1"> When the heuristic was applied, the TTESTLOG weight function was used with a canonical set size of 100 and a maximum frequency cutoff of 10,000.</Paragraph>
    <Paragraph position="2"> The values 4-16, 8-32, 16-64, and 32-128 were chosen for p and c. This gives a range of branching factors to test the balance between sparseness, where there is potential for erroneous fragmentation of large clusters, and bushiness, where more tests must be made to find near children. The c = 4p relationship is derived from the simple hashing rule of thumb that says that a hash table should have roughly twice the size required to store all its ele- null Our initial experiments showed that the random distribution of nodes (RANDOM) in SASH construction caused the nearest-neighbour approximation to be very inaccurate for distributional similarity. Although the speed was improved by two orders of magnitude when c = 16, it achieved only 13% of the INVR of the na&amp;quot;ive implementation. The best RANDOM result was less than three times faster then the na&amp;quot;ive solution and only 60% INVR.</Paragraph>
    <Paragraph position="3"> In accordance with Zipf's law the majority of terms have very low frequencies. Similarity measurements made against these low frequency terms are less reliable, as accuracy increases with the number of relations and their frequencies (Curran and Moens, 2002b). This led to the idea that ordering the nodes by frequency before generating the SASH would improve accuracy.</Paragraph>
    <Paragraph position="4"> The SASH was then generated with the highest frequency terms were near the root so that the initial search paths would be more accurate. This has the unfortunate side-effect of slowing search by up to four times because comparisons with high frequency terms take longer than with low frequency terms as they have a larger number of relations.</Paragraph>
    <Paragraph position="5">  This led to updating our original frequency ordering idea by recognising that we did not need the most accurately comparable terms at the top of the SASH, only more accurately comparable terms than those randomly selected.</Paragraph>
    <Paragraph position="6"> As a first attempt, we constructed SASHs with frequency orderings that were folded about a chosen number of relations M. For each term, if its number of relations mi was greater than M, it was given a new ranking based on the score M2mi . In this way, very high and very low frequency terms were pushed away from the root. The folding points this was tested for were 500, 1000 and 1500. There are many other node organising schemes we are yet to explore.</Paragraph>
    <Paragraph position="7"> The frequency distributions over the top three levels for each ordering scheme are shown in Table 2. Zipf's law results in a large difference between the mean and median frequency values in the RANDOM results: most of the nodes have low frequency, but some high frequency results push the mean up. The four-fold reduction in efficiency for SORT (see Table 3) is a result of the mean number of relations being over 65 times that of RANDOM.</Paragraph>
    <Paragraph position="8"> Experiments covering the full set of permutations of these parameters were run, with and without the heuristic applied. In the cases where the heuristic rejected pairs of terms, the SASH treated the rejected pairs as being as infinitely far apart. In addition, the brute force solutions were generated with (NAIVE HEURISTIC) and without (NAIVE) the heuristic.</Paragraph>
    <Paragraph position="9"> We have assumed that all weights and measures introduce similar distribution properties into the SASH, so that the best weight and measure when performing a brute-force search will also produce the best results when combined with the SASH. Future experiments will explore SASH behaviour with other similarity measures.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML