File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-1046_evalu.xml

Size: 5,565 bytes

Last Modified: 2025-10-06 13:59:39

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1046">
  <Title>Scaling Distributional Similarity to Large Corpora</Title>
  <Section position="10" start_page="365" end_page="367" type="evalu">
    <SectionTitle>
8 Results
</SectionTitle>
    <Paragraph position="0"> As the accuracy of comparisons between terms increases with frequency (Curran, 2004), applying a frequency cut-off will both reduce the size of the vocabulary (n) and increase the average accuracy of comparisons. Table 1 shows the reduction in vocabulary and increase in average context relations per term as cut-off increases. For LARGE, the initial 541,722 word vocabulary is reduced by 66% when a cut-off of 5 is applied and by 86% when the cut-off is increased to 100. The average number of relations increases from 97 to 1400.</Paragraph>
    <Paragraph position="1"> The work by Curran (2004) largely uses a frequency cut-off of 5. When this cut-off was used with the randomised techniques RI and LSH, it produced quite poor results. When the cut-off was increased to 100, as used by Ravichandran et al.</Paragraph>
    <Paragraph position="2"> (2005), the results improved significantly. Table 2 shows the INVR scores for our various techniques using the BNC with cut-offs of 5 and 100.</Paragraph>
    <Paragraph position="3"> Table 3 shows the results of a full thesaurus extraction using the BNC and LARGE corpora using a cut-off of 100. The average DIRECT score and INVR are from the 300 test words. The total execution time is extrapolated from the average search time of these test words and includes the setup time. For LARGE, extraction using NAIVE takes 444 hours: over 18 days. If the 184,494 word vocabulary were used, it would take over 7000 hours, or nearly 300 days. This gives some indication of  the scale of the problem.</Paragraph>
    <Paragraph position="4"> The only technique to become less accurate when the corpus size is increased is RI; it is likely that RI is sensitive to high frequency, low information contexts that are more prevalent in LARGE. Weighting reduces this effect, improving accuracy.</Paragraph>
    <Paragraph position="5"> The importance of the choice of d can be seen in the results for LSH. While much slower, LSH10,000 is also much more accurate than LSH3,000, while still being much faster than NAIVE. Introducing the PLEB data structure does not improve the efficiency while incurring a small cost on accuracy. We are not using large enough datasets to show the improved time complexity using PLEB.</Paragraph>
    <Paragraph position="6"> VPT is only slightly faster slightly faster than NAIVE. This is not surprising in light of the original design of the data structure: decreasing radius search does not guarantee search efficiency.</Paragraph>
    <Paragraph position="7"> A significant influence in the speed of the randomised techniques, RI and LSH, is the fixed dimensionality. The randomised techniques use a fixed length vector which is not influenced by the size of m. The drawback of this is that the size of the vector needs to be tuned to the dataset.</Paragraph>
    <Paragraph position="8"> It would seem at first glance that HEURISTIC and SASH provide very similar results, with HEURISTIC slightly slower, but more accurate.</Paragraph>
    <Paragraph position="9"> This misses the difference in time complexity between the methods: HEURISTIC is n2 and SASH nlog n. The improvement in execution time over NAIVE decreases as corpus size increases and this would be expected to continue. Further tuning of SASH parameters may improve its accuracy.</Paragraph>
    <Paragraph position="10"> RIMI produces similar result using LARGE to SASH using BNC. This does not include the cost of extracting context relations from the raw text, so the true comparison is much worse. SASH allows the free use of weight and measure functions, but RI is constrained by having to transform any context space into a RI space. This is important when  considering that different tasks may require different weights and measures (Weeds and Weir, 2005). RI also suffers n2 complexity, where as SASH is nlog n. Taking these into account, and that the improvements are barely significant, SASH is a better choice.</Paragraph>
    <Paragraph position="11"> The results for LSH are disappointing. It performs consistently worse than the other methods except VPT. This could be improved by using larger bit vectors, but there is a limit to the size of these as they represent a significant memory overhead, particularly as the vocabulary increases.</Paragraph>
    <Paragraph position="12"> Table 4 presents the theoretical analysis of attribute indexing. The average number of comparisons made for various cut-offs of LARGE are shown. NAIVE and INDEX are the actual values for those techniques. The values for SASH are worst case, where the maximum number of terms are compared at each level. The actual number of comparisons made will be much less. The efficiency of INDEX is sensitive to the density of attributes and increasing the cut-off increases the density. This is seen in the dramatic drop in performance as the cut-off increases. This problem of density will increase as volume of raw input data increases, further reducing its effectiveness. SASH is only dependent on the number of terms, not the density.</Paragraph>
    <Paragraph position="13"> Where the need for computationally efficiency out-weighs the need for accuracy, RIMI provides better results. SASH is the most balanced of the techniques tested and provides the most scalable, high quality results.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML