File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/w05-1011_evalu.xml

Size: 4,637 bytes

Last Modified: 2025-10-06 13:59:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-1011">
  <Title>Approximate Searching for Distributional Similarity</Title>
  <Section position="8" start_page="101" end_page="102" type="evalu">
    <SectionTitle>
7 Results
</SectionTitle>
    <Paragraph position="0"> Table 3 presents the results for the initial experiments. SORT was consistently more accurate than RANDOM, and when c = 128, performed as well as NAIVE for all evaluation measures except for direct matches. Both SASH solutions outperformed NAIVE in efficiency.</Paragraph>
    <Paragraph position="1"> The trade-off between efficiency and approximation accuracy is evident in these results. The most efficient result is 100 times faster than NAIVE, but only 13% accurate on INVR, with 6% of direct matches. The most accurate result is 100% accurate on INVR, with 99% of direct matches, but is less than twice as fast.</Paragraph>
    <Paragraph position="2"> Table 4 shows the trade-off for folded distributions. The least accurate FOLD500 result is 30% accurate but 50 times faster than NAIVE, while the most accurate is 87% but less than two times faster. The least accurate FOLD1500 result is 43% accurate but 71 times faster than NAIVE, while the most accurate is 101% and two and half times faster. These results show the impact of moving high frequency terms away from the root.</Paragraph>
    <Paragraph position="3"> Figure 2 plots the trade-off using search time and INVR at c = 16, 32, 64 and 128. For c = 16 every SASH has very poor accuracy. By c = 64 their accuracy has improved dramatically, but their search time also increased somewhat. At c = 128, there is only a small improvement in accuracy, coinciding with a large increase in search time. The best trade-off between efficiency and approximation accuracy occurs at the knee of the curve where c = 64.</Paragraph>
    <Paragraph position="4"> When c = 128 both SORT and FOLD1500 perform as well as, or slightly outperform NAIVE on some evaluation measures. These evaluation measures involve the rank of correct synonyms, so if the SASH  approximation was to fail to find some incorrectly proposed synonyms ranked above some other correct synonyms, those correct synonyms would have their ranking pushed up. In this way, the approximation can potentially outperform the original nearest-neighbour algorithm.</Paragraph>
    <Paragraph position="5"> From Tables 3 and 4 we also see that as the value of c increases, so does the accuracy across all of the experiments. This is because as c increases the number of paths between nodes increases and we have a solution closer to a true nearest-neighbour search, that is, there are more ways of finding the true nearest-neighbour nodes.</Paragraph>
    <Paragraph position="6"> Table 5 presents the results of combining the canonical attributes heuristic (see Section 3.1) with the SASH approximation. This NAIVE HEURISTIC is 14 times faster than NAIVE and 97% accurate, with 96% of direct matches. The combination has comparable accuracy and is much more efficient than the best of the SASH solutions. The best heuristic SASH results used the SORT ordering with c = 16, which was 37 times faster than NAIVE and 2.5 times faster than NAIVE HEURISTIC. Its performance was statistically indistinguishable from NAIVE HEURISTIC. Using the heuristic changes the impact of the number of children c on the SASH performance characteristics. It seems that beyond c = 16 the only significant effect is to reduce the efficiency (often to slower than NAIVE HEURISTIC).</Paragraph>
    <Paragraph position="7"> The heuristic interacts in an interesting way with the ordering of the nodes in the SASH. This is most obvious with the RANDOM results. The RANDOM heuristic INVR results are eight times better than the full RANDOM results. Similar, though less dramatic, results are seen with other orderings. It appears that using the heuristic changes the clustering of nearestneighbours within the SASH so that better matching paths are chosen and more noisy matches are eliminated entirely by the heuristic.</Paragraph>
    <Paragraph position="8"> It may seem that there are no major advantages to using the SASH with the already efficient heuristic matching method. However, our experiments have used small canonical attribute vectors (maximum length 100). Increasing the canonical vector size allows us to increase the accuracy of heuristic solutions at the cost of efficiency. Using a SASH solution would offset some of this efficiency penalty. This has the potential for a solution that is more than an order of magnitude faster than NAIVE and is almost as accurate.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML