File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-0908_metho.xml
Size: 9,983 bytes
Last Modified: 2025-10-06 14:08:04
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-0908"> <Title>Improvements in Automatic Thesaurus Extraction</Title> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Results </SectionTitle> <Paragraph position="0"> For computational practicality, we assume that the performance behaviour of measure and weight functions are independent of each other. Therefore, we have evaluated the weight functions using the JACCARD measure, and evaluated the measure functions using the TTEST weight because they produced the best results in our previous experiments.</Paragraph> <Paragraph position="1"> Table 4 presents the results of evaluating the measure functions. The best performance across all measures was shared by JACCARD and DICE , which produced identical results for the 70 words. DICE is easier to compute and is thus the preferred measure function.</Paragraph> <Paragraph position="2"> Table 5 presents the results of evaluating the weight functions. Here TTEST signi cantly outperformed the other weight functions, which supports our intuition that good context descriptors are also strong collocates of the term. Surprisingly, the other collocation discovery functions did not perform as well, even though TTEST is not the most favoured for collocation discovery because of its behaviour at low frequency counts.</Paragraph> <Paragraph position="3"> One dif culty with weight functions involving logarithms or differences is that they can be negative. The results in Table 6 show that weight functions that are not bounded below by zero do not perform as well on thesaurus extraction. However, unbounded weights do produce interesting and unexpected results: they tend to return misspellings of the term and synonyms, abbreviations and lower frequency synonyms. For instance, TTEST returned Co, Co. and PLC for company, but they do not appear in the synonyms extracted with TTEST. The unbounded weights also extracted more hyponyms, such as corporation names for company, including Kodak and Exxon. Finally unbounded weights tended to promote the rankings of synonyms from minority senses because the frequent senses are demoted by negative weights. For example, TTEST returned writings, painting, fieldwork, essay and masterpiece as the best synonyms for work, whereas TTEST returned study, research, job, activity and life.</Paragraph> <Paragraph position="4"> Introducing a minimum cutoff that ignores low frequency potential synonyms can eliminate many unnecessary comparisons. Figure 2 presents both the performance of the system using direct match evaluation (left axis) and execution times (right axis) for increasing cutoffs. This test was performed using JACCARD and the TTEST and LIN98A weight functions. The rst feature of note is that as we increase the minimum cutoff to 30, the direct match results improve for TTEST, which is probably a result of the TTEST weakness on low frequency counts. Initially, the execution time is rapidly reduced by small increments of the minimum cutoff. This is because Zipf's law applies to relations, and so by small increments of the cutoff we eliminate many terms from the tail of the distribution. There are only 29,737 terms when the cutoff is 30; 88,926 terms when the cutoff is 5; and 246,067 without a cutoff, and because the extraction algorithm is O(n2m), this results in signi cant ef ciency gains. Since extracting only 70 thesaurus terms takes about 43 minutes with a minimum cutoff of 5, the ef ciency/performance trade-off is particularly important from the perspective of implementing a practical extraction system.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Ef ciency </SectionTitle> <Paragraph position="0"> Even with a minimum cutoff of 30 as a reasonable compromise between speed and accuracy, extracting a thesaurus for 70 terms takes approximately 20 minutes. If we want to extract a complete thesaurus for 29,737 terms left after the cut-off has been applied, it would take approximately one full week of processing. Given that the size of the training corpus could be much larger (cf.</Paragraph> <Paragraph position="1"> Curran and Moens (2002)), which would increase both number of attributes for each term and the total number of terms above the minimum cutoff, this is not nearly fast enough. The problem is that the time complexity of thesaurus extraction is not practically scalable to signi cantly larger corpora.</Paragraph> <Paragraph position="2"> Although the minimum cutoff helps by reducing n to a reasonably small value, it does not constrain m in any way. In fact, using a cutoff increases the average value of m across the terms because it removes low frequency terms with few attributes. For instance, the frequent company appears in 11360 grammatical relations, with a total frequency of 69240 occurrences, whereas the infrequent pants appears in only 401 relations with a total frequency of 655 occurrences.</Paragraph> <Paragraph position="3"> The problem is that for every comparison, the algorithm must examine the length of both attribute vectors. Grefenstette (1994) uses bit signatures to test for shared attributes, but because of the high frequency of the most common attributes, this does not skip many comparisons. Our system keeps track of the sum of the remaining vector which is a signi cant optimisation, but comes at the cost of increased representation size. However, what is needed is some algorithmic reduction that bounds the number of full O(m) vector comparisons performed.</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 7 Approximation Algorithm </SectionTitle> <Paragraph position="0"> One way of bounding the complexity is to perform an approximate comparison rst. If the approximation returns a positive result, then the algorithm performs the full comparison. We can do this by introducing another, much shorter vector of canonical attributes, with a bounded length k. If our approximate comparison returns at most p positive results for each term, then the time complexity becomes O(n2k +npm), which, since k is constant, is O(n2 +npm). So as long as we nd an approximation function and vector such that p n, the system will run much faster and be much more scalable in m, the number of attributes. However, p n implies that we are discarding a very large number of potential matches and so there will be a performance penalty. This trade-off is governed by the number of the canonical attributes and how representative they are of the full attribute vector, and thus the term it- null self. It is also dependent on the functions used to compare the canonical attribute vectors.</Paragraph> <Paragraph position="1"> The canonical vector must contain attributes that best describe the thesaurus term in a bounded number of entries. The obvious rst choice is the most strongly weighted attributes from the full vector. Figure 3 shows some of the most strongly weighted attributes for pants with their frequencies and weights. However, these attributes, although strongly correlated with pants, are in fact too speci c and idiomatic to be a good summary, because there are very few other words with similar canonical attributes. For example, (adjective, smarty) only appears with two other terms (bun and number) in the entire corpus. The heuristic is so aggressive that too few positive approximate matches result.</Paragraph> <Paragraph position="2"> To alleviate this problem we lter the attributes so that only strongly weighted subject, direct-obj and indirect-obj relations are included in the canonical vectors. This is because in general they constrain the terms more and partake in fewer idiomatic collocations with the terms. So the general principle is the most descriptive verb relations constrain the search for possible synonyms, and the other modi ers provide ner grain distinctions used to rank possible synonyms. Figure 4 shows the 5 canonical attributes for pants. This canonical vector is a better general description of the term pants, since similar terms are likely to appear as the direct object of wear, even though it still contains the idiomatic attributes (direct-obj, wet) and (direct-obj, scare).</Paragraph> <Paragraph position="3"> One nal dif culty this example shows is that at- null tributes like (direct-obj, get) are not informative. We know this because (direct-obj, get) appears with 8769 different terms, which means the algorithm may perform a large number of unnecessary full comparisons since (direct-obj, get) could be a canonical attribute for many terms. To avoid this problem, we apply a maximum cutoff on the number of terms the attribute appears with.</Paragraph> <Paragraph position="4"> With limited experimentation, we have found that TTESTLOG is the best weight function for selecting canonical attributes. This may be because the extra log2( f (w;r;w0)+1) factor encodes the desired bias towards relatively frequent canonical attributes. If a canonical attribute is shared by the two terms, then our algorithm performs the full comparison.</Paragraph> <Paragraph position="5"> Figure 5 shows system performance and speed, as canonical vector size is increased, with the maximum cutoff at 4000, 8000, and 10,000. As an example, with a maximum cutoff of 10,000 and a canonical vector size of 70, the total DIRECT score of 1841 represents a 3.9% performance penalty over full extraction, for an 89% reduction in execution time. Table 7 presents the example term results using the techniques we have described: JACCARD measure and TTEST weight functions; minimum cutoff of 30; and approximation algorithm with canonical vector size of 100 with TTESTLOG weighting. The BIG columns show the previous measure results if we returned 10,000 synonyms, and MAX gives the results for a comparison of the gold standard against itself.</Paragraph> </Section> class="xml-element"></Paper>