File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/c02-1090_evalu.xml
Size: 6,419 bytes
Last Modified: 2025-10-06 13:58:45
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1090"> <Title>Taxonomy learning factoring the structure of a taxonomy into a semantic classification decision</Title> <Section position="5" start_page="3" end_page="3" type="evalu"> <SectionTitle> 6. Results </SectionTitle> <Paragraph position="0"> We first conducted experiments evaluating performance of the three standard classifiers. To determine the best version for each particular classifier, only those parameters were varied that, as described above, we deemed to be critical in the setting of thesaurus augmentation.</Paragraph> <Paragraph position="1"> In order to get a view on how the accuracy of the algorithms was related to the amount of available distributional data on the target word, all words of the thesaurus were divided into three groups depending on the amount corpus data available on them (see Table 1). The amount of distributional data for a word (the &quot;frequency&quot; in the left column) is the total of frequencies of its context words.</Paragraph> <Paragraph position="2"> The results of the evaluation of the methods are summarized in the tables below. Rows specify the measures used to determine distributional similarity (JC for Jaccard's coefficient, L1 for the L1 distance and SD for the skew divergence) and columns specify frequency ranges. Each cell describes the average of direct+near hits / the average of direct hits over words of a particular frequency range and over all words of the thesaurus. The statistical significance of the results was measured in terms of the one-tailed chi-square test.</Paragraph> <Paragraph position="3"> kNN. Evaluation of the method was conducted with k=1, 3, 5, 7, 10, 15, 20, 25, and 30. The accuracy of classifications increased with the increase of k. However, starting with k=15 the increase of k yielded only insignificant improvement. Table 2 describes results of evaluation of kNN using 30 nearest neighbors, which was found to be the best version of kNN.</Paragraph> <Paragraph position="4"> Category-based method. To determine the best version of this method, we experimented with the number of levels of hyponyms below a concept that were used to build a class vector). The best results were achieved when a class was represented by data from its hyponyms at most three levels below it (Table 3).</Paragraph> <Paragraph position="5"> Centroid-based method. As in the case with the category-based method, we varied the number of levels of hyponyms below the candidate concept. Table 4 details results of evaluation of the best version of this method (a class is represented by 3 levels of its hyponyms).</Paragraph> <Paragraph position="6"> Comparing the three algorithms we see that overall, kNN and the category-based method exhibit comparable performance (with the exception of measuring similarity by L1 distance, when the category-based method outperforms kNN by a margin of about 5 points; statistical significance p<0.001). However, their performance is different in different frequency ranges: for lower frequencies kNN is more accurate (e.g., for L1 distance, p<0.001). For higher frequencies, the category-based method improves on kNN (L1, p<0.001). The centroid-based method exhibited performance, inferior to both those of kNN and the category-based method.</Paragraph> <Paragraph position="7"> Tree descending algorithm. In experiments with the algorithm, candidate classes were represented in terms of the category-based method, 3 levels of hyponyms, which proved to be the best generalized representation of a class in previous experiments. Table 5 specifies the results of its evaluation.</Paragraph> <Paragraph position="8"> Its performance turns out to be much worse than that of the standard methods. Both direct+near and direct hits scores are surprisingly low, for 040 and 40-500 much lower than chance. This can be explained by the fact that some of top concepts in the tree are represented by much less distributional data than other ones. For example, there are less than 10 words that lexicalize the top concepts MASS_CONCEPT and MATHEMATICAL_CONCEPT and all of their hyponyms (compare to more than 150 words lexicalizing THING and its hyponyms up to 3 levels below it). As a result, at the very beginning of the search down the tree, a very large portion of test words was found to be similar to such concepts.</Paragraph> <Paragraph position="9"> Tree ascending algorithm. The experiments were conducted with the same number of nearest neighbors as with kNN. Table 6 describes the results of evaluation of the best version (formula 3, k=15).</Paragraph> <Paragraph position="10"> Table 6. Tree ascending algorithm, total of votes according to (3), k=15.</Paragraph> <Paragraph position="11"> There is no statistically significant improvement on kNN overall, or in any of the frequency ranges. The algorithm favored more upper concepts and thus produced about twice as few direct hits than kNN. At the same time, its direct+near hits score was on par with that of kNN! This algorithm thus produced much more near hits than kNN, what can be interpreted as its better ability to choose a superconcept of the correct class. Based on this observation, we combined the best version of the tree ascending algorithm with kNN in one algorithm in the following manner. First the former was used to determine a superconcept of the class for the new word and thus to narrow down the search space. Then the kNN method was applied to pick a likely class from the hyponyms of the concept determined by the tree ascending method. Table 7 specifies the results of evaluation of the proposed algorithm.</Paragraph> <Paragraph position="12"> The combined algorithm demonstrated improvement both on kNN and the tree ascending method of 1 to 3 points in every frequency range and overall for direct+near hits (except for the 40-500 range, L1). The improvement was statistically significant only for L1, &quot;>500&quot; (p=0.05) and for L1, overall (p=0.011). For other similarity measures and frequency ranges it was insignificant (e.g., for JC, overall, p=0.374; for SD, overall, p=0.441). The algorithm did not improve on kNN in terms of direct hits. The hits scores set in bold in Table 7 are those which are higher than those for kNN in corresponding frequency ranges and similarity measures.</Paragraph> </Section> class="xml-element"></Paper>