File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-2013_evalu.xml
Size: 9,731 bytes
Last Modified: 2025-10-06 13:59:13
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2013"> <Title>WordNet-based Text Document Clustering</Title> <Section position="5" start_page="2" end_page="2" type="evalu"> <SectionTitle> 5 Results and Evaluation </SectionTitle> <Paragraph position="0"> The results are presented in the format of one graph per corpus, showing the entropy, purity andoverallsimilarityvaluesforeachoftheconfigurations shown in Table 2. On the X-axis, the different configuration settings are listed. On the right-hand side, hype refers to the hypernym depth, syn refers to whether synonyms were included or not, pos refers to the presence or absence of PoS tags and clusters refers to the number of clusters created. For improved readability, lines are drawn, splitting the graphs into three sections, one for each number of clusters. For experiments on the corpora 'reut-max20' and 'reut-max50', the values in the graphs are the average of three test runs, whereas for the corpora 'reut-min15-max20' and 'reut-max100', the values are those obtained from a single test run.</Paragraph> <Paragraph position="1"> The Y-axis indicates the numerical values for each of the measures. Note that the values for purity and similarity are percentages, and thus limited to the range between 0 and 1. For those two measures, higher values indicate better quality. High entropy values, on the other hand, indicate lower quality. Entropy values are always greater than 0 and for the particular experiments carried out, they never exceed 1.3.</Paragraph> <Paragraph position="2"> In analysing the test results, the main focus is on the data of corpora 'reut-max20' and 'reutmax50', shown in Figure 3 and Figure 4, respectively. This data is more reliable, because it is the average of repeated test runs. Figures 6-7 show the test data obtained from clustering the corpora 'reut-min15-max20' and 'reutmax100', respectively.</Paragraph> <Paragraph position="3"> The fact that the purity and similarity values are far from 100 percent is not unusual. In many cases, not even human annotators agree on how to categorise a particular document (Hotho et al., 2003a). More importantly, the number of categories are not adjusted to the number of labels present in a corpus, which makes complete agreement impossible.</Paragraph> <Paragraph position="4"> All three measures indicate that the quality increases with the number of clusters. The graph in Figure 5 illustrates this for the entropy in 'reut-max50'. For any given configuration, it appears that the decrease in entropy is almost constant when the number of clusters increases.</Paragraph> <Paragraph position="5"> This is easily explained by the average cluster sizes, which decrease with an increasing number of clusters; when clusters are smaller, the probability of having a high percentage of documents with the same label in a cluster increases. This becomes obvious when very small clusters are looked at. For instance, the minimum purity value for a cluster containing three documents is 33 percent, for two documents it is 50 percent, and, in the extreme case of a single document per cluster, purity is always 100 percent.</Paragraph> <Paragraph position="6"> The PoS Only experiment results in performance, which is very similar to the Baseline, and is sometimes a little better, sometimes a little worse. This is expected, and the experiment is included to allow for a more accurate interpretation of the subsequent experiments using synonyms and hypernyms.</Paragraph> <Paragraph position="7"> A more interesting observation is that purity and entropy values indicate better clusters for Baseline than for any of the configurations using background knowledge from WordNet (i.e.</Paragraph> <Paragraph position="8"> Syns, Hyper 5 and Hyper All). One possible conclusion is that adding background knowledge is not helpful at all. However, the reasons for the relatively poor performance could also be due to the way the experiments are set up.</Paragraph> <Paragraph position="9"> Therefore, a possible explanation for these results could be that the benefit of extra overlap between documents, which the added synonyms and hypernyms should provide, is outweighed by the additional noise they create. WordNet does often provide five or more senses for a word, which means that for one correct sense a number of incorrect senses are added, even if the PoS tags eliminate some of them.</Paragraph> <Paragraph position="10"> The overall similarity measure gives a different indication. Its values appear to increase for the cases where background knowledge is included, especially when hypernyms are added.</Paragraph> <Paragraph position="11"> Overall similarity is the weighted average of the intra-cluster similarities of all clusters. So the intra-cluster similarity actually increases with added information. As similarity increases with additional overlap, the overall similarity measure shows that additional overlap is achieved. The main problem with the approach of adding all synonyms and all hypernyms into the document vectors seems to be the added noise. The expectation that tfidf weighting would take care of these quasi-random new concepts is not met, but the results also indicate possible improvements to this approach.</Paragraph> <Paragraph position="12"> used, the correct sense of a word could be chosen and only the hypernyms for the correct sense of the word could be taken into account.</Paragraph> <Paragraph position="13"> This should drastically reduce noise. The benefit of the added 'correct' concepts would then probably improve cluster quality. Hotho et al.</Paragraph> <Paragraph position="14"> (2003a) experimented successfully with simple disambiguation strategies, e.g., they used only the first sense provided by WordNet.</Paragraph> <Paragraph position="15"> As an alternative to word-by-word disambiguation, a strategy to disambiguate based on document vectors could be devised; after adding all alternative senses of the terms, the least frequent ones could be removed. This is similar to pruning but would be done on a document by document basis, rather than globally on the whole corpus. The basis for this idea is that only concepts that appear repeatedly in a document contribute (significantly) to the meaning of the document.Itisimportantthatthisisdonebefore hypernyms are added, especially when all levels of hypernyms are added, because the most general terms are bound to appear more often than the more specific ones. This would lead to lots of very similar, but meaningless bags of words or bags of concepts.</Paragraph> <Paragraph position="16"> Comparing Syns, Hyper 5 and Hyper All with each other, in many cases Hyper 5 gives the best results. A possible explanation could again be the equilibrium between valuable information and noise that are added to the vector representations. From these results it seems that there is a point where the amount of information added reaches its maximum benefit; adding more knowledge afterwards results in decreased cluster quality again. It should be noted that a fixed threshold for the levels of hypernymsusedisunlikelytobeoptimalforall null words. Instead, a more refined approach could set this threshold as a function of the semantic distance (Resnik and Yarowsky, 2000; Stetina, 1997) between the word and its hypernyms.</Paragraph> <Paragraph position="17"> The maximised benefit is most evident in the 'reut-max100' corpus (Figure 7). However, itneedstobekeptinmindthatforthelast two data points, Hyper 5 and Hyper All, the pruning threshold is 200. Therefore, the comparison with Syns needstobedonewith care. This is not much of a problem, because the other graphs consistently show that the performance for Syns is worse than for Hyper 5. The difference between Hyper 5 and Hyper All in 'reut-max100', can be directly compared though, because the pruning threshold of 200 is used for both configurations. Surprisingly, there is a sharp drop in the overall similarity from Hyper 5 to Hyper All, much more evident than in the other three corpora. One possible explanation could be the different structure of the corpus. It seems more probable, however, that the high pruning threshold is the cause again. Assuming that Hyper 5 seldom includes the most general concepts, whereas Hyper All always includes them, their frequency in Hyper All becomes so high that the frequencies of all the other terms are very low in comparison. The document vectors in case of Hyper All end up containing mostly meaningless concepts, because most of the others are pruned. This leads to decreased cluster quality because the general concepts have little discriminating power. In the corresponding experiments on other corpora, more of the specific concepts are retained. Therefore, a better balance between general and specific concepts is maintained, keeping the cluster quality higher than in the case of corpus 'reut-max100'.</Paragraph> <Paragraph position="18"> PoS Only performs similar to Baseline,although usually a slight decrease in quality can be observed. Despite the assumption that the disambiguation achieved by the PoS tags should improve clustering results, this is clearly not the case. PoS tags only disambiguate the cases where different word classes are represented by the same stem, e.g., the noun 'run' and the verb 'run'. Clearly the meanings of these pairs are in most cases related. Therefore, distinguishing between them reduces the weight of their common concept by splitting it between two concepts. In the worst case, they are pruned if treated separately, instead of contributing significantly to the document vector as a joint concept. null</Paragraph> </Section> class="xml-element"></Paper>