File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2013_metho.xml
Size: 8,091 bytes
Last Modified: 2025-10-06 14:09:19
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2013"> <Title>WordNet-based Text Document Clustering</Title> <Section position="4" start_page="2" end_page="2" type="metho"> <SectionTitle> 3 The Corpus </SectionTitle> <Paragraph position="0"> Here we look at what kind of corpus is required to assess the quality of clusters, and present our choice, the Reuters-21578 test collection.</Paragraph> <Paragraph position="1"> This is followed by a discussion of the ways sub-corpora can be extracted from the whole corpus in order to address some of the problems of the Reuters corpus.</Paragraph> <Paragraph position="2"> A corpus useful for evaluating text document clustering needs to be annotated with class or category labels. This is not a straightforward task, as even human annotators sometimes disagree on which label to assign to a specific document. Therefore, all results depend on the quality of annotation. It is therefore unrealistic to aim at high rates of agreement with regard to the corpus, and any evaluation should rather focus on the relative comparison of the results achieved by different experiment setups and configurations.</Paragraph> <Paragraph position="3"> Due to the aforementioned difficulty of agreeing on a categorisation and the lack of a definition of 'correct' classification, no standard corpora for evaluation of clustering techniques exist. Still, although not standardised, a number of pre-categorised corpora are available. Apart from various domain-specific corpora with class annotations, there is the Reuters-21578 test collection (Lewis, 1997b), which consists of 21578 newswire articles from 1987.</Paragraph> <Paragraph position="4"> The Reuters corpus is chosen for use in the experimentsofthisprojectsforfourreasons.</Paragraph> <Paragraph position="5"> 1. Its domain is not specific, therefore it can be understood by a non-expert.</Paragraph> <Paragraph position="6"> 2. WordNet, an ontology, which is not tailored to a specific domain, would not be effective for domains with a very specific vocabulary.</Paragraph> <Paragraph position="7"> 3. It is freely available for download.</Paragraph> <Paragraph position="8"> 4. It has been used in comparable studies before (Hotho et al., 2003b).</Paragraph> <Paragraph position="9"> On closer inspection of the corpus, there remain some problems to solve. First of all, only about half of the documents are annotated with category-labels. On the other hand some documents are attributed to multiple categories, meaning that categories overlap. Some confusion seems to have been caused in the research community by the fact that there is a TOPICS attribute in the SGML, the value of which is either set to YES or NO (or BYPASS). However, this does not correspond to the values observed within the TOPICS tag; sometimes categories can be found, even if the TOPICS attribute is set to NO and sometimes there are no categories assigned, even if the attribute indicates YES.</Paragraph> <Paragraph position="10"> Lewisexplainsthatthisisnotanerrorinthe corpus, but has to do with the evolution of the corpus and is kept for historic reasons (Lewis, 1997a).</Paragraph> <Paragraph position="11"> Therefore, to prepare a base-corpus, the TOPICS attribute is ignored and all documents that have precisely one category assigned to them are selected. Additionally, all documents with an empty document body are also discarded. This results in the corpus 'reut-base' containing 9446 documents. The distribution of category sizes in the 'reut-base' is shown in Figure 1. It illustrates that there are a few categories occurring extremely often, in fact the two biggest categories contain about two thirds of all documents in the corpus. This unbalanced distribution would blur test results, because even 'random clustering' would potentially obtain purity values of 30% and more only due to the contribution of the two main categories. null</Paragraph> <Paragraph position="13"> 'reut-base' (only selected categories are listed).</Paragraph> <Paragraph position="14"> Similar to Hotho et al. (2003b), we get around this problem by deriving new corpora from the base corpus. Their maximum category size is reduced to 20, 50 and 100 documents respectively. Categories containing more documents are not excluded, but instead they are reduced in size to comply with the defined maximum, i.e., all documents in excess of the maximum are removed.</Paragraph> <Paragraph position="15"> Creating derived corpora has the further advantages of reducing the size of corpora and thus computational requirements for the test runs.</Paragraph> <Paragraph position="16"> Also, tests can be run on more and less homogeneous corpora, homogeneous with regard to the cluster size, that is, which can give an idea of how the method performs under different conditions. Especially for this purpose a fourth, extremely homogeneous test corpus, 'reut-min15-max20' is derived. It is like the 'reut-max20' corpus, but all categories containing less than 15 documents are entirely removed. The 'reut-min15-max20' is thus the most homogeneous test corpus, with a standard deviation in cluster size of only 0.7 documents.</Paragraph> <Paragraph position="17"> A summary of the derived test corpora is shown in Table 1, including the number of documents they contain, i.e. their size, the average category size and the standard deviation. Figure 2 shows the distribution of categories within the derived corpora graphically.</Paragraph> <Paragraph position="18"> Base: stopword removal, stemming, pruning and tf idf weighting are performed; PoS tags are stripped.</Paragraph> <Paragraph position="19"> PoS: PoS tags are kept attached to the words.</Paragraph> <Paragraph position="20"> Syns: all senses of a word are included using synset offsets. Hyper: hypernyms to the specified depth are included. Empty fields indicate 'no' or '0'.</Paragraph> <Paragraph position="21"> Baseline The first configuration setting is used to get a baseline for comparison. All basic preprocessing techniques are used, i.e.</Paragraph> <Paragraph position="22"> stopword removal, stemming, pruning and</Paragraph> <Paragraph position="24"> PoS Only Identical to the baseline, but the PoS tags are not removed.</Paragraph> <Paragraph position="25"> Syns In addition to the previous configuration, all WordNet senses (synset IDs) of each PoS tagged token are included.</Paragraph> <Paragraph position="26"> Hyper 5 Herefivelevelsofhypernymsareincluded in addition to the synset IDs. Hyper All Same as above, but all hypernyms for each word token are included.</Paragraph> <Paragraph position="27"> Each of the configurations is used to create 16, 32 and 64 clusters from each of the four test-corpora. Due to the random choice of initial cluster centroids in the bisecting k-means algorithm, the means of three test-runs with the same configuration is calculated for corpora 'reut-max20' and 'reut-max50'. The existing project time constraints allowed us to gain some additional insight by doing one test-run for each of 'reut-max100' and 'reut-min15-max20'. This results in 120 experiments in total.</Paragraph> <Paragraph position="28"> All configurations use tfidf weighting and pruning. The pruning thresholds vary. For all experiments using the 'reut-max20' corpus all terms occurring less than 20 times are pruned.</Paragraph> <Paragraph position="29"> The experiments on corpora 'reut-max50' and 'reut-min15-max20' are carried out with a pruning threshold of 50. For the corpus 'reutmax100', the pruning threshold is set to 50 when configurations Baseline, PoS Only or Syns are used and to 200 otherwise. This relatively high threshold is chosen, in order to reduce memory requirements. To ensure that this inconsistency does not distort the conclusions drawn from the test data, the results of these tests are considered with great care and are explicitly referred to when used.</Paragraph> <Paragraph position="30"> Further details of this research are described in an unpublished report (Sedding, 2004).</Paragraph> </Section> class="xml-element"></Paper>