File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/w98-0706_intro.xml
Size: 9,939 bytes
Last Modified: 2025-10-06 14:06:43
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-0706"> <Title>i Text Classification Using WordNet Hypernyms</Title> <Section position="2" start_page="0" end_page="46" type="intro"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> The task of Supervised Machine Learning can be stated as follows: given a set of classification labels C, and set of training examples E, each of which has been assigned one of the class labels from C, the system must use E to form a hypothesis that can be ~used to predict the class labels of previously unseen examples of the same type \[Mitchell 97\]. In machine learning systems that classify text, E is a set of labeled documents from a corpus such as Reuters21578. The labels can signify topic headings, writing styles, or judgements as to the documents' relevance.</Paragraph> <Paragraph position="1"> Text classification systems are used in a variety of contexts, including e-mail and news filtering, personal information agents and assistants, information retrieval, and automatic indexing.</Paragraph> <Paragraph position="2"> Before a set of documents can be presented to a machine learning system, each document must be transformed into a feature vector. Typically, each element of a feature vector represents a word from the corpus. The feature values may be binary, indicating presence or absence of the word in the document, or they may be integers or real numbers indicating some measure of frequency of the word's appearance in the st:etn~csl, uot t:awa, ca text. This text representation, referred to as the bagof-words, is used in most typical approaches to text classification (for recent work see \[Lang 95\], \[Joaehims 97\], and \[Koller & Sahami 97\]). In these approaches, no linguistic processing (other than a stop list of most frequent words) is applied to the original text.</Paragraph> <Paragraph position="3"> This paper explores the hypothesis that incorporating linguistic knowledge into text representation can lead to improvements in classification accuracy.</Paragraph> <Paragraph position="4"> Specifically, we use part of speech information from the Brill tagger \[Brill 92\] and the synonymy and hypernymy relations from WordNet \[Miller 90\] to change the representation of the text from bag-of-words to hypernym density. We report results from an ongoing study in which the hypernym density representation at different heights of generalization is compared to the old bag-of-words model. We focus on using the new representation of text with a particular machine learning algorithm (Ripper) that was designed with the high dimensionality of text classification tasks in mind. The issue of whether our results will generalize to other machine learning systems is left as future work.</Paragraph> <Paragraph position="5"> The only published study comparable to this one is \[Rodffguez et al. 97\]. Their study used WordNet to enhance neural network learning algorithms for significant improvements in classification accuracy on the Reuters-21578 corpus. However, their approach only made use of synonymy and involved a manual word sense disambiguation step, whereas our approach uses synonymy and hypernymy and is completely automatic. Furthermore, their approach took advantage of the fact that the Reuters topic headings are themselves good indicators for classification, whereas our approach makes no such assumptions. Finally their approach to using WordNet focussed on improving the specific algorithms used by neural networks while retaining the bag-of-words representation of text. Our approach looks at using WordNet to change the representation of the text itself and thus may be</Paragraph> <Paragraph position="7"> applicable to a wider variety of machine learning systems.</Paragraph> <Paragraph position="8"> The paper proceeds as follows. In section 2 we present the data sets that we work with, the classification tasks defined on this data, and some initial experiments with the Ripper learning system. Section 3 discusses the new hypernym density representation. Section 4 presents experimental results using both bag-of-words and hypernym density and discusses the accuracy and comprehensibility of the rules learned by Ripper. Finally, section 5 presents the conclusion and future work.</Paragraph> <Paragraph position="9"> 2. Preliminaries: the Corpora,</Paragraph> <Section position="1" start_page="45" end_page="46" type="sub_section"> <SectionTitle> Classification Tasks, and Learning Algorithm </SectionTitle> <Paragraph position="0"> The classification tasks used in this study are drawn from three different corpora: Reuters-21578, USENET, and the Digital Tradition (DigiTrad). Both Reuters and USENET have been the subject of previous studies in machine learning (see \[Koller & Sahami 97\] for a study of Reuters and \[Weiss et al.</Paragraph> <Paragraph position="1"> 96\] for a study of USENET). In keeping with previous studies, we used topic headings as the basis for the Reuters classification tasks and newsgroup names as the basis for the USENET tasks. The third corpus, DigiTrad is a public domain collection of 6500 folk song lyrics \[Greenhaus 96\]. To aid searching, the owners of DigiTrad have assigned to each song one or more key words from a fixed list.</Paragraph> <Paragraph position="2"> Some of these key words capture information on the origin or style of the songs (e.g. &quot;Irish&quot; or &quot;British&quot;) while others relate to subject matter (e.g. &quot;murder&quot; or &quot;marriage&quot;). The latter type of key words served as the basis for the classification tasks in this study.</Paragraph> <Paragraph position="3"> Not all types of text are equally difficult to classify.</Paragraph> <Paragraph position="4"> Reuters consists of articles written purely as a source of factual information. The writing style tends to be direct and to the point, and uses a restricted vocabulary to aid quick comprehension. It has been observed that the topic headings in Reuters tend to consist of words that appear frequently in the text, and this observation has been exploited to help improve classification accuracy \[Rodtguez et al. 97\].</Paragraph> <Paragraph position="5"> DigiTrad and USENET are good examples of the opposite extreme. The texts in DigiTrad make heavy use of metaphoric, rhyming, unusual and archaic language. Often the lyrics do not explicitly state what a song is about. Contributors to USENET often vary in their use of terminology, stray from the topic, or use unusual language. All of these qualities tend to make subject-based classification tasks from DigiTrad and USENET more difficult than those of a comparable size from Reuters.</Paragraph> <Paragraph position="6"> From the three corpora described above, six binary classification tasks were defined, as shown in table 1. The tasks were chosen to be roughly the same size, and cover cases in which the classes seemed to be semantically related (REUTER2 and USENET2) as well as those in which the classes seemed unrelated (REUTER1 and USENETI). In all cases the classes were made completely disjoint by removing any overlapping examples, t The machine learning algorithm chosen for this study was Ripper, a rule-based learner developed by William Cohen \[Cohen 95\]. Ripper was specifically designed to handle the high dimensionality of bag-of-words text classification by being fast and using set-valued features \[Cohen 96\]. Table 1 shows that our intuitions about the difficulty of the three corpora for bag-of-words classification are valid in the case of the Ripper algorithm. Error rates over lO.fold cross-validation 2 for the Reuters tasks were under 5%, while error rates for the other tasks ranged from approximately 19% to 38%. We believe that with the growing applications of text classification on the Internet, it is likely that the kinds of texts to be automatically classified will share many features with the kinds of texts that are difficult for the bag-of-words approach.</Paragraph> <Paragraph position="7"> It is worth noting that difficult classification tasks for Ripper are not necessarily difficult for humans. We classified 200 examples from each of the SONGI and SONG2 by hand (with no special training phase) and compared our classifications to those from DigiTrad.</Paragraph> <Paragraph position="8"> i USENET articles that were cross-posted or tagged as follow-ups were excluded so that the remaining articles reflected a wide variety of attempts to launch discussions within the given topics. Non-text objects such as uuencoded bitmaps were also removed from the postings.</Paragraph> <Paragraph position="9"> z In n-foM cross-validation the articles in the corpus are split into n partitions. Then the learning algorithm is executed n times. On the k ~h run, partition k is used as a testing set and all the other partitions make up the training set. The mean error-rate (percentage of the testing set wrongly classified) on the n runs is taken as an approximate measure of the real error-rate of the system on the given corpus.</Paragraph> <Paragraph position="10"> misc. taxes, moderated bionet.microbiology 280 1171163 152 37.86 bionet.neuroscience ication tasks discussed in this paper. &quot;Size&quot; refers to total number of texts in each task. &quot;Balance&quot; shows number of examples in each class. &quot;Words&quot; shows the average length of the documents in each task. &quot;Error&quot; show the average percentage error rates for each task using Ripper with bag-of-words and lO-fold cross-validation. The error rates were approximately I% for SONGI and 4% for SONG2. Clearly the background knowledge and linguistic competence humans bring to a classification task enables us to overcome the difficulties posed by the text itself.</Paragraph> </Section> </Section> class="xml-element"></Paper>