File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/06/w06-2604_relat.xml
Size: 6,916 bytes
Last Modified: 2025-10-06 14:15:56
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2604"> <Title>Basque Country ccpzejaa@si.ehu.es I~naki Alegria UPV-EHU Basque Country acpalloi@si.ehu.es Olatz Arregi UPV-EHU Basque Country acparuro@si.ehu.es</Title> <Section position="3" start_page="25" end_page="26" type="relat"> <SectionTitle> 2 Related Work </SectionTitle> <Paragraph position="0"> As previously mentioned in the introduction, text categorization consists in assigning predefined categories to text documents. In the past two decades, document categorization has received much attention and a considerable number of machine learning based approaches have been proposed. A good tutorial on the state-of-the-art of document categorization techniques can be found in (Sebastiani, 2002).</Paragraph> <Paragraph position="1"> In the document categorization task we can find two cases; (1) the multilabel case, which means that categories arenot mutuallyexclusive, because the same document may be relevant to more than one category (1 to m category labels may be assigned to the same document, being m the total number of predefined categories), and (2) the single-label case, where exactly one category is assigned to each document. While most machine learning systems are designated to handle multi-class data1, much less common are systems that can handle multilabel data.</Paragraph> <Paragraph position="2"> For experimentation purposes, there are standard document collections available in the public domain that can be used for document categorization. The most widely used is Reuters-21578 collection, which is a multiclass (135 categories) and multilabel (the mean number of categories assigned to a document is 1.2) dataset. Many experiments have been carried out for the Reuters collection. However, they have been performed in different experimental conditions. This makes results difficult to compare among them. In fact, effectiveness results can only be compared between studies that use the same training and testing sets.</Paragraph> <Paragraph position="3"> In order to lead researchers to use the same training/testing divisions, the Reuters documents have been specifically tagged, and researchers are encouraged to use one of those divisions. In our experiment we use the &quot;ModApte&quot; split (Lewis, 2004).</Paragraph> <Paragraph position="4"> In this section, we analize the category subsets, evaluation measures and results obtained in the past and in the recent years for Reuters-21578 ModApte split.</Paragraph> <Section position="1" start_page="25" end_page="25" type="sub_section"> <SectionTitle> 2.1 Category subsets </SectionTitle> <Paragraph position="0"> Concerning the evaluation of the classification system, we restrict our attention to the TOPICS 1Categorization problems where there are more than two possible categories.</Paragraph> <Paragraph position="1"> group of categories that labels Reuters dataset, which contains 135 categories. However, many categories appear in no document and consequently, and because inductive based learning classifiers learn from training examples, these categories are not usually considered at evaluation time. The most widely used subsets are the following: null * Top-10: It is the set of the 10 categories which have the highest number of documents in the training set.</Paragraph> <Paragraph position="2"> * R(90): It is the set of 90 categories which have at least one document in the training set and one in the testing set.</Paragraph> <Paragraph position="3"> * R(115): It is the set of 115 categories which have at least one document in the training set. In order to analyze the relative hardness of the three category subsets, a very recent paper has been published by Debole and Sebastiani (Debole andSebastiani,2005)whereasystematic,comparative experimental study has been carried out. The results of the classification system we propose are evaluated according to these three category subsets.</Paragraph> </Section> <Section position="2" start_page="25" end_page="26" type="sub_section"> <SectionTitle> 2.2 Evaluation measures </SectionTitle> <Paragraph position="0"> The evaluation of a text categorization system is usually done experimentally, by measuring the effectiveness, i.e. average correctness of the categorization. In binary text categorization, two known statisticsarewidelyusedtomeasurethiseffectiveness: precision and recall. Precision (Prec) is the percentageofdocumentscorrectlyclassifiedintoa given category, and recall (Rec) is the percentage of documents belonging to a given category that are indeed classified into it.</Paragraph> <Paragraph position="1"> In general, there is a trade-off between precision and recall. Thus, a classifier is usually evaluated by means of a measure which combines precision and recall. Various such measures have been proposed. The breakeven point, the value at which precision equals recall, has been frequently used during the past decade. However, it has been recently criticized by its proposer ((Sebastiani, 2002) footnote 19). Nowadays, the F1 score is more frequently used. The F1 score combines recall and precision with an equal weight in the following way:</Paragraph> <Paragraph position="3"> Since precision and recall are defined only for binaryclassificationtasks, formulticlassproblems results need to be averaged to get a single performance value. This will be done using microaveraging and macroaveraging. In microaveraging, which is calculated by globally summing over all individual cases, categories count proportionally to the number of their positive testing examples.</Paragraph> <Paragraph position="4"> In macroaveraging, which is calculated by averaging over the results of the different categories, all categories count the same. See (Debole and Sebastiani, 2005; Yang, 1999) for more detailed explanation of the evaluation measures mentioned above.</Paragraph> </Section> <Section position="3" start_page="26" end_page="26" type="sub_section"> <SectionTitle> 2.3 Comparative Results </SectionTitle> <Paragraph position="0"> Sebastiani (Sebastiani, 2002) presents a table wherelistsresultsofexperimentsforvarioustraining/testing divisions of Reuters. Although we are aware that the results listed are microaveraged breakeven point measures, and consequently, are not directly comparable to the ones we present in this paper, F1, we want to remark some of them.</Paragraph> <Paragraph position="1"> In Table 1 we summarize the best results reported for the ModApte split listed by Sebastiani.</Paragraph> <Paragraph position="2"> reported by Sebastiani for the Reuters-21578 ModApte split.</Paragraph> <Paragraph position="3"> In Table 2 we include some more recent results, evaluated according to the microaveraged F1 score. For R(115) there is also a good result, F1 = 87.2, obtained by (Zhang and Oles, 2001)2.</Paragraph> </Section> </Section> class="xml-element"></Paper>