File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/w97-0806_intro.xml

Size: 2,657 bytes

Last Modified: 2025-10-06 14:06:29

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0806">
  <Title>Integrating a Lexical Database and a Training Collection for Text Categoriza tion</Title>
  <Section position="3" start_page="0" end_page="39" type="intro">
    <SectionTitle>
2 Task Description
</SectionTitle>
    <Paragraph position="0"> Given a set of documents and a set of categories, the goal of acategorization system is to decide whether any document belongs to anycategory or not. The system makes use of the information contained in adocument to compute a degree of pertainance of the document to each category. Categories are usually subject labels likeart or military, but other categories like text genres are also interesting\[Karlgren and Cutting, 1994\]. Documents can be news stories, emailmessages, reports, and so forth.</Paragraph>
    <Paragraph position="1"> The most widely used resource for TC is the training collection. Attaining collection is a set of manually classified documents that allowsthe system to guess clues on how to classify new unseen documents.</Paragraph>
    <Paragraph position="2"> Thereare currently several TC test collections, from which a training subset and a test subset can be obtained. Forinstance, the huge TREC collection \[Harman, 1996\], OHSUMED \[Hersh etal, 1994\] and Reuters-22173 \[Lewis, 1992\] have been collected for thistask. We have selected Reuters because it has been used in other work,facilitating the comparison of resuits. null Lexical databases have been rarely employed in TC, but severalapproaches have demonstrated their usefulness for term classification operations like word sense disambiguation\[Resnik, 1995; Agirre and Rigau, 1996\]. A lexical database is a referencesystem that accumulates information on the lexical items of one o  several languages In this view,machine readable dictionaries can also be regarded as primitive lexicaldatabases. Current lexical databases include WordNet \[Miller, 1995\], EDR\[Yokoi, 1995\] and Roget's Thesaurus. WordNet's large coverage and frequent utilization has led us touse it for our experiments.</Paragraph>
    <Paragraph position="3"> We organize our work depending on the kind and number ofresources involved. First, a direct approach in which only the categoriesthemselves are the terms used in representation has been tested. Secondly, WordNet by itself has been usedfor increasing the number of terms and so, the amount of predictinginformation. Thirdly, we have made use of the training subset of Reuters toobtain the categories representatives. Finally, we have employed both WordNet and Reuters to get a betterrepresentation of undertramed</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML