File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/97/w97-0806_relat.xml

Size: 3,370 bytes

Last Modified: 2025-10-06 14:16:03

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0806">
  <Title>Integrating a Lexical Database and a Training Collection for Text Categoriza tion</Title>
  <Section position="8" start_page="42" end_page="42" type="relat">
    <SectionTitle>
5 Related Work
</SectionTitle>
    <Paragraph position="0"> Text categorization has emerged as a very active field of research in the recent years. Many studies have been conducted to test the accuracy of training methods, although much less work has been developed in lexical database methods. However, lexical databases and especially WordNet have been often used for other text classification tasks, like word sense disambiguation. null Many different algorithms making use of a training collection have been used for TC, including k-nearest-neighbor algorithms \[Masand et al., 1992\], Bayesian classifiers \[Lewis, 1992\], learning algorithms based in relevance feedback \[Lewis et al., 1996\] or in decision trees \[Apte et al., 1994\], or neural networks \[Wiener et al., 1995\]. Apart from Lewis \[1992\], the closest approach to ours is the one from Larkey and Croft \[1996\], who combine k-nearest-neighbor, Bayesian independent and relevance feedback classifiers, showing improvements over the separated approaches.</Paragraph>
    <Paragraph position="1"> Although they do not make use of several resources, their approach tends to increase the information available to the system, in the spirit of our hypothesis.</Paragraph>
    <Paragraph position="2"> To our knowledge, lexical databases have been used only once in TC. Hearst \[1994\] adapted a disambiguation algorithm by Yarowsky using WordNet to recognize category occurrences. Categories are made of WordNet terms, which is not the general case of standard or user-defined categories. It is a hard task to adapt WordNet subsets to pre-existing categories, especially when they are domain dependent. Hearst's approach shows promising results confirmed by the fact that our WordNet -based approach performs at least equally to a simple training approach.</Paragraph>
    <Paragraph position="3"> Lexical databases have been employed recently in word sense disarnbiguation. For example, Agirre and Rigan \[1996\] make use of a semantic distance that takes into account structural factors in WordNet for achieving good results for this task. Additionally, Resnik \[1995\] combines the use of WordNet and a text collection for a definition of a distance for disambiguating noun groupings. Although the text collection is not a training collection (in the sense of a collection of manually labeled texts for a pre-defined text processing task), his approach can be regarded as the most similar to ours in the disambiguation task. Finally, Ng and Lee \[1996\] make use of several sources of information inside a training collection (neighborhood, part of speech, morphological form, etc.) to get good results in disambiguating unrestricted text.</Paragraph>
    <Paragraph position="4"> We can see, then, that combining resources in TC is a new and promising approach supported by previous research in this and other text classification operations. With more information extracted from WordNet and better training algorithms, automatic TC integrating several resources could compete with manual indexing in qua!ity, and beat it in cost and efficiency.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML