File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/92/h92-1041_intro.xml
Size: 2,384 bytes
Last Modified: 2025-10-06 14:05:18
<?xml version="1.0" standalone="yes"?> <Paper uid="H92-1041"> <Title>Feature Selection and Feature Extract ion for Text Categorization</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> Text categorization-the automated assigning of natural language texts to predefined categories based on their content-is a task of increasing importance. Its applications include indexing texts to support document retrieval [I], extracting data from texts [2], and aiding humans in these tasks.</Paragraph> <Paragraph position="1"> The indexing language used to represent texts influences how easily and effectively a text categorization system can be built, whether the system is built by human engineering, statistical training, or a combination of the two. The simplest indexing languages are formed by treating each word as a feature. However, words have properties, such as synonymy and polysemy, that make them a less than ideal indexing language. These have motivated attempts to use more complex feature extraction methods in text retrieval and text categorization tasks.</Paragraph> <Paragraph position="2"> If a syntactic parse of text is available, then features can be defined by the presence of two or more words in particular syntactic relationships. We call such a feature a syntactic indexing phrase. Another strategy is to use cluster analysis or other statistical methods to detect closely related features. Groups of such features can then, for instance, be replaced by a single feature corresponding to their logical or numeric sum. This strategy is referred to as term clustering.</Paragraph> <Paragraph position="3"> Syntactic phrase indexing and term clustering have opposite effects on the properties of a text representation, which led us to investigate combining the two techniques [3]. However, the small size of standard text retrieval test collections, and the variety of approaches available for query interpretation, made it difficult to study purely representational issues in text retrieval experiments. In this paper we examine indexing language properties using two text categorization data sets. We obtain much clearer results, as well as producing a new text categorization method capable of handing multiple, overlapping categories.</Paragraph> </Section> class="xml-element"></Paper>