File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1104_intro.xml

Size: 4,939 bytes

Last Modified: 2025-10-06 14:01:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1104">
  <Title>A Differential LSI Method for Document Classification</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> This paper introduces a new efficient supervised document classification procedure, whereby given a number of labeled documents preclassified into a finite number of appropriate clusters in the database, the classifier developed will select and classify any of new documents introduced into an appropriate cluster within the learning stage.</Paragraph>
    <Paragraph position="1"> The vector space model is widely used in document classification, where each document is represented as a vector of terms. To represent a document by a document vector, we assign weights to its components usually evaluating the frequency of occurrences of the corresponding terms. Then the standard pattern recognition and machine learning methods are employed for document classification(Li et al., 1991; Farkas, 1994; Svingen, 1997; Hyotyniemi, 1996; Merkl, 1998; Benkhalifa et al., 1999; Iwayama and Tokunaga, 1995; Lam and Low, 1997; Nigam et al., 2000).</Paragraph>
    <Paragraph position="2"> In view of the inherent flexibility imbedded within any natural language, a staggering number of dimensions seem required to represent the featuring space of any practical document comprising the huge number of terms used. If a speedy classification algorithm can be developed (Sch&amp;quot;utze and Silverstein, 1997), the first problem to be resolved is the dimensionality reduction scheme enabling the documents' term projection onto a smaller subspace.</Paragraph>
    <Paragraph position="3"> Like an eigen-decomposition method extensively used in image processing and image recognition (Sirovich and Kirby, 1987; Turk and Pentland, 1991), the Latent Semantic Indexing (LSI) method has proved to be a most efficient method for the dimensionality reduction scheme in document analysis and extraction, providing a powerful tool for the classifier (Sch&amp;quot;utze and Silverstein, 1997) when introduced into document retrieval with a good performance confirmed by empirical studies (Deerwester et al., 1990; Berry et al., 1999; Berry et al., 1995).The LSI method has also demonstrated its efficiency for automated cross-language document retrieval in which no query translation is required (Littman et al., 1998).</Paragraph>
    <Paragraph position="4"> In this paper, we will show that exploiting both of the distances to, and the projections onto, the LSI space improves the performance as well as the robustness of the document classifier. To do this, we introduce, as the major vector space, the differential LSI (or DLSI) space which is formed from the differences between normalized intra- and extra-document vectors and normalized centroid vectors of clusters where the intra- and extra-document refers to the documents included within or outside of the given cluster respectively. The new classifier sets up a Baysian posteriori probability function for the differential document vectors based on their projections on DLSI space and their distances to the DLSI space, the document category with a highest probability is then selected. A similar approach is taken by Moghaddam and Pentland for image recognition (Moghaddam and Pentland, 1997; Moghaddam et al., 1998).</Paragraph>
    <Paragraph position="5"> We may summarize the specific features introduced into the new document classification scheme based on the concept of the differential document vector and the DLSI vectors: 1. Exploiting the characteristic distance of the differential document vector to the DLSI space and the projection of the differential document onto the DLSI space, which we believe to denote the differences in word usage between the document and a cluster's centroid vector, the differential document vector is capable of capturing the relation between the particular docu- null ment and the cluster.</Paragraph>
    <Paragraph position="6"> 2. A major problem of context sensitive semantic grammar of natural language related to synonymy and polysemy can be dampened by the major space projection method endowed in the LSIs used.</Paragraph>
    <Paragraph position="7"> 3. A maximum for the posteriori likelihood func null tion making use of the projection of differential document vector onto the DLSI space and the distance to the DLSI space provides a consistent computational scheme in evaluating the degree of reliability of the document belonging to the cluster.</Paragraph>
    <Paragraph position="8"> The rest of the paper is arranged as follows: Section 2 will describe the main algorithm for setting up the DLSI-based classifier. A simple example is computed for comparison with the results by the standard LSI based classifier in Section 3. The conclusion is given in Section 4.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML