File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/c94-2172_intro.xml

Size: 1,832 bytes

Last Modified: 2025-10-06 14:05:42

<?xml version="1.0" standalone="yes"?>
<Paper uid="C94-2172">
  <Title>DOCUMENT CLASSIFICATION BY MACHINE:Theory and Practice</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
TOPICAL PAPER,
Subject Area: TEXT PROCESSING
1 Introduction
</SectionTitle>
    <Paragraph position="0"> A problem of considerable interest in Computational Linguistics is thai; of classifying documents via computer processing \[lIayes, 1992; Lewis 1992; Walker and Amsler, 1986\]. Siml&gt;ly put, it is this: a document is one of several types, and a machine processing of the document is to determine of wbicll type.</Paragraph>
    <Paragraph position="1"> In this note, we present results concerning the theory and practice of classification schemes t)ased on word frequencies. The theoretical results are about matlt.</Paragraph>
    <Paragraph position="2"> ematical models of classification schemes, and apply' to any document classitication problem to tile extent that the model represents faithfully that problem. One must cimosc a model that not only provides a mathematical description of the problem at imnd, but one in which the desired calculations can be made. For example, in document classificatiou, it would bc nice to be able to calcnlatc the probability that a document on subject i will be classified as on subject i. Further, it would be comforting to know that there is no better scheme than the ouc being used. Our models have these characteristics. They are siml)lc, the calculations of probabilities of correct document classification are straightforward, and we imve proved that there are no schemes using tile same information that have better success rates. In an experiment the scheme was u~d to classify two types of documents, and was found to work very well indeed.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML