File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/c96-2138_intro.xml

Size: 4,033 bytes

Last Modified: 2025-10-06 14:06:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-2138">
  <Title>Content-Oriented Categorization of Document Images</Title>
  <Section position="2" start_page="0" end_page="818" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> The number of documents available on the network is increasing with the development of the computational infrastructure. Accordingly, information retrieval has become one of the most important research topics in natural language processing (NLP). In the digital network world, documents are usually distributed in either text file or image format, where the former is a sequence of character codes (e.g., ASCII) and the latter is a bitmap. Although only text files are nmchinereadable and convenient from the viewpoint of informarion retrieval, many documents are available as images alone. They are easily generated by scanning hard-copy documents which the real world is massively using.</Paragraph>
    <Paragraph position="1"> While most information retrieval systems have been designed for text files, there are some systems proposed for images. They convert images into text files using optical character recognition (OCR) to utilize existing NLP techniques. Even though state-of-the-art OCR creates noisy output with recognition errors (Rice, et al., 1995), prior work has shown that OCR output is satisfactory for retrieval purposes (Itther, et al., 1995; Mittendorf, et al., 1995; Myers and Mulgaonkar, 1995; Wenzel and Hoch, 1995). The inaccuracy of OCR can be largely mitigated. However, little attention has been paid to reducing the computational expense of OCR. OCR is a major bottleneck for information retrieval systems in terms of speed. For example, Myers and Mulgaonkar reported in their OCR-based information extraction system that the total processing time was dominated by character and word recognition processes (Myers and Mulgaonkar, 1995). This suggests an important question: &amp;quot;how much NLP can be done without character recognition (Church, et al., 1994)?&amp;quot; As an alternative technique to OCR, there is word shape token processing which converts images into a shape-based representation. It recognizes coarse character shape classes (character shape codes) rather than character codes. Because the number of character shape codes is small and they are defined by simple graphical features, their recognition from images is inexpensive. Word shape token processing has been proven to be of use for European language identification (Nakayama and Spitz, 1993; Sibun and Spitz, 1994). Also, its feasibility for content characterization has been discussed with the use of controlled (noise-free) on-line data set (Nakayama, 1994; Nakayama 1995; Sibun and Farrar, 1994). However, no analysis has been done with real document images, which are usually degraded in quality. In addition, a comparative evaluation between the word shape token-based and the OCR-based approach is needed.</Paragraph>
    <Paragraph position="2"> We have developed a technique which automatically categorizes document images into pre-defined classes based on their content. It employs a vector space classifier drawn from many robust statistical techniques in information retrieval (see Salton, 1991).</Paragraph>
    <Paragraph position="3"> We show in this paper that our technique can categorize as accurately as the conventional OCR-based approach, while it can process much faster.</Paragraph>
    <Paragraph position="4"> In the next section, we describe the definition of character shape codes and word shape tokens, and their generation from document images. In section 3, we outline the automated categorization system which we developed. In section 4, with the use of a topic-tagged document image database, we show the word shape token-based approach is quite adequate for content-oriented categorization in comparison with a conventional OCR-based system. In section 5, we discuss the experimental results and future work.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML