File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-2138_metho.xml

Size: 15,860 bytes

Last Modified: 2025-10-06 14:14:13

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-2138">
  <Title>Content-Oriented Categorization of Document Images</Title>
  <Section position="3" start_page="818" end_page="819" type="metho">
    <SectionTitle>
2 Character Shape Code and Word
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="818" end_page="819" type="sub_section">
      <SectionTitle>
Shape Token
</SectionTitle>
      <Paragraph position="0"> A character shape code is a machine-readable code which represents a set of graphically similar characters. A word shape token is a sequence of one o1&amp;quot; more character shape codes which represents a word.</Paragraph>
      <Paragraph position="1"> Character shape codes are defined differently by the selection of graphical features. In this paper, we consider the number of connected components, vertical location, and deep concavity as graphical features to classify characters. First, we identify the positions of the text lines as shown in figure 1. Second, we identify the character cells, and count the number of connected components in each character cell. Third, we note their position with respect to the text lines.</Paragraph>
      <Paragraph position="2"> Finally, we identify the presence of a deep eastward/ southward concavity. In figure 1, vertical location classifies characters into three groups--{&amp;quot;l&amp;quot;} {&amp;quot;g&amp;quot;} {&amp;quot;a&amp;quot;, &amp;quot;n&amp;quot;, &amp;quot;u&amp;quot;, &amp;quot;e&amp;quot;}; characters that occupy the space between the top and the baseline, characters that occupy the space between the x-height line and the bottom, and characters that occupy the space between the x-height line and the baseline, respectively. The last one is further classified by presence or absence of a deep eastward/southward concavity. Resultant groups are {&amp;quot;a&amp;quot;, &amp;quot;u&amp;quot;} {&amp;quot;e&amp;quot;} {&amp;quot;n&amp;quot;}. The defined character classes and the members for the ASCII character set are shown in Table 1. Once classification has been performed, the resulting character shape codes are grouped by word boundary and used as word shape tokens for the downstream processing. Figure 2 gives an example of generated word shape token representation with its original document image.</Paragraph>
      <Paragraph position="3"> x-.eig.,,,dege Too Figure 1 : text line parameter positions (above) and comlected components (below) 'Fable 1: character shape code membership character menlbers shape code A A-Zbdfhklt0-9#$&amp;@</Paragraph>
      <Paragraph position="5"> There are many different languages in common use around the world and many different scripts in which these languages are typeset.</Paragraph>
      <Paragraph position="6"> AAexe xxe xxng AIAAexenA Axngxxgex In exxxxn xxe xxxxnA AAe xxxAA xnA xxAg AIAAexenA xexigAx In xAleA AAexe Axngxxgex xxe AggexeA.</Paragraph>
      <Paragraph position="7"> Figure 2: document image (above) and generated word shape tokens (below) note: there is all error (many - xxAg) in the second line due to a small ink drop Our character shape code recognition doesn't require a complicated image analysis. For example, distinguishing &amp;quot;c&amp;quot; from &amp;quot;e&amp;quot; is a difficult task for OCR that requires a considerable computational expense (Ho and Baird, 1994), whereas they are in the same class in our representation (Table 1). Also, our process is free from font identification which is mandatory for OCR (for font identification complexity, see Zramdini and Ingold, 1993). As a result, the process of word shape token generation from images is much faster than current OCR technology.</Paragraph>
      <Paragraph position="8"> While we save a computational expense, we lose some information which original document images have. Table 1 shows that the mapping between character shape codes and original characte~ is one-tomany--we use only seven character shape codes {A x e n i g j }1 to represent all alphabetical characters.  This would seem to be very ambiguous. However, when used for mapping between word shape tokens and original words, the ambiguity is much reduced.</Paragraph>
      <Paragraph position="9"> We show this using a lexicon of 122,545 distinct word (surface-form) entries. When we transformed the lexicon into word shape token representation, the number of distinct entries was reduced to 89,065. This means one word shape token mapped to 1.38 words on average. Next, we extracted nouns, which are important content-representing words for information retrieval, from the lexicon. We were then left with 75,043 distinct word entries. Similarly, we obtained 57,049 distinct word shape tokens from them. This time, one word shape token mapped to 1.32 words. More importantly, most of them--49,953 of 57,049 word shape tokens (87.6%)--mapped to a single word.</Paragraph>
      <Paragraph position="10"> of topic-tagged document images. The system uses the cosine measure to compute the similarity:</Paragraph>
      <Paragraph position="12"> The greater the value of sim(Di, Dj), the more the similarity between D i and Dj. For each prepared category profile, the system computes the similarity to assign the test document to the most similar category 1 .</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="819" end_page="819" type="metho">
    <SectionTitle>
3 Categorization System
</SectionTitle>
    <Paragraph position="0"> We implemented a content-oriented categorization system to evaluate the word shape token-based approach in comparison with the OCR-based approach. The system, which uses the vector space classifier, consists of three main processes as shown in figure 3.</Paragraph>
    <Paragraph position="1"> First, the system transforms the test document image into a sequence of word shape tokens as described in the previous section, where conventional systems perform OCR to generate a sequence of ASCII encoded words.</Paragraph>
    <Paragraph position="2"> Next, it generates a document profile through the following stages: Stage 1. The system removes punctuation marks. Note that they are distinguishable from alphabetical characters in the character shape code representation (Table 1).</Paragraph>
    <Paragraph position="3"> Stage 2. The system removes word shape tokens corresponding to stop-words. In this process, it may also remove some non stop-words because of the one-to-many mapping between word shape tokens and words. In the OCR-based approach, it removes stop-words.</Paragraph>
    <Paragraph position="4"> Stage3. The system computes frequencies of word shape tokens to generate a document profile. The document profile D i is represented as a vector of numeric weights, De =(Wil, Wi2 ..... Wik ..... wit ) ,where Wik is the weight given the kth word shape token in the ith document, and t is the number of distinct word shape tokens of the ith document. We use the relative frequency between 0 and 1 as the weight. As for the OCR-based approach, read word shape token as word.</Paragraph>
    <Paragraph position="5"> Finally, the system measures the degree of similarity between the document profile and a category profile. The category profile Dj is also represented as a vector derived in the same manner from a collection</Paragraph>
  </Section>
  <Section position="5" start_page="819" end_page="821" type="metho">
    <SectionTitle>
4 Performance Assessment
</SectionTitle>
    <Paragraph position="0"> We have constructed a document image database to compare our categorization approach with the conventional OCR-based approach. First, we carefully chose ten topic categories with strong boundaries. In general, the accuracy of an automated categorization system is evaluated by contrast with the expert judgements. However, experts don't always agree on the judgements. For an unbiased comparative experiments between the two approaches, we chose rela- null tively specific topics. Resultant topic categories are affirmative action, Internet, stock market, local traffic, 1. In this paper, documents are always assigned to a single category.</Paragraph>
    <Paragraph position="1">  Presidential race, Athletics (MLB), Giants (MLB), PGA golf, Tokyo subway attack, and food recipe.</Paragraph>
    <Paragraph position="2"> Second, we manually collected the body potion of 50 newswire articles for each category; 500 documents in total. They were clearly relevant to a single category and much less relevant to the other categories. Third, we printed them using a 300-dpi laser printer, and made nth generation photo-copies from them to degrade images by quality. In the photo-copy process, documents were degraded due to spreading toner, low print contrast, paper placement angle, paper flaws, and so on. Finally, we scanned the hard-copy documents of the first, the third, and the fifth generation with a 300-dpi scanner. As a result, we obtained 500 topic-tagged document images for each nth generation photo-copies (n = 1, 3, 5). Figure 4 shows scanned image samples. The average size of the original documents was 647, and ranged from 63 to 2,860 words. The standard deviation was 377.</Paragraph>
    <Paragraph position="3"> n=l There are many different languages in common t  generation photo-copy We transformed the document images into word shape tokens and ASCII encoded words, where we randomly took 30 inlages for each category (300 in total) as training data to generate category profiles, and tested the remaining 20 images (200 in total). We used ScanWorX OCR (Xerox hnaging Systems) 1 for the ASCII encoding.</Paragraph>
    <Paragraph position="4"> 'Fable 2 shows the processing thne for the u'ansformarion of all images on a SPARCstation 10 (Sun Microsystems). Although it had not been optimized, word shape token generation was 8 to 52 times faster than OCR. The difference increased with progression of n (n = 1, 3, 5). The OCR speed was highly dependent on image quality. Also, its word recognition accuracy was affected by image quality--96.3%, 92.8%, and 80.7% for the first, the third, and the fifth generation copies, respectively. It is well understood that OCR is slower and generates numerous elxors for lower quality images (Taghva, et al., 1994). O11 the  1. This is one of the state-of-the-art OCRs in terms of speed and accuracy, see Rice, et al., 1995.</Paragraph>
    <Paragraph position="5"> other hand, word shape token generation was a little faster for lower quality images. This mffavorable result was mainly caused by the lack of character segmentation function. Some characters touched each other in lower quality images, and were treated as a single character in the process of word shape token generation. Consequently, the number of characters to process became small.</Paragraph>
    <Paragraph position="6"> 'Fable 2: processing time (second) Ior word shape token (WST) generation and OCR  Our system categorized the test documents in word shape token and ASCII format as described in the previous section. As shown in Table 3, the accuracy of the word shape token-based approach for higher quality images (n = 1, 3) was nearly equal to that of the OCR-based approach. For lower quality images (n = 5), the former was significantly lower than the latter. Table 4 and 5 show the accuracy of the two approaches as a function of the size of test documents. When images were in higher quality (n = 1, 3), there was little correlation between the accuracy and the size. When they were in lower quality (n = 5), the OCR-based approach had stronger correlation between the accuracy and the size than the word shape token-based approach. This can be explained as follows: In the statistical categorization, it is generally difficult to get good accuracy when the size of the test document is small. In the OCR-based approach with the first and the third generation copies (n = 1, 3), the test documents were large enough for this categorization task. When the OCR encountered the fifth generation copies (n = 5), it garbled ninny words.</Paragraph>
    <Paragraph position="7"> Most of them were transformed into ill-formed (unl~lown) words 2 rather than mistaken for other words. These ill-formed words were ignored in our sinfilarity measurement. Thus, they didn't act as a negative factor, but virtually made the size of the test document smaller. On the other hand, in the word shape token-based approach with the first and the third generation copies (n = 1, 3), the test documents were similarly large enough. When it encountered the fifth generation copies (n = 5), it also garbled many words. But, this time, they were mistaken for other word shape tokens (e.g., many - xxAg in Fig. 2), and acted as a negative factor to reduce the accuracy.</Paragraph>
    <Paragraph position="8"> 2. ScanWorX outputs a word with a reject mark when it is unable to recognize or is unsure in recognition (e.g., meterii~g).</Paragraph>
    <Paragraph position="10"> categorization as a function of the size of test docnments</Paragraph>
    <Paragraph position="12"/>
  </Section>
  <Section position="6" start_page="821" end_page="821" type="metho">
    <SectionTitle>
5 Discussion
</SectionTitle>
    <Paragraph position="0"> From the experimental results in the previous section, our hypothesis that word shape token-based approach is quite adequate for content-oriented categorization was strongly supported at least for the document images from first and third generation photo-copies.</Paragraph>
    <Paragraph position="1"> This means that the mapping ambiguity between word shape tokens and original words was acceptable for the categorization purpose. The accuracy drop observed with the fifth generation photo-copies was not due to the mapping ambiguity but was caused by recognition errors. Unlike OCR which attempts to correctly recognize each word using lexical information, word shape token generation is only faithful to the original image, Thus, it makes many errors with low quality images, whereas OCR indicates illegible characters. Indicating diffidence is better than incorrect recognition for categorization. It would be possible to utilize lexical information in word shape token representation for reducing errors. However, we must pay attention to its computational expense.</Paragraph>
    <Paragraph position="2"> Although it is arguable whether word stemming algorithms contribute to improving the categorization accuracy (Riloff, 1995), we desire to develop an algorithm for word shape token representation. It would be of use for other information retrieval applications such as word-spotting. We feel the word shape token representation is sufficient for locating some suffixes with accuracy. For example, 1,651 words were with suffix &amp;quot;-tion&amp;quot; in the lexicon of 122,545 distinct word entries. We obtained a set of word shape tokens from them. The set mapped to only 25 words without the suffix 1. Similarly, word shape tokens from all 8,077 words with suffix &amp;quot;-ing&amp;quot; mapped to only 20 words without the suffix 2.</Paragraph>
    <Paragraph position="3"> Because all capital letters map to A (Table 1), it is difficult to identify words with only capital letters, which are sometimes important content-representing words (e.g., acronyms). We need to find a graphical feature to distinguish some capital letters from others, considering the complexity of image analysis.</Paragraph>
    <Paragraph position="4"> When we extend the word shape token processing to other applications, it is important to note that the word shape token representation is only meaningful for the computer and hardly human-friendly. Thus, it should be used in unsupervised systems with no human interaction required. Our technique would be useful for an automated incoming fax sorting by the content. Also, it would be used as an automated dictionary selector for the OCR which uses domain-specific dictionaries.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML