File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/94/a94-1004_concl.xml

Size: 1,880 bytes

Last Modified: 2025-10-06 13:57:08

<?xml version="1.0" standalone="yes"?>
<Paper uid="A94-1004">
  <Title>Modeling Content Identification from Document Images</Title>
  <Section position="4" start_page="26" end_page="26" type="concl">
    <SectionTitle>
4 Conclusions and further directions
</SectionTitle>
    <Paragraph position="0"> Generating word shape tokens from images is inexpensive and robust for real-world documents. Word shape tokens do not carry alphabetical information, but they are potentially usable for content identification by locating content-representing word images. Our method uses a word shape token stop list and analyzes the dislribution of tokens. This technique depends on the observation that, in English, the characteristics of word shape differ between function and content words, and between frequent and infrequent words.</Paragraph>
    <Paragraph position="1"> We expect to be able to extend the technique to many other European languages that have similar characteristics. For example, German function words tend to be shorter than nouns, which are always capitalized.</Paragraph>
    <Paragraph position="2"> In addition, by drawing on our language determination technique, which uses the same word shape tokens (Nakayama and Spitz, 1993; Sibun and Spitz, this volume), we could enhance the technique described here for multilingual sources.</Paragraph>
    <Paragraph position="3"> Other future work involves examining automatic document categorization in which an input document image is assigned to some pre-existing subject category (Cavnar and Trenlde, 1994). With reliable training data, we feel we can identify the configuration of word shape tokens across subjects. Using a statistical method to compute the distance between input and configurations of categories would be a good approach. This might be useful for document sorting service for fax machines.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML