File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/94/a94-1003_concl.xml

Size: 2,290 bytes

Last Modified: 2025-10-06 13:57:08

<?xml version="1.0" standalone="yes"?>
<Paper uid="A94-1003">
  <Title>Language Determination: Natural Language Processing from Scanned Document Images</Title>
  <Section position="6" start_page="19" end_page="20" type="concl">
    <SectionTitle>
6 Conclusion
</SectionTitle>
    <Paragraph position="0"> We have described our method for generating word shape tokens from images and have shown how this shape-level representation of the text can be used for important tasks such as determining the language or languages of a document. We have shown that the method can discriminate among 23 languages with high accuracy.</Paragraph>
    <Paragraph position="1"> Since our approach is statistical, the more text our system sees in a document image, the more reliably it can determine the document's language. So far, we have not tried to determine the language of a document shorter than 27 words, and most of the documents we work with are a few hundred words long (2000-3000 characters).</Paragraph>
    <Paragraph position="2"> We are investigating the lower bound on the length of texts whose language we can reliably determine. In the ideal case we would be able to detect the presence of a very few words of a secondary language interpolated into a document predominated by another language.</Paragraph>
    <Paragraph position="3"> In other work, we axe using the shape-level representation as input to higher-level natural language processing systems for rudimentary content analysis.</Paragraph>
    <Paragraph position="4"> However, many sorts of information, particularly style characteristics, can be derived from the shape-level rep- null resentation directly. For instance, since the number of character shape codes extracted form a document is comparable to the number of characters, characterizations about word length in a shape-level representation apply as well to the character-coded version of the document.</Paragraph>
    <Paragraph position="5"> This word length characterization is not perfect: ligatures introduce some uncertainty. Additionally, braces, brackets, and parentheses which are typically set contiguous with words, are currently mapped to A, this will affect word length counts. We are refining the mapping to account for these delimiting characters.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML