File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/a94-1003_intro.xml
Size: 7,447 bytes
Last Modified: 2025-10-06 14:05:34
<?xml version="1.0" standalone="yes"?> <Paper uid="A94-1003"> <Title>Language Determination: Natural Language Processing from Scanned Document Images</Title> <Section position="2" start_page="0" end_page="15" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Computational linguists work with texts. Computational lin!mistic applications range from natural language understanding to information retrieval to machine translation. Such systems usually assume the language of the text that is being processed. However, as corpora become larger and more diverse this assumption becomes less warranted. Attention is now turning to the issue of determining the language or languages of a text before further processing is done. Several sources of information for language determination have been tried: short words (Kulikowski 1991, Ingle 1976); n-grams of words (Batchelder 1992); n-grams of characters (Cavner & Trenkle 1994); diacritics and special characters (Beesley 1988, Newman 1987); syllable characteristics (Mustonen 1965); morphology and syntax (Ziegler 1991). F~ch of these approaches is prGmising although none is completely accurate. More fundamentally, many rely on relatively large amounts of text data and all rely on data in the form of character codes (e.g., ASCID.</Paragraph> <Paragraph position="1"> In today's world of text-based information, however, not all sources of text will be character coded.</Paragraph> <Paragraph position="2"> Many documents such as incoming faxes, patent applications, and office memos are only accessible on paper. Processes such as Optical Character Recognition (OCR) have been developed for mapping paper documents into character-coded text.</Paragraph> <Paragraph position="3"> However, for applications like OCR, it is desirable to know the language a document is in before trying to decode its characters. There appears to be a fundamental Catch-22: natural language processing systems want to be able to work automatically with arbitrary documents, many of which may be available only on paper, and in the process, they minimally need to know which language or languages are present. The algorithms cited above can determine a document's language, but they require a character-coded representation of the text.</Paragraph> <Paragraph position="4"> OCR can produce such a representation, but OCR does not work well unless the language(s) of the document are known. So how can the language of a paper document be determined? We have developed a method which reliably determines the language or lanPSxlages of a document image. In this paper, we discuss Roman-alphabet languages such as English, Polish, and Swahili; see Spitz (1994) for a discussion of the determination of Asian-script languages. Our method finesses the problems inherent in mapping from an image to a character-coded representation: we map instead from the image to a shape-based representation. The basal representation is the character shape code of which there are a small number. These shape codes are aggregated into word shape tokens which are delimited by white space. From examining these word shape tokens we can determine the language of the document. An example of the transformation from character codes to character shape codes is shown in figurel. null Character codes Confidence in the international monetary system was shaky enough before last week's action.</Paragraph> <Paragraph position="5"> shape code representation.</Paragraph> <Paragraph position="6"> The shape-based representation of a document is proving to be a remarkably rich source of information. While our initial goal has been to use it for language identification, in support of downstream OCR pro- null cesses, we are finding that this representation may itself be sufficient for natural language applications such as document indexing and content characterization (see Nakayama (this volume), Sibun & Farrar 1994). We fred these indications exciting because OCR is an expensive, slow, and often inaccurate process, especially in the presence of printing and scanning artifacts such as broken or touching characters or skew or curvature of text lines. Thus, if our technique allows natural language processing systems to apply OCR selectively or to side-step OCR entirely, such systems will become faster, less expensive, and more robust.</Paragraph> <Paragraph position="7"> In this paper, we first explain the background of our system that constructs character shape codes and word shape tokens from a document image. We next describe our method for language determination from this shape-based representation, and demonstrate our approach using only the three languages F.nglish, French, and German. We then describe an automated version of this process that allows us to apply our techniques to an arbitrary set of lan~ruages and show its performance on 23 Roman-alphabet languages.</Paragraph> <Paragraph position="8"> 2 Character shape codes and word shape tokens Our determinations about document characteristics are made neither on the raw image I nor on the character codes by which the document can be represented. The determinations are made on a shape-based representation built of a novel component, the character shape code (Spitz 1993).</Paragraph> <Paragraph position="9"> Four horizontal lines define the boundaries of three significant zones on each text line (see figure 2). The area between the bottom and the baseline is the descender zone; the area between the baseline and the top of such characters as x is the x zone; and the area between the x-height level and the top is the ascender zone.</Paragraph> <Paragraph position="10"> positions: Top, x-height, Baseline and Bottom.</Paragraph> <Paragraph position="11"> Characterizations of the number of connected compouents in a character cell and, in some instances, their aspect ratios, contribute to the coding. Thus most characters can be readily mapped from their positions relative to the baseline and x-height to a small number of distinct codes (see figure 3). 2</Paragraph> <Section position="1" start_page="15" end_page="15" type="sub_section"> <SectionTitle> 2.1 Typesetting effects </SectionTitle> <Paragraph position="0"> Typesetters use different conventions. For example, in German text 0 may be set as ue and 8 may be set as.</Paragraph> <Paragraph position="1"> Therefore, there may be several-to-one mappings of typeset information to character shape codes, since ii maps to U andue toxx.</Paragraph> <Paragraph position="2"> If this shape mapping can be done from document images, it can more trivially be accomplished from character-coded documents (e.g., ASCII, ISO-Latin-1, JIS, Unicode), providing, of course, that the method of encoding is known.</Paragraph> </Section> <Section position="2" start_page="15" end_page="15" type="sub_section"> <SectionTitle> 2.2 Computational complexity </SectionTitle> <Paragraph position="0"> Our approach takes on a much less difficult problem than does OCR. There is no need to investigate the free structure of character images, the number of classes is small, and measurements are largely independent of font or typeface. As a result, the process of classifying text into character shape codes and aggregating those codes into word shape tokens is two to three orders of magnitude faster than current OCR technology.</Paragraph> </Section> </Section> class="xml-element"></Paper>