File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/a94-1003_metho.xml

Size: 18,204 bytes

Last Modified: 2025-10-06 14:13:36

<?xml version="1.0" standalone="yes"?>
<Paper uid="A94-1003">
  <Title>Language Determination: Natural Language Processing from Scanned Document Images</Title>
  <Section position="3" start_page="15" end_page="16" type="metho">
    <SectionTitle>
3 Language determlnat,on&amp;quot;
</SectionTitle>
    <Paragraph position="0"> We have found that we can readily distinguish the language of a document for 23 Roman-alphabet (mostly European) languages from a relatively small text. This technique exploits the high frequency of short words in such languages and the diversity of their word shape token representations.</Paragraph>
    <Paragraph position="1"> In this section, we describe our method for determining a document's language from the shape-based representation derived from the image (some of this 1. Document images may be obtained by scanning of paper documents, by retrieval from a document image database, or by digital rendering of a high level representation of the document.</Paragraph>
    <Paragraph position="2"> 2. This paper adepts the following conventions: monospaced to represent input characters, boldface to represent the character shape codes (A, x, i, g, J, U), and sans-serif to represent typographic conventions.</Paragraph>
    <Paragraph position="3">  work has been reported in Nakayama &amp; Spitz 1993). Our system learns how to discriminate a set of languages; then, for any input document, the system determines to which language it belongs. Our method uses the statistical technique of Linear Discriminate Analysis (LDA). First, we demonstrate the method using a hand-selected set of distinguishing features for a small set of lanPSuages. In section 4, we describe our process for automating the selection of distinguishing features across an arbitrary number of lanPSuages, and show the results on a corpus that includes documents from 23 languages.</Paragraph>
    <Paragraph position="4"> Our initial set of discriminable languages comprised English, French, and German. To ascertain the set of discriminating features, we built a training corpus of approximately 15 scanned images of one-page documents for each language. We tokenized these images following the procedure described in section 2. This resulted in 7621 tokens from l~.,glisla, 6826 tokens from French, and 5472 tokens from German. We then ranked the frequency of word shape tokens across each corpus and noted the ten most frequent tokens. By comparing these top ten word shape tokens for each of the languages, we were able to select one per language that was both frequent in that lanouage and less frequent in the other languages. Intuitively, each of these tokens is characteristic of its language; therefore, we call these characteristic tokens (see figure 4). The characteristic token for English is AAx; AAx constitutes 7% of the tokens in the  French and German: the top five for each language are shown; rankings of these are shown for the other languages when they fall in the top ten; shading indicates the characteristic token for each language; and common words that map to the top five tokens for each language are shown.</Paragraph>
    <Paragraph position="5"> Axx constitutes 6%. However, of the five, only Aix is rare in the other languages. While Ax is frequent in all three corpora, it is overwhelmingly frequent in French, where it makes up 11% of the tokens (vs. 4% for English and 2% for German). These differences in the distribution of the characteristic tokens in the three corpora are sufficient for LDA to correctly identify each language almost every time (see figure 5). 3 The documents are from the training corpus: by a process called cross-validation, each was removed from the training corpus one at a time and classified based on the discriminating results from training on the rest of the corpus.</Paragraph>
    <Paragraph position="6">  It may be noted that each of the top five word shape tokens in each of the English, French, and German corpora is a mapping of dosed class words such as determiners, conjunctions, and pronouns. This is not surprising, since dosed class words are frequent in European languages. Of course, other words map to these word shape tokens too. For example, in English, the word flu maps to AAx. However the overwhelming proportion of AAx tokens in the English corpus are mappings of the. Since the is such a common word in English, we can expect AAx to be characteristic of any shape-level representation of an F.nglish document. Similar situations obtain in the other languages.</Paragraph>
    <Paragraph position="7"> While it may seem fortuitous that in English AAx is virtually always a mapping of the, unique word shape tokens are more common in Roman-alphabet languages than one might suppose. We mapped an English lexicon of surface ft~ms into word shape tokens and discovered that 20% of the resulting word ~ape tokens were unique; examples include the surface forms apple and apples.</Paragraph>
  </Section>
  <Section position="4" start_page="16" end_page="18" type="metho">
    <SectionTitle>
4 Automated language determina-
</SectionTitle>
    <Paragraph position="0"> tion In the previous section, we discussed the selection of discriminating word shape tokens by hand. We now describe our method for automating this process. We have been able to use this technique to discover a discriminating set of tokens for a large fraction of the languages written in the Roman alphabet. We initially tested this automated technique by recapitulating our English corpus and is quite rare in the others. In the German corpus, Aix is not the most frequent token: xx, xxA, Aix, and xxx each make up about 3% of the corpus while 3. In the case of the German document that was misclassified, examination of the image reveals that, due to printing and scanning artifacts, many characters axe artifactually touching each other.</Paragraph>
    <Paragraph position="1">  work done by hand in discriminating English, French, and German. We then applied the technique to a 755document corpus comprising 23 languages.</Paragraph>
    <Section position="1" start_page="17" end_page="17" type="sub_section">
      <SectionTitle>
4.1 The automated method
</SectionTitle>
      <Paragraph position="0"> While it is easy to hand-select a single discriminating token for each of a few languages, the task becomes more complex as the number of languages grows. Further, a single feature per language may no longer be sufficient; a profile, or vector of features, for each language would be more robust.</Paragraph>
      <Paragraph position="1"> For the automated method, a corpus for each of the languages is scanned and tokeuized, and the tokens are sorted by frequency. The n most frequent tokens for each corpus are selected. We apply stepwise discriminant analysis, a variant of LDA, to this token set: variables are selected one by one according to their ability to discriminate between languages. The optimal value of n has not yet been determined. We need to gather enough discriminating tokens to characterize the languages as completely as possible. However, if we use too many, the accuracy of the classification may actually be degraded; further, relatively uncommon tokens may improve performance on test data but may not work well in general.</Paragraph>
      <Paragraph position="2"> As we discuss below, n = 5 suffices for three languages, but may not be optimal for 23.</Paragraph>
      <Paragraph position="3"> There are several considerations for ensuring that this process is robust. The size of the corpus for each language must be sufficiently large in terms of both the number of documents and the total number of word shape tokens. The number of documents must be large enough to enable the LDA testing procedure to systematically eliminate some of them for cross validation without skewing the overall characteristics of the corpus. The number of word shape tokens must be large enough to be reflective of the language in which the documents are written to allow for accurate comparison between languages. A further consideration is that the number of discriminating tokens used by the LDA system should be considerably smaller than the number of documents.</Paragraph>
      <Paragraph position="4"> For our initial test we selected the five most frequent word shape tokens from each of English, F~nch, and German; this fo~aied a set of ten tokens (because of overlap between corpora). Using stepwise discriminant analysis, the system fonnd the best way to use the tokens by selecting the single token that was most discriminating and then for each of the remaining tokens adding the next most discriminating tokens given the ones that had already been selected. This resulted in a ranking of nine discriminating tokens (Ax, xA, ix, AIX, Axx, xx, AAx, xxA, xxx). The tenth was not found to improve the reliability of the discrimination; in fact accuracy peaked at four tokens.</Paragraph>
      <Paragraph position="5"> We compared the performance of the automated system with that using the hand-selected tokens. When the top three automatically-selected tokens were used, performance was comparable to that of the three hand-selected tokens. Interestingly, there is no overlap in the misclassification of documents. Using four automatically-selected tokens, the system classified all but one document correctly (see figure 6).</Paragraph>
    </Section>
    <Section position="2" start_page="17" end_page="18" type="sub_section">
      <SectionTitle>
4.2 Automated determination for many
languages
</SectionTitle>
      <Paragraph position="0"> We have constructed a database of 755 one-page documents in 23 languages including virtually every European language written in the Roman alphabet. There are 18 Indo-European languages: Afrikaans, Croatian, Czech/Slovak 4, Danish, Dutch, English, French, Gaelic, German, Icelandic, Italian, Norwegian, Polish, Portuguese, Rumanian, Spanish, Swedish, and Welsh. There are two Uralic languages: Finni.~h and Hungarian.</Paragraph>
      <Paragraph position="1"> Finally, we include three languages from disparate families: Turkish, Swahili~ and Viemamese.</Paragraph>
      <Paragraph position="2"> To construct a set of discriminating features, we selected the five most frequent word shape tokens from each language. Because of overlap, this resulted in 23 tokens. Some of these discriminating tokens have a high frequency across languages; in fact, xx appears in the top five of 22 of the languages we examined. However, even when we consider 23 languages, there are eight tokens appearing in the top five of one language which do not appear in the top five of any others. (This does not mean of course, that these tokens do not appear in other languages at all, but simply that they are relatively much less frequent.) The 23 tokens comprise the set (x, xx, xxx, xxxx, i, ix, xi, xix, A, AAx, Ax, AxA, AxAx, Axx, Axxx, xA, xxA, Ai, AIX, g, gx, xg, xxg, jx).</Paragraph>
      <Paragraph position="3"> As before, we used LDA to build a statistical model of the language categorizations, and by cross validation tested the accuracy of the model (see figure 7). Our over-all accuracy is better than 90%, while the accuracy for individual languages varies between 100% and 75%, with an outlier of 44% for Czech/Slovak. Examination of misclassifications proves somewhat instructive, as can be seen in the confusion matrix in figure 8. For example, Dutch and Afrikaans are closely related languages, and the only error in either language is the categorization of one Afrikaans document as Dutch. Among the five 4. We initially considered Czech and Slovak as separate languages, but this yielded worse results than combining them. We feel our decision was legitimate because ~Slovak is similar enough to Czech to be considered by some as merely a dialect&amp;quot; despite &amp;quot;the existence of slightly different alphabets, as well as distinct litoratures ~ (Katzner 1986, p 91).  Romance languages - French, Italian, Spanish, Portuguese, Rumaulan - nine of the ten classification errors are within that language family. For the Scandinavian language family - Danish, Norwegian, Swedish, and Icelandic - the pattern is less clear. Two Norwegian documents are classified as Icelandic, but the three other errors in that family are classifications outside of the family.</Paragraph>
      <Paragraph position="4">  viations shown are used as indices in figure 8.</Paragraph>
      <Paragraph position="5"> Croatian, Czech/Slovak, and Polish are all Slavic languages; Hungarian and Finnish are related to each other but not to any other European languages. However, there is a large cluster of errors within the set of these five languages. Most of these errors are for Czech/Sloyak documents; in fact, Czech/Slovak was recognized far less accurately than any other language and it is unclear why. It may be the case that many of these doeuments are of poor quality. Seventeen of the 69 errors seem to be random; while we are working to reduce such errors, it is unlikely that we can eliminate them entirely. It is possible that 23 discriminating tokens is not sufficient; since the accuracy has been improved by the addition of each new token, adding several more may continue the improvement.</Paragraph>
    </Section>
    <Section position="3" start_page="18" end_page="18" type="sub_section">
      <SectionTitle>
4.3 Discussion of methodology
</SectionTitle>
      <Paragraph position="0"> While LDA has proved adequate, there are some drawbacks to this technique. We are somewhat disappointed by the system's accuracy. Examination of token frequencies suggests that the profiles for each language are distinct enough that 90% should be a lower bound on classification accuracy. However, for several languages the accuracy was much lower, and for many more it was not much better than 90%. A more troubling problem is the instability of the model. When we add or delete languages, overall accuracy fluctuates between 80% and 93%. This suggests that removing a l~_nguage affects the typical distribution across all lanPSuages, which should not be the case. It is difficult to identify the underlying causes of both of these observations. Finally, the results of LDA are difficult to interpret. All these considerations suggest that LDA may not be the best technique to use.</Paragraph>
      <Paragraph position="1"> Therefore, we are exploring alternative statistical models, such as classification trees, to fred an approach that is more robust for our task.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="18" end_page="19" type="metho">
    <SectionTitle>
5 Comparison with other methods
</SectionTitle>
    <Paragraph position="0"> It is difficult for us to compare our approach to other methods of language determination. Most sources we have found are simply guides for librarians or translators.</Paragraph>
    <Paragraph position="1"> For example, Ingle (1976) found that the presence or absence of specific one- or two-character words suff~,es to distinguish among 17 Roman-alphabet languages.</Paragraph>
    <Paragraph position="2"> There are several implemented systems, some of which report on their accuracy, but none is addressing exactly the same problem as ours: all work from character-coded text. However, it is useful to get a ballpark estimate of the accuracy to be expected of character-based systems.</Paragraph>
    <Paragraph position="3"> Batchelder (1992) trained neural networks to recognize 3-6 character words from 10 languages. While her networks had high accuracy in recognizing words from the training set, their best-case performance on untrained words was 53%, thus making accurate determination of a document's language highly onlikely.</Paragraph>
    <Paragraph position="4"> Cavner and Trenlde (1994) used n-grams of characters for n = 1 to 5. Their task was not language determination per se, but determining to which country's newsgroup (in the netnews soc.culture hierarchy) a document belonged. In each newsgroup, the documents were written in either English or other language(s). For documents longer than 300 characters, the system determined the correct newsgroup with 97% accuracy when using the 100 most frequent n-grams. These results are good, but the technique should be tested on a set of documents for which the l~nguages are known and the topics are varied.</Paragraph>
    <Paragraph position="5"> Kulikowski (1991) used a semi-automatic method to determine a profile of frequent 2-3 character words for nine languages. He claims at least 95% accuracy for determining that a single-language document is in one of the nine languages or in none of them. Unforttmately he does not expand on this claimdeg Henfich (1989) used criteria such as language-specific word-boundary character sequences and common short words to determine the language of sentences in English, French, or German.</Paragraph>
    <Paragraph position="6"> Mustonen (1965) used discriminant analysis to distinguish English, Swedish, and Finnish words. His system, which used 43 discriminating features, such as particular letters and syllable types, performed with 76% accuracy. This relatively poor performance is probably due to the data being isolated words rather than documents, though it may also be due to overfitting of the test data by too many features (see section 4.1).</Paragraph>
    <Paragraph position="7"> We would like to emphasize that our statistics on word shape token distribution across the various lan- null diagonal indicate the number of correct classificaticms for each language. Numbers off the diagonal indicate classification errors.</Paragraph>
    <Paragraph position="8"> guages are generated entirely from scanned images of text. We feel this is important because the text whose language we are trying to identify should not be systematically different in any way from the texts from which the discriminate analysis was generated. For example, typographic conventions such as a ligature between a vowel and an acute accent (as in characters like ~t) cause the character shape code recognizer to classify these characters as A. However, if we were working from encoded on-line corpora we would &amp;quot;know&amp;quot; that such a character should be classified as i.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML