File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/p98-2152_intro.xml

Size: 4,402 bytes

Last Modified: 2025-10-06 14:06:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-2152">
  <Title>Japanese OCR Error Correction using Character Shape Similarity and Statistical Language Model</Title>
  <Section position="3" start_page="0" end_page="922" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> As our society is becoming more computerized, people are getting enthusiastic about entering everything into computers. So the need for OCR in areas such as office automation and information retrieval is becoming larger, contrary to our expectation.</Paragraph>
    <Paragraph position="1"> In Japanese, although the accuracy of printed character OCR is about 98%, sources such as old books, poor quality photocopies, and faxes are still difficult to process and cause many errors. The accuracy of handwritten OCR is still about 90% (Hildebrandt and Liu, 1993), and it worsens dramatically when the input quality is poor. If NLP techniques could be used to boost the accuracy of handwriting and poor quality documents, we could enjoy a very large market for OCR related applications.</Paragraph>
    <Paragraph position="2"> OCR error correction can be thought of a spelling correction problem. Although spelling correction has been studied for several decades (Kukich, 1992), the traditional techniques are implicitly based on English and cannot be used for Asian languages such as Japanese and Chinese.</Paragraph>
    <Paragraph position="3"> The traditional strategy for English spelling correction is called isolated word error correction: Word boundaries are placed by white spaces. If the tokenized string is not in the dictionary, it is a nonword. For a non-word, correction candidates are retrieved from the dictionary by approximate string match techniques using context-independent word distance measures such as edit distance (Wagner and Fischer, 1974) and ngram distance (Angell et al., 1983).</Paragraph>
    <Paragraph position="4"> Recently, statistical language models and feature-based method have been used for context-sensitive spelling correction, where errors are corrected considering the context in which the error occurs (Church and Gale, 1991; Mays et al., 1991; Golding and Schabes, 1996). Similar techniques are used for correcting the output of English OCRs (Tong and Evans, 1996) and English speech recognizers (Ringger and Allen, 1996).</Paragraph>
    <Paragraph position="5"> There are two problems in Japanese (and Chinese) spelling correction. The first is the word boundary problem. It is impossible to use isolated word error correction techniques because there are no delimiters between words. The second is the short word problem. Word distance measures are useless because the average word length is short (&lt; 2), and the character set is large (&gt; 3000). There are a much larger number of one edit distance neighbors for a word, compared with English.</Paragraph>
    <Paragraph position="6"> Recently, the first problem was solved by selecting the most likely word sequence from all combinations of exactly and approximately matched words using a Viterbi-like word segmentation algorithm and a statistical language model considering unknown words and non-words (Nagata, 1996). However, the second problem is not solved yet, at least elegantly. The solution presented in (Nagata, 1996) which sorts a list of one edit distance words considering the context in which it will be placed is inaccurate because the context itself might include some errors.</Paragraph>
    <Paragraph position="7"> In this paper, we present a context-independent approximate word match method using character shape similarity. This is suitable for languages with large character sets, such as Japanese and Chinese.</Paragraph>
    <Paragraph position="8"> We also present a method to build a statistical OCR model by smoothing the character confusion probability using character shape similarity.</Paragraph>
    <Paragraph position="9"> It seems previous NLP researchers are reluctant  to use resources such as the character confusion matrix and feature vectors of the characters, and try to solve the problem by using only linguistic devices.</Paragraph>
    <Paragraph position="10"> We found that, by using character shape similarity, the resulting OCR error corrector is robust and accurate enough to correct unrestricted texts with a wide range of recognition accuracies.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML