File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/a94-1039_intro.xml
Size: 3,718 bytes
Last Modified: 2025-10-06 14:05:35
<?xml version="1.0" standalone="yes"?> <Paper uid="A94-1039"> <Title>Fukui-shi,Japan</Title> <Section position="2" start_page="0" end_page="198" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> In order to improve the computers' man-machine tares'faces, input devices such as Optical Character Readers(OCR.) or speech recognition devices have been developed. However, text input through an OCR or a speech recognition device usually contains erroneous character strings.</Paragraph> <Paragraph position="1"> The erroneous characters can be classified into three types. The first is characters that have been recognized incorrectly, that is taken to be characters other than the correct characters. The second and the third are extra characters wrongly inserted and deleted (skipped) characters. Markov chain modeLs have been used to find and correct the first type of errors.</Paragraph> <Paragraph position="2"> Recently, the Selective Error Correction Method to judge the three types of the errors and correct correct these errors, using m-th order Markov chain model for Japanese 'kanji-kana' characters, has been proposed (Arak iet al., 1994).</Paragraph> <Paragraph position="3"> In this paper, the Selective Error Correction Method is applied to detect and correct erroneous characters in Japanese text input through an OCR.</Paragraph> <Paragraph position="4"> (1) The number of phrases used for statistics: 70 issues of a daily Japanese newspaper containing 283,963 phrases.</Paragraph> <Paragraph position="5"> (2) The number of phrases input through the OCR: lOOO phrases (a) The average length of phrase (in 'kanji-kana' characters): 6 (b) The size of character fonts: 8 point (c) The input method to the OCR: Fax A Japanese sentence can be separated into syntactic units called phrases ( usually called &quot;bunsetsu&quot; ). Japanese phrases in a text can be divided into two types: correct phrases, erroneous phrases. The set of correct Japanese phrases is represented by Fc. The set of erroneous phrases is denoted by FE, and it is further divided into three types: The first is erroneous phrases which contain erroneous characters substituted wrongly in the phrase, and is denoted by Fs. The second and the third are erroneous phrases which have characters ommitted from them (denoted by FD) or inserted wrongly in them (denoted by FI). The accuracy ratios to detect and to correct the errors by a method are evaluated by the &quot;Relevance Factor&quot; P and the &quot;Recall Factor&quot; R. Here, P denotes the ratio of errors detected or corrected by a method to the whole of FE. R denotes the ratio that the elements of FE can be detected or corrected by a method.</Paragraph> <Paragraph position="6"> Next, we introduce the following assumption based on previous experiments: &quot;Each Markov probability for erroneous chains o\] syllables and 'kanji-kana' characters is small compared to that of correct chains&quot;.</Paragraph> <Paragraph position="7"> According to this assumption, the procedure of detecting the location i and the length k of error chains is defined as followed: Namely, the procedure detects that k characters are wrongly substituted or inserted at the location i, if the m-th order Markov probability for a chain remains smaller than the critical value T just (k + m) times from the location i toi+k+m-1.</Paragraph> <Paragraph position="8"> Similar),, the method of detecting the location of a chain wrongly deleted in F~ ) and the methods of correcting the chains with wrongly substituted, inserted or deleted dlaracters are described in Ref.(Araki et al., 1994).</Paragraph> </Section> class="xml-element"></Paper>