File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/w96-0108_metho.xml
Size: 21,397 bytes
Last Modified: 2025-10-06 14:14:25
<?xml version="1.0" standalone="yes"?> <Paper uid="W96-0108"> <Title>A Statistical Approach to Automatic OCR Error Correction in Context</Title> <Section position="3" start_page="88" end_page="94" type="metho"> <SectionTitle> 2 The Approach </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="88" end_page="90" type="sub_section"> <SectionTitle> 2.1 Problem Statement </SectionTitle> <Paragraph position="0"> The problem of context-based OCR word-error correction can be stated as follows: Let L = {wl, w~, ..., win} be the set of all the words in a given lexicon. For an input sentence, S = sl, ..., sn, produced as the output of an OCR device, where sl, ...,s,~ are character strings separated by spaces, find the best word sequence, ~?g = wl, w2, ..., w,, for wi E L, that maximizes the probability pr (W\[ S):</Paragraph> <Paragraph position="2"> &quot;Ruel&quot; is an obscure French-derivative word meaning the space between a bed and the wall.</Paragraph> <Paragraph position="3"> Using Bayes' formula, we can rewrite 1 as:</Paragraph> <Paragraph position="5"> The probability pr(W) is given by the language model and can be decomposed as:</Paragraph> <Paragraph position="7"> where pr(wilw~ .i- ~) is the probability that the word wi appears given that wl, w2, * *., wi_ ~ appeared previously.</Paragraph> <Paragraph position="8"> In a word-bigram language model, we assume that the probability that a word w~ will appear is affected only by the immediately preceding word. Thus,</Paragraph> <Paragraph position="10"> The conditional probability, pr(SIW ), reflects the channel (processing) characteristics of the OCR environment. If we assume that strings produced under OCR are independent of one another, we have the following formula:</Paragraph> <Paragraph position="12"> Thus, the problem of calculating W is reduced to estimating the word-bigram probability, pr (wil w~_ 1), and the word confusion probability, pr(silw~). The word-bigram probability, pr(wi\[wi_ ~), can be estimated by a maximum likelihood estimator (MLE): prML(WiiW,_i ) = C(Wi-1, Wi) where c(wi_x) is the number of times that wi-1 occurs in the text and c(wi_~, w~) is the number of times that the word bigram (Wi_l, wi) occurs in the text.</Paragraph> <Paragraph position="13"> However, the estimatation of unseen bigrams is a problem. We use a back-off model similar to that described in \[Dagan & Pereira 1994\] to estimate the word-bigram probabilities in our system. If we already have estimates of the probabilities pr(wilwi_l) and pr(si\[wi), the Viterbi algorithm \[Charniak 1993\] could be used to determine the best word sequence for the given sentence. Details of the back-off model and Viterbi algorithm can be found in \[Dagan & Pereira 1994\] and \[Charniak 1993\].</Paragraph> </Section> <Section position="2" start_page="90" end_page="91" type="sub_section"> <SectionTitle> 2.2 Estimate of Channel Probabilities and Learning of Character Confusion Table </SectionTitle> <Paragraph position="0"> The probability pr(slw)--the conditional probability that, given a word w, it is recognized by the OCR software as the string s---can be estimated by the confusion probabilities of the characters in s if we assume that character recognition in OCR is an independent process.</Paragraph> <Paragraph position="1"> We assume that an OCR string is generated from the original word by one or more of the following operations: (a) delete a character; (b) insert a character; or (c) substitute one character for another. Under such circumstances, a dynamic programming method can be used to determine the operations that maximize the conditional probability when transforming the original word to the OCR string, given a character confusion probability table.</Paragraph> <Paragraph position="2"> Let tl, t~ ... ti be the first i characters of the string that is produced by the OCR process for a source word s and let sl,s2...sj be the first j actual characters of ~. Define pr(ilj ) to be the conditional probability that the substring sl,j is recognized as substring tl.i by the OCR process, i.e., pr(tl,ilSl.j). The dynamic programming recurrence is given as follows:</Paragraph> <Paragraph position="4"> where pr(ins(y)) is the probability that letter y is inserted.</Paragraph> <Paragraph position="5"> pr(del(y)ly ) is the probability that letter y is deleted.</Paragraph> <Paragraph position="6"> pr(xly) is the probability that letter y is replaced by letter x.</Paragraph> <Paragraph position="7"> For example, suppose that source word &quot;flag&quot; is recognized as &quot;flo&quot; by an OCR device. Formula 8 may determine that a sequence of four operations--(1) substitute &quot;f&quot; for &quot;f'; (2) substitute &quot;T' for &quot;l'; (3) substitute &quot;a&quot; for &quot;o&quot;, and (4) delete &quot;g&quot;--maximizes the conditional probability pr(&quot;flo&quot;l&quot;flag&quot;). Then the probability of &quot;flag&quot; being rendered as &quot;flo&quot; can be estimated as: pr(&quot;flo&quot;l&quot;flag') = pr(&quot;f&quot;l&quot;f&quot; ) * pr(&quot;l&quot;l&quot;l&quot; ) * pr(&quot;o'l&quot;a&quot;) * pr(del(&quot;g&quot;)l&quot;g&quot;) This method is similar to what was described in \[Wagner 1974\] where the minimum edit distance between two strings was computed. The minimum edit distance is the minimum number of operations that transform the source string to the target string. Note that to effect spelling correction, we could include character transposition probabilities.</Paragraph> <Paragraph position="8"> If we have no information about the character confusion probabilities, we can estimate them as:</Paragraph> <Paragraph position="10"> where N is the total number of printable characters.</Paragraph> <Paragraph position="11"> The estimator a can be regarded as the probability that a given character is correctly recognized. Our experiments show that system performance is very sensitive to the value of a, especially for real-word error correction. For example, if a is very high, then the probability pr(sls ) will be too high to be affected by subsequent processing and will not be changed. On the other hand if a is very low, some correct words may be detected as real-word errors and will be changed.</Paragraph> <Paragraph position="12"> If we have both the original text and the corresponding OCR output and if we assume that the errors made by a particular OCR system are not random (but semi-deterministic), we can count the cases of substitution, deletion, and insertion using a method similar to computing the minimum edit distance between strings \[Wagner 1974\] and we can estimate the probabilities using formulas similar to those in \[Church & Gale 1991\]:</Paragraph> <Paragraph position="14"> Obviously, in practice, we typically do not have the original text to compare to the OCR text or to use for correction. Moreover, as noted in \[Liu et al. 1991\], the character confusion characteristics are heavily dependent on the OCR environment, encompassing everything from the performance biases of the specific OCR software to the size of characters in the source text, fonts used, individual character types, and print quality of the text being processed. It is not feasible to train on texts to acquire character confusion probabilities for each OCR environment.</Paragraph> <Paragraph position="15"> The current system employs an iterative learning-from-correcting technique that treats the corrected OCR text as an approximation of the original text. The system starts by assuming all characters are equally likely to be misrecognized (with some uniform, small probability) and learns the character confusion probabilities by comparing the OCR text to the corrected OCR text after each pass. Then the learned character confusion probabilities are used for the next pass processing (feedback processing). This method proves to be quite effective in improving system performance.</Paragraph> </Section> <Section position="3" start_page="91" end_page="92" type="sub_section"> <SectionTitle> 2.3 Generation of Word Candidates for a Given String </SectionTitle> <Paragraph position="0"> Ideally, each word, w, in the lexicon should be compared to a given OCR string, s, to compute the conditional probability, pr(wls ). However, this approach would be computationally too expensive.</Paragraph> <Paragraph position="1"> Instead, the system operates in two steps, first to generate the candidates and then to specify the maximal number of candidates, N, to be considered for the correction of an OCR string.</Paragraph> <Paragraph position="2"> In step 1, the system retrieves a large list of word candidates for a given string. To nominate candidates, we use a vector space information retrieval technique \[Salton 1989\]: all the words in the lexicon are indexed by letter n-grams and the (OCR) string, also parsed into letter n-grams, is treated as a query over the database of lexicon entries. In particular, all words (or OCR strings) are indexed by their letter trigrams, including the 'beginning' and 'end' spaces surrounding the string. Words of four or fewer characters are also indexed by their letter bigrams. For example: &quot;the&quot; ~ {#th, the, he#, #t, th, he, e#} &quot;example&quot; ~ {#ex, exa, xam, mpl, ple, le#} A given OCR string to be corrected is represented by a vector containing its letter n-grams.</Paragraph> <Paragraph position="3"> Using the vector as the query, the lexicon words that are similar to the word error are retrieved, giving a large list of candidate correct forms. Candidates must share at least some features with the input string (query). A ranked list can be generated by scoring matches using a simple term frequency (TF) count--the number of matches between the query vector and the n-gram vector of a candidate word. For example, given the string: &quot;exanple&quot; ~ {#exanple#} {#ex, exa, xan, anp, npl, ple, le#} the word &quot;example&quot; is a candidate: &quot;example&quot; ~ {#example#} {#ex, exa, xam, amp, mpl, pie, le#} Since the two items share four letter n-grams--&quot;#ex', &quot;exa', &quot;ple&quot;, and &quot;le#'--the TF score of the candidate word &quot;example&quot; for the input string &quot;exanple&quot; is four. Note also that the TF score can be used to establish a threshold or cutoff score to limit the number of candidates to consider. In step 2, the system re-ranks the words in the candidate list using channel probabilities as described above.</Paragraph> <Paragraph position="4"> On average, the system generates several hundred candidates for a given string. Only the first N candidates are retained for context-based word-error correction.</Paragraph> </Section> <Section position="4" start_page="92" end_page="94" type="sub_section"> <SectionTitle> 2.4 The Word Correction System for OCR Post-Processing </SectionTitle> <Paragraph position="0"> The architecture of the word correction system for OCR post-processing is given in Figure 1.</Paragraph> <Paragraph position="1"> The lexicon is generated from the training text; it includes all the words in the training set with frequency greater than the preset threshold. The words in the lexicon are indexed by letter n-grams as described in the previous section.</Paragraph> <Paragraph position="2"> The overall process for correcting a sentence is as follows: 1. Read a sentence from the input OCR text.</Paragraph> <Paragraph position="3"> 2. Retrieve up to M candidates from the lexicon for each possible errorJ Rerank the M candidates by their conditional probabilities to the error. Keep only the top N candidates for the next processing step. (In the current system, M is 10,000 and N is 10.) 3. Use the Viterbi algorithm to get the best word sequence for the strings in the sentence. Figure 2 illustrates the alternative choices and the optimal path found during the processing (correcting) of the sentence &quot;john fornd he man&quot;.</Paragraph> <Paragraph position="4"> The system requires several passes to correct an OCR text. In the first pass, the system has no information on the character confusion probabilities, so it will assume a prior belief o~ as the probability that a character is correctly recognized. The system distributes the rest of the probability uniformly among other events. (Cf. Formula 9.) In each feedback step, the system first generates a character confusion probability table by comparing the OCR text to the corrected OCR text from the last pass. It uses the new confusion table for the next-pass correction of the OCR text. Sin its non-word error mode of operation, the system treats every word that does not match a lexicon entry as a possible error. In its non-word and real-word error mode, the system treats every word as though it were a possible error.</Paragraph> </Section> </Section> <Section position="4" start_page="94" end_page="95" type="metho"> <SectionTitle> 3 Experiments and Results </SectionTitle> <Paragraph position="0"> To test our OCR-error-correction process, we used a set of electronic documents from the Ziff-Davis (ZIFF) news wire? The documents in the corpus are business articles in the domain of computer science and computer engineering. We used 90% of the collection for training and the remaining 10% for testing.</Paragraph> <Paragraph position="1"> The system created a lexicon and collected word-bigram sequences and statistics from the training data. Words or word-bigrams with frequency less than three were discarded. The resulting lexicon contained about 100,000 words; these were indexed using 34,847 letter n-grams. The resulting word-bigram table had about 1,000,000 entries.</Paragraph> <Paragraph position="2"> Seventy pages of ZIFF data in the test set were printed in 7-point Times font. We degraded the print quality of the documents by photocopying them on a &quot;light' setting. The photocopies were then scanned by a Fujitsu 3097E scanner and the resulting images were processed by Xerox Textbridge OCR software.</Paragraph> <Paragraph position="3"> The set of documents contained 55,699 strings and the overall word error rate after OCR processing was 22.9% (12,760). For literal words in the source (only letter sequences, not alpha-numeric ones), the error rate was lower, 14.7% (8,198). Table I gives the number of real-word and non-word errors for literal words in the OCR data.</Paragraph> <Paragraph position="4"> We conducted three experiments: 1. Isolated-Word Error Correction: The system used only channel probabilities without considering context information, i.e., it always selected the candidate with the highest rank in the candidates list to correct a given OCR string.</Paragraph> <Paragraph position="5"> 2. Context-Dependent Non-Word Error Correction: The system used context to correct strings that did match valid lexicon words.</Paragraph> <Paragraph position="6"> 3. Context-Dependent Non- and Real-Word Error Correction: The system treated all input strings as possible errors and tried to correct them by taking into account the contexts in which the strings appeared.</Paragraph> <Paragraph position="7"> In each experiment, the system conducted four correction passes: one initial pass with prior probability c~ = 0.99 and three feedback passes.</Paragraph> <Paragraph position="8"> Results are given in Tables 2, 3, and 4. In all cases, we considered only those strings whose correct forms are literal words (not alpha-numerics). Note that errors can be introduced by the system when it incorrectly changes a correct word in the OCR text into another word. In fact, we distinguish two types of errors introduced by the system: errors caused by changing correct 3The ZIFF collection is distributed as part of the data used in the Text Retrieval Conference (TREC) evaluations. The corpus contains about 33 million words.</Paragraph> <Paragraph position="9"> unknown words and errors caused by changing correct lexicon words. The error reduction rate was calculated by subtracting total errors from 8,198 and dividing by 8,198.</Paragraph> <Paragraph position="10"> The system, running unoptimized code on a 128MHz DECalpha processor, processed the test corpus at a rate of about 200 words (strings) per second for experiments 1 and 2; and 30 words (strings) per second for experiment 3.</Paragraph> </Section> <Section position="5" start_page="95" end_page="95" type="metho"> <SectionTitle> 4 Analysis </SectionTitle> <Paragraph position="0"> Based on the results, we can see that the predominant, positive effect in correction occurs in the first pass. Performance also improves significantly in the first feedback process, as the system learns the character confusion probabilities by correcting the OCR text. The second and third feedback steps have only slight effect on the error reduction rates. Indeed, in experiment 3, the result from the third feedback pass is actually worse than that from the second feedback pass. These results indicate that an initial pass followed by two feedback passes may optimize the method. In the following discussion, we compare the three experiments using the results obtained from the second feedback step (Feedback-2).</Paragraph> <Paragraph position="1"> As we might expect, the results from the context-based experiments are much better than those from the isolated-word experiment. The error reduction rates in experiments 2 and 3 are, respectively, 10.3% and 17.1% higher than the rate in experiment 1. This indicates that even a modest (e.g., bigram-based) representation of context is useful in selecting the best candidates for word-error correction.</Paragraph> <Paragraph position="2"> In all three experiments, the system introduced 182 new errors due to false corrections of words that were not in the lexicon. (Recall that the system lexicon is based on the words derived from the training corpus; some words may be present in the test corpus that are not in the training corpus.) Whenever the system encounters an unknown word, it treats it as a non-word error and attempts to correct it. In such cases, the system replaces the presumed non-word error with a word from its lexicon. Thus, for example, if the system encounters the word &quot;MobileData&quot; (a correct name) in the OCR output, but does not have &quot;MobileData&quot; in its lexicon, it might change &quot;MobileData&quot; to &quot;MobileComm&quot; (a word that does exist in the training corpus lexicon). Of course, such problems in processing unknown words are not unique to OCR error correction; they represent a general problem for all natural-language processing tasks.</Paragraph> <Paragraph position="3"> As shown by experiment 3, when the system uses context-based non- and real-word error correction, it achieves a total error reduction rate of 60.2%. This is 6.8% higher than the rate achieved in the context-based non-word experiment. The improvement in performance is gained principally from the reduction of the real-word errors. Although the system introduces additional errors--since all the strings in the OCR text are treated as possible errors and subject to change--the number of corrected real-word errors far exceeds the number of real-word errors introduced. In the second feedback pass, for example, the system introduced 141 new errors by changing correct lexicon words into other lexicon words. On the other hand, the system properly corrected 684 real errors--32.1% of all the real errors. The corrected OCR text, therefore, has 543 fewer real-word errors than the original OCR text.</Paragraph> <Paragraph position="4"> Certain types of errors in the source or OCR-output text present systematic problems for our approach, highlighting the limitations of the system. In particular, because the process is based on the structural definition of a word (viz., a character sequence 'between white space')--not a morphological one--any errors that obscure word boundaries will defy correction. For example, run-on errors (e.g., &quot;of the&quot;/&quot;ofthe&quot;) and split-word errors (&quot;training&quot; /&quot;train ng') cannot be corrected. In addition, the use of a vector-space querying to find candidate lexical entries-including our special approach to word decomposition and scoring--can present problems when processing some OCR errors, especially short strings. For example, if &quot;both&quot; (in the source) is rendered as &quot;hotn&quot; (in the OCR text), it is not possible for the system to generate &quot;both&quot; as one of the high-ranked candidates--they share only one feature, the bigram &quot;ot&quot;-- despite the fact that the conditional probability pr(&quot;hotn&quot;l&quot;both&quot; ) might be high. Finally, the system suffers from the common limitation of word bigram or trigram models in that it cannot capture discourse properties of context, such as topic and tense, which are sometimes required to select the correct word.</Paragraph> </Section> class="xml-element"></Paper>