File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-1648_intro.xml

Size: 9,928 bytes

Last Modified: 2025-10-06 14:03:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1648">
  <Title>Language Modeling, and Shallow Morphology</Title>
  <Section position="4" start_page="0" end_page="409" type="intro">
    <SectionTitle>
2 Background
</SectionTitle>
    <Paragraph position="0"> This section reviews prior work on Arabic OCR for Arabic and OCR error correction.</Paragraph>
    <Section position="1" start_page="0" end_page="408" type="sub_section">
      <SectionTitle>
2.1 Arabic OCR
</SectionTitle>
      <Paragraph position="0"> The goal of OCR is to transform a document image into character-coded text. The usual process is to automatically segment a document image into character images in the proper reading order using image analysis heuristics, apply an automatic classifier to determine the character codes that most likely correspond to each character image, and then exploit sequential context (e.g., preceding and following characters and a list of possible words) to select the most likely character in each position. The character error rate can be influenced by reproduction quality (e.g., original documents are typically better than photocopies), the resolution at which a document was scanned, and any mismatch between the instances on which the character image classifier was trained and the rendering of the characters in the printed document. Arabic OCR presents several challenges, including: * Arabic's cursive script in which most characters are connected and their shape vary with position in the word.</Paragraph>
      <Paragraph position="1"> * The optional use of word elongations and ligatures, which are special forms of certain letter sequences.</Paragraph>
      <Paragraph position="2"> * The presence of dots in 15 of the 28 letters to distinguish between different letters and the optional use of diacritic which can be confused with dirt, dust, and speckle (Darwish and Oard, 2002).</Paragraph>
      <Paragraph position="3">  surface forms, complicates dictionary-based error correction. Arabic words are built from a closed set of about 10,000 root forms that typically contain 3 characters, although 4-character roots are not uncommon, and some 5-character roots do exist. Arabic stems are derived from these root forms by fitting the root letters into a small set of regular patterns, which sometimes includes addition of &amp;quot;infix&amp;quot; characters between two letters of the root (Ahmed, 2000).</Paragraph>
      <Paragraph position="4"> There is a number of commercial Arabic OCR systems, with Sakhr's Automatic Reader and Shonut's Omni Page being perhaps the most widely used. Retrieval of OCR degraded text documents has been reported for many languages, including English (Harding et al., 1997), Chinese (Tseng and Oard, 2001), and Arabic (Darwish and Oard, 2002).</Paragraph>
    </Section>
    <Section position="2" start_page="408" end_page="409" type="sub_section">
      <SectionTitle>
2.2 OCR Error Correction
</SectionTitle>
      <Paragraph position="0"> Much research has been done to correct recognition errors in OCR-degraded collections. There are two main categories of determining how to correct these errors. They are word-level and passage-level post-OCR processing. Some of the kinds of word level post-processing include the use of dictionary lookup, probabilistic relaxation, character and word n-gram frequency analysis (Hong, 1995), and morphological analysis (Oflazer, 1996). Passage-level post-processing techniques include the use of word ngrams, word collocations, grammar, conceptual closeness, passage level word clustering, linguistic context, and visual context. The following introduces some of the error correction techniques.</Paragraph>
      <Paragraph position="1"> * Dictionary Lookup: Dictionary Lookup, which is the basis for the correction reported in this paper, is used to compare recognized words with words in a term list (Church and Gale, 1991; Hong, 1995; Jurafsky and Martin, 2000). If a word is found in the dictionary, then it is considered correct. Otherwise, a checker attempts to find a dictionary word that might be the correct spelling of the misrecognized word.</Paragraph>
      <Paragraph position="2"> Jurafsky and Martin (2000) illustrate the use of a noisy channel model to find the correct spelling of misspelled or misrecognized words. The model assumes that text errors are due to edit operations namely insertions, deletions, and substitutions. Given two words, the number of edit operations required to transform one of the words to the other is called the Levenshtein edit distance (Baeza-Yates and Navarro, 1996). To capture the probabilities associated with different edit operations, confusion matrices are employed. Another source of evidence is the relative probabilities that candidate word corrections would be observed. These probabilities can be obtained using word frequency in text corpus (Jurafsky and Martin, 2000). However, the dictionary lookup approach has the following problems (Hong, 1995): a) A correctly recognized word might not be in the dictionary. This problem could surface if the dictionary is small, if the correct word is an acronym or a named entity that would not normally appear in a dictionary, or if the language being recognized is morphologically complex. In a morphological complex language such as Arabic, German, and Turkish the number of valid word surface forms is arbitrarily large which complicates building dictionaries for spell checking.</Paragraph>
      <Paragraph position="3"> b) A word that is misrecognized is in the dictionary. An example of that is the recognition of the word &amp;quot;tear&amp;quot; instead of &amp;quot;fear&amp;quot;. This problem is particularly acute in a language such as Arabic where a large fraction of three letters sequences are valid words.</Paragraph>
      <Paragraph position="4"> * Character N-Grams: Character n-grams maybe used alone or in combination with dictionary lookup (Lu et al., 1999; Taghva et al., 1994).</Paragraph>
      <Paragraph position="5"> The premise for using n-grams is that some letter sequences are more common than others and other letter sequences are rare or impossible. For example, the trigram &amp;quot;xzx&amp;quot; is rare in the English language, while the trigram &amp;quot;ies&amp;quot; is common. Using this method, an unusual sequence of letters can point to the position of an error in a misrecognized word. This technique is employed by BBN's Arabic OCR system (Lu et al., 1999).</Paragraph>
      <Paragraph position="6"> * Using Morphology: Many morphologically complex languages, such as Arabic, Swedish, Finnish, Turkish, and German, have enormous numbers of possible words. Accounting for and listing all the possible words is not feasible for purposes of error correction. Domeij proposed a method to build a spell checker that utilizes a stem lists and orthographic rules, which govern how a word is written, and morphotactic rules, which govern how morphemes (building blocks of meanings) are allowed to combine, to accept legal combinations of stems (Domeij et al., 1994). By breaking up compound words, dictionary lookup can be applied to individual constituent stems. Similar work was done for Turkish in which an error tolerant finite state  recognizer was employed (Oflazer, 1996). The finite state recognizer tolerated a maximum number of edit operations away from correctly spelled candidate words. This approach was initially developed to perform morphological analysis for Turkish and was extended to perform spelling correction. The techniques used for Swedish and Turkish can potentially be applied to Arabic. Much work has been done on Arabic morphology and can be potentially extended for spelling correction.</Paragraph>
      <Paragraph position="7"> * Word Clustering: Another approach tries to cluster different spellings of a word based on a weighted Levenshtein edit distance. The insight is that an important word, specially acronyms and named-entities, are likely to appear more than once in a passage. Taghva described an English recognizer that identifies acronyms and named-entities, clusters them, and then treats the words in each cluster as one word (Taghva, 1994). Applying this technique for Arabic requires accounting for morphology, because prefixes or suffixes might be affixed to instances of named entities. DeRoeck introduced a clustering technique tolerant of Arabic's complex morphology (De Roeck and Al-Fares, 2000). Perhaps the technique can be modified to make it tolerant of errors.</Paragraph>
      <Paragraph position="8"> * Using Grammar: In this approach, a passage containing spelling errors is parsed based on a language specific grammar. In a system described by Agirre (1998), an English grammar was used to parse sentences with spelling mistakes. Parsing such sentences gives clues to the expected part of speech of the word that should replace the misspelled word. Thus candidates produced by the spell checker can be filtered. Applying this technique to Arabic might prove challenging because the work on Arabic parsing has been very limited (Moussa et al., 2003).</Paragraph>
      <Paragraph position="9"> * Word N-Grams (Language Modeling): A Word n-gram is a sequence of n consecutive words in text. The word n-gram technique is a flexible method that can be used to calculate the likelihood that a word sequence would appear (Tillenius, 1996). Using this method, the candidate correction of a misspelled word might be successfully picked. For example, in the sentence &amp;quot;I bought a peece of land,&amp;quot; the possible corrections for the word peece might be &amp;quot;piece&amp;quot; and &amp;quot;peace&amp;quot;. However, using the n-gram method will likely indicate that the word trigram &amp;quot;piece of land&amp;quot; is much more likely than the trigram &amp;quot;peace of land.&amp;quot; Thus the word &amp;quot;piece&amp;quot; is a more likely correction than &amp;quot;peace&amp;quot;.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML