File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/02/c02-2014_abstr.xml
Size: 6,724 bytes
Last Modified: 2025-10-06 13:42:23
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-2014"> <Title>Recognition Assistance Treating Errors in Texts Acquired from Various Recognition Processes Gabor PROSZEKY MorphoLogic</Title> <Section position="1" start_page="0" end_page="2" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> Texts acquired from recognition sources--continuous speech/handwriting recognition and OCR--generally have three types of errors regardless of the characteristics of the source in particular. The output of the recognition process may be (1) poorly segmented or not segmented at all; (2) containing underspecified symbols (where the recognition process can only indicate that the symbol belongs to a specific group), e.g. shape codes; (3) containing incorrectly identified symbols. The project presented in this paper addresses these errors by developing of a unified linguistic framework called the MorphoLogic Recognition Assistant that provides feedback and corrections for various recognition processes. The framework uses customized morpho-syntactic and syntactic analysis where the lexicons and their alphabets correspond to the symbol set acquired from the recognition process. The successful framework must provide three services: (1) proper disambiguated segmentation, (2) disambiguation for underspecified symbols, (3) correction for incorrectly recognized symbols. The paper outlines the methods of morpho-syntactic and syntactic post-processing currently in use.</Paragraph> <Paragraph position="1"> Introduction Recognition processes produce a sequence of discrete symbols that usually do not entirely correspond to characters of printed text. Further on, we refer to this sequence as an input sequence.</Paragraph> <Paragraph position="2"> This framework is actually a second tier of the data flow. The user receives a black box providing linguistically sound and correctly recognized text. Inside the black box, the first tier performs the actual recognition, and the second tier carries out linguistic corrections and disambiguation.</Paragraph> <Paragraph position="3"> unified linguistic framework must perform a transformation where (1) the symbols from the recognition process are converted into characters of written text, and (2) the correlation between the original analog source and the result is the closest possible. A post-processing framework must not simply perform a symbol-to-symbol conversion. A direct conversion is either impossible (phonetic symbols of any kind do not directly correspond to printed characters) or insufficient (source symbols are underspecified or incorrectly recognized). Morpho-lexical and syntactic models can help this process as they recognize elements of the language, extracting meaningful passages from the input sequence. null Lexical databases with fully inflected forms are fairly standard for speech recognition, mainly where a small closed vocabulary is used, and new, unknown or ad hoc word formations are not required (Gibbon et al., 1997). This procedure is convenient in languages with very small inflectional paradigms. An example of a language with few inflections is English, where, in general, three forms exist for nouns and four for verbs. English is therefore not a good example for illustrating inflectional morphology.</Paragraph> <Paragraph position="4"> Agglutinative languages such as Turkish, Finnish or Hungarian, however, have complex inflectional and derivational morphology, with significantly more endings on all verbs, nouns, adjectives and pronouns. The number of endings increase the size of a basic vocabulary by a factor of thousands.</Paragraph> <Paragraph position="5"> Algorithmic morphological techniques have been developed for efficient composition of inflected forms and to avoid a finite but unmanageable explosion of lexicon size. Still, according to Althoff et al. (1996), these techniques have not been applied to any significant extent in speech technology.</Paragraph> <Paragraph position="6"> In this paper, we describe the application of a new method based on morphology and partial parsing. This method uses a unified error model with flexible symbol mapping, facilitating the use of any linguistic module with traditional orthographic lexicons--for any recognition process (OCR, handwriting, speech recognition), even for highly inflectional languages. The integrated system uses our existing morpho-syntactic modules and lexicons.</Paragraph> <Paragraph position="7"> 1 The error model The linguistic correction framework must be aware of three classes of error sources occurring in the input sequence: (1) poor or nonexistent segmentation, (2) underspecified symbols, (3) incorrectly recognised symbols.</Paragraph> <Paragraph position="8"> The input sequence does not appear in the form of written text. It comprises of complex symbol codes in a normalized format, where the codes closely correspond to the signals recognized by the particular recognition process. In the case of OCR or handwriting recognition, this can be a shape code such as <lower> indicating a group of characters. With speech recognition, this is rather a phoneme code such as <e:>. (Here we use the notation of the proposed framework.) Standard orthographic characters may also appear in the input sequence.</Paragraph> <Paragraph position="9"> With all types of recognition processes, there exists no one-to-one mapping between the symbols of the input sequence (the input alphabet) and the orthographic alphabet of the written text. The number of identified phonemes/phoneme complexes or characters/character complexes does not provide information about the number of characters to be used in the output text.</Paragraph> <Paragraph position="10"> Unlike in two-level-morphology, the framework must provide n-to-m character (symbol) mapping, where n [?] m. Mapping between speech and text chunks of different length makes the system able to offer, for example, consonant sequences instead of affricates usually represented by single characters: An example: in OCR outputs, the letter 'm' often occurs in place of the 'rn' sequence. The correction module must be able to transform single 'm'-s into 'rn'. With continuous speech recognition (and, though less frequently, continuous handwriting recognition) it is even possible that a written segment boundary--such as the end of a word or a sentence--occurs within a symbol. The framework must be aware of these schemes as well.</Paragraph> <Paragraph position="11"> The following sections present each error class with Hungarian examples to show the complexity of the linguistic model required by some languages.</Paragraph> </Section> class="xml-element"></Paper>