File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-2014_metho.xml

Size: 9,588 bytes

Last Modified: 2025-10-06 14:07:52

<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-2014">
  <Title>Recognition Assistance Treating Errors in Texts Acquired from Various Recognition Processes Gabor PROSZEKY MorphoLogic</Title>
  <Section position="2" start_page="2" end_page="2" type="metho">
    <SectionTitle>
2 The basics: symbol mapping
</SectionTitle>
    <Paragraph position="0"> Atomic segments of input sequences are assumed to consist of (underspecified) symbols (phonemes/phoneme complexes, characters/character complexes). The correction framework must have a database of complex symbols--either phoneme codes or shape codes representing the classes of underspecified characters. An obvious approach to acquire a database of phonetic description of stems and suffixes (for morphological processing) is converting the existing (orthographical) lexicon.</Paragraph>
    <Paragraph position="1"> However, this conversion is very complicated and may result in an extremely large database. With speech recognition, for example, all orthographic representations must be converted into as many phonetic allomorphs as possible, on the basis of a grapheme-sequence-to-phoneme-sequence conversion. This set contains every allomorph where the first or the last phoneme of which is subject to contextual change. E.g. ket ('two') is converted to {keVt, keVt y }, because of palatalization before certain words, like n</Paragraph>
    <Paragraph position="3"> uVl bright ket nyul ('two rabbits') As the above method has some obvious disadvantages, we decided to separate the symbol mapping from the linguistic processes. We have created a database mapping the recognized symbols to all possible orthographical characters/character sequences. In this scheme, the framework creates several possible orthographical sequences from the input sequence (implemented internally as a directed acyclic graph for performance reasons). The correction framework then segments and validates each sequence using 'traditional' linguistic modules with the original orthographical lexicons. The conversion database uses a unified entry format suitable for all types of recognition processes. Example: null &lt;ccs&gt; ((&lt;t&gt;|((c|)c&lt;s&gt;)|)(c(&lt;s&gt;|)c&lt;s&gt;|ts)) This is a phoneme conversion entry. On the left side, a phonetic code is listed in the unified internal representation of the framework. (Note that this input symbol is the result of a mapping from the output of the recognition module.) On the right side, there is a directed acyclic graph (more or less a regular expression) describing all possible orthographic representations of the single phonetic entity.</Paragraph>
    <Paragraph position="4"> This is the core idea of the framework: the separate conversion process provides for an open architecture where the framework can be attached to any recognition process, and even the linguistic modules are replaceable.</Paragraph>
  </Section>
  <Section position="3" start_page="2" end_page="2" type="metho">
    <SectionTitle>
3 Morpho-lexical segmentation
</SectionTitle>
    <Paragraph position="0"> For the simplest example, let us assume that the input sequence consists of phonetic symbols with no segmentation: however, pauses are indicated by the recognition process. The input sequence is processed symbol by symbol, and when the segmenter encounters a potential segment boundary, registers it and checks if the phonetic processor saw any pause, stress or other sign of segmentation at that particular position in the original speech signal. This might require some interaction with speech recognizer, but for the sake of simplicity, now we describe the operation of the linguistic subsystem only.</Paragraph>
    <Paragraph position="1"> The original architecture design devises the framework as a feedback service, one requesting further information from the recognition source. In the current implementation, however, the correction framework can be separated from the recognition process, and provide corrected and disambiguated text without feedback to the recognition module.</Paragraph>
    <Paragraph position="2"> In the analysis process of the unsegmented signal (see Figure 1), for example, the input slice vonateVr has three morpho-lexically equally likely segmentations: von a ter, vonat er, vonater. Either the acoustic signal contains information confirming or rejecting any of them; or all of them will be temporarily kept, and the segmentation process itself will filter them out later on. In Figure 1, after reading some further symbols from the input, it becomes clear that the only orthographically correct word boundary is between vonat and erkezik .</Paragraph>
  </Section>
  <Section position="4" start_page="2" end_page="2" type="metho">
    <SectionTitle>
4 Underspecified forms
</SectionTitle>
    <Paragraph position="0"> It is quite common that the recognition process cannot perfectly identify segments in the original signal source. These are the cases of underspecification. Let us assume that a speech recognition  process is unable to identify the value of the binary feature VOICED. In these cases, the linguistic subsystem attempts to find orthographically well-formed morpho-lexical constructions for both voiced and voiceless variants of the phoneme in question. In fact, underspecified forms of the input signal are represented either by lists of possible characters--like set representations in two-level morphology (Koskenniemi, 1983):</Paragraph>
    <Paragraph position="2"> ('train is arriving to the second platform') or by underspecified feature complexes:</Paragraph>
    <Paragraph position="4"> where D, G and V are d, g and v, respectively, but not specified as voiced or voiceless.</Paragraph>
    <Paragraph position="5"> 5 Using higher-level linguistic processes The linguistic correction framework operates rather inefficiently if it uses morpho-lexical processing only. This results in extreme ambiguity: numerous overgenerated orthographic patterns appear with grammatically incorrect segmentation.</Paragraph>
    <Paragraph position="6"> Thus the process must be improved by adding higher level linguistic analysis. Currently, the framework uses partial syntax similar to the mechanism applied in the Hungarian grammar checker module. This partial syntax describes particular syntactic phenomena in order to identify incorrect grammar beyond the word boundaries.</Paragraph>
    <Paragraph position="7"> A more efficient post-processing filter is being developed by applying the HumorESK parser module (Proszeky, 1996). Figure 2 shows the possible segmentations of the morphology-only system. In this figure, an asterisk marks syntactically non-motivated word sequences filtered out by the partial syntax or the full parser--operating as a higher-level segmenter.</Paragraph>
    <Paragraph position="8"> In the first 10 segmentations, the personal pronoun ti (2 nd person, pl.) does not agree with either the verb ir (3 rd person, sing.) or the verb irok (1 st person, sing.). Syntactically the last two segmentations can be accepted (but semantically and according to topic-focus articulation, Nr. 11 is bizarre). In most cases it is true that the segmentation containing the longest matches in the input sequence is the best orthographical candidate.</Paragraph>
  </Section>
  <Section position="5" start_page="2" end_page="2" type="metho">
    <SectionTitle>
6 Further development
</SectionTitle>
    <Paragraph position="0"> Morpho-lexical and syntactic segmentation and correction can be very useful in improving the quality of 'traditional' recognition sources. However, it is important to emphasize that the proposed framework would only support existing recognition methods (e.g. likelihood-based mechanisms in speech recognition) rather than replacing them. The current breakdown of the framework makes no assumptions on the operation of the underlying recognition process, and does not prefer any methods to any other. In terms of architecture, the correction framework's operation is separated from the recognition module.</Paragraph>
    <Paragraph position="1"> One of the aims of this project is, however, a better interaction between the linguistic and the recognition subsystem. As the first step, it requires a standard feedback interface (yet to be developed).</Paragraph>
    <Paragraph position="2"> Because the current implementation of the MorphoLogic Recognition Assistant framework does not make assumptions of the recognition subsystem, it cannot influence its operation. A standard feedback interface consists of a formalism for describing the interaction between a recognition source and the correction framework, regardless of the characteristics of the recognition subsystem.</Paragraph>
    <Paragraph position="3"> Stub modules must be developed to communicate with existing recognition systems.</Paragraph>
    <Paragraph position="4"> An example for the dialogue between a phonetic and linguistic subsystem: first, a superficial acoustic-phonetic analysis offers some sequence of underspecified feature complexes, then the linguistic subsystem attempts to transform them into potential orthographically correct units with surface word boundaries. Finally, the phonetic system  1. *nyel vesz e ti cikket ir ok 2. *nyel vesz e ti cikket irok 3. *nyel vesze ti cikket ir ok 4. *nyel vesze ti cikket irok 5. *nyelv esz e ti cikket ir ok 6. *nyelv esz e ti cikket irok 7. *nyelvesz e ti cikket ir ok 8. *nyelvesz e ti cikket ir ok 9. *nyelvesze ti cikket ir ok 10. *nyelvesze ti cikket irok 11. nyelveszeti cikket ir ok 12. nyelveszeti cikket irok ('I am writing a linguistic paper.')</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML