XML Viewer - a92-1017

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/92/a92-1017_intro.xml
Size: 8,114 bytes
Last Modified: 2025-10-06 14:05:05
<?xml version="1.0" standalone="yes"?>
<Paper uid="A92-1017">
  <Title>Overview of new Phenomena for Spelling Correction</Title>
  <Section position="3" start_page="126" end_page="128" type="intro">
    <SectionTitle>
3 Method
</SectionTitle>
    <Paragraph position="0"> According to a distinction made in the literature (Pollock and Zamora, 1984; Salton, 1989), there are two main approaches in automatic spelling correction: While the 'absolute' approach &amp;quot;consists of using a dictionary of commonly misspelled words, and replacing any misspelling detected in a text by the corrected version listed in the dictionary&amp;quot; (Salton, 1989), the 'relative' approach consists of locating in a conventional dictionary with correct spellings words &amp;quot;that are most similar to a misspelling and selecting a correction from these. Generally, the selection method is based on maximizing similarity or minimizing the string-to-string edit distance&amp;quot; (Pollock and Zamora, 1984).</Paragraph>
    <Paragraph position="1"> Although there is some use of 'absolute' methods in some systems (Pollock and Zamora, 1984), &amp;quot;referencing a dictionary of correctly spelled words&amp;quot; (Frisch and Zamora, 1988) is standard. On that basis, most of the purely motoric single word errors, or &amp;quot;typographical  errors&amp;quot; (Berkel and Smedt, 1990), can be corrected. Some conventional systems additionally try to cope with a certain subset of cognitive, or &amp;quot;orthographical&amp;quot; (Berkei and Smedt, 1990), errors which &amp;quot;result in homophonous strings&amp;quot; and involve &amp;quot;some kind of phonemic transcription&amp;quot; (Berkel and Smedt, 1990) for their correction.</Paragraph>
    <Paragraph position="2"> Sincc the cognitive errors outlined in 1 and 2 abovc are non-standard, in the sense that they are ncither motoric (by definition) nor phonologically motivated, a straightforward method to correct thcm is the 'absolute' one of directly encoding error pattcrns in a lexicon and replacing each matching occurrence in a text by the correction listcd in the system lexicon.</Paragraph>
    <Paragraph position="3"> Now, in ordcr to treat single (non-) words and compounds in a uniform way, each entry in the systcm lcxicon is modelled as a quintuple &lt;W,L,R,C,E&gt; specifying a pattern of a (multi-) word W for which a correction C will be proposed accompanicd by an explanation E iff a given match of W against some passage in the text under scrutiny differs significantly from C and the - possibly empty - left and right contexts L and R of W also match the environment of W's counterpart in the text.</Paragraph>
    <Paragraph position="4"> Disrcgarding E for a moment, this is tantamount to saying that each such record is interpreted as a string rewriting rule W--&gt;C / L R replacing W (e.g.: Bczug) by C (e.g.: bezug) in the environment L R (e.g.: in auf).</Paragraph>
    <Paragraph position="5"> The form of these productions can best be characterized with an eye to the Chomsky hierarchy as unrestrictcd, since we can have any non-null number of symbols on the LHS replaced by any numbcr of symbols on the RHS, possibly by null (Partee et al., 1990).</Paragraph>
    <Paragraph position="6"> With an eye to semi-Thue or extended axiomatic systems one could say that a linearly ordered sequence of strings W, C1, C2, ..., Cm is a derivation of Cm iff (1) W is a (faulty) string (in the text to be corrected) and (2) each Ci follows from the immediately preceding string by one of the productions listed in the lexicon (Partee et al., 1990).</Paragraph>
    <Paragraph position="7"> Thus, theoretically, a single mistake can be corrected by applying a whole sequence of productions, though in practice the default is clearly that a correction be done in a single derivational step, at least as long as the system is just operating on strings and not on additional non- terminal symbols.</Paragraph>
    <Paragraph position="8"> Occurrences of W, L, and R in a text are recognized by pattern matching techniques. An error pattern W ignores the particularly error-prone aspects upper/lower case and word separator (see the examples in 2 above). It thus matches both the correct and incorrect spellings with respect to these features.</Paragraph>
    <Paragraph position="9"> Beside wildcards for characters, like &amp;quot;*&amp;quot;, a pattern for W, L, or R may contain also wildcards for words allowing, for example, the specification of a maximal distance of L or R with respect to W. Since the types of errors discussed here only occur within sentences, such a distant match has to be restricted by the sentence boundaries. Thus, by having the system operate sentencewise, any left or right context is naturally restricted to be some string within the same sentence as W or to be a boundary of that sentence (e.g.: a punctuation mark).</Paragraph>
    <Paragraph position="10"> Any left or right context is either a positive or a negative one, i.e., its components are homogeneously either required or forbidden in order for the corresponding rule to fire. So far it has not been necessary to allow for mixed modes within a left or right context.</Paragraph>
    <Paragraph position="11"> In case a correction C is proposed to the user, additionally a message will be displayed to him identifying the reason why C is correct rather than W. Depending on the user's knowledge of the language under investigation, he can take this either as an opportunity to learn or rather as a help for deciding whether to finally accept or reject the proposal.</Paragraph>
    <Paragraph position="12"> There are two kinds of explanations, absolute and conditional ones. Whereas absolute rules indicate that the system has necessary and sufficient evidence for W's deviance, there clearly are cases where either W or C could be correct and this question cannot be decided on the basis of the system's lexical information alone. In these cases, a conditional or if-then explanation is given to the user offering a higher-level decision criterion which the system itself is unable to apply.</Paragraph>
    <Paragraph position="13"> Take, as an example, the sentence  Dieser Film betrifft Alt und Jung.</Paragraph>
    <Paragraph position="14"> which clearly allows for two readings, one which renders &amp;quot;Alt und Jung&amp;quot; as the false spelling of the idiomatic expression &amp;quot;alt und jung&amp;quot; meaning &amp;quot;everybody&amp;quot;, and another one which takes &amp;quot;Alt und Jung&amp;quot; as the correct form that literally designates the old and the young while excluding the middle-aged. Thus, substitutability by &amp;quot;jedermann&amp;quot; (i.e.: &amp;quot;everybody&amp;quot;) would be an adequate decision criterion to convey to the user. Although the method described above introduces a new kind of lexical data, its (higherlevel) error correction still operates on nothing but strings. No deep and time-consuming analysis, like parsing, is involved.</Paragraph>
    <Paragraph position="15"> Restricting the system that way makes our approach to context-sensitiveness different from the one considered in (Rimon and Herz, 1991), where context sensitive spelling verification is proposed to be done with the help of &amp;quot;local constraints automata (LCAs)&amp;quot; which process contextual constraints on the level of lexical or syntactic categories rather than on the basic level of strings. In fact, proof-reading with LCAs rather amounts to genuine grammar checking and as such belongs to a different and higher level of language checking.</Paragraph>
    <Paragraph position="16"> Context sensitive spelling checking, as proposed here, can be regarded as a checking level in its own right, lying in between any checking on word level and grammar checking. It thus could complement the two-level checker discussed in (Vosse, 1992) by correcting especially those errors in idiomatic expressions, like &amp;quot;te alle tijden&amp;quot; -&gt; &amp;quot;te allen tijde&amp;quot;, which cannot be detected on word or sentence level; compare (Vosse, 1992).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML