File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/a92-1017_metho.xml

Size: 8,131 bytes

Last Modified: 2025-10-06 14:12:56

<?xml version="1.0" standalone="yes"?>
<Paper uid="A92-1017">
  <Title>Overview of new Phenomena for Spelling Correction</Title>
  <Section position="4" start_page="128" end_page="129" type="metho">
    <SectionTitle>
4 A Processing Model
</SectionTitle>
    <Paragraph position="0"> A good model of the system is given by a deterministic multitape Turing machine (Hopcroft and Ullman, 1979) consisting of a finite control with, in effect, three tapes and tape heads. The following description relates to sentence level: Initially, the input appears on the first tape with each of the tape's cells containing either a word, a blank (symbolized below by a single &amp;quot;B&amp;quot;), or a left or right sentence boundary symbol.</Paragraph>
    <Paragraph position="1"> Thus, any input sentence can be stored by a finite sequence of cells.</Paragraph>
    <Paragraph position="2"> The second tape holds a read-only copy of the initial text. While the first tape will be rewritten, the second serves just as a reference tape. The third tape is also read-only, it holds the finite sequence of lexicon entries.</Paragraph>
    <Paragraph position="3"> Consider the following snapshot of the system TI: B in B Bezug B auf B T2: B in B Bezug B auf B T3: /b/ezug (1in )lauf bezug where &amp;quot;Bezug&amp;quot; has been scanned on the reference tape T2, and a pattern /b/ezug has been found in the lexicon T3 that ignores upper/lower case in the match but requires a lower case &amp;quot;bezug&amp;quot; just in case &amp;quot;in&amp;quot; can be found as 1 word to the left (as is expressed by &amp;quot;(1in&amp;quot;) and &amp;quot;auf&amp;quot; can be found 1 non-blank cell to the right on T2.</Paragraph>
    <Paragraph position="4"> Since the corresponding contexts of /b/ezug can be verified on T2 (by simply moving T2's head &amp;quot; to the respective cells, scanning their contents, and comparing these with the relevant information on T3), finally the error &amp;quot;Bezug&amp;quot; is corrected on T1 and a new checking cycle is started with the next word: TI: B in B bezug B auf B A T2: B in B Bezug B auf B A Note, as should be clear from the outset, that a previous correction on T1 is not available as a context for any next word under scrutiny, but only the uncorrected words on T2 are.</Paragraph>
    <Paragraph position="5"> Thus, if it were counterfactually the case that &amp;quot;auf&amp;quot; had to be corrected somehow whenever it appeared to the right of &amp;quot;bezug&amp;quot; as opposed to &amp;quot;Bezug&amp;quot;, and given the input of the above example, our system - though producing &amp;quot;bezug&amp;quot; as the left context of &amp;quot;auf&amp;quot; on T1 - would clearly fail to correct &amp;quot;auf&amp;quot; since it would still be taking any context from T2.</Paragraph>
    <Paragraph position="6">  Although one can think of other, more realistic, cases (like, e.g., &amp;quot;dab ich Eis laufte&amp;quot; -&gt; &amp;quot;dab ich cislief&amp;quot;) which require two or more correction steps such that at least one of these steps (&amp;quot;Eis lief&amp;quot; -&gt; &amp;quot;eislief&amp;quot;) depends on another one (&amp;quot;laufte&amp;quot; -&gt; &amp;quot;lief&amp;quot;), there clearly are other alternatives (like writing clever lexical entries) beside giving up reading from T2.</Paragraph>
    <Paragraph position="7"> Giving up T2 would mean to give up the simple working hypothesis that all the higher-Icvcl errors within a givcn input sentence can be corrcctcd indcpendcntly. As a consequence, the systcm would become much more complex and, probably, less efficient.</Paragraph>
    <Paragraph position="8"> For German, we have not yet faced any (significant amout of) data that would justify a more complex redesign of the system. However, since the data captured in the system's lexicon covers at present some 50 % of the relevant phenomena compared to the Duden (Berger 1985), the ultimate complexity of the system has to be regarded as an open and empirical question. read beyond a known abbreviation. This might result eventually in taking two sentences to be one, but would, of course, not disturb intra-sentential error correction. Nothing, however, prevents the system from stopping at an unknown abbreviation and thereby falling short of a context it otherwise would have recognized. From this it is clear that the system should at least know the most frequent abbreviations of a given language.</Paragraph>
    <Paragraph position="9"> Likewise, the formatting information of a text is preserved to a very high degree during correction, as it should be. Nevertheless, there naturally are cases where some such information will get lost as is clear from the simple fact that there can be shrinking productions reducing n differently formatted elements on the LHS to m elements on the RHS, with m &lt; n. But these are borderline cases.</Paragraph>
    <Paragraph position="10"> What is less acceptable, for each of the implcmentations mentioned above, is the lack of integration of the checking on the various levels.</Paragraph>
  </Section>
  <Section position="5" start_page="129" end_page="130" type="metho">
    <SectionTitle>
5 Status of Implementation
</SectionTitle>
    <Paragraph position="0"> A first prototype of the system described above has been developed in C under UNIX within the ESPRIT II project 2315 &amp;quot;Translator's Workbench&amp;quot; (TWB) as one of several separate modules checking basic as well as higher levels of various languages Ilike grammar and style; see (Thurmair, 1990) and (Winkelmann, 1990)\].</Paragraph>
    <Paragraph position="1"> A derived and extended B-release version covering 3.000 rewriting rules - has been integrated into both a proprietary text processing software under DOS and Microsoft's WINWORD</Paragraph>
    <Section position="1" start_page="129" end_page="130" type="sub_section">
      <SectionTitle>
1.1 under MS WINDOWS 3.0. In each case it runs
</SectionTitle>
      <Paragraph position="0"> independently from the built-in standard spelling verifier, although this is not transparent to the user who perceives just one proofreader checking each sentence of a text twice, i.e., on two different levels.</Paragraph>
      <Paragraph position="1"> On both these implementations, some problems have received practical solutions to an acceptable degree.</Paragraph>
      <Paragraph position="2"> For example, the problem of mistaking an abbreviation for the end of a sentence (because both end with a dot), which could prevent a context from being recognized, is 'solved' by having the sentence segmentation routine always Thus, it may happen that the checkers running one after the other over the same text disturb each other's results by proposing antagonistic corrections with respect to one and the same expression: Within the correct passage &amp;quot;in bezug auf&amp;quot;, for example, &amp;quot;bezug&amp;quot; will first be regarded as an error by the standard checker which then will propose to rewrite it as &amp;quot;Bezug&amp;quot;. If the user accepts this proposal, he will receive the exactly opposite advice by the context sensitive checker.</Paragraph>
      <Paragraph position="3"> On the other hand, checking on different levels could nicely go hand in hand and produce synergetic effects: For, clearly, any context sensitive checking requires that the contexts themselves be correct and thus possibly have been corrected in a previous, possibly context free, step. The checking of a single word could in turn profit from contextual knowledge in narrowing down the number of correction alternatives to be proposed for a given error: While there may be some eight or nine plausible candidates a~, corrections of &amp;quot;Bezjg&amp;quot; when regarded in isolation, only one candidate, i.e. &amp;quot;bezug&amp;quot;, is left when the context &amp;quot;in auf&amp;quot; is taken into account. Thus, there is a strong demand for arriving al a holistic solution for multi-level language checking rather than for just having various level  experts particularistically hooked together in series. This will be a task for the future.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML