File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/96/c96-2131_abstr.xml

Size: 7,762 bytes

Last Modified: 2025-10-06 13:48:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-2131">
  <Title>An Agreement Corrector for Russian</Title>
  <Section position="1" start_page="0" end_page="776" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> The paper describes an application-oriented system that corrects agreement errors. In order to correct a sentence with such errors, an extended morphological structure is created which contains various grammatical forms of the words used in the sentence. For this structure the bottom-up parsing is performed, and syntactic structures are found that contain minimal number of changes in comparison with the original sentence. Experiments with real sentences have shown promising results.</Paragraph>
    <Paragraph position="1"> i Introduction Correction of agreement errors in Russian texts is a problem of real practical interest. Being a language of the inflectional type, Russian has a rich system of word changing. The paradigm of a typical verb contains 30 finite forms and 190 participles, as well as infinitives and certain other forms. The complete adjectival paradigm contains 57 forms, and the paradigm of a noun - 12 forms. Although the number of different graphic words in a paradigm is usually less than the total number of forms, it is also rather large. For that reason, agreement errors give a high proportion of all grammatical errors in Russian texts (here and below the expression *agreement errors' means the use of words in incorrect forms).</Paragraph>
    <Paragraph position="2"> In this paper we describe an application-oriented system that corrects such errors. The system, called below 'corrector', uses a formal description of the Russian syntax in terms of dependency structures.</Paragraph>
    <Paragraph position="3"> In our case, these structures are directed trees whose nodes represent the words of a sentence, and whose arcs are labelled with names of syntactic relations (see Mel'~uk 1974; Mel'~uk and Pertsov 1987; Apresjan et al. 1992). The corrector is based on the general idea widely used in this kind of systems: if an input sentence is syntactically ill-formed, i.e. it cannot be assigned a syntactic structure (SyntS), the system considers minimal changes that enable it to construct a SyntS, and presents them as possible corrections (see, for example, Carbonell and Hayes 1983; Jensen et al. 1983; Weiscbedel and Sondheimer 1983; Mellish 1989; Bolioli et al. 1992).</Paragraph>
    <Paragraph position="4"> A segment of a sentence is called 'syntactically connected' if a well-formed dependency tree can be constructed on it. (In terms of constituents, connectedness of a segment would mean that it can be parsed as a single constituent.) The 'degree of syntactic disconnectedness' of a sentence is defined as the least number C of connected segments into which the sentence can be partitioned. Hence, C = 1 if and only if the sentence can be assigned a SyntS; for an &amp;quot;absolutely disconnected&amp;quot; sentence C would be equal to the number of words. The general idea of correction can be expressed in these terms as follows: for an input sentence, which has C &gt; l, minimal changes are considered that produce sentences with C = I. A more &amp;quot;indulgent&amp;quot; strategy is also possible which only requires that the value of C for new sentences should be less than the initial value, and not necessarily equal to 1.</Paragraph>
    <Paragraph position="5"> In the case of correcting agreement crrors changes concern only word forms, while the lexical content and word order of the sentence do not vary.</Paragraph>
    <Paragraph position="6"> At first, the corrector tries to improve the sentence by changing a single word; in case of failure, it tries to change a pair of words, then a triple of words and so on. Actually, particular subsets of words to be changed are not considered, but instead the bottom-up parsing is performed which constructs syntactic subtrees that contain no more than R modified words; here R is a parameter which is succesively assigned values l, 2 ....</Paragraph>
    <Paragraph position="7"> At present, the linguistic information used by the corrector is not complete. The morphological and syntactic dictionaries, which describe respectively the paradigms and syntactic properties of words, cover about 15 thousand words; the grammar does not cover a number of less frequent syntactic constructions. Nevertheless, experiments show that, if supplied with a large morphological dictionary, the corrector even in its current state could effectively process real texts.</Paragraph>
    <Paragraph position="8"> Incompleteness of the syntactic dictionary is overcome by assigning 'standard' entries to the words absent from it (but present in the morphological dictionary). A standard entry describes syntactic properties typical of the words with a particular type of paradigm. Due to incompleteness of the grammar, tile corrector fails to construct SyntSs for certain well-formed sentences. (Here and below a sentence is called 'well-formed' if it has one or more SyntSs which are correct with respect to the (hypothetical) complete grammar; otherwise a sentence is called 'ill-formed'.) For that reason, all sentences whose degree of disconnectedness is less than that of the input sentence (C) are regarded as  'improvements'. If C = 1, the input sentence is ('onsidered correct, and if C &gt; 1 and improvements have not been found, it is considered 'quasicorrect'. null The corrector was tested on sentences chosen at random from the Russian journal Computer Science Abstracts. The experiments are described in detail in Section 5; here we present only the main results.</Paragraph>
    <Paragraph position="9"> Of 100 sentences chosen, 95 were evaluated as correct or quasi-correct; 3 gave 'false alarms', i.e.</Paragraph>
    <Paragraph position="10"> wrong corrections were proposed; 2 cases gave system failure (exhaustion of time or memory quotas).</Paragraph>
    <Paragraph position="11"> The same 100 sentences with single random distortions gave the following results: 14 turned out to bc well-formed and were evaluated by the system as correct or quasi-correct; in 79 cases the initial sentences were reconstructed; in 5 cases wrong corrections were proposed; 2 cases gave system failure. The repeated experiment with distorted sentences generated by a different series of pseudo-random numbers gave respectively the figures 10, 84, 5, and I.</Paragraph>
    <Paragraph position="12"> It can be said that in these experiments the difference in performance between the system described and the &amp;quot;ideal&amp;quot; corrector was 5% for correct sentences and 6 - 7% for sentences with single distortions. For more than 90% of ill-formed sentences the right corrections were found.</Paragraph>
    <Paragraph position="13"> A natural application of an agreement corrector is to process texts in computer editors. Another possibility is to combine it with a scanner for reading printed texts. Applying this system to problems with &amp;quot;high noise&amp;quot;, such as reading handwritten texts or speech recognition, seems more questionable: observations show that when the density of errors increases, the quality of correction becomes rather low.</Paragraph>
    <Paragraph position="14"> Further development of the corrcctor includes as the first step incorporation of a large morphological dictionary (in the experiments the entries of words absent from the morphological dictionary were added to it before running the corrector, i.e. a &amp;quot;complete&amp;quot; dictionary was simulated). Then the syntactic dictionary and the grammar should be extended, and further debugging on real texts should be carried out.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML