File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-2131_metho.xml
Size: 21,742 bytes
Last Modified: 2025-10-06 14:14:11
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-2131"> <Title>An Agreement Corrector for Russian</Title> <Section position="2" start_page="776" end_page="776" type="metho"> <SectionTitle> 2 Parsing the Input Sentence </SectionTitle> <Paragraph position="0"> The corrector begins its work with ordinary morphological analysis and parsing of the input sentence. null As a result of morphological analysis, for each word all its possible morphological interpretations (called 'homonyms') are constructed. A homonym consists of a lcxcme name, a part-of-speech marker, and a list of values of morphological features, such as number, case, gender, tense, voice, and so on.</Paragraph> <Paragraph position="1"> The set of all homonyms built for a sentence is called its morphological structure (MorphS).</Paragraph> <Paragraph position="2"> Tile MorphS is regarded as input information for the parser, which is based on the bottom-up principle. The bottom-up method for dependency structures was proposed by Lejkina and Tsejtin (1975). Let 'fragment' be a dependency tree constructed on a certain segment of a sentence. Precisely speaking, a fragment is a set of homonyms occupying one or more successive positions in the sentence (one homonym in each position) together with a directed tree defined on these homonyms as nodes, the arcs of the tree being labelled with names of syntactic relations (such arcs are called 'syntactic links'). A separate homonym and a SyntS are &quot;extreme&quot; instances of fragments.</Paragraph> <Paragraph position="3"> If two fragments are adjacent in the sentence, then drawing a syntactic link from a certain node of one fragment to the root of the other creates a new fragment on the union of segments occupied by the initial fragments (this is similar to constructing a new constituent from two adjacent constituents).</Paragraph> <Paragraph position="4"> By such operations, starting from separate homonyms, we can construct a SyntS of the sentence provided it does not contain 'tangles' (Mitjushin 1999); though SyntSs with tangles may occur, they are very infrequent and &quot;pathological&quot;.</Paragraph> <Paragraph position="5"> Correctness of links between fragments is determined by grammar rubs which have access to all information about the fragments to be linked, including the syntactic entries for homonyms in their nodes.</Paragraph> <Paragraph position="6"> In the course of parsing, only the most preferred of all fragments are built (see Section 4). Besides that, many fragments are excluded at certain intermediate points. As a result, the amount of computing is substantially reduced, and though the expected total number of fragments remains exponential with respect to the length of the sentence, the cases of &quot;combinatorial explosion&quot; are fairly infrequent (2% in our experiments).</Paragraph> <Paragraph position="7"> For the set of fragments constructed, the degree of disconnectedness C is counted as the least number of fragments covering all words of the sentence. This value of C will be denoted by C(0). If at least one complete SyntS has been built, then C(0) = l; otherwise C(0) > I. In case C(0) = l the sentence is regarded as correct and the process terminates, in case C(0) > l an attempt is made to improve the sentence.</Paragraph> </Section> <Section position="3" start_page="776" end_page="777" type="metho"> <SectionTitle> 3 Search for Corrections </SectionTitle> <Paragraph position="0"> The process used to find corrections is quite similar to the ordinary parsing described in the previous section. The main difference is that the parser gets as the input not the initial MorphS but the 'extended' one, which is constructed by adding new homonyms to the initial MorphS. The new homonyms arise as the result of varying the forms of the words of the input sentence. The process of varying concerns only semantically empty morphological features, such as the case of a noun, the number, gender and person of a finite verb, the number, gender and case of an adjective or participle, and the like. Transforming finite verbs into infinitives and vice versa is also regarded as semantically empty.</Paragraph> <Paragraph position="1"> As a result, for each homonym of the initial MorphS a 'set of variants' is built, i.e. a certain set of homonyms of the same lexeme that contains the given homonym. For unchangeable words and in certain other cases no real variation takes place, and the set of variants contains a single element, namely the given homonym. The precise rules for constructing variants may be found in (Mitjushin 1993). The extended MorphS is the union of the sets of variants built for all homonyms of the initial MorphS. On average, the extended MorphS is much larger than the initial: for 100 sentences from the Computer Science Abstracts the mean number of homonyms in the initial MorphS was 2.4n, while in the extended one it was 12.2n, where n is the number of words in the sentence.</Paragraph> <Paragraph position="2"> The extended MorphS is processed by the parser for various values of the parameter R which limits the number of changed words in the sentence. R is succesively assigned values !, 2 .... , Rma x (where Rma x is set by the user), and for each R parsing is repeated from the beginning. Let d he the number of homonyms of a certain fragment which do not belong to the initial MorphS, i.e. the graphic words of which are different from the words of the input sentence. In a sense, d is the distance between the fragment and the input sentence. In the course of parsing only those fragments are considered for which d < R; one can imagine that the parsing algorithm remains unchanged, but creation of fragments with d > R is blocked in the grammar rules.</Paragraph> <Paragraph position="3"> For each value of R, the degree of disconnectedness C = C(R) is calculated for the results of parsing. It shonld be noted that if we put R = 0, then parsing on the extended MorphS is actually reduced to that on the initial MorphS, which justifies the notation C(0) introduced in the previous section. Function C(R) does not increase for R > 0. Behaviour of the corrector may in different ways depend on the values of C(R), which would result in different modes of operation.</Paragraph> <Paragraph position="4"> If parsing is highly reliable, i.e. the probability to construct at least one SyntS for a well-formed sentence is close to l, and the same probability for an ill-formed sentence is close to 0, then it is reasonable to regard all sentences with C(0) > 1 as incorrect and carry out parsing on the extended MorphS until C(R) = 1 is achieved for some R, i.e. at least one SyntS is constructed. Then, replacing the homonyms of each SyntS by their graphic representations (that is, transforming tbem into ordinary words), we obtain hypothetical corrected sentences; some of them may be identical, as different homonyms may be represented by the same graphic words. Each of the created sentences conrains exactly R words changed in comparison with the input sentence.</Paragraph> <Paragraph position="5"> If C(R) > 1 for all R < Rmax, corrections cannot be constructed within this mode. Itowever, the corrector can inform the user that the input sentence is syntactically ill-formed, and indicate the gaps in the SyntS, i.e. the boundaries of the fragments which providc the minimal covering of the sentence for R = 0.</Paragraph> <Paragraph position="6"> In our case, due to incompleteness of the grammar, many well-formed sentences would have C(0) > I. However, the majority of the missing rules describe links which do not require agreement, and so for ahnost all well-formed sentences C(R)=C(O) for all R>0, i.e. they turn out to be 'unimprovable'. Taking this fact into account, we adopted the following strategy: the least R1 is found for which C(R) = C(R 1) for all R, RI < R _< Rmax. In other words, R 1 is the least value of R for which the situation becomes 'unimprovable' (within the extended MorphS constructed for the input sentence). If Rl > 0, i.e.</Paragraph> <Paragraph position="7"> C(R l) < C(0), then for R = R1 the minimal sets of fragments covering all words of thc sentcnec are considered. Replacing the homonyms of those fragments by their graphic representations, we obtain hypothetical corrected sentences. In case of overlapping fragments, the problem of choosing among several homonyms could arise, but actually overlapping does not occur.</Paragraph> <Paragraph position="8"> Experiments with the corrector showed from the very beginning that the process described often generated redundant hypothetical corrections. So an additional pruning step was introduced, which will be described for tile case of SyntS (for fragments everything is quite similar). The arcs of a SyntS are assigned certain weights which express relative &quot;strength&quot; or &quot;priority&quot; of the corresponding syntactic relations; the weight of the SyntS is equal to the sum of the weights of its arcs. The maximum weight over all constructed SyntSs is counted, and only SyntSs with that weight are retained.</Paragraph> <Paragraph position="9"> Though this method is simple, it proved to be quite effective and reliable: in most cases tl\]e corrector generates a single hypothetical correction, while the probability of losing the right correction is rather small.</Paragraph> </Section> <Section position="4" start_page="777" end_page="779" type="metho"> <SectionTitle> 4 Implementation </SectionTitle> <Paragraph position="0"> The system described is based on tile &quot;Meaning Text&quot; linguistic model (Mel'~uk 1974; see also Mel'6uk and Pertsov 1987) and its coinputer implementation - a multi-purpose linguistic processor developed at the Institute for Information Transmission Problems of the Russian Academy of Sciences (Apresjan et al. 1992).</Paragraph> <Paragraph position="1"> The corrector employs the morphological and syntactic dictionaries of Russian which are part of the processor. As regards its linguistic content, the grammar of the corrector is similar to the Russian grammar used in tim processor, as they describe the same corresixmdence between Russian sentences and their syntactic structures, ltowever, the eorrector uses a different formalism to represent rules, which partly stems from the difference in parsing methods: in the processor an algorithm of the so-called 'filtering' type is implemented, while the corrector uses an algorithm of the 'bottom-utC/ variety.</Paragraph> <Paragraph position="2"> It shouhl be noted dmt, in contrast to certain other systems (for example, Jensen et al. 1983; Weischedel and Sondheimer 1983; V&onis 1988; Chanod et at. 1992), the present corrcvtor does not contain any 'negative' information intended specifically for correcting errors. It contains only 'positive' rules that describe correct SyntSs and their parts and are assumed to be used in ordinary parsing.</Paragraph> <Paragraph position="3"> Correction of errors is reduced to parsiug on the extended MorphS, as described in ~ction 3.</Paragraph> <Paragraph position="4"> In comparison with the experimental version of the system (Mitjushin 1993), in the present w~'rsion the grammar is augmented, and it is made possible to process words absent from the syntactic dictionary and to consider quasi-correct sentences. Now, to make the corrector applicable to real texts, it is sufficient to supply it with a large morphological dictionary.</Paragraph> <Paragraph position="5"> Such a dictionary containing about 90 thousand words has recently been compiled at IITP by Vladimir Sannikov. It is rather close by its lexical content to the Grammatical Dictionary by Zaliznjak (1980), but is based on the model of Russian morphology used in the linguistic processor.</Paragraph> <Paragraph position="6"> Compilation of a large syntactic dictionary is a more labour-consuming task, as its entries contain more complex information. For each word there must be specified a set of relevant syntactic features (from the full list of 240 features), a set of semantic categories (from the list of 50 categories), and a government pattern which expresses the requirements that must be fulfilled by the elements representing in the SyntS the semantic actants of the word (Mei'~.uk 1974; Mel'~uk and Pertsov 1987; Apresjan et al. 1992).</Paragraph> <Paragraph position="7"> Verbs, nouns, adjectives, and adverbs which are present in the morphological dictionary but absent from the syntactic one are assigned one of the following standard entries: transitive verb, intransitive verb, inanimate masculine noun, animate masculine noun, inanimate feminine noun, animate feminine nolnl, neuter noun~ adj ectiw~', adverv.</Paragraph> <Paragraph position="8"> Words of other parts of speech constitute closed classes and nmst be present in the syntactic dictionary. The standard entries contain &quot;generalized&quot; in: formation which is typical of words of the specified categories. A verb is assumed to be transitive if its paradigm contains passive forms. The gender and animacy of a noun are explicitly indicated in its paradigm.</Paragraph> <Paragraph position="9"> Although this method is rather approximate by its nature, it works quite well: in most cases standard entries do not prevent the parser from building correct or &quot;ahnost correct&quot; SyntSs (the latter differing from the former in nalnes of relations on certain arcs). The reason of this is, on the one hand, that the majority of words with highly idiosyncratic behaviour are present in the 15-thousand dictionary of the linguistic processor, and, on the other hand, that syntactic peculiarities of words are often irrelevant to specific constructions in which they occur (for instance, consider the first two occurences of the verb be in the sentence 7&quot;o be, or not to be: that is the question).</Paragraph> <Paragraph position="10"> Tim algorithms by which the corrc~tor constructs the initial and extended MorphSs are similar to the algorithms of morphological analysis and synthesis used in the linguistic processor.</Paragraph> <Paragraph position="11"> Due to space limitations, we cannot describe the parsing algorithm in detail and give only a sketch. The parsing, i.e. constructing fragments by the bottom-up procedure, is performed in three stages, in the order of decreasing predictability of syntactic links. The parser intensively exploits the idea of syntactic preference used in a wide range of systems based on various principles (see, for example, Tsejtin 1975; Kulagina 1987, 1990; Tsujii et al.</Paragraph> <Paragraph position="12"> 1988; llobbs and Bear 1990).</Paragraph> <Paragraph position="13"> At the first stage the parser constructs fragments containing 'high-l)rnbability' links; as a result, on average 70 - 80% of all syntactic links of a sentence are established (for details see Mitjushin 1992). At the second stage the fragments are connected with &quot;weaker&quot; and more ainbiguous links, like those between a verb or noun and a modifying prepositional phrase. At the third stage &quot;rare&quot; and/or &quot;far&quot; links are established, such as coordination of independent clauses. At the second and third stage attempts are also made t() establish links of previous stages, as they could be not established at their &quot;own&quot; stage because of missing intermediate links of tile later stages.</Paragraph> <Paragraph position="14"> At each stage the sentence is looked through from left to right, and attempts are made to link each fragment with its left neighbours. A strong system of preferences is used which substautially reduces the number of arising fragments. Its main points are: longer fragments are preferred to shorter ones; links of earlier stages are preferred to those of later stages; shorter links are preferred to longer ones. The general rule requires that only the most preferred of all lX~ssible actions should be consid ~ ered. Only if they all fail, the actions of the next priority h;vel are considered, and so on.</Paragraph> <Paragraph position="15"> After each stage only 'maximal' fragments are retained (a fragment is maximal if its segment is not a proper part of the segment of any other frag-Inent). The process terminates after the stage at which complete SyntSs have arisen; otherwise the fragments left after the third stage are regarded as the final result of parsing.</Paragraph> <Paragraph position="16"> It should be noted that grammar rules, by means of special operations, can change priorities of links and fragments in order to widen the search if there is a danger to &quot;lose&quot; correct fragments. They can also mark the fragments which must be retained after the stage even if they are not maximal. As the rules have access to all information about the fragments they consider, this makes it possible to control the parsing process effectively enough depending on the specific situation in the sentence.</Paragraph> </Section> <Section position="5" start_page="779" end_page="780" type="metho"> <SectionTitle> 5 Preliminary Experiments </SectionTitle> <Paragraph position="0"> In order to evaluate performance of the corrector, 100 sentences were chosen at random from the journal Computer Science Abstracts ('referativnyj zhurnal Vychislitel'nye Nauki', in Russian). The sentences had to have no more than 50 words and to contain no formulas or words in Latin alphabet.</Paragraph> <Paragraph position="1"> The words absent from the morphological dictionary were added to it before the experiments (such words covered about 5% of all word occurences in those sentences), The chosen 100 sentences were processed by the corrector. Then a single random distortion was made in each sentence, and the 100 distorted sentences were processed (this was made twice, with different series of pseudo-random numbers used to generate distortions). As only single distortions were considered, it was fixed Rma x = 2.</Paragraph> <Paragraph position="2"> The 100 initial sentences gave the following resuits. In 75 cases SyntSs were built; 20 sentences were evaluated as quasi-correct, i.e. they had</Paragraph> <Paragraph position="4"> &quot;corrections&quot; were proposed; in one case the time limit (120 seconds) was exceeded; one case gave an overflow of working arrays. Thus, the corrector's reaction was right for 95 sentences.</Paragraph> <Paragraph position="5"> Distortions were generated as follows. A word of the sentence was chosen at random for which the number of homonym s in the extended MorphS was greater than that in the initial one (the mean number of such &quot;changeable&quot; words in a sentence was 14.3, while the mean length of a sentence was 17.6 words). A list of different graphic words corresponding to those homonyms was built (on average, it contained 7.7 words), and one of the words different from the initial word was chosen at random.</Paragraph> <Paragraph position="6"> All random choices were made with equal probabilities for the results. An additional condition was imposed that the initial word should belong to the set of variants of the new one (sometimes it may not hold). If this was not fulfilled, generation of a distorted sentence was repeated.</Paragraph> <Paragraph position="7"> Some of distorted sentences turn out to be well-formed (for the distortions described, the probability of this is about 10 - 15%). In most cases such sentences are semantically and/or pragmatically abnormal. However, it cannot be established on the syntactic level, just as a spelling corrector is helpless if a word is transformed by an error into another existent word.</Paragraph> <Paragraph position="8"> There were 14 well-formed sentences in the first series of distorted sentences, and 10 in the second series. The corrector evaluated all those sentences as correct or quasi-correct. The results for the other distorted sentences are shown in Table 1. On the whole, for the first series of distorted sentences the corrector's reaction was right in 93 cases, and for the second in 94 cases.</Paragraph> <Paragraph position="9"> No regular experiments were carried out for sentences containing more than one distortion. Our experience suggests that if the number of distorted words is small and they are syntactically isolated, i.e. the corresponding nodes are not too close to each other in the SyntS of the original sentence, then the system corrects each distortion independently of the others, as if it were the only one in the sentence. On the other hand, for massively distorted (and not too short) sentences probability of good results is rather low.</Paragraph> <Paragraph position="10"> The mean processing time on the MicroVAX 3100 computer was 11.2 seconds for an initial sentence (0.64 seconds per word) and 11.4 seconds for a distorted one. Faster performance may be expected when the granunar is enlarged, because the proportion of sentences with SyntSs in comparison with quasi-correct ones will become higher. For quasi-correct sentences parsing must be performed for all R _< Rmax, while for sentences with SyntSs it must be done only for R = 0 (if a correct sentence is to be checked) or for R -< K (if K distortions are to be corrected). In our experiments, for initial sentences with SyntSs the mean processing time was 2.6 seconds (0.17 seconds per word, the mean length of such sentences being 15.5 words), and the mean time of parsing was 0.6 seconds.</Paragraph> </Section> class="xml-element"></Paper>