File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/a92-1015_metho.xml
Size: 28,211 bytes
Last Modified: 2025-10-06 14:12:53
<?xml version="1.0" standalone="yes"?> <Paper uid="A92-1015"> <Title>Detecting and Correcting Morpho-syntactic Errors in Real Texts</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2300 RB Leiden, The Netherlands. </SectionTitle> <Paragraph position="0"> need for a parser. After a brief overview of the system and a discussion of the word-level modules, I will describe the grammar formalism, the parser, its mechanism for error detection, and a pre-processor for word lattices. Finally, after looking at the integration of the modules and at some useful heuristics, I will give a summary of the results obtained by a non-interactive Dutch grammar-driven spell checker.</Paragraph> </Section> <Section position="4" start_page="0" end_page="111" type="metho"> <SectionTitle> 2. Morpho-syntactic Errors </SectionTitle> <Paragraph position="0"> This paper is concerned with three types of errors: typographical errors (typing errors or OCR scanning errors), orthographical errors (erroneous transliterations of phonemes to graphemes) and, most importantly, morpho-syntactic errors (resulting from misapplication of morphological inflection and syntactic rules). Simple spell checkers are only able to spot errors leading to non-words; errors involving legally spelled words go unnoticed. These morpho-syntacfic errors occur quite frequently in Dutch texts, though, and are considered serious because they are seen as resulting from insufficient language competence rather than from incidental mistakes, such as typographical errors. Therefore they constitute an interesting area for grammar checking in office and language teaching applications. I will now present a classification of the morpho-syntacfic errors and some related errors in Dutch (Kempen and Vosse, 1990).</Paragraph> <Section position="1" start_page="0" end_page="111" type="sub_section"> <SectionTitle> 2.1. Agreement violations </SectionTitle> <Paragraph position="0"> Typically syntactic errors are agreement violations.</Paragraph> <Paragraph position="1"> Though none of the words in the sentence She walk home is incorrect, the sentence is ungrammatical. No simple spelling checking mechanism can find the error, let alone correct it, since it is caused by a relation between two words that need not be direct neighhours. Detection and correction of this type of error requires a robust parser, that can handle ungrammatical input.</Paragraph> <Paragraph position="2"> 2.2. Homophonous words Homophony is an important source of orthographical errors: words having the same pronunciation but a different spelling. Dutch examples are ze/and zij, sectie and sexy, wort and wordt and achterruit and achteruit. Such words are easily replaced by one of its homophonous counterparts in written text.</Paragraph> <Paragraph position="3"> The problem of current spell checkers is that they do not notice this substitution as the substitutes are legal words themselves. In order to detect this substitution, a parser is required since often a change of syntactic category is involved. In section 4.3.2 1 will demonstrate that the treatment of these errors strongly resembles the treatment of non-words 1.</Paragraph> <Paragraph position="4"> Unfortunately, a parser cannot detect substitutions by homophones which have the same syntactic properties.</Paragraph> <Paragraph position="5"> 2.3. Homophonous inflections A special case of homophonous words are words which differ only in inflection. This type of homophony is very frequent in Dutch and French. French examples are donner, donnez, donnd, donnde, donnds and donndes or cherche, cherches and cherchent. Dutch examples typically involve d/t-errors: -d, -t and -dt sound identical at the end of a word but they often signal different verb inflections. Examples are the forms gebeurt (third person singular, present tense) and gebeurd (past participle) of the verb gebeuren; word (first person, singular, present tense) and wordt (third person, singular, present tense) of the verb worden; and besteden (infinitive and plural, present tense),besteedden (plural, past tense), and bestede (an adjective, derived from the past participle).</Paragraph> <Paragraph position="6"> However, unlike the general case of homophonous words, homophonous inflections, by their very nature, do not alter the syntactic category of the word but rather its (morpho-syntactic) features. So this type of error can be regarded as a homophonous word or a spelling error, or as an agreement violation.</Paragraph> </Section> <Section position="2" start_page="111" end_page="111" type="sub_section"> <SectionTitle> 2.4. Word doubling </SectionTitle> <Paragraph position="0"> Notoriously difficult to spot are word doubling errors, especially at the end of a line (&quot;Did you actually see the the error in this sentence?&quot;). A parser surely notices it, but it should not fail to analyze the sentence because of this.</Paragraph> <Paragraph position="1"> 2.5. Errors in idiomatic expressions Idiomatic expressions often cause problems for parsers since they often do not have a regular syntactic structure and some of their words may be illegal outside the idiomatic context. A Dutch example is te allen tijde (English: at all times), with the word 1I will not discuss typographical errors resulting in legal words (such as rotsen and rosten) since their treatment is similar.</Paragraph> <Paragraph position="2"> tijde only occurring in idiomatic expressions.</Paragraph> <Paragraph position="3"> Whenever it occurs in a normal sentence it must be considered to be a spelling error. (An English example might be in lieu of.) The problem is even more serious in case of spelling errors. E.g. the expression above is more often than not written as te alle tijden, which consists of legal words and is syntactically correct as well.</Paragraph> </Section> <Section position="3" start_page="111" end_page="111" type="sub_section"> <SectionTitle> 2.6. Split Compounds </SectionTitle> <Paragraph position="0"> Somewhat similar to idiomatic expressions is the case of compound nouns, verbs, etc. In both Dutch and German these must be written as single words.</Paragraph> <Paragraph position="1"> However, under the ever advancing influence of English on Dutch, many compounds, especially new ones such as tekst verwerker (text processor) and computer terminal are written separated by a blank, thus usually confusing the parser.</Paragraph> </Section> </Section> <Section position="5" start_page="111" end_page="114" type="metho"> <SectionTitle> 3. System overview </SectionTitle> <Paragraph position="0"> The system presented here consists of two main levels: word level and sentence level. Before entering the sentence level (i.e., parsing a sentence), a spelling module should check on all the words in the sentence. This is a rather simple task for a language such as English, but for morphologically complex languages such as Dutch and German, it is by no means trivial. Because compound nouns, verbs and adjectives are written as a single word, they cannot always be looked up in a dictionary, but have to be analyzed instead. There are three problems involved in compound analysis: (1) not every sequence of dictionary words forms a legal compound, (2) certain parts of a compound cannot be found in the dictionary and (3) full analysis usually comes up with too many alternatives. My solution follows the lines set out in (Daelemans, 1987): a deterministic word parser, constrained by the grammar for legal compounds, that comes up with the left-most longest solution first. This solution is rather fast on legal compounds, while it takes at most O(n 2) time for nonexistent words and illegal compounds. The word parser is built upon a simple morphological analyzer, which can analyze prefixes, suffixes and some types of inflection. Both use a dictionary, containing 250,000 word forms 2, derived from 90,000 Dutch lemmata, which appears to be sufficient for most purposes. There is also a possibility to add extra dictionaries for special types of text.</Paragraph> <Paragraph position="1"> 2For each lemma the dictionary contains all the inflections and derivations that were found in a large corpus of Dutch text (the INL corpus, compiled by the Instituut voor Nederlandse Lexicografie in Leyden). The dictionary itself is a computerised expanded version of the &quot;Hedendaags Nederlands&quot; (&quot;Contemporary Dutch&quot;) dictionary, published by Van Dale Lexicografie (Utrecht), which was enriched with syntactic information from the CELEX database (University of Nijmegen).</Paragraph> <Paragraph position="2"> If a word does not appear in one of the dictionaries and is not a legal compound either, the spell checker can resort to a correction module. In an interactive situation such a module might present the user as many alternatives as it can find.</Paragraph> <Paragraph position="3"> Although this 'the-more-the-better' approach is very popular in commercially available spell checkers, it is not a very pleasant one. It is also unworkable in a batch oriented system, such as the one I am describing here. Ideally, a spelling corrector should come up with one (correct!) solution, but if the corrector finds more than one alternative, it should assign a score or ranking order to each of the alternatives.</Paragraph> <Paragraph position="4"> The system presented here employs a correction mechanism based on both a variation of trigram analysis (Angell et al., 1983) and triphone analysis (Van Berkel and De Smedt, 1988), extended with a scoring and ranking mechanism. The latter is also used in pruning the search space 3. Thus the system can handle typographical errors as well as orthographical errors, and includes a satisfactory mechanism for ranking correction alternatives, which is suitable both for interactive environments as well as for stand-alone systems.</Paragraph> <Paragraph position="5"> When all words of a text have been checked and, if necessary, corrected, a pre-processor (to be described in section 4.4) combines the words and their corrections into a word lattice. The syntactic parser then checks the grammatical relations between the elements in this lattice. If the parsing result indicates that the sentence contains errors, a syntactic corrector inspects the parse tree and proposes corrections. If there is more than one possible correction, it ranks the correction alternatives and executes the top-most one. Section 4 will describe the parser and the pre-processor in some detail. Due to space limitations, I have to refer to (Vosse, 1991) for further information, e.g. the adaptations that need to be made to the Tomita algorithm in order to keep the parsing process efficient.</Paragraph> <Paragraph position="6"> 4. Shift-Reduce Parsing with ACFGs 4.1. Augmented Context-free Grammars Augmented Context-Free Grammars (ACFGs for short) form an appropriate basis for error detection and correction. Simply put, an ACFG is a Context-Free Grammar where each non-terminal symbol has a (finite) sequence of attributes, each of which can have a set of a finite number of symbols as its value. 3pruning the search space is almost obligatory, since trigram and triphone analysis require O(n*m) space, where n is the length of the word and m the number of entries in the dictionary. The constant factor involved can be very large, e.g. for words containing the substring ver, which occurs in more than seven out of every hundred words (13,779 triphones and 16,881 trigrams in 237,000 words).</Paragraph> <Paragraph position="7"> In a rule, the value of an attribute can be represented by a constant or by a variable.</Paragraph> <Paragraph position="8"> A simple fragment of an ACFG is for example: In the actual implementation of the parser, the grammatical formalism is slightly more complex as it uses strongly typed attributes and allows restrictions on the values the variables can take, thereby making grammar writing easier and parsing more reliable. The Dutch grammar employed in the system contains nearly 500 rules.</Paragraph> <Paragraph position="9"> 4.2. The parser The construction of the parsing table is accomplished by means of standard LR-methods, e.g.</Paragraph> <Paragraph position="10"> SLR(0) or LALR(1), using the &quot;core&quot; grammar (i.e. leaving out the attributes). The parsing algorithm itself barely changes as compared to a standard shift-reduce algorithm. The shift step is not changed except for the need to copy the attributes from lexical entries when using a lexicon and a grammar with pre-terminals. The reduction step needs to be extended with an instantiation algorithm to compute the value of the variables and a succeed/fail result. It should fail whenever an instantiation fails or the value of a constant is not met.</Paragraph> <Paragraph position="11"> To accomplish this, the trees stored on the stack should include the values resulting from the evaluation of the right-hand side of the reduced rule. This makes the instantiation step fairly straightforward.</Paragraph> <Paragraph position="12"> The variables can be bound while the elements are popped from the stack. If a variable is already bound, it must be instantiated with the corresponding value on the stack. If this cannot be done or if a constant value in a rule does not match the value on the stack, the reduction step fails. A simple example (not completely) following the grammar sample above may clarify this.</Paragraph> <Paragraph position="13"> In Figure la parsing succeeds just as it would have done if only the context-free part of the grammar had been used. The only difference is that the symbols on the stack have attributes attached to them. In Figure lb however, parsing fails -- not because the context-free part of the grammar does not accept the sentence (the parse table does contain an entry for this case) but because the instantiation of pl and sg3 in rule 1 causes the reduction to fail.</Paragraph> <Paragraph position="14"> Note that the mechanism for variable binding is not completely equivalent to unification. It typically differs from unification in the reduction of the following two rules</Paragraph> <Paragraph position="16"> The reduction of rule 2 will leave two values on the stack rather than an indication that the two variables are one and the same. Therefore X and Y may differ after the reduction of rule 1.</Paragraph> <Section position="1" start_page="113" end_page="114" type="sub_section"> <SectionTitle> 4.3. Parsing Erroneous Input </SectionTitle> <Paragraph position="0"> 4.3.1. Coercing syntactic agreement Figure lb shows one type of problem I am interested in, but clearly not the way to solve it. Though the parser actually detects the error, it does not give enough information on how to correct it. It does not even stop at the right place 4, since the incongruity is only detected once the entire sentence has been read.</Paragraph> <Paragraph position="1"> Therefore the reduction step should undergo further modification. It should not fail whenever the instantiation of a variable fails or a constant in the left-hand side of the rule being reduced does not match the corresponding value on the stack, but mark the incongruity and continue parsing instead. Later in the process, when the parsing has finished, the syntactic corrector checks the marks for incongruity and coerces agreement by feature propagation.</Paragraph> <Paragraph position="2"> This approach contrasts with, e.g., the approach taken by (Schwind, 1988), who proposes to devise an error rule (cf. section 4.3.3) for every unification error of interest. However, this makes efficient parsing with a large grammar nearly impossible since the size of the parsing table is exponentially related to the number of rules.</Paragraph> <Paragraph position="3"> Consider the error in The yelow cab stops. The English spelling corrector on my word processor (MS-Word) offers two alternatives: yellow and yellows. Since the 4This of course is caused by the context-free part of the grammar. If we had created a unique non-terminal for every non-terminal-feature combination, e.g. s -> NP_sing3_nom VP_sing3, parsing would have stopped at the right place (i.e. between &quot;man&quot; and &quot;eat&quot;). This however depends mainly on the structure of the grammar. E.g. in Dutch the direct object may precede the finite verb, in which case agreement can only be checked after having parsed the subject following the finite verb. Then the parser cannot fail before the first NP following the finite verb. This is too late in general.</Paragraph> <Paragraph position="4"> string yelow is obviously incorrect, it has no syntactic category and the sentence cannot be parsed. One might therefore try to substitute both alternatives and see what the parser comes up with, as in Figure 2. This example clearly shows that the only grammatically correct alternative is yellow. In this way a parser can help the spelling corrector to reduce the set of correction alternatives. Since a realistic natural language parser is capable of parsing words with multiple syntactic categories (e.g. stop is both a noun and a verb), the two entries for yelow can be parsed in a similar fashion. The grammatical alternative(s) can be found by inspecting the resulting parse trees afterwards.</Paragraph> <Paragraph position="5"> In order to handle errors caused by homophones as well, this mechanism needs to be extended. When dealing with legal words it should use their syntactic categories plus the syntactic categories of all possible homophones, plus -- to be on the safe side -- every alternative suggested by the spelling corrector.</Paragraph> <Paragraph position="6"> Afterwards the parse trees need to be examined to see whether the original word or one of its alternatives is preferred.</Paragraph> <Paragraph position="7"> The third and last category of errors the system attempts to deal with consists of the structural errors. General techniques for parsing sentences containing errors are difficult, computationaUy rather expensive and not completely fool-proof. For these reasons, and because only a very limited number of structural errors occur in real texts, I have developed a different approach. Instead of having a special mechanism in the parser find out the proper alternative, I added error rules to the formalism. The grammar should now contain foreseen improper constructions. These might treat some rare constituent order problems and punctuation problems.</Paragraph> <Paragraph position="8"> Natural language sentences are highly syntactically ambiguous, and allowing errors makes things considerably worse. Even the simple toy grammar above yields a great number of useless parses on the sentence They think. The word think may have differ.</Paragraph> <Paragraph position="9"> ent entries for 1st and 2nd person singular, 1st, 2nd and 3rd person plural and for the infinitive. This would result in one parse tree without an error message and five parse trees indicating that the number of they does not agree with the number of think. By using sets of values instead of single values this number can be reduced, but in general the number of parses will be very large. Especially with larger grammars and longer sentences there will be large amounts of parses with all sorts of error messages.</Paragraph> <Paragraph position="10"> A simple method to differentiate between these parses is to simply count the number of errors, agreement violations, structural errors and spelling errors in each parse, and to order the parses accordingly. Then one only has to look at the parse(s) with the smallest number of errors. However, this concept of weight needs to be extended since not all errors are equally probable. Some types of agreement violation simply never occur whereas others are often found in written texts. Orthographical and typographical errors and homophone substitution are frequent phenomena while structural errors are relatively rare. Suppose the parser encounters a sentence like Word je broer geopereerd? (Eng.: Are your brother (being) operated?). In Dutch this is a frequent error (see section 2.3), since the finite verb should indeed be word if je instead of je broer were the subject.</Paragraph> <Paragraph position="11"> (Translating word-by-word into English, the correction is either/s your brother (being) operated? or Are you brother (being) operated? Je is either you or your.) The most likely correction is the first one. How can a syntactic parser distinguish between these two alternatives? My solution involves adding error weights to grammar rules. These cause a parse in which verb transitivity is violated to receive a heavier penalty than one with incorrect subject verb agreement. Thus, parse trees can be ordered according to the sum of the error weight of each of their nodes.</Paragraph> </Section> <Section position="2" start_page="114" end_page="114" type="sub_section"> <SectionTitle> 4.4. Word Lattices </SectionTitle> <Paragraph position="0"> As noted in section 2.5, idiomatic expressions cause parsers a lot of trouble. I therefore propose that the parser should not operate directly on a linear sentence, but on a word lattice that has been prepared by a pre-processor. For a sentence like Hij kan te allen tijde komen logeren (he can come to stay at all tim~) such a structure might look like Figure 3. Instead of parsing each word of the expression te allen tijde separately, the parser can take it as a single word spanning three word positions at once or as three separate words. Should one of the words in the expression have been misspelled, the pre-processor builds a similar structure, but labels it with an error message containing the correct spelling obtained from the spelling corrector. Word lattices can of course become much more complex than this example.</Paragraph> <Paragraph position="1"> Since there is a pre-processor that is able to combine multiple words into a single item, it might as well be used to aid the parser in detecting two further types of errors as well. The first one is the Dutch split compound. By simply joining all the adjacent nouns (under some restrictions) the grammar and the parser can proceed as if split compounds do not occur. The second error type is word doubling. The pre-processor can join every subsequent repetition of a word with the previous occurrence so that they will be seen both as two distinct words and as one single word (since not every occurrence of word repetition is wrong). Another possibility is to concatenate adjacent words when the concatenated form occurs as one entry in the dictionary. E.g. many people do not know whether to write er op toe zien, erop toezien, erop toezien or any other combination (though a parser might not always have the right answer either).</Paragraph> </Section> </Section> <Section position="6" start_page="114" end_page="115" type="metho"> <SectionTitle> 5. Integration and Heuristics </SectionTitle> <Paragraph position="0"> The combination of the modules described above -- a spell checker with compound analysis, a spelling corrector, a robust parser and a syntactic corrector -- does not lead by itself to a batchoriented proof-reading system. Most texts do not only contain sentences, but also rifles and chapter headings, captions, jargon, proper names, neologisms, interjections, dialogues (&quot;yes&quot;, she sa/d, &quot;yes, that is true, but...&quot;), quotations in other languages, literature references, et cetera, not to mention mark-up and typesetting codes. The system therefore has a mechanism for dealing with the layout aspects of texts and some heuristics for dealing with proper names, jargon and neologisms. The layout aspects include mark-up codes and graphics, title markers and a mechanism for representing diacritics, such as the diaeresis, which is frequent in Dutch.</Paragraph> <Paragraph position="1"> Dictionaries seldom contain all words found in a text. In Dutch, part of the problem can be solved by using compound analysis. However, a misspelled word can sometimes be interpreted as a compound, or as two words accidentally written together. I partially solved this problem by having the compound analyzer repeat the analysis without the word grammar if it fails with the word grammar, and by defining a criterion which marks certain compounds as &quot;suspicious &quot;s. If the analyzer marks the compound as either suspicious or ungrammatical, the spelling corrector is invoked to see if a good alternative (i.e. closely resembling and frequent word) can be found instead, or, else, if the compound was ungrammatical, whether it can be split into separate words. This process is further improved by adding the correct compounds in the text to the internal word list of the spelling corrector.</Paragraph> <Paragraph position="2"> Other words that do not appear in a dictionary are proper names, jargon and neologisms. Therefore the system first scans the entire text for all word types while counting the tokens before it starts parsing. My rule of thumb is to treat words, that appear mainly capitalized in the text as proper names.</Paragraph> <Paragraph position="3"> Frequently occurring words, that do not have a good correction, are supposed to be neologisms. Both proper nouns and neologisms are added to the internal word list of the spelling corrector. The main disadvantage of this approach is that it misses consistently misspelled words. At the end of the run therefore, the system provides a list of all the words it tacitly assumed to be correct, which must then be checked manually.</Paragraph> <Paragraph position="4"> Another feature of the system is that it coerces variant spelling into preferred spelling. This feature also takes compounds which have is no official preferred spelling into consideration, thus preventing compound to be written in different ways. E.g. both 5Misspelled word can often be analyzed as sequences of very small words. E.g. the misspelled kwaliteitesverbetering (which should be kwaliteitsverbetering, Eng.: quality improvement) can be divided into kwaliteit +es+verbetering, which could mean quality ash improvement. The amount of overgeneration correlates strongly with the size of the dictionary.</Paragraph> <Paragraph position="5"> spellingcorrectie and spellingscorrectie (Eng.: spelling correction) are correct in Dutch. My system only allows one to occur in a text and coerces the least frequently occurring variants into the most frequent one.</Paragraph> <Paragraph position="6"> The last but not least important tricks help to reduce parsing time. Since the system cannot detect all types of errors with equal reliability (cf. section 6), I added a d/t-mode in which only sentences that might contain a d/t-error (cf. section 2.2) are parsed. In this mode a pre-processor first checks whether the sentence contains such a &quot;d/t-risk&quot; word. If this is the case the parser is invoked, but the error messages not pertaining to this class of errors are suppressed.</Paragraph> <Paragraph position="7"> As d/t-risks show up in less than a quarter of all sentences, parsing time is cut by a factor of four at least. Although this solution can hardly be called elegant, it gives the user a faster and more reliable system.</Paragraph> <Paragraph position="8"> There also is an upper bound on the number of allowed parses. Because analyzing a parse tree takes some time, this speeds up the process. The disadvantage is that the system may choose an unlikely correction more often as it cannot compare all parse trees. Large sentences with multiple errors may produce thousands of parse trees, each of which has to be scored for comparison. As the allowed number of parses becomes less than the potential number of parses, the probability that the system overlooks a likely correction grows. But since it produces an error message anyway, albeit an unlikely one, the advantage outweighs the disadvantage.</Paragraph> </Section> class="xml-element"></Paper>