File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/88/a88-1011_metho.xml
Size: 29,100 bytes
Last Modified: 2025-10-06 14:12:00
<?xml version="1.0" standalone="yes"?> <Paper uid="A88-1011"> <Title>TRIPHONE ANALYSIS: A COMBINED METHOD FOR THE CORRECTION OF ORTHOGRAPHICAL AND TYPOGRAPHICAL ERRORS.</Title> <Section position="3" start_page="0" end_page="77" type="metho"> <SectionTitle> 1. INTRODUCTION </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="77" type="sub_section"> <SectionTitle> 1.1 Error types </SectionTitle> <Paragraph position="0"> Any method for the correction of word level errors in written texts must be carefully tuned. On the one hand, the number of probable corrections should be maximized; on the other hand, the number of unlikely corrections should be minimized. In order to achieve these goals, the characteristics of specific error types must be exploited as much as possible. In this article we distinguish two major types of word level errors: orthographical errors and typographical errors. They have some clearly different characteristics.</Paragraph> <Paragraph position="1"> Orthographical errors are cognitive errors consisting of the substitution of a deviant spelling for a correct one when the author either simply doesn't know the correct spelling for a correct spelling, forgot it, or misconceived it. An important characteristic of orthographical errors is that they generally result in a string which is phonologically identical or very similar to the correct string (e.g. indicies instead of indices1). As a consequence, orthographical errors are dependent on the correspondence between spelling and pronunciation in a particular language. Another characteristic is that proper names, infrequent words and foreign words are particularly prone to orthographical errors.</Paragraph> <Paragraph position="2"> 1 All examples of errors given in this article were actually found by the authors in texts written by native speakers of the language in question.</Paragraph> <Paragraph position="3"> Typographical errors are motoric errors caused by hitting the wrong sequence of keys. Hence their characteristics depend on the use of a particular keyboard rather than on a particular language. Roughly eighty percent of these errors can be described as single deletions (e.g. continous) insertions (e.g. explaination), substitutions (e.g. anyboby) or transpositions (e.g.</Paragraph> <Paragraph position="4"> autoamtically) while the remaining twenty percent are complex errors (Peterson, 1980). Some statistical facts about typographical errors are that word-initial errors are rare, and doubling and undoubling (e.g.</Paragraph> <Paragraph position="5"> succeeed, discusion) are common. In general, typographical errors do not lead to a string which is homophonous with the correct string.</Paragraph> <Paragraph position="6"> Most of the correction methods currently in use in spelling checkers are biased toward the correction of typographical errors. We argue that this is not the fight thing to do. Even if orthographical errors are not as frequent as typographical errors, they are not to be neglected for a number of good reasons. First, orthographical errors are cognitive errors, so they are more persistent than typographical errors: proof-reading by the author himself will often fail to lead to correction. Second, orthographical errors leave a worse impression on the reader than typographical errors. Third, the use of orthographical correction for standardization purposes (e.g. consistent use of either British or American spelling) is an important application appreciated by editors. In this context, our research pays special attention to Dutch, which has a preferred standard spelling but allows alternatives for a great many foreign words, e.g. architect (preferred) vs.</Paragraph> <Paragraph position="7"> architekt (allowed and commonly used in Dutch).</Paragraph> <Paragraph position="8"> Editors of books generally prefer a consistent use of the standard spelling.</Paragraph> <Paragraph position="9"> Finally, we would like to point out that methods for orthographical error correction can not only be applied in text processing, but also in database retrieval. In fact, our research was prompted partly by a project proposal for a user interface to an electronic encyclopedia. One or our experiments involving a lists of some five thousand worldwide geographical names (mainly in Dutch spelling, e.g. Noordkorea, Nieuwzeeland) has yielded very positive results. In this context, the correction of orthographical errors is obviously more important than the correction of typographical errors.</Paragraph> </Section> <Section position="2" start_page="77" end_page="77" type="sub_section"> <SectionTitle> 1.2 Correction strategies </SectionTitle> <Paragraph position="0"> Daelemans, Bakker & Schotel (1984) distinguish between two basic kinds of strategies: statistical and linguistic strategies. Statistical strategies are based on string comparison techniques, often augmented by specific biases using statistical characteristics of some error types, such as the fact that typographical errors do not frequently occur in the beginning of a word.</Paragraph> <Paragraph position="1"> Since these strategies do not exploit any specific linguistic knowledge, they will generally work better for typographical errors than for orthographical errors.</Paragraph> <Paragraph position="2"> Linguistic strategies exploit the fact that orthographical errors often result in homophonous strings (sound-alikes. e.g. consistancy and consistency). They normally involve some kind of phonemic transcription. Typographical errors which do not severely affect the pronunciation, such as doubling and undoubling, may be covered as well, but in general, linguistic strategies will do a poor job on all other typographical errors.</Paragraph> <Paragraph position="3"> Because each type of strategy is oriented toward one class of errors only, what is needed in our opinion is a combined method for orthographical and typographical errors. Our research has explored one approach to this problem, namely, the combination of a linguistic strategy with a statistical one.</Paragraph> <Paragraph position="4"> The remainder of this document is structured as follows. First we will discuss and criticize some existing statistical and linguistic correction methods.</Paragraph> <Paragraph position="5"> Then we will introduce triphone analysis. Finally we will report some results of an experiment with this method.</Paragraph> </Section> </Section> <Section position="4" start_page="77" end_page="79" type="metho"> <SectionTitle> 2. SOME EXISTING CORRECTION METHODS </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="77" end_page="77" type="sub_section"> <SectionTitle> 2.1 Spell </SectionTitle> <Paragraph position="0"> In Peterson's SPELL (Peterson, 1980), all probable corrections are directly generated from an incorrect string by considering the four major single error types. The program first makes a list of all strings from which the incorrect string can be derived by a single deletion, insertion, substitution or transposition. This list is then matched against the dictionary: all strings occuring in both the list and the dictionary are considered probable corrections.</Paragraph> <Paragraph position="1"> Although the number of derivations is relatively small for short strings, they often lead to several probable corrections because many of them will actually occur in the dictionary. For longer strings, many possible derivations are considered but most of those will be non-existent words.</Paragraph> <Paragraph position="2"> An advantage of SPELL with respect to all other methods is that short words can be corrected equally well as long ones. A disadvantage is that all complex errors and many orthographical errors fall outside the scope of SPELL.</Paragraph> </Section> <Section position="2" start_page="77" end_page="78" type="sub_section"> <SectionTitle> 2.2 Speedcop </SectionTitle> <Paragraph position="0"> SPEEDCOP (Pollock & Zamora, 1984) uses a special technique for searching and comparing strings. In order to allow a certain measure of similarity, strings are converted into similarity keys which intentionally blur the characteristics of the original strings. The key of the misspelling is looked up in a list of keys for all dictionary entries. The keys found in the list within a certain distance of the target key are considered probable corrections.</Paragraph> <Paragraph position="1"> The blurring of the similarity keys must be carefully finetuned. On the one hand, if too much information is lost, too many words collate to the same key. If, on the other hand, too much information is retained, the key will be too sensitive to alterations by misspellings. Two similarity keys are used in SPEEDCOP: a skeleton key and an omission key. These keys are carefully designed in order to partially preserve the characters in a string and their interrelationships. The information contained in the key is ordered according to some characteristics of typographical errors, e.g. the fact that word-initial errors are infrequent and that the sequence of consonants is often undisturbed.</Paragraph> <Paragraph position="2"> The skeleton key contains the first letter of a string, then the remaining consonants and finally the remaining vowels (in order, without duplicates). E.g.</Paragraph> <Paragraph position="3"> the skeleton key of information would be infrmtoa.</Paragraph> <Paragraph position="4"> The advantage of using this key is that some frequent error types such as doubling and undoubling of characters as well as transpositions involving one consonant and one vowel (except for an initial vowel) results in keys which are identical to the keys of the original strings.</Paragraph> <Paragraph position="5"> The most vulnerable aspect of the skeleton key is its dependence on the first few consonants. This turned out to be a problem, especially for omissions.</Paragraph> <Paragraph position="6"> Therefore, a second key, the omission key, was developed. According to Pollock & Zamora (1984), consonants are omitted in the following declining or- null der of frequency: RSTNLCHDPGMFBYWVZXQKJ. The omission key is construed by first putting the consonants in increasing order of omission frequency and adding the vowels in order of occurrence. E.g. the omission key for information isfmntrioa.</Paragraph> <Paragraph position="7"> SPEEDCOP exploits the statistical properties of typographical errors well, so it deals better with frequent kinds of typographical errors than with infrequent ones. Because of this emphasis on typographical errors, its performance on orthographical errors will be poor. A specific disadvantage is its dependence on the correctness of initial characters.</Paragraph> <Paragraph position="8"> Even when the omission key is used, word-initial errors involving e.g. j or k do not lead to an appropriate correction.</Paragraph> <Paragraph position="9"> 2.3 Trigram analysis: Fuzzie and Acute Trigram analysis, as used in FUZZIE (De Heer, 1982) and ACUTE (Angell, 1983), uses a more general similarity measure. The idea behind this method is that a word can be divided in a set of small overlapping substrings, called n-grams, which each carry some information about the identity of a word.</Paragraph> <Paragraph position="10"> When a misspelling has at least one undisturbed ngram, the correct spelling spelling can still be traced. For natural languages, trigrams seem to have the most suitable length. E.g., counting one surrounding space, the word trigram is represented by the trigrams #tr, tri, rig, igr, gra, ram, and am#. B/grams are in general too short to contain any useful identifying information while tetragrams and larger n-grams are already close to average word length.</Paragraph> <Paragraph position="11"> Correction using trigrams proceeds as follows. The trigrams in a misspelling are looked up in an inverted file consisting of all trigrams extracted from the dictionary. With each trigram in this inverted file, a list of all words containing the trigram is associated. The words retrieved by means of the trigrams in the misspelling are probable corrections.</Paragraph> <Paragraph position="12"> The difference between FUZZIE and ACUTE is mainly in the criteria which are used to restrict the number of possible corrections. FUZZIE emphasizes frequency as a selection criterium whereas ACUTE also uses word length. Low frequency trigrams are assumed to have a higher identifying value than high frequency trigrams. In FUZZIE, only the correction candidates associated with the n least frequent trigrams, which are called selective trigrams, are considered. ACUTE offers the choice between giving low frequency trigrams a higher value and giving all trigrams the same value.</Paragraph> <Paragraph position="13"> Taking trigram frequency into account has advantages as well as disadvantages. On the one hand, there is a favorable distribution of trigrams in natural languages in the sense that there is a large number of low frequency trigrams. Also, the majority of words contain at least one selective trigram. On the other hand, typographical errors may yield very low frequency trigrams which inevitably get a high information value.</Paragraph> <Paragraph position="14"> In general, trigram analysis works better for long words than for short ones, because a single error may disturb all or virtually all trigrams in a short word. Some advantages of this method are that the error position is not important and that complex errors (e.g. differenent), and, to a certain extent, orthographical errors, can often be corrected. A disadvantage which is specific to this method is that transpositions disturb more trigrams than other types of errors and will thus be more difficult to correct.</Paragraph> <Paragraph position="15"> Trigram analysis lends itself well to extensions. By first selecting a large group of intermediate solutions, i.e. all words which share at least one selective trigram with the misspelling, there is a lot of room for other factors to decide which words will eventually be chosen as probable corrections. ACUTE for example uses word length as an important criterium.</Paragraph> </Section> <Section position="3" start_page="78" end_page="79" type="sub_section"> <SectionTitle> 2.4 The PF-474 chip </SectionTitle> <Paragraph position="0"> The PF-474 chip is a special-purpose VLSI circuit designed for very fast comparison of a string with every entry in a dictionary (Yianilos, 1983). It consists of a DMA controller for handling input from a data base (the dictionary), a proximity computer for computing the proximity (similarity) of two strings, and a ranker for ranking the 16 best solutions according to their proximity values.</Paragraph> <Paragraph position="1"> The proximity value (PV) of two strings is a function of the number of corresponding characters of both strings counted in forward and backward directions. It is basically expressed as the following ratio:</Paragraph> <Paragraph position="3"> This value can be influenced by manipulating the parameters weight, bias and compensation. The parameter weight makes some characters more important than others. This parameter can e.g. be manipulated to reflect the fact that consonants carry more information than vowels. The parameter bias may correct the weight of a character in either word-initial or word-final position. The parameter compensation determines the importance of an occurrence of a certain character within the word. By using a high compensation/weight ratio, for example, substitution of characters will be less severe than omission. One may force two characters to be considered identical by equalizing their compensation and weight values.</Paragraph> <Paragraph position="4"> An advantage of the PF-474 chip, apart from its high speed, is that it is a general string comparison technique which is not biased to a particular kind of errors. By carefully manipulating the parameters, many orthographical errors may be corrected in addition to typographical errors.</Paragraph> </Section> <Section position="4" start_page="79" end_page="79" type="sub_section"> <SectionTitle> 2.5 Spell Therapist </SectionTitle> <Paragraph position="0"> SPELL THERAPIST (Van Berkel, 1986) is a linguistic method for the correction of orthographical errors. The misspelling is transcribed into a phonological code which is subsequently looked up in a dictionary consisting of phonological codes with associated spellings. The phonemic transcription, based on the GRAFON system (Daelemans, 1987), is performed in three steps. First the character string is split into syllables. Then a rule-based system converts each syllable into a phoneme string by means of transliteration rules. These syllabic phoneme strings are further processed by phonological rules which take the surrounding syllable context into account and are finally concatenated.</Paragraph> <Paragraph position="1"> The transliteration rules in SPELL THERAPIST are grouped into three ordered lists: one for the onset of the syllable, one for the nucleus, and one for the coda. Each rule consists of a graphemic selection pattern, a graphemic conversion pattern, and a phoneme string. The following rules are some examples for Dutch onsets: ((sc(- h i e y)) c /k/)</Paragraph> <Paragraph position="3"> The first rule indicates that in a graphemic pattern consisting of sc which is not followed by either h, i, e or y, the grapheme c is to be transcribed as the phoneme/k/.</Paragraph> <Paragraph position="4"> The transcription proceeds as follows. The onset of a syllable is matched with the graphemic selection patterns in the onset rule list. The first rule which matches is selected. Then the characters which match with the conversion pattern are converted into the phoneme string. The same procedure is then performed for the nucleus and coda of the syllable.</Paragraph> <Paragraph position="5"> The result of the transcription is then processed by means of phonological rules, which convert a sequence of phonemes into another sequence of phonemes in a certain phonological context on the level of the word. An example for Dutch is the cluster reduction rule which deletes a/t/in certain consonant clusters:</Paragraph> <Paragraph position="7"> Such rules account for much of the power of SPELL THERAPIST because many homophonous orthographic errors seem to be related to rules such as assimilation (e.g. inplementation) or cluster reduction and degemination (e.g. Dutch kunstof instead of kunststo\]).</Paragraph> <Paragraph position="8"> This method is further enhanced by the following refinements. First, a spelling may be transcribed into more than one phonological code in order to account for possible pronunciation variants, especially those due to several possible stress patterns. Second, the phonological code itself is designed to intentionally blur some finer phonological distinctions. E.g. in order to account for the fact that short vowels in unstressed syllables are prone to misspellings (e.g.</Paragraph> <Paragraph position="9"> optomization, incoded) such vowels are always reduced to a schwa /3/. As a result, misspellings of this type will collocate.</Paragraph> <Paragraph position="10"> It is clear that this method is suited only for errors which result in completely homophonous spellings (e.g. issuing, inplementation). A somewhat less stringent similarity measure is created by using a coarse phonological coding, as mentioned above.</Paragraph> <Paragraph position="11"> Still, this method is not suitable for most typographical errors. Moreover, orthographical errors involving 'hard' phonological differences (e.g.</Paragraph> <Paragraph position="12"> managable, recommand) fail to lead to correction.</Paragraph> </Section> </Section> <Section position="5" start_page="79" end_page="81" type="metho"> <SectionTitle> 3. AN INTEGRATED METHOD </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="79" end_page="80" type="sub_section"> <SectionTitle> 3.1 Combining methods </SectionTitle> <Paragraph position="0"> Of the methods described in the previous chapter, no single method sufficiently covers the whole spectrum of errors. Because each method has its strengths and weaknesses, it is advantageous to combine two methods which supplement each other.</Paragraph> <Paragraph position="1"> Because orthographical errors are the most difficult and persistent, we chose to take a linguistic method as a starting point and added another method to cover its weaknesses. SPELL THERAPIST has two weak points. First, most typographical errors cannot be corrected. Second, even though the phonological codes are somewhat blurred, at least one possible transcription of the misspelling must match exactly with the phonological code of the intended word.</Paragraph> <Paragraph position="2"> A possible solution to both problems consists in applying a general string comparison technique to phonological codes rather than spellings. We decided to combine SPELL THERAPIST with trigram analysis by using sequences of three phonemes instead of three characters. We call such a sequence a triphone and the new strategy triphone analysis.</Paragraph> </Section> <Section position="2" start_page="80" end_page="80" type="sub_section"> <SectionTitle> 3.2 Trlphone analysis </SectionTitle> <Paragraph position="0"> Triphone analysis is a fast and efficient method for correcting orthographical and typographical errors.</Paragraph> <Paragraph position="1"> When carefully implemented, it is not significantly slower than trigram analysis. The new method uses only one dictionary in the form of an inverted file of triphones. Such a file is created by first computing phonological variants for each word, then splitting each code into triphones, and finally adding backpointers from each triphone in the file to each spelling in which it occurs. Also, a frequency value is associated with each triphone.</Paragraph> <Paragraph position="2"> The way this inverted file is used during correction is virtually the same as in FUZZIE, except that f'trst all phonological variants of the misspelling have to be generated. The grapheme-to-phoneme conversion is similar to that of SPELL THERAPIST, except that the phonological code is made even coarser by means of various simplifications., e.g. by removing the distinction between tense and lax vowels and by not applying certain phonological rules.</Paragraph> <Paragraph position="3"> The easiest way to select probable corrections from an inverted file is the method used by FUZZIE, because the similarity measure used by ACUTE requires that the number of triphones in the possible correction be known in advance. The problem with this requirement is that phonological variants may have different string lengths and hence a varying number of triphones.</Paragraph> <Paragraph position="4"> Using the FUZZIE method, each phonological variant may select probable corrections by means of the following steps: 1. The phonological code is split into triphones.</Paragraph> <Paragraph position="5"> 2. Each triphone receives an information value depending on its frequency. The sum of all values is I.</Paragraph> <Paragraph position="6"> 3. The selective triphones (those with a frequency below a certain preset value) are looked up in the inverted file.</Paragraph> <Paragraph position="7"> 4. For all correction candidates found in this way, the similarity with the misspelling is determined by computing the sum of the information values of all triphones shared between the candidate and the misspelling.</Paragraph> <Paragraph position="8"> If a certain candidate for correction is found by more than one phonological variant, only the highest information value for that candidate is retained. After candidates have been selected for all variants, they are ordered by their similarity values. A possible extension could be realized by also taking into account the difference in string length between the misspelling and each candidate.</Paragraph> <Paragraph position="9"> Because processing time increases with each phonological variant, it is important to reduce the number of variants as much as possible. A considerable reduction is achieved by not generating a separate variant for each possible stress pattern. The resulting inaccuracy is largely compensated by the fact that a perfect match is no longer required by the new method.</Paragraph> <Paragraph position="10"> Although this method yields very satisfactory results for both orthographical and typographical errors and for combinations of them, it does have some shortcomings for typographical errors in short words. One problem is that certain deletions cause two surrounding letters to be contracted into very different phonemes. Consider the deletion of the r in very: the pronunciation of the vowels in the resulting spelling, vey, changes substantially. Counting one surrounding space, the misspelling does not have a single triphone in common with the original and so it cannot be corrected.</Paragraph> <Paragraph position="11"> A second problem is that a character (or character cluster) leading to several possible phonemes carries more information than a character leading to a single phoneme. Consequently, an error affecting such a character disturbs more triphones.</Paragraph> </Section> <Section position="3" start_page="80" end_page="81" type="sub_section"> <SectionTitle> 3.3 An experiment </SectionTitle> <Paragraph position="0"> The triphone analysis method presented here has been implemented on a Symbolics LISP Machine and on an APOLLO workstation running Common LISP.</Paragraph> <Paragraph position="1"> After the programs had been completed, we decided to test the new method and compare its qualitative performance with that of the other methods.</Paragraph> <Paragraph position="2"> For a first, preliminary test we chose our domain carefully. The task domain had to be very error-prone, especially with respect to orthographical errors, so that we could elicit errors from human subjects under controlled circumstances. Given these requirements, we decided to choose Dutch surnames as the task domain. In Dutch, many surnames have very different spellings. For example, there are 32 different names with the same pronunciation as Theyse, and even 124 ways to spell Craeybeckx! When such a name is written in a dictation task (e.g. during a telephone conversation) the chance of the right spelling being chosen is quite small.</Paragraph> <Paragraph position="3"> For our experiment, we recorded deviant spellings of Dutch surnames generated by native speakers of Dutch in a writing-to-dictation task. A series of 123 Dutch surnames was randomly chosen from a telephone directory. The names were dictated to 10 subjects via a cassette tape recording. A comparison of the subjects' spelling with the intended spellings showed that on the average, subjects wrote down 37.6% of the names in a deviant way. The set of 463 tokens of misspellings contained 188 different types, which were subsequently given as input to implementations of each of the methods 2. The dictionary consisted of 254 names (the 123 names mentioned above plus I31 additional Dutch surnames randomly selected from a different source). The results of the correction are presented in Tables 1 and 2.</Paragraph> <Paragraph position="4"> numbers refer to percentages of recognized (first, second or third choice) or not recognized surnames (n = 188).</Paragraph> <Paragraph position="5"> Ist choice 2nd or 3rd not found numbers refer to percentages of recognized (first, second or third choice) or not recognized surnames multiplied by their instead of using the special hardware.</Paragraph> </Section> <Section position="4" start_page="81" end_page="81" type="sub_section"> <SectionTitle> 3.4 Discussion </SectionTitle> <Paragraph position="0"> The experiment was designed in order to minimize typographical errors and to maximize orthographical errors. Hence it is not surprising that SPELL and SPEEDCOP, which are very much dependent on the characteristics of typographical errors, do very poorly. What is perhaps most surprising is that SPELL THERAPIST, a method primarily aiming at the correction of orthographical errors, shows worse results than FUZZIE, ACUTE and the PF-474 method, which are general string comparison methods. The reason is that a certain number of orthographical errors turned out to involve real phonological differences. These were probably caused by mishearings rather than misspellings. Poor sound quality of the cassette recorder and dialectal differences between speaker and hearer are possible causes. As expected, triphone analysis yielded the best results: not a single misspelling could not be corrected, and only about one out of twenty failed to be returned as the most likely correction.</Paragraph> </Section> </Section> class="xml-element"></Paper>