File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/a92-1016_metho.xml

Size: 25,241 bytes

Last Modified: 2025-10-06 14:12:54

<?xml version="1.0" standalone="yes"?>
<Paper uid="A92-1016">
  <Title>XUXEN: A Spelling Checker/Corrector for Basque Based on Two-Level Morphology Agirre E., Alegria I., Arregi X.,</Title>
  <Section position="3" start_page="119" end_page="119" type="metho">
    <SectionTitle>
2 Brief Description of Basque Morphology
</SectionTitle>
    <Paragraph position="0"> Basque is an agglutinative language; that is, for the formation of words the dictionary entry independently takes each of the elements necessary for the different functions (syntactic case included). More specifically, the affixes corresponding to the determinant, number and declension case are taken in this order and independently of each other (deep morphological structure).</Paragraph>
    <Paragraph position="1"> One of the principal characteristics of Basque is its declension system with numerous cases, which differentiates it from the languages from surrounding countries. The inflections of determination, number and case appear only after the last element in the noun phrase. This last element may be the noun, but also typically an adjective or a determiner. For example: etxe zaharreAN (etxe zaharrean: in the old house) etxe: noun (house) zahar: adjective (old) r and e: epenthetical elements A: determinate, singular N: inessive case So, these inflectional elements are not repeated in each individual word of a noun phrase as in the Romance languages.</Paragraph>
    <Paragraph position="2"> Basque declension is unique; that is, there exists a single declension table for all flexionable entries, compared to Latin for instance,which has 5 declension paradigms. As prepositional functions are realized by case suffixes inside word-forms, Basque presents a relatively high power to generate inflected word-forms. For instance, from one noun entry a minimum of 135 inflected forms can be generated. Moreover, while 77 of them are simple combinations of number, determination, and case marks, not capable of further inflection, the other 58 are word-forms ended with one of the two possible genitives or with a sequence composed of a case mark and a genitive mark. If the latter is the case, then by adding again the same set of morpheme combinations (135) to each one of those 58 forms a new, complete set of forms could be recursively generated. This kind of construction reveals a noun ellipsis inside a complex noun phrase and could be theoretically extended ad infinitum; in practice, it is not usual to find more than two levels of this kind of recursion in a word-form but, in turn, some quite frequent forms contain even three or more levels. This means that a morphological analyzer for Basque should be able to recognize the amount</Paragraph>
    <Paragraph position="4"> (to the son) (of the son) (the house of the son) (the one (house) of the son) (to the one (house) of the son) This generation capability is similar for aLl parts of speech. In the case of adjectives, due to the possibility of graduation, this capability is 4 times greater. The grammatical gender does not exist in Basque; there are not masculine and feminine. However, the verb system uses the difference sometimes, depending on the receiver and the grade of familiarity: this is the case of the allocutive verb forms.</Paragraph>
    <Paragraph position="5"> Verb forms are composed of a main verb and an auxiliary finite form. The verb system in Basque is a rich one: it is often found in a single finite verb form morphemes corresponding to ergative, nominative and dative cases. Derivation and composition are quite productive and they are widely used in neologism formation.</Paragraph>
  </Section>
  <Section position="4" start_page="119" end_page="121" type="metho">
    <SectionTitle>
3 Application of Two-Level Morphology to
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="119" end_page="120" type="sub_section">
      <SectionTitle>
Basque
3.1 The Rules
</SectionTitle>
      <Paragraph position="0"> The correlations existing between the lexical level and the surface level due to morphophonological transformations are expressed by means of the rules. In the case of Basque 21 two-level rules have been defined. These rules are due to the four following reasons: eminently phonological (7 rules), morphological (3 rules), orthographical (5 rules), and both phonological and morphological (6 rules). The effects of the rules are always phonological. Given that suppletion cases are rare in Basque, phonemically unrelated allomorphs of the same morpheme are included in the lexicon system as separated entries. No rules deal with these phenomena. The rules are applied to express three types of realizations: adding or removing a character, or alternation of a character from the lexical to the surface level. These basic transformations can be combined.</Paragraph>
      <Paragraph position="1"> In order to control the application of the rules 17 selection marks are used. Since two-level rules are sensitiVc only to the form of the word, these marks inform on part ol speech, special endings and other features needed for handlin~ exceptions in rules.</Paragraph>
      <Paragraph position="2"> Examples of rules:  stated at the beginning of affixes requiring epenthetical e, 1 is the selection mark stated at the end of those lemmas witl final au diphthong, 6 is the selection mark stated at the em of those lemmas with final hard r, 4 is the selection marl  stated at the end of verb infinitives with final n, and &amp; is the selection mark stated at the end of place names with final 1 or n which forces voicing of following t.</Paragraph>
      <Paragraph position="3"> The first rule states that the selection mark 2 is realized as surface e, always and only when it is preceded either by a consonant or a selection mark 8, or a selection mark 6 realized as surface r, or a selection mark 4.</Paragraph>
      <Paragraph position="4"> The second rule specifies the voicing of lexical t, always and only when it is preceded either by a n or I followed by the selection marks &amp; and 2, or a n followed by the selection mark 2.</Paragraph>
      <Paragraph position="5"> At the moment, the translation of rules into automata required by the two-level formalism is made by hand.</Paragraph>
    </Section>
    <Section position="2" start_page="120" end_page="120" type="sub_section">
      <SectionTitle>
3.2 The Lexicon System
</SectionTitle>
      <Paragraph position="0"> Among the morphological phenomena handled by our system so far, we would like to emphasize the following: whole declension system --including place and person names, special declension of pronouns, adverbs, etc.--, graduation of adjectives, relational endings and prefixes for verb forms --finite and non-finite-- and some frequent and productive cases of derivation and compounding.</Paragraph>
      <Paragraph position="1"> The lexicon system is divided into sublexicons. Lexical representation is defined by associating each entry to its sublexicon and giving it the corresponding continuation class.</Paragraph>
      <Paragraph position="2"> a) Sublexicons: Lemmas, auxiliaries of verbs and finite verb forms, and different affixes corresponding to declension, determination, number, verb endings, and so on are distinguished.</Paragraph>
      <Paragraph position="3"> All of the entries in the sublexicons are coded with their continuation class and morphological information. At present near 15,000 items are completely coded in the lexicon system: 8,697 lemmas, 5,439 verb forms and 120 affixes. They are grouped into 94 different sublexicons. Within short time, this number will be increased in order to code all the 50,000 entries present at the moment in the database supporting the lexicon. The entry code gives, when appropriate, information on part of speech, determination, number, declension case, gender (exceptional cases), relation (of subordination), part of speech transformation that a relational affix produces, type of verb, root of finite verb forms, tense-mood, grammatical person, etc. along with the specific information each entry requires.</Paragraph>
      <Paragraph position="4"> b) Continuation class: Generalizations are not always possible. For example, while with nouns and adjectives the assignment of a single continuation class to all of the elements of each category has been possible, adverbs, pronouns and verbs have required more particularized solutions. A number of 79 continuation classes have been defined.</Paragraph>
      <Paragraph position="5"> The system permits the unlimited accumulation and treatment of information as it extracts data from the dictionary according to the segmentation found. This feature is essential to Basque given that: a) a large amount of morpho-syntactic knowledge can be derived from a single word-form, and b) there is no set theoretical limit to the potential recursion of genitives.</Paragraph>
      <Paragraph position="6"> Separated representation for homographs and homonyms --in the main sublexicon, with the same or different continuation classes-- has been made possible. Although this distinction is not necessarily relevant to morphological analysis, future work on syntax and semantics has been taken into consideration.</Paragraph>
    </Section>
    <Section position="3" start_page="120" end_page="121" type="sub_section">
      <SectionTitle>
3.3 Some Problems and Possible Solutions
</SectionTitle>
      <Paragraph position="0"> Although until now, the notation and concept of continuation class have been used, in authors' opinion it is the weakest point of the formalism. Specially in dealing with the Basque auxiliary verb, cases of long-distance dependencies that are not possible to express adequately have been found. Different solutions have been proposed to solve similar problems for other languages (Trost, 90; Schiller, 90). The solution suggested below is not as elegant and concise as a word-grammar but it seems expressive enough and even more efficient when dealing with this kind of problems. To this end, an improved continuation class mechanism is being implemented. This mechanism supports the following two extra features: bans that can be stated altogether with a continuation class; they are used to express the set of continuation classes forbidden further along the word-form (from the lexical entry defined with this restricted continuation class).</Paragraph>
      <Paragraph position="1"> Examples: bait (PERTSONA - LA - N) this states that among the morphemes in the word-form following to the verb prefix bait are to be allowed those belonging to the continuation class PERTSONA but also that further on in the word no morphemes belonging to the continuation classes LA or N will be accepted.</Paragraph>
      <Paragraph position="2"> continuation class-tree: the lexicon builder has the possibility of restricting the set of allowed continuation morphemes for a given one, by means of making explicit these morphemes through different segments in the word-form; this explicitation is done by giving a parenthesized expression representing a tree. This mechanism improves the expressiveness of the formalism providing it with the additional power of specifying constraints to the set of morphemes allowed after the lexicon entry, stating in fact a continuation &amp;quot;path&amp;quot; --not restricted to the immediate morpheme-- which makes explicit that set in a conditioned way.</Paragraph>
      <Paragraph position="3">  (nominative, first person) allows dative morphemes corresponding to the third person after the morpheme tzai (root) but not those corresponding to the first person. Analogously the theoretically possible hatzain* is not grammatical in Basque because it combines two second person morphemes in nominative and dative cases. The continuation corresponding to na can be stated as follows: na (KI (DAT23 (N_KE)), TZAI (DAT23 (LAT))) which specifies two alternative continuation &amp;quot;paths&amp;quot; allowed after this morpheme: the one including the morphemes in the continuation class KI and that which includes those in the continuation class TZAI. In both cases DAT23 restricts the set of morphemes potentially permitted as continuation of those in KI or TZAI, allowing only the 2nd and 3rd person dative morphemes. Without this extension of the formalism, it would be possible to do it by storing repeatedly the morpheme tzai in two or more different lexicons, but this is not very useful when the distance between dependent morphemes is longer. Similarly: ha (KI (DAT13 (N_KE)), TZA! (DAT13 (LAT))) is the way to express that ha (nominative, 2nd person) is to be combined with dative morphemes of 1st and 3rd person but not with those of 2nd.</Paragraph>
      <Paragraph position="4"> Continuation classes N_KE and LAT further restrict the morphemes allowed conditioning them in this case to the classes KI and TZAI respectively. Note that in this example two different cases of long-distance dependency are present.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="121" end_page="121" type="metho">
    <SectionTitle>
4 The Lexical Database
</SectionTitle>
    <Paragraph position="0"> The lexical database is supported permanently in a relational system. This database is intended as an independent linguistic tool. Within this framework, information about the two-level lexicon system is stored in three different relations.</Paragraph>
    <Paragraph position="1"> Each lexicon is mainly characterized by the susceptibility of its components to be the initial morpheme in a word-form and by whether or not they are of semantic significance.</Paragraph>
    <Paragraph position="2"> In another relation, continuation classes are defined in terms of lexicons or other continuation classes. It is possible to store examples as well.</Paragraph>
    <Paragraph position="3"> Finally, the main component of the database is the set of lexicons with their associate entries: the two-level form of the entry is stored along with its original form, the source from which it has been obtained, examples, and in some cases (lemmas) the usage frequency. Obviously, the linguistic knowledge related to the entry is also stored in this relation.</Paragraph>
    <Paragraph position="4"> A user friendly interface allows the lexicon builder to do the operations of addition and updating of entries, consistency checking, etc. in a comfortable way. Selection marks depending on knowledge contained in the database such as part of speech, subcategorization of nouns, special endings for certain categories, etc. may be automatically derived from the information in the base.</Paragraph>
    <Paragraph position="5"> The production of the up-to-date run-time lexicon and continuation class definitions in the format required by the two-level system is obtained automatically from this database by means of specially designed procedures.</Paragraph>
  </Section>
  <Section position="6" start_page="121" end_page="123" type="metho">
    <SectionTitle>
5 The Spelling Checker/Cor~ctor
</SectionTitle>
    <Paragraph position="0"> The morphological analyzer-generator is an indispensable basic tool for future work in the field of automatic processing of Basque, but in addition, it is the underlying basis of the spelling checker/corrector. The spelling checker accepts as good any word which permits a correct morphological breakdown, while the mission of the morphological analyzer is to obtain all of the possible breakdowns and the corresponding information. Languages with a high level of inflection such as Basque make impossible the storage of every word-form in a dictionary even in a very compressed way; so, spelling checking cannot be resolved without adequate treatment of words from a morphological standpoint.</Paragraph>
    <Paragraph position="1"> From the user's point of view XUXEN is a valid system to analyze documents elaborated by any word processor. It operates at a usual speed and takes up reasonable amount of space, thus allowing it to work with any microcomputer.</Paragraph>
    <Section position="1" start_page="121" end_page="122" type="sub_section">
      <SectionTitle>
5.1 The Spelling Checker
</SectionTitle>
      <Paragraph position="0"> The basic idea of accepting words which have a correct morphological analysis is fulfilled with classic techniques and tools for detecting spelling errors (Peterson, 80). A filter program appropriate for the punctuation problems, capital letters, numbers, control characters and so on has been implemented. In addition to the mentioned problems, difficulties intrinsic to Basque, like word-composition, abbreviations, declension of foreign words, etc. have been also taken into account. Besides this filter, interactive dialogue with the user, buffers for the most frequent words (in order to improve the performance of the system), and maintenance of the user's own dictionary (following the structure of the two-level lexicon) are the essential elements to be added to the morphological analyzer for the creation of a flexible and efficient spelling checker.</Paragraph>
      <Paragraph position="1"> It is very important to notice the necessity of a suitable interface for lexical knowledge acquisition when it comes to managing with precision the inclusion of new lemmas in the user's own dictionary. Without this interface, morphological and morphotactical information essential to the checker would be left unknown and, so, no flexions could be accepted. Currently, the system acquires information from the user about part of speech, subcategorization for nouns --person or place names, mainly-- and some morphonological features like final hard-or-soft r distinction. So, the user, giving to the system several answers, makes possible the correct assignment of continuation class and selection marks to the new lemma. In this way, open class entries may be accepted and adequately treated. Entries belonging to other classes may also be entered but no flexions of them will be recognized. This ability of the checker to deal correctly with new lemmas  requires, in turn, certain grammatical knowledge from the user.</Paragraph>
      <Paragraph position="2"> Our prototype, running on a SUN 3/280 and using a buffer containing 4,096 of the most frequent word-forms, checks an average of 17.1 words per second in a text with a rate of misspellings and unknown words (not present in the current lexicon) of 12.7%. Considering the word-forms the system deems as erroneous, statistical tests have shown that 60% are actual misspellings, 16% would have been recognized had the general lexicon been more comprehensive, and the rest correspond to specific words (technical terms, proper nouns, etc.) which the user should include in his own dictionary.</Paragraph>
      <Paragraph position="3"> Within a short time minor changes will provide greater performance. A PC version is also in use.</Paragraph>
    </Section>
    <Section position="2" start_page="122" end_page="123" type="sub_section">
      <SectionTitle>
5.2 The Spelling Conector
</SectionTitle>
      <Paragraph position="0"> When a word is not recognized by the spelling checker, the user can choose, among other options, to ask the system for suggestions for replacing the erroneous word. These suggestions, logically, must be correct words which will be similar to the word-form given by the user.</Paragraph>
      <Paragraph position="1"> To find similar words to propose, there exists two working lines: 1) Using as a guide the &amp;quot;sources of error&amp;quot; described by Peterson (Peterson, 80), errors are basically of two types: - Errors due to lack of knowledge of the language: these errors are often not dealt with on the assertion that they are infrequent, but Pollock and Zamora (Pollock, 84) evaluate their frequency at between 10% and 15%.</Paragraph>
      <Paragraph position="2"> Moreover, because Basque is a language whose standardization for written use has begun only in recent years, a higher degree of error would be expected for it.</Paragraph>
      <Paragraph position="3"> - Typographical errors. According to the classic typification by Damerau (Damerau, 64) 80% of &amp;quot;typos&amp;quot; are one of the following four types: one exceeding character, one missing character, a mistaken character, or the transposition of two consecutive characters.</Paragraph>
      <Paragraph position="4"> Following that, n+26(n-1)+26n+(n-1) possible combinations (n being the length of a word) can be generated; they must be examined to verify their validity and the most probable must be selected. For this examination it is normal to use statistical methods which, though not very reliable, are highly efficient (Pollock, 84).</Paragraph>
      <Paragraph position="5"> 2) Definition of a measurement of distance between words and calculation of which words of the dictionary give a lesser distance with respect to the erroneous word (Angell, 83; Tanaka, 87). The most frequently used measure is the &amp;quot;distance of Levenshtein&amp;quot;.</Paragraph>
      <Paragraph position="6"> This second method, measurement of distance, is slower but much more reliable than the first one, though it is not suitable for a lexicon system where the words are incomplete, as is the case. Due chiefly to this, the chosen option has been the adaptation of the first method, taking into account the following criteria: Handling of typical errors. A linguistic study has been carried out on typical errors, that is, errors most frequently committed due to lack of knowledge of the language itself or its latest standardization rules, or due to the use of dialectal forms. To store typical errors a parallel two-level lexicon subsystem is used. In this subsystem, each unit is an erroneous morpheme which is directly linked to the corresponding correct one. When searching for words the two-level mechanism is used together with this additional lexicon subsystem. When a word-form is not accepted by the checker the typical errors subsystem is added and the system retries the orthographical checking. If the incorrect form is now correctly analyzed --so, it contains a typical error-the correct morpheme corresponding to the erroneous one is directly obtained from the typical errors subsystem. There will also be additional two-level rules, which will reflect the erroneous, but typical morphonological alternations in dialectal utilizations or training periods.</Paragraph>
      <Paragraph position="7"> Generating alternatives. Generating alternatives to typographical errors using Damerau's classification.</Paragraph>
      <Paragraph position="8"> Trigram analysis. In generating the alternatives, trigram analysis is used both for discarding some of them as well as for classifying them in order of probability.</Paragraph>
      <Paragraph position="9"> Spelling checking of proposals. On the basis of the three previous criteria, incorrect word-forms would be offered to the user. Therefore, the word-forms must be ted into the spelling checker to check whether they are valid or not.</Paragraph>
      <Paragraph position="10"> The whole process would be specially slow, due mostly to the checking of alternatives. To speed it up the following techniques have been used: If during the analysis of the word considered wrong a correct morpheme has been found, the criteria of Damerau are applied only in the part unrecognized morphologically, so that the number of possibilities will be considerably lower. This criterion is applied on the basis that far fewer &amp;quot;typos&amp;quot; are committed at the beginning of a word (Yannakoudakis, 83).</Paragraph>
      <Paragraph position="11"> Moreover, on entering the proposals into the checker, the analysis continues from the state it was in at the end of that last recognized morpheme.</Paragraph>
      <Paragraph position="12"> On doing trigrammatical analysis a trigram table mechanism is used, by means of which generated proposals will be composed only of correct trigrams and classified by their order of probability; thus, correction analysis (the slowest element of the process) is not carried out with erroneous trigrams and the remaining analyses will be in the order of trigrammatical probability. Besides that, the number of proposals to be checked is also limited by filtering the words containing very low frequency trigrams, and never exceeds 20 forms. At any rate, after having obtained three correct proposals, the generation process will end.</Paragraph>
      <Paragraph position="13">  If a word is detected as a typical error, it will not be verified as a possible &amp;quot;typo&amp;quot;. This requires the analysis of typical errors to take place previous to that of &amp;quot;typos&amp;quot;, in spite of being less probable. The justification is that we are particularly interested in giving preferential treatment to typical errors and, what's more, these can be handled more speedily.</Paragraph>
      <Paragraph position="14"> The average time for the generation of proposals for a misspelt word-form, on the SUN machine cited above, is 1.5 s. The best case is when three or more alternatives are found in the buffer of most frequent words, and takes less than 0.1 s. The worst case, when no correct proposals are found for a long word-form and when no correct initial morphemes were recognized during its analysis, takes around 6s.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML