File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/83/e83-1008_metho.xml
Size: 14,606 bytes
Last Modified: 2025-10-06 14:11:37
<?xml version="1.0" standalone="yes"?> <Paper uid="E83-1008"> <Title>KNOWLEDGE ENGINEERING APPROACH TO MORPHOLOGICAL ANALYSIS*</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> KNOWLEDGE ENGINEERING APPROACH TO MORPHOLOGICAL ANALYSIS* </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> ABSTRACT </SectionTitle> <Paragraph position="0"> Finnish is a highly inflectional language. A verb can have over ten thousand different surface forms - nominals slightly fewer. Consequently, a morphological analyzer is an important component of a system aiming at &quot;understanding&quot; Finnish. This paper briefly describes our rule-based heuristic analyzer for Finnish nominal and verb forms. Our tests have shown it to be quite efficient: the analysis of a Finnish word in a running text takes an average of 15 ms of DEC 20 CPU-time.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> I INTRODUCTION </SectionTitle> <Paragraph position="0"> This paper briefly discusses the application of rule-based systems to the morphological analysis of Finnish word forms. Production systems seem to us a convenient way to express the strongly context-sensltive segmentation of Finnish word forms. This work demonstrates that they can be implemented to efficiently perform segmentations and uncover their interpretations.</Paragraph> <Paragraph position="1"> For any computational system aiming at interpreting a highly inflectional language, such as Finnish, the morphological analysis of word forms is an important component. Inflectional suffixes carry syntactic and semantic information which is necessary for a syntactic and logical analysis of a sentence.</Paragraph> <Paragraph position="2"> In contrast to major Indo-European languages, such as English, where morphological analysis is often so simple that reports of systems processing these languages usually omit morphological discussion, the analysis of Finnish word forms is a hard problem.</Paragraph> <Paragraph position="3"> A few algorithmic approaches, i.e. methods using precise and fully-informed decisions, to a morphological analysis of Finnish have been reported. Brodda and Karlsson (1981) attempted to find the most probable morphological segmentation for an arbitrary Finnish surface-word form without a reference to a lexicon. They report surprisingly high success, close to 90 %. However, their system neither transforms stems into a basic form, nor finds morphotactic interpretations. Karttunen et *This research is being supported by SITRA (Finnish National Fund for Research and Development) P.O. Box 329, 00121Helsinki 12, Finland IDigitalSystems Laboratory 21nstitu~e of Mathematics al. (1981) report a LISP-program which searches in a root lexicon and in four segment tables for adjacent parts, which generate a given surface-word form. Koskenniami (1983) describes a relational, symmetric model for analysis, as well as for production of Finnish word forms. He, too, uses a word-root lexicon and suffix lexicons to support comparisons between surface and lexical levels.</Paragraph> <Paragraph position="4"> Our morphological analyzer MORFIN was planned to constitute the first component in our forthcoming Finnish natural-language database query system. We therefore rate highly a computationally efficient method which supports an open lexicon.</Paragraph> <Paragraph position="5"> Lexical entries should carry the minimum of morphological information to allow a casual user to add new entries.</Paragraph> <Paragraph position="6"> We relaxed the requirement of fully informed decisions in favor of progressively generated and tested plausible heuristic hypotheses, dressed in production rules. The analysis of a word in our model represents a multi-level heuristic search.</Paragraph> <Paragraph position="7"> The basic control strategy of MORFIN resembles the one more extensively exploited in the Hearsay-II system (Erman et al.,1980).</Paragraph> </Section> <Section position="4" start_page="0" end_page="49" type="metho"> <SectionTitle> II FINNISH MORPHOTACTICS </SectionTitle> <Paragraph position="0"> Finnish morphotactics is complex by any ordinary standard. Nouns, adjectives and verbs take numerous different forms to express case, number, possession, tense, mood, person and other morpheme categories. The problem of analysis is greatly aggravated by context sensitivity. A word stem may obtain different forms depending on the suffixes attached to it. Some morphemes have stem-dependent segments, and some segments are affected by other segments juxtaposed to it.</Paragraph> <Paragraph position="1"> Due to lack of space, we outline here only the structure of Finnish nominals. The surface form of a Finnish nominal ~ay be composed of the following constituents (parentheses denote</Paragraph> <Paragraph position="3"> The stem endings comprise a large collection of highly context-sensitive segments which link the word roots with the number and case suffixes in phonologically sound ways. The authorative Dictionary of Contemporary Finnish classifies nomi- null nals into 85 distinct paradigms based on the variation in their stem endings in the nominative, genetive, partitive, essive, and illative cases. The plural in a nominal is signaled by an 'i', 'j', 't', or the null string (4) depending on the context. The fourteen cases used in Finnish are expressed by one or more suffix types each.</Paragraph> <Paragraph position="4"> Furthermore, consonant gradation may take place in the roots and stem endlngs with certain manifestations of 'p', 't' or 'k'.</Paragraph> <Paragraph position="5"> As an example, consider the word 'pursi' (=yacht). The dictionary representation 'pu~ si 42' indicates the root 'put', . the stem ending 'si' in the nominative singular case, and the paradigm number 42. Among others, we have the inflections (2) pur + re + d + lla + mne + kin (=also on our yacht) put + s + i + lla + nme + ko (=on our yachts?) Consonant gradation takes place, for instance, in the word 'tak~ i 4' (=coat) as follows: (3) tak + i + ~ + ssa + ni (=in my coat) tak--k + e + i + hi + ni (=into my coats)</Paragraph> </Section> <Section position="5" start_page="49" end_page="49" type="metho"> <SectionTitle> III DESCRIPTION OF THE HEURISTIC METHOD A. Control Structure </SectionTitle> <Paragraph position="0"> Our heuristic method uses the hypothesis-andtest paradigm used in many AI systems. A global database is divided into four distinct levels.</Paragraph> <Paragraph position="1"> Productions, which carry local heuristic knowledge, generate or confirm hypotheses between two levels as shown in the figure.</Paragraph> <Paragraph position="2"> input surface-word form level logical surface-segment configurations in a word, and slice and interprete the word accordingly. We use directly the allomorphic variants of the morphemes. Since possible segment configurations overlap, several mutually exclusive hypotheses are usually produced on the morphotactic level. All valid interpretations of a homographic word form are among them.</Paragraph> <Paragraph position="3"> The extracted rules were packed and compiled into a network of 33 distinct state-transition automata (3 for clitic, I for person, 6 for tense, 3 for case, 2 for number, 5 for adjective comparation, 3 for passive, 5 for participle, and 5 for infinitive segments). These automata were generated by 204 morpheme productions of the form: (4) name: (2nd_context)(Ist context)segment --> POSTULATE-~int er pr etat i on, next ) 'Segment' exhibits an allomorph; the optional 'Ist' and '2nd contexts' indicate 0 to 2 leftcontextual letters. The operation POSTULATE separates a recognized segment, attaches an interpretation to it, and proceeds to the indicated automata ('next'). For example, the production null (5) LZ~n --> POSTULATE(\[gen,sg,...\], ~TGMI, NUM2, PAR I, PAR4, PAR5, INF3, INF4, COMP4\] ) recognizes the substring 'n', if preceeded by a vowel, as an allomorph for the singular genetive case, separates 'n', and proceeds in parallel to two automatons for number, three for participles, two for infinitive, and one for comparation.</Paragraph> </Section> <Section position="6" start_page="49" end_page="50" type="metho"> <SectionTitle> C. Stem Productions </SectionTitle> <Paragraph position="0"> Stem productions are case- and numberspecific heuristic rules (genus-, mood- and tensespeslflc for verbs) postulating nominative singular nouns as basic forms (Ist infinitive for verbs) which, under the postulated morphotactic interpretation, might have resulted in the observed stem form on the morphotactic level. They may reject a candidate stem-form as an impossible transformation, or produce one or more basic-form hypotheses.</Paragraph> <Paragraph position="1"> The Reverse Dictionary of Finnish lists close to 100 000 Finnish words sorted backwards. For each word the dictionary tags its syntactic category and the paradigm number. From that corpus we extracted heuristic information about equivalence classes of stem behavior. This knowledge we dressed into productions of the following form: (6) condition --> POSTULATE(cut,string,shift) If the condition of a production is satisfied, a basic-form hypothesis is postulated on the basic word-form level by cutting the recognized stem, adding a new string (separated by a blank to indicate the boundary between the root and the stem ending), and possibly shifting the blank. These operations are indicated by the arguments 'cut', 'string', and 'shift'. A well-formed condition (WFC) is defined recursively as follows.</Paragraph> <Paragraph position="2"> Any letter in the Finnish alphabet is a WFC, and such a condition is true if the last letter of a stem matches the letter. If &1 ,&2,-.. ,&n are WFCs, then the following constructions are also WFCs: (I) is true if &1 and &2 are true, in that order, under the stipulation that the recognized letters in a stem are consomed. (II) is true if &1 or &2 or ... or &n is true. The testing in (II) proceeds from left to right and halts if recognition occurs. The recognized letters are cons~ed. A capital letter can be used as a macro name for a WFC. For example, a genetive 'n'-specific production null (8) <Ka,y>hde ~> POSTULATE(3,'ksi',0) ('K' is an abbreviation for <d,f,g,h...> - the consonants) recognizes, among other stems, the genetive stem 'kahde' and generates the basic form hypothesis 'ka ksi' (: two).</Paragraph> <Paragraph position="3"> We collected 12 sets of productions for nominal and 6 for verb stems. On average, a set has about 20 rules. These sets were compiled into 18 efficient state-transition automata.</Paragraph> <Paragraph position="4"> We could also apply productions to consonant gradation. However, since a Finnish word can have at most two stems (weak and strong), MORFIN trades storage for computation and stores double stems in the lexicon.</Paragraph> <Paragraph position="5"> D. Dictionary Look-up The dictionary lock-up procedure confirms or rejects the baslc-word form hypotheses that have proliferated from the previous stages by matching them against the lexicon. Thus in MORFIN the only morphological information a dictionary entry carries is the boundary between the root and the stem ending in the basic-word form and grade. All other morphological knowledge is stored in MORFIN in an active form as rules.</Paragraph> <Paragraph position="6"> In MORFIN, input words are totally analyzed before a reference to the lexicon happens. Consequently, also words not existing in the lexicon are analyzed. This fact and the simple lexical form make it easy to add new words in the lexicon: a user simply chooses the right alternative(s) from postulated baslc-word form hypotheses.</Paragraph> </Section> <Section position="7" start_page="50" end_page="50" type="metho"> <SectionTitle> IV DISCUSSION </SectionTitle> <Paragraph position="0"> MORFIN has been fully implemented in standard PASCAL and is in the final stages of testing. The lexicons contain nearly 2000 most frequent Finnish words. In addition to one lexicon for nominals, and one for verbs, MORFIN has two &quot;front&quot; lexicons for unvarying words, and words with slight variation (pronouns, adverbs etc. and those with exceptional forms).</Paragraph> <Paragraph position="1"> Currently MORFIN does not analyze compound nouns into parts (as Karttunen et al. (1981) and Koskenniemi (1983) do). By modifying our system slightly we could do this by calling the system recursively. We rejected this kind of analysis because the semantics of many compounds must be stored as separate lexical entries in our database interface anyway. MORFIN does not 2roduce word * forms as the other two systems do.</Paragraph> <Paragraph position="2"> With respect to the goals we set, our tests rate MORFIN quite well (J~ppinen et al., 1983).</Paragraph> <Paragraph position="3"> Lexical entries are simple and their addition is easy. On average, only around 4 basic-word form hypotheses are produced on the basic-word form level. The analysis of a word in randomly selected newspaper texts takes about 15 ms of DEC 2060 CPUtime. Karttunen et al. (1981) report on their system that &quot;It can analyze a short unambiguous word in less than 20 ms \[DEC-2060/Interlisp\] ... a long word or a compound ... can take ten times longer.&quot; Koskenniemi (1983) writes that &quot;with a large lexicon it L1~is system\] takes about 0.1CPU seconds EBurroughs B7800/PASCAL\] to analyze a reasonably complicated word form.&quot; Both Karttunen et al. (1981) and Koskenniemi (1983) proceed from left to right and compare an input word with forms generated from lexical entries. It is not clear how such models explain the phenomenon that a native speaker of Finnish spontaneously analyzes also granm~atical but meaningless word forms. Most Finns would probably agree that, for instance, 'vimpuloissa' is a plural inessive form of a meaningless word 'vimpula'. How can a model based on comparison function when there is no lexical entry to be compared with? Our model encounters no problems with new or meaningless words. 'Vimpuloissa', if given as an input, would produce, among others, the hypothesis 'vimpul a' with correct interpretation.</Paragraph> <Paragraph position="4"> It would be rejected only because it is a non-existent Finnish word.</Paragraph> </Section> class="xml-element"></Paper>