XML Viewer - j86-4001

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/86/j86-4001_metho.xml
Size: 32,106 bytes
Last Modified: 2025-10-06 14:11:54
<?xml version="1.0" standalone="yes"?>
<Paper uid="J86-4001">
  <Title>ASSOCIATIVE MODEL OF MORPHOLOGICAL ANALYSIS: As EMPIRICAL INQUIRY 1</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 GENERAL VIEW OF THE ASSOCIATIVE MODEL
</SectionTitle>
    <Paragraph position="0"> An associative model for the analysis of word forms of Finnish consists of a triplet &lt;{MRi}, &lt;*, {SRi}&gt;, where {MR i} is a set of associative morphotactic rules, {SRi} is a 258 Computational Linguistics, Volume 12, Number 4, October-December 1986 Harri J~ippinen and Matti Ylilammi Associative Model of Morphological Analysis: An Empirical Inquiry set of associative stem rules, and &lt;* is a precedence relation in the set of the morphotactic rules. The rules associate phonemic stimulus with morphemic data. They obey the general form: (4) \[ml context\] &lt;pl context&gt; key &lt;pr context&gt; \[mr context\]--&gt; \[m result\] A rule is applicable and fires whenever a phonemic string identical to key appears somewhere in an input form and the contextual conditions of the rule are satisfied, pl and pr stand for phonemic contexts and inl and mr designate morphemic contexts. A contextual phonemic string calls for substring identity, morphemic contexts require set inclusion. Infixes I and r denote left and right sensitivity, respectively. When a rule fires, it contributes one or more morphemes (m result) as a possible partial interpretation of the word form. There is not necessarily a one-to-one correspondence between keys and morphs. Keys may represent entire morphs or morphs that have been truncated by fusion processes.</Paragraph>
    <Paragraph position="1"> The model is implemented in a sequential machine.</Paragraph>
    <Paragraph position="2"> Therefore the rule (4) has two slightly differing applications: null  Morphotactic rules recognize and interpret all other morphs except stem alternants. As our algorithm is tuned to right-to-left sequential processing, these rules are invoked first. Only left phonemic and right morphemic contexts make sense in these rules. Stem rules recognize the allomorph relation of a potentially unlimited number of stems. They use alternant stem endings (a ending) as keys. A stem rule produces a hypothetical basic stem (lexeme) in which the recognized alternant ending is replaced (concatenated with the root) by the basic ending (b ending) shown. Only left phonemic and right morphemic contexts are meaningful in stem rules in the chosen strategy. We discuss the morphotactic part and the stem alternation part of the model in separate sections below.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 MORPHOTACTIC MODEL
</SectionTitle>
    <Paragraph position="0"> This section lists the morphemes of Finnish and presents a few outstanding problems in their allomorph relation.</Paragraph>
    <Paragraph position="1"> The discussion uses similar semiformal generative rules as Matthews (1972). Then an associative solution to the</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Morphotactic Problem is displayed.
4.1 THE MORPHEMES
</SectionTitle>
      <Paragraph position="0"> Expression (3) arranges the morpheme classes of Finnish in three precedence orders, one for the nominal forms, one for the verbal forms, and one for the verbal nominal forms. These morpheme classes have grammatical functions shown in (6). (Clitics have such complex functions in Finnish that we do not attempt to mark them by morphemes but use instead their phonetic realizations in  the discussion.) (6) COMPAR ---- {eom(parative),sup(erlative)}</Paragraph>
      <Paragraph position="2"/>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4.2 SOME PROBLEMS IN THE ALLOMORPH RELATIONS
</SectionTitle>
    <Paragraph position="0"> Morphemes of Finnish are complex. Some morphemes have no correlates on the phonemic level, some morphemes have more than one allomorphs, and some phoneme strings have the simplest explanation, it seems, if empty morphs are postulated. Further complexities are caused by certain fusion processes. We outline these phenomena below for non-Finnish readers.</Paragraph>
    <Paragraph position="1"> The pl morpheme has the suppletive allomorphs 'i', 'j', or 't', and in some rare cases no phonemic string marks plural. The sg morpheme has no allomorphs. These are not the only morphemes without phonemic correlates. In order to faithfully pair off morphs and morphemes in a one-to-one correspondence, it is convenient to postulate a zero alternant in place of a missing morph. For the lexeme kala ('fish'), for instance, we then get among other derivations between the morphemic (ML) and phonemic levels (PL) the few possibilities shown in  The standard treatment assigns four allomorphs for the partitive case: 'a', '~', 'ta', and 't~i'. This variety decreases if one posits two allomorphs, 'a' and '~t', for part and allows the existence of an empty morph 't' to be conditioned by stress. The partitive forms of kala ('fish') and pasuuna ('trombone') in Figure 2 illustrate the interplay between pl and part morphemes in this stipulation.  Some alternants, when joined with neighboring affixes, exhibit regularities in behavior which can be captured conveniently by archiphonemes on the mediating morphophonemic level (MPL). The allomorphs of comparison are examples of such alteruants, and so are some clitic segments. The use of archiphonemes captures nicely consonant gradation in the former and vowel harmony in the latter. The two part allomorphs discussed above can also be generated via a single archipboneme 'A' on the morphophonemic level. It is realized as an 'a' or an 'a' on the phonemic level as vowel harmony demands. In Figure 3 lexemes suuri ('big') and jda ('ice') exemplify how the use of archimorphemes reduces a set of generative rules. There are fusion processes that delete information. These phenomena are easily formulated in generative terms but are problematic for analysis. The leftmost consonant in the possessive morphs</Paragraph>
    <Paragraph position="3"> 3pp:'nsa'), be it a nasal or a fricative, overlaps and dominates the preceding consonant. For the lexeme kala ('fish'), for instance, we get the derivations in Figure 4 in the singular and plural nominative and genitive cases when a possessive segment is present or absent, respectively. null Notice how the four forms are distinct when a possessive is absent (kala, kalat, kalan, kalojen) and become threefold ambiguous when the possessive segment is attached (kalamme, kalamme, kalamme, kalojemme).</Paragraph>
    <Paragraph position="4"> This is a general phenomenon. A nominal in Finnish always becomes grammatically ambiguous when a possessive suffix is attached to a singular nominative or genitive, or to a plural nominative form.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 MORPHOTACT1C MODEL
</SectionTitle>
      <Paragraph position="0"> An associative Morphotactic Model (MTModel) is a pair &lt;{MRi},&lt;*&gt;, where {MRi} is a set of morphotactic rules (5a) and &lt;* is a precedence relation in the set. &lt;* is an irreflexive, antisymmetric, and nontransitive relation which imposes a coherence constraint on the rules. Each morphotactic rule associates a morphemic interpretation with a phonemic substring. The relation &lt;* orders the rules in such a way that partial interpretations, when a word form is processed from right to left, contribute to valid total interpretations.</Paragraph>
      <Paragraph position="1">  Computational Linguistics, Volume 12, Number 4, October-December 1986 261 Harri Jiippinen and Matti Ylilammi Associative Model of Morphological Analysis: An Empirical Inquiry MR i &lt;* MRj iff MR i can &amp;quot;immediately follow&amp;quot; MRj. A rule can immediately follow another if the key of the former can be juxtaposed to the left of the latter on the phonemic level. The keys may not overlap, or be discontinuous, and their morphemic interpretations must obey the ordering (3).</Paragraph>
      <Paragraph position="2"> For coherence, the model also needs boundary rules.</Paragraph>
      <Paragraph position="3"> Let e denote a zero key for zero morphs, and a and /3 mark the zero keys for two special empty sets of morphemes. The &amp;quot;rightmost&amp;quot; morphotactic rule MR a = e L -~'\]\[o~ and the &amp;quot;leftmost&amp;quot; morphotactic rule MR B = the coherence constraint are defined below. -,,,, two boundary rules have obvious interpretations: MR a signals the right end of a word form and MR/3 indicates a stem boundary.</Paragraph>
      <Paragraph position="5"> Brodda and Karlsson (1980) tried to find the most likely morphotactic segmentation for a given Finnish word form drawn from a running text. The algorithm does not use a lexicon, neither does it associate phonemic segments with their morphemic interpretations.</Paragraph>
      <Paragraph position="6"> From that work we were able to extract and enumerate the valid phonemic keys for the morphotactic rules. The keys were then associated with their morphemic correlates and the rules were organized under the precedence relation &lt;*. The set in (8) lists a small subset of the rule set and a fragment of the coherence constraint. For the sake of brevity, only the key is shown in the left hand side of a rule. To compress the rules, archiphonemes, typed in upper case letters, are used in keys whenever possible. Figure 5 illustrates this part of the coherence constraint in graphic form.</Paragraph>
      <Paragraph position="8"> 262 Computational Linguistics, Volume 12, Number 4, October-December 1986 Harri Jiippinen and Matti Ylilammi Associative Model of Morphological Analysis: An Empirical Inquiry (8) MRot = a ~ Ii</Paragraph>
      <Paragraph position="10"> The rule set and the coherence constraint represent the morphotactic part for morphological analysis. A phoneme string is a morphotactically valid form if there is a &amp;quot;path&amp;quot; between the &amp;quot;rightmost&amp;quot; rule, MR a, and the &amp;quot;leftmost&amp;quot; rule, MR/3, in the coherence constraint. The interpretation of the form is the union of the morphemes associated with the rules along the path. For an ambiguous word form more than one path exists between the MR a and the MRfl.</Paragraph>
      <Paragraph position="11"> The fragmentary rule set and the constraint in (8) give, for instance, the following morphotactic interpretations for the ambiguous form kalamme shown in Figure</Paragraph>
      <Paragraph position="13"> The first three are valid interpretations. The verbal interpretation, although morphotactically valid, does not result in an existing verb stem. That interpretation will be rejected by the Stem Alternation Model discussed below.</Paragraph>
      <Paragraph position="14"> That the verbal interpretation is indeed morphotactically plausible can be seen, for instance, with the form palamme, analyzed as pala+\[act,ind, pr, lpp\], which is a valid interpretation for the verb lexeme palaa ('burn').</Paragraph>
      <Paragraph position="15"> MTModel for Finnish consists of 178 rules. It is not yet an algorithm. It does not state how analysis is being done, that is, how control is to proceed in an analysis.</Paragraph>
      <Paragraph position="16"> These are matters of an algorithm discussed in a later section. The previous discussion has committed the model from right to left processing, but reverse processing or some more advanced control schemes might be used as well.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 STEM ALTERNATION MODEL
</SectionTitle>
    <Paragraph position="0"> For any given word form, MTModel resolves sets of morphemes that make up coherent wholes. MTModel also indicates stem alternant boundaries (MR/3) but leaves the alternants intact. The Stem Alternation Model (SAModel) discussed in this section finds for each postulated stem alternant its basic form(s), or rejects it. We first discuss the stem alternants in Finnish as they are customarily described in the Word and Paradigm Model.</Paragraph>
    <Paragraph position="1"> We then describe associative rules for the analysis of stem alternants.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 THE STEM ALTERNATION PROBLEM
The Standard Dictionary of Modern Finnish (Nykysuo-
</SectionTitle>
      <Paragraph position="0"> men sanakirja, 1966) describes the behavior of Finnish word forms in terms of the Word and Paradigm Model. It classifies nominals into 82 and verbs into 45 equivalence classes - paradigms - based on variations in their stem alternants. For each paradigm the classification gives a theme word, to represent the class, and its stem alternants. Thus, for instance, the nominal paradigms 10 and 41, and the verb paradigm 25 are listed as in Figure 6.</Paragraph>
      <Paragraph position="1"> The theme words are KALa ('fish'), TOSi (&amp;quot;true&amp;quot;), and TULla (&amp;quot;come&amp;quot;), respectively. (We have slightly edited the entries for our purpose.) Upper case letters in Figure 6 indicate the roots and the stem-forming affixes; lower case letters are reserved strictly for the alternant stem endings.</Paragraph>
      <Paragraph position="2"> The information conveyed by the paradigm tables can be compressed into two matrices below which show just the distributions of the stem endings. The rows of the matrices represent the paradigms and the columns morphemic contexts (not given here). Whenever allomorphs generate different stem endings, the endings are enclosed in parentheses. The vertical bars separate singular nominal stems from plural stems and active verbal stems from passive stems. The first column in both matrices represents the ending of the basic form, the lexeme. e v denotes a null ending in a vowel stem, e a null ending in general. Upper case letters mark here archiphonemes.</Paragraph>
      <Paragraph position="3">  (10) Nominal stem endings: 01: e v,e v,e v,e v, e v \] e v,e v, ev 02: e v,e v,e v,e v, e v \] e v,e v, e v 03: e v,e v,e v,e v, e v I ev?ev, ev 04: i, i, i, i, i I i, e, e 05: i, i, i, i, i I (i,e), e, e 10: A,A,A,A, A I (O,A),O, O 41: si,de, t, te, te I s, s, s Verbal stem endings: 01-&amp;quot; ~V' ~V' EV' eV' EV' EV' EV' EV&amp;quot; \[ EV' ~V' EV 02: A,A, e,A,A,A,A, A I e, e, e, 03: tA,dA, e v, tA, tA,dA, tA, tA \[ de, de, de 25: la, e, e, e, e, e, e, e I e, e, e  Each interpretation postulated by MTModel unambiguously chooses a column. The problem of stem alternation follows from the fact that the row of the stem is not known. Should SAModel know, say, that the poStulated singular genitive stem ki~de in the form kaden represents paradigm 41, simple substring replacement operation would produce the correct lexeme KJfsi rightaway (the singular genitive case occupies the second column in the nominal matrix above).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 STEM ALTERNATION MODEL
</SectionTitle>
      <Paragraph position="0"> Our associative SAModel consists of a set of stem rules {SRi\], each of the form (5b) and retyped below:  (11) &lt;pl context&gt;a ending\[mr context\] -~ \[cone(ROOT,b ending)\] 'a ending' is an alternant stem ending. When a rule fires, its alternant ending is replaced with the basic stem ending ('b ending'). The operator 'cone' concatenates the new ending with the root, producing a hypothetical basic word form. The consonant gradation process in roots is  not analyzed in SAModel. Weak and strong stems are dealt with as separate lexemes.</Paragraph>
      <Paragraph position="1"> The paradigm tables (10) yield data for morphemic contexts ('mr context') and alternant and basic endings. Alternant endings are necessary but not sufficient phonemic data for rules. Stem rules without phonemic contexts are too productive.</Paragraph>
      <Paragraph position="2"> Luckily, due to phonotactic reasons the orthographic distribution of roots (unvarying parts of stems) is uneven in various paradigms. A manageable number of short phoneme strings suffice to represent all roots of whole paradigms. The Reverse Dictionary of Finnish (Tuomi 1980) lists practically speaking all Finnish basic word forms (in reverse order), including some archaic ones and some of foreign origin. Each lexeme is tagged with its paradigm number and syntactic category. That dictionary 264 Computational Linguistics, Volume 12, Number 4, October-December 1986 HaiTi 3iippinen and Matti Ylilammi Associative Model of Morphological Analysis: An Empirical Inquiry was a valuable source for the contextual phoneme strings for the stem rules.</Paragraph>
      <Paragraph position="3"> For stem rules a well-formed phonemic context (WFPC) and its truth value is defined recursively as follows. Any lower case letter in the Finnish alphabet is a WFPC and the context is true if the last letter of a root is identical to that letter. If &amp;l, &amp;2 ..... &amp;n are WFPCs, then the following constructions are also WFPCs:  (12) (i) &amp;n...8~Z&amp;l (ii) &lt;&amp;l,&amp;z ..... &amp;n&gt; (i) is true if &amp;l and &amp;2 and ... and &amp;n are true, in that order. Testing continues from the point in a stem where  the previous test left off. (ii) is true if &amp;l or &amp;2 or ... or &amp;n is true. The testing of 8q's halts if a recognition occurs. Each 8q starts its test afresh.</Paragraph>
      <Paragraph position="4"> To enhance compact notation we stipulate that a single capital letter may represent a WFPC. Archiphonemes are conveniently expressed A for &lt;a,~i&gt;, O for &lt;o,6&gt;, and U for &lt;u,y&gt;; the set of consonants and vowels appear compactly as K for &lt;d,f,g,h,j,k,l,m,n, p,r,s,t,v&gt; and V for &lt;a,e,i,o,u,y,~i,6&gt;. But a WFPC of any complexity can be denoted by a single upper case letter.</Paragraph>
      <Paragraph position="5"> The phonemic contexts vary in complexity in the rules in SAModel. Most of them have a fairly simple structure. Two paradigms are, however, without any phonemic contextual regularity. One is the nominal paradigm 08. The stem of the theme word LOVi ('notch') ends with an i in the basic form and with an e in singular genitive case love+n ('of a notch'). This paradigm represents an old form and the set of its lexemes is closed. All new nominals that end with an i in the basic form retain the i in the genitive case and in other singular cases. For example, the theme word for the paradigm 04 is RISTi ('cross') and its genitive form is risti+n ('of a cross'). The criterion for choosing between paradigms 04 and 08 is not phonotactic; it is diachronic. Therefore, no phonemic context short of a minilexicon would help us to resolve, say, that suurin (a valid superlative form for suuri ('big')) is not SUURi+lsg, genl, as muurin (for muuri ('wall')) is MUURi+lsg, gen\]. We solved the problem by using two kinds of i's as the last letter of a lexeme.</Paragraph>
      <Paragraph position="6"> SAModel consists of 280 rules. Added context sensitivity increased greatly the quality of stem rules. A stem alternant produces only a fraction over one basic forms on average. The stem rules augment the coherence constraint of MTModel with an obvious component: a morphotactically coherent word form passes the coherence test of SAModel only if at least one of the basic forms generated by the stem rules is a valid lexeme.</Paragraph>
      <Paragraph position="7"> To illustrate the interplay of MTModel and SAModel, ki~sissi~mmekO will be analyzed KAsi+\[pl,in, lpp,'kO'\] in the way shown in Figure 7. The figure exhibits schematically only the stem rule responsible for the correct lexeme. The morphotactic rules in Figure 7 are from (8). The form gets other morphotactically coherent segmentations as well, but they are rejected by the stem rules and the lexicon.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 ALGORITHM
</SectionTitle>
    <Paragraph position="0"> One can think of various alternative algorithms to realize the model. A multiprocessor environment might make the blackboard strategy used in HearsayII (Erman et al.</Paragraph>
    <Paragraph position="1"> 1980) an attractive alternative. Our choice was a monoprocessor environment and right-to-left strategy: first all morphotactically coherent stem alternants are postulated, then stem rules and dictionary check are invoked in that order for each alternant. The algorithmic issues are briefly talked about in this section.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.1 MORPHOTACTICS
</SectionTitle>
      <Paragraph position="0"> First we decided to implement MTModel as a structured collection of interconnected &amp;quot;islands&amp;quot;. Each island comprises the possible and mutually exclusive morphotactic rules at any given point of processing; the rules represent valid paths through the island. The coherence constraint provides &amp;quot;bridges&amp;quot; between the islands, a bridge indicating a valid continuation after a walk through an island. Computationally the islands were finite state transition automata.</Paragraph>
      <Paragraph position="1"> There were 32 distinct automata: 3 for clitics, 1 for person, 5 for tense, 3 for case, 2 for number, 3 for passive, 5 for participle, 5 for comparation, and 5 for infinitive rules. To assist automatic compilation from the rules to the automata, the morphotactic rules were slightly modified to read as:</Paragraph>
      <Paragraph position="3"> expressions of two optional sets: for phonemes next to the left and second to the left. The term automaton names the island the rule belongs to; next automata identifies valid continuations after this path. mr context in (5a) is represented implicitly in (13) as the path leading to this rule.</Paragraph>
      <Paragraph position="4"> The example rule belongs to the person automaton.</Paragraph>
      <Paragraph position="5"> The rule recognizes the lps suffix n for an active indicative verb if an n is found such that it has an ordinary or stressed vowel first to the left and any phoneme except a long n,o, or 6 second to the left. Control proceeds to the automaton Tense1 to identify modal and temporal morphemes. In general more than one continuation automaton is possible.</Paragraph>
      <Paragraph position="6"> The island approach worked quite well. However, it was redundant because identical transition paths existed for different automata. To save memory, we implemented another version of MTModel, this one as an orthographic tree of the keys (and rules). The islands were layered, so to speak, on top of each other. A pass through the constraint in (8), or a walk through an island, corresponds now to a traversal through the tree. Coherence is satisfied if, for each transition along a path from the MR a to a MRfl, a successful walk through the orthographic tree can take place. Automatic compilation again transforms the rules of the form (13) into the orthographic tree.</Paragraph>
      <Paragraph position="7"> The orthographic tree occupied only about one-tenth of the memory needed for the island approach. Using a novel key-and-lock construct we were able also to speed up the analysis. With each node in the tree a &amp;quot;lock&amp;quot; was associated as a union of the automata names ('next automata' in (13)) in its subtree. Each traversal through the tree provides a &amp;quot;key&amp;quot; as a set of possible continuations. During the next traversal the key is checked in the lock of each node along the path and only a match (nonempty intersection) permits continuation. This method aborts fruitless attempts through the tree early on.</Paragraph>
      <Paragraph position="8"> Morphotactic analysis in the orthographic tree with this lock-and-key approach takes about 40% of the time the original island approach took.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.2 STEMS AND LEXICON
</SectionTitle>
      <Paragraph position="0"> The control of the stem rules was first realized as an orthographic tree of &amp;quot;prolonged stem endings&amp;quot;. A prolonged ending concatenates an alternant ending with its contextual strings. The 280 stem rules yield 420 distinct extended stem endings. Exit points were marked in the tree and morpheme contexts were attached to these nodes as exit conditions. Basic stem endings were also associated with the exit points. A stem alternant traversed the tree and produced basic forms along the path whenever the exit condition was satisfied in exit nodes.</Paragraph>
      <Paragraph position="1"> The stem alternant tree wasted, however, memory to an extent that we implemented also a hash-coded version of the extended endings. This version saves memory considerably without a noticeable increase in the analysis time.</Paragraph>
      <Paragraph position="2"> A word form is valid only if it has at least one coherent morphotactic interpretation and if at least one of the lexemes produced by the stem rules appears in the lexicon. Dictionary organization and its search procedure constitute therefore an integral part of the algorithm. In our implementation the dictionary is composed of three distinct parts. The main dictionary is preceded by a hash-coded lexicon that contains the function words.</Paragraph>
      <Paragraph position="3"> The main dictionary consists of an open set of adjectives, nouns, and verbs (and also numerals). It is implemented as a backward-sorted orthographic tree. The unconventional ordering allows for iterative analysis of compound word forms. Lexemes whose roots participate in consonant-gradation process have two separate lexical entries: weak and strong.</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="0" end_page="0" type="metho">
    <SectionTitle>
7 IMPLEMENTATION AND DEMONSTRATION
</SectionTitle>
    <Paragraph position="0"> The ultimate test of the model and the algorithm lies in its performance. We felt that the primary justification of our model is its capability of meeting certain functional requirements:  The model separates linguistic data from algorithms, as the discussion above has indicated. Due to the rule structure the model has proved to be easy to augment, and now it covers the entire Finnish inflectional morphology. This satisfies the second requirement. In this section we discuss efficiency, open lexicon, and other issues of implementation and demonstrate the implementation.</Paragraph>
    <Paragraph position="1"> For the reasons of efficiency and portability we implemented the algorithm in PASCAL. Separate compiler procedures transform the associative rules into their internal representations, as discussed in the previous section. The orthographic morphotactic tree takes about 4kW and the hash coded extended stem endings 5kW of DEC2060 memory. The procedures that utilize these data structures take about 20kW. The two hash-coded front lexicons reside also in the main memory. They cover already the majority of function words in Finnish. Their data structures and code together occupy 21kW of DEC20 memory. There is also a version on VAX11 and one on IBM PC/XT. In the latter, MORFO, as we call the system, takes up 305kB of memory. That figure includes MS-DOS.</Paragraph>
    <Paragraph position="2"> 266 Computational Linguistics, Volume 12, Number 4, October-December 1986 Harri J~ppinen and Matti Ylilammi Associative Model of Morpholo~cal Analysis: An Empirical Inquiry The main lexicon resides on disc. As of this writing the main lexicon contains over 30,000 of the most frequently used Finnish verbal and nominal lexemes taken from Saukkonen et al. (1979) and from running ordinary texts.</Paragraph>
    <Paragraph position="3"> Figure 8 shows a few sample analyses with the trace mode of the system switched on. Alusta is a highly ambiguous word form in Finnish. MTModel (JAOTIN) finds six coherent morphotactic interpretations for it.</Paragraph>
    <Paragraph position="4"> SAModel (MUOKKAIN) extracts two different basic word forms for the first interpretation, one for the second and the third, three for the fourth, and none for the fifth and the sixth. ('VA', 'HA', and 'NE' stand for strong (or neutral), weak (or neutral) and neutral grade, respectively. The numbers within angle brackets are identifiers of the stem rules.) The weak stem alu is accompanied by its strong partner alku. Of the seven postulated lexemes, five actually occur, found in the main lexicon (SANAKIR-JAT). The presence of the affix n (gen, or lpp) greatly reduces ambiguity as Figure 8 further shows. The morph n is either gen or lpp person. This information disqualifies the cases el and part.</Paragraph>
    <Paragraph position="5"> We have tested the system rather extensively. In addition to randomly picked word forms we typed in, a typist entered news reports and columns picked from various Finnish newspapers. The test texts also included, of course, function words and compound word forms. Over 300,000 forms have been thus introduced. The analysis of a word form takes about 20ms of DEC2060, 35ms of VAXll/780, and 50ms of VAXll/750 CPU-time on the average. Throughput on an IBM PC/XT is about 95 words forms per minute. These figures satisfy our functional requirement for an efficient analysis method.</Paragraph>
    <Paragraph position="6"> As an example trace of the system at work, the first word forms of Genesis in the Finnish Bible are analyzed by MORFO in the way shown in Figure 9. (Our lexicons carry English equivalents for each lexeme.)</Paragraph>
  </Section>
  <Section position="10" start_page="0" end_page="268" type="metho">
    <SectionTitle>
1: ALUSTA
2: ALUSTAA
3: ALKU
4: ALUNEN
5: ALUS
&gt; ALUSTAN
</SectionTitle>
    <Paragraph position="0"> Computational Linguistics, Volume 12, Number 4, October-December 1986 Harri Jiippinen and Matti Ylilammi Associative Model of Morphological Analysis: An Empirical Inquiry A randomly picked verbal lexeme, say katua ('repent'), to continue in the Biblical domain, has some of its various forms analyzed in Figure 10. Notice, by the way, how the verbal forms katua and kadun are homonymic with partitive and genitive forms of katu ('street).</Paragraph>
    <Paragraph position="1"> (In Figure 10 'imp' stands for past, 'imper' for imperative; 's' for singular in verbs, 'sg' singular in nominals; 'p' for plural in verbs, 'pl' plural in nominals.) The analysis of compound word forms is automatically invoked, if none of the basic forms postulated by the stem rules is found in the main lexicon. If this analysis also fails, control proceeds to the lexical acquisition mode. Good Friday is the compound pitkiiperjantai in Finnish. (Its literal translation in English is Long Friday.) That compound belongs to a subclass of complex lexical items whose modifying part gets inflected in various cases in agreement with the head. Incidentally, this phenomenon holds also for numerals. Figure 11 shows example analyses of some forms of pitkdperjantai.</Paragraph>
    <Paragraph position="2"> katua kadun katuvlmmillaan katukaam katumlseaaansa  Figure 12. An example of lexical acquisition.</Paragraph>
    <Paragraph position="3"> We may now state in more precise terms in what way our model is capable of supporting open lexicons. Maybe inadvertently, we had not inserted paholainen ('devil') in the lexicon. If we input one of its forms, say paholaisen ('devil's'), the failure of the analysis prompts the user to choose one of the postulated basic forms as shown in Figure 12. When the user has chosen the only valid option (3), has supplied syntactic category ('S' for substantive), and provided its English equivalent, the lexeme enters the lexicon, as the subsequent test proves in Figure 12. In this convenient manner, we have built up our lexicon to hold about 30,000 entries. We continuously augment the lexicon from running texts.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML