File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/88/c88-1069_metho.xml
Size: 13,720 bytes
Last Modified: 2025-10-06 14:12:08
<?xml version="1.0" standalone="yes"?> <Paper uid="C88-1069"> <Title>Co~npiexity~ Two-Level i~/il~phology and Finnish</Title> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> 1. 77re Two-Level Model </SectionTitle> <Paragraph position="0"> 'll~e twoolevel model provides a language independent framework for describing phonological mid morphological phenomena associated with word inflection, derivation and compounding. The model can be expressed ill tenos of finiteostate machines, and it is easy to impliement. \]he model has, in fact, two aspects: (1) it is a linguistic formalism for describing phonological phenomena, and (2) it is a computational apparatus wtfich implements descriptions of particular languages as operalional systems capable of recog~fizing and generating word-fonns.</Paragraph> <Paragraph position="1"> The model consists of three representions (nmq,hological, lexical and surface forms) and two systems (the lexicon and phonological rides) rela~hlg them: molphemes in word-fonn</Paragraph> <Paragraph position="3"/> </Section> <Section position="3" start_page="0" end_page="335" type="metho"> <SectionTitle> TWO~LEVEI, RULES I ) </SectionTitle> <Paragraph position="0"> surface representation of word4om~ &quot;fhc surface representation is typically a phonemic representation of wind-form, but sometimes graphic or wriRen forms are used instead. The lexical representation is an underlying (postulated) morphophonemic representation of the word stem and affixes.</Paragraph> <Paragraph position="1"> These two representations need not be identical, and in case there are phonological alternations in tl-e language, these representations are more or less different. The task of the two-level rule component is to account for any discrepancies between these representations.</Paragraph> <Paragraph position="2"> The task of the lexicon component is two-fold.</Paragraph> <Paragraph position="3"> First, it specifies what kinds of lexical representations are possible according to the inventory of known words and their possible inflectional tb~ms, plus derivations aud compounds according to productive rules. The second task of the lexicon is to associate proper morphemes to lexical representations. The task of the lexicon component is considered to be universal.</Paragraph> <Paragraph position="4"> Many languages can be quite well described with rather simple lexicon stmctmes. The lexicon needed for Finnish is basically a set of sublexicons (for stems, case endings, possessive suffixes, clitic particles, tense of verbs, person, etc.). Each entry specifies all continuation lexicons which are possible after that morpheme. This scheme is equivalent to a (pat,ly nondetenninistic) finite state transition network. Twodeglevel rules compare lexic,-d and surface representations. The pmtitive plural of the Finnish wold lasi 'glass' is laseja, Tiffs form might be represented as a stem lasi plus a plural ending I plus a partitive ending A. The correspondence would be then be: The~e are three discrepancies here: the stem final i is realized as e (and not as i like in singuler forths), the plural I is realized as j instead of i, and the partitive A is realized as the back vowel a (and not as front vowel /i ). The first discrepancy is described with a two-level rule: i:e <=> I: This states that lexical i is realized as surface e if and only if it is followed by a lexical I (the plural affix). The plural I itself is a bit different from other i's because it is realized as j if and only if it occurs between two surface vowels (let V denote the set of vowels): I:j <=> :V _ :V The realization of partitive A is an instance of Finnish vowel harmony, which causes endings to agreee in frontness or backness with stem vowels. Thus A has two possible realizations: it must be a back vowel iff there are back vowels in the stem: \[A:a IO:o \] U:u\] => :Vback :Vnonfront* _ The set Vback contains the back vowels a, o, and u whereas Vnonfront contains anything that does not have one of/i 6 iJ on surface.</Paragraph> <Paragraph position="5"> Phonological two-level descriptions have been made for about twenty different languages up to now. Only about a third of them can be considered to he comprehensive. Typically a description co:&quot;:, ots of 7-40 rules (English and Classical Greek being the low and high extemes).</Paragraph> <Paragraph position="6"> A special compiler is used for converting these rules into finite state transducers (Karttunen, Koskenniemi, and Kaplan, 1987). The resulting machines are similar to the ones that were hand compiled, eg. in (Koskenniemi, 1983).</Paragraph> </Section> <Section position="4" start_page="335" end_page="335" type="metho"> <SectionTitle> 2. Barton's Challenge </SectionTitle> <Paragraph position="0"> \[Barton86\] poses a challenge to find the constraint that makes words of a natural language easy to process: &quot;The Kimmo algorithms contain the seeds of complexity, for local evidence does not always show how to construct a lexical-surface correspondence that will satisfy the constraints expressed in a set of two-level automata. These seeds can be exploited in mathematical reductions to show that two-level automata can describe computationally difficult problems in a very natural way. It follows that the finite-state two-level framework itself cannot guarantee computational efficiency. If the words of natural languages are easy to analyze, the efficiency of processing must result from some additional property that natural languages have, beyond those that are captured in the two-level model. Otherwise, computationally difficult problems might turn up in the two-level automata for some natural language, just as they do in the artificially constructed languages here. In fact, the reductions are abstractly modeled on the Kimmo treatment of harmony processes and other long-distance dependencies in natural languages.&quot; \[Barton86, p56\] We suggest that words of natural languages are easy to analyze because morphological grammars are small. As Barton shows, two-level complexity grows rapidly with the number of harmony processes. But, fortunately, natural languages don't have very many harmony processes.</Paragraph> <Paragraph position="1"> Any single language seems to have at most two harmony processes: * zero (most, ie. some 88 % of languages), * one (Uralic, Tungusic, Sahaptian) or * two (most Altaic languages) Even in principle, a three dimensional vowel harmony is rather improbable, because it would lead to a total (or almost total) collapse of distinctions between vowels. In most languages there are not enough distinctive features in vowels to make a four-way harmony even possible. We have not found any reliable accounts for more than two harmony-like processes in a single language.</Paragraph> <Paragraph position="2"> Normally, most complexity results describe space/time costs as a function of the size of the. input. Claims in support of the two-level model are generally of this form; speed is generally measured in terms of numbers of letters processed per second. Barton's result is somewhat non-standard; it describes costs as a function of the size of the grammar (or more precisely, the number of harmony processes).</Paragraph> <Paragraph position="3"> Complexity results generally don't discuss the &quot;grammar constant&quot; because any particular grammar has just a fixed (and very small number) of rules (such as harmony processes), arid tiros it isn't very helpful to know how the algorittma would pertbrm if there were more, because there aren't.</Paragraph> <Paragraph position="4"> If phonological grammars were laxge and complex, there could be efficiency problems because processing time does depend on the size and structure of the grammar. However, since phonolo~pcal grammars tend to be relatively small (when compared with file size of the input), it is fairly safe to adopt the grammar co,aslant tLssumpfiorL</Paragraph> </Section> <Section position="5" start_page="335" end_page="338" type="metho"> <SectionTitle> 3. Barton's Reduction </SectionTitle> <Paragraph position="0"> Let tm consider the satisfaction reducton in \[Barton86\]. Barton used a grammar like the one below to reduce two-.level generation to the satisfaction problem.</Paragraph> <Paragraph position="1"> In tiffs mtificial grammar, it is assumed that there are an arbitrary number of harmony processes over the letters: a, b, c, d, e, 1, .... ; each letter must coirespond to either T (truth) or F (falsehood), consistently throughout the word.</Paragraph> <Paragraph position="2"> ~l~is reduction is a generalization of harmony processes which are common in certain families of natur',d languages. In these languages, stem (mad affix) vowels must agree ill one or more of the following distinctive features: o Front/back vowels (palatal, velar harmony), eg. in Uralic and Turcic languages. (Replaced by consonantal palatalization in Karaite, a Turcic language.) o Rounded/tmrounded vowels (labial harmony), eg. in Tttrcic languages . Tongue height, eg. Tungusic languages deg Nasalization, and o Phatyngealization eg. emphatic consonants and vowels in semitic languages Some processes are classified as umlaut rather than vowel harmonies, but behave similarly. One, still different but relevant process, has been reported in Takelma (Sapir 1922). There, a suffixal /a/ is leplaced with an fi/, if the following suffix contains /i\[. This rule derives \[ikuminininink\] from underlying /ikumanananink/.</Paragraph> <Paragraph position="3"> It may be a mistake to classify all of these processes as vowel harmonies, and if so, it only strengthens the claim that languages don't have very many vowel harmony processes.</Paragraph> <Paragraph position="4"> Empirically, we observe that generation time is linear with the length of the word and exponential with the number of harmony processes. That is, given Barton's Satisfaction grammar, words of the form aaa...* are processed in time linear with the number of as, but words for the form abe.., are processed in time exponential with the number of different characters.</Paragraph> <Paragraph position="5"> two-level model with n harmony processes can be reduced to a satisfaction problem with n variables. Thus, it is not surprising to find that the two-level model takes time exponential with the number of harmony processes. I 1. Most harmonies are progressive, ie. the harmony propagates from left ~o right. A few exceptions to this are mentioned in literature: Sahaptian (inchlding Nez Perce), Luorawetlan (including Chuckchee), Dlola Foguy, and Kalenjin languges. These are said to have so called dominant and recessive vowels where an occtarrence of a dominant vowel in the stem or even in affixes causes the whole word to contain only dominant variants of vowels. We have found no references to languages with more than one harmony process combined with (potentially) regressive, or right-t0-1eft direction.</Paragraph> <Paragraph position="6"> Left-to-right harmony seems to have a virtually unlimited scope because, in addition to inflectional affixes, also derivational suffixes that can be recursively attached to the stem.</Paragraph> <Paragraph position="7"> Neither progressive nor regressive harnmny-Ilke processes cattse any nondetermlnlsm in recognition in the Two-Level Model. Even generation of word-forms with progressive harmonies is always quite deterministic. The only truly nondeterminlstic behavior with vowel harmonies occurs in the generation with regeessive harmonies where there is no way to choose among possible realizations of prefix vowels until the word root is seen.</Paragraph> <Paragraph position="8"> An artlfical (and almost maximal) example of the unbounded character of Finnish vowel harmony is the following where back harmony propagates from the verbal root (havai- 'observe') all the way to the last</Paragraph> </Section> <Section position="6" start_page="338" end_page="338" type="metho"> <SectionTitle> 4. Experience With Finnish </SectionTitle> <Paragraph position="0"> However, if there are only a fixed (and small) number of harmony processes, as there are in any natural language, then processing time is found to be linear with input length. This has been our experience as verified by the following experiment. We collected a word list and measured recognition time as a function of word length in character. The word list is a combination of two samples from a Finnish newspaper corpus (seven issues of Helsingin Sanomat consisting of some 400,000 running words): * all Finnish words with 17 or more lette~ in the whole corpus, plus * some 700 words of running text from the same corpus.</Paragraph> <Paragraph position="1"> (This construction produces very few words with 16 characters.) Figure 1 plots recognition time (in steps) as a function of word length. Note that the relationship is well modeled by the linear regression line with a slope of 2.43 steps/letter. The data show no hint of an exponential relationship between processing time and word length.</Paragraph> <Paragraph position="2"> One of the two outlets is &quot;lakiasiaintoimistoa,&quot; an 18 letter word that takes 206 steps (11.4 steps/letter). Part of the trouble can be attributed to ambiguity; this word happens to be two ways ambiguous. In addition, there is a false path &quot;laki+asia+into+imis...&quot; that consumes even more resources. The fit of the regression line can be improved considerably by removing these ambiguous words as illustrated in figure 2.</Paragraph> </Section> class="xml-element"></Paper>