File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-1035_metho.xml
Size: 12,104 bytes
Last Modified: 2025-10-06 14:13:37
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-1035"> <Title>SYLLABLE-IIASED MOI)EI. FOR TIIF, K()III~AN MORPIIOLOGY</Title> <Section position="4" start_page="0" end_page="221" type="metho"> <SectionTitle> 2. The l)roblem </SectionTitle> <Paragraph position="0"> Two.-level model is widely known to be n eomputationnlly efficient method for the practical system on the condition thai the munber of rules is smnll\[Bart86, Kosk88\].</Paragraph> <Paragraph position="1"> Howew.w, when the size of the rulebase is large it causes an exponential probleln. In case of the Korean langttage, it is common that a stenl is succeeded by I,,rammatical n~orphemes. If we use the twe-level model for a practical sys/eln, a small set of phonological rules and a large set of rnorl)helne isolation rules are required because there are several thousand colnbinalion.~ of grammatical morphemes\[Zhan90\].</Paragraph> <Paragraph position="2"> in order to solve the problem, we can try a 2--pass algorithm. All the l)OS,~ible morl)hemes are isolated, and then do a phonological processing. It is also l~ossil)le to do a phonological processing first and morphemes are isolated at the second Imss.</Paragraph> <Paragraph position="3"> \[iowever, this Mad of solution causes m~olller Selious problem that o(;eut+s l'totll the conditional resh'ie{ions: (1) ,,:eme morphologieal transformation occurs not only at a stem but also at a functional inorpheme, (2) there are eooccurrenee restrictions between two morl)hemes, (3) morphological tl'ansfovlnat\[oll OCCtll'S only for the El)e.cinl word grotlp.</Paragraph> <Paragraph position="4"> 3. Syllable-based writing system The writina system for most languaffes i,~; based on tile letter set called as alphabet. Instead of' a letter se, t, Chinese writing system is based on the set of characters that consists of one or more letters. Each character is a meaning unit and words are represented by the combination of characters. In case of Korean, words are represented by one or more characters as in Chinese. The difference is that Korean character is a well-formed written syllable, which is a sound unit rather than a meaning unit as in Chinese. A written syllable is a combination of two or three sound symbols, which corresponds to a spoken syllable in a one-to-one fashion\[Chun90\]. Korean words are constructed as follows based on the syllable unit.</Paragraph> <Paragraph position="5"> word ::= { syllable )&quot; syllable ::= open_syll I closed syll open_syll ::= initial + medial closed_syll ::= initial + medial + final</Paragraph> </Section> <Section position="5" start_page="221" end_page="221" type="metho"> <SectionTitle> 4. Idiosynehratic features of syllable </SectionTitle> <Paragraph position="0"> There are 11,172 syllables in the modern Korean language( = 19 initials * 21 medials * 27 finals plus one for null). However, it is i,)teresting to investigate the usage of syllables to make a word. About 2,350 syllables cover more than 99.9% of the modern Korean words. Furthermore, 267 syllables(11.36% of 2,350 syllables) are only used for the surface form of verbs, and grammatical morphemes are combinations of 151 syllables(6.43% of 2,350 syllables). In addition, only a very small set of syllables, 1 to 46 syllables for each type of irregular verbs, are tied to the morphological transformation \[Kang93\]. This ldnct of information is very useful to improve Ihe efficiency of the morphological mmlyzer. For example, if a syllable used only for the surface form of verb is found in a word, we can easily guess that the word is a verb, the string before that syllable is a stem, and the rest is a grammatical morpheme. There is no other chance for the different result except typographic errors.</Paragraph> <Paragraph position="1"> Suppose that X is a set of syllables that are used at the first position of grammatical morphemes. We can easily guess the syllable boundary position of grammatical morpheme in an n-syllable word at syllable .v~, where xj X and i :~ j K n. There is no the possibility at other positions. It is based on the fact that only 48 syllables are used for the first position of postl)ositions and 72 syllables for the first position of final endings in the Korean language.</Paragraph> <Paragraph position="2"> Three Idnds of syllable features are defined from where the features are extracted. 'Unit feature' is a syllable featm'e defined on the syllable itself. If a syllable xi itself has an idiosynchratic feature J\], then xi has a unit feature g. 'Partial feature' is defined by the component of a syllable. A syllable xi is called to have a pm'tial feature 1)~, if xi includes a component 1)~, as an initial, a medial, or a final letter. 'Successive featlwe' is a mete-level feature defined for the adjacent two syllable features. For example, if there is a set of two successive syllables xixi,l that construct grammatical morphemes and that cannot construct any noun/verb, then the boundary position of a grammatk.'al morpheme is possible only at syllable xi or Xi~ 1.</Paragraph> </Section> <Section position="6" start_page="221" end_page="222" type="metho"> <SectionTitle> 5. Characteristic function </SectionTitle> <Paragraph position="0"> Idiosynchratie features of syllables are represented using a characteristic set of syllables. Suppose that a part of speech(i), morpheme length(j), and the position of syllable in a word(k) are discriminating features of a characteristic set. I.et IPi be a set of syllables that are used for a part of speech i, ()j be q sol: of syllables Ihal are used for the morpheme length j, and ~k be a set of syllables that are used for the k-th l)osition of syllable in the word. &quot;\['hen, a characteristic set of syllables A<i,j,k> is an intersection of Pi, {~j, and ~l{.</Paragraph> <Paragraph position="2"> For the characteristic set of syllables A<i,j,k>, characteristic function CA<ij,k> is defined fi'om A<i,j,k> to {0,1).</Paragraph> <Paragraph position="3"> \[Definition\] characteristic function Let X be a set of Korean syllables and A<id, k> be a characteristic set of syllables where A<ij,k> ~ X for pm't of sI)eech i, morpheme length j, and the k-th position of morpheme. Define the function</Paragraph> <Paragraph position="5"> A lot of characteristic functions are possible by the arguments i, j, and k.</Paragraph> <Paragraph position="6"> However, some of them are chosen for the morpheme isolation or morphological transformation, and they are reorganized as syllable infornmtion function(/) in order to find out the characteristics of a specific syllable. The value of f(x) on a syllable, x is defined by the characteristic function CA<i,i,k>(X). Suppose that a be the nuinber of parts of speech, /3 be the maximum number of syllables in a word, then a lriple A<i,i.l,> can be transformed into At by the following expression.</Paragraph> <Paragraph position="8"> Let g be a flmction from a set of syllables to a Cartesian product of characteristic functions and h be a function from a Cartesian product of characteristic flmctions to an integer.</Paragraph> <Paragraph position="9"> Then, function K and h are defined as follows.</Paragraph> <Paragraph position="11"> Now, syllable information flmction f is defined as a combination of h and g. Domain of the flmction f is a set of syllable and the range is a bit string of integer where bit position t Ls used for the specific feature and tile wfiue of the t-th bit means whelher tile syllaNe has the corresponding feature or not.</Paragraph> <Paragraph position="13"/> </Section> <Section position="7" start_page="222" end_page="224" type="metho"> <SectionTitle> 6. Syllable-based formalism </SectionTitle> <Paragraph position="0"> Mot'l)hological analysis system is formalized as a function F. Tile domain of function F is a set of words and the range of F is a Cartesian l~roduct of a set of morl~hemes and their morl)ho-synlactic features.</Paragraph> <Paragraph position="2"> I,': a set of mOrl~ho-- '.~y n t,qc.tic featurc, s SUI)l)OSe Ihat mi be. a root form of loxic.al inorl)hemo, fa be a con~.bination of l'eat:ures and rk be a two--level rule. Then, function F is defined as follows, FuncLion p is to check tile condition of two--level rulos, l:unction (1 go.neral(.'s a combhmtion of morl)ho-synta(;\[ic feattu'e.~; of a word.</Paragraph> <Paragraph position="4"> defined for the mori~hological analy.qis, l>arts of Slleech, irre0,ular types and o{her f(.'atttl'(2s arc.' dc'fir..~d as follows.</Paragraph> <Paragraph position="6"> A syllable-based rule consists of loft-hand side(IA\[S) and right--hand ,~,ide(l{\[IS). They are described by Ihe following primitive func/ions.</Paragraph> <Paragraph position="8"> insert(x, word, i): insert syllable x at i-th position delete(word, i): delete i-th syllable 'syllable(word,i)' fetches i-th syllable of word and 'subsyl(word,i,j)' is to get j syllables starting from i-th syllable of word. C^<ia,k> is to check whether a syllable x belongs to a syllable characteristic function or not. For example, b-irregular rule in Korean is described as follows. Set 'AT' is supposed to be a characteristic set of the last syllables of b-irregul~ verbs.</Paragraph> <Paragraph position="10"> tail <--subsyl(word, i, n-i-l), change(tail\[I\], 'we(M)', 'e( q)', M~maO The b-irregular rule is described as a syllable-based formalism and it is applied after the isolation of stem parts. So, stem and ending candidates should be identified Overall view of the morphological analyzer is shown in the figure. The first step is to find the morpheme boundaries using characteristic function for syllables. Stem candidates are generated at the second step by the phonological rules. Phonological rules are only applied at a syllable w\[i\] if and only if w\[i-1\] is an element of a required characteristic set, and w\[i+l\] is the beginning syllable of other morpheme.</Paragraph> <Paragraph position="11"> Following algorithm is to guess the beginning position of gralnmatical morpheme. In the algorithm, GM_SET1 and GM_SET2 are characteristic sets for the fi,'st and the rest syllables of grammatical morphemes, respectively.</Paragraph> <Paragraph position="13"> Algorithm. morpheme boundary 7. Evaluation of the model There are two types of candidates for a word. The first type is generated by the morpheme isolation at all the syllable boundary and tile second type is generated for each morpheme candidate by the phonological rules. We can count the number of candidates as follows. Suppose that a be the maximum number of syllables that causes an inflexion, /3 be the candidates for prefinal endings, and ?&quot; be the maximum number of inflexions for one syllable, in case of Korean, ct is less than n, 13 is 2, and ~' is 3. If a word consists of n syllables, then lhe maximum number of canclidates is 10n+8a+2.</Paragraph> <Paragraph position="14"> - candidates for 1-morpheme word and (notm+postposition) 0) 1-morpheme word: 1 @ noun + postposition: n-1 @ noun + suffix + poslposition: n-2 - candidates for irregular verbs and ( verb + ending) @ verb + ending: n-l+a (D verb + prefinal_ending + ending: /3 (6) verb infiexion: ?'(n-l+a+~)</Paragraph> <Paragraph position="16"> It is very inefficient to look up the dictionary for all the implausible stems and grammatical morphemes. Only plausible candidates are generated using the idiosynehratie features of syllable. Now, maximum number of candidates is connted as a constant and tile number of dictionary accesses is highly reduced.</Paragraph> <Paragraph position="17"> The previous algorithm has O(n) complexity because it tries to isolate function word at all the syllable positions. However, if syllable features are used then the worst--time complexity of the Korean morphological analysis beeoines a constant. In this case, we should use lhe fact that there is no stem that includes two successive syllables 'xy' such that 'xy' is a substring of grammaticaI morpheme.</Paragraph> </Section> class="xml-element"></Paper>