File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1110_metho.xml

Size: 17,914 bytes

Last Modified: 2025-10-06 14:15:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1110">
  <Title>Generalized unknown morpheme guessing for hybrid POS tagging of Korean*</Title>
  <Section position="3" start_page="85" end_page="85" type="metho">
    <SectionTitle>
2 Linguistic characteristics of
Korean
</SectionTitle>
    <Paragraph position="0"> Korean is classified as an agglutinative language in which ~.n eojeol consists of several number of morphemes that have clear-cut morpheme boundaries. For examples. 'Q~ ~'~Tl~l ~t r-l-(I caught a cold)&amp;quot; consists of 3 eojeols and</Paragraph>
    <Paragraph position="2"> + c\]-(final ending)/eGE. Below are the characteristics of Korean that must be considered for morphological-level natural language processing and POS tagging.</Paragraph>
    <Paragraph position="3"> As an agglutinative language, Korean POS tagging is usually performed on a morpheme basis rather than an eojeol basis. So, morphological analysis is essential to POS tagging because morpheme segmentation is much more important and difficult than POS assignment. Moreover: morphological analysis should segment out unknown morphemes as well as known morphemes, so unknown morpheme handling should be integrated into the morphological analysis process. There are three possible analyses fl'om the eojeol &amp;quot;Q~&amp;quot; : 'Q(I)/T' + ' ~(subject-marker)/jS', ~ ut-(sprout)/DR' + '~(adnominal)/eCNMG', ?~(fly)/DI' + '~(adnominal)/eCNMG', so morpheme &amp;quot;~Here, '+' is a morpheme boundary in an eojeol and '/' is for the POS tag symbols (see Fig. 1). segmentation is often ambiguous.</Paragraph>
    <Paragraph position="4"> * Korean is a postpositional language with many kind of noun-endings (particles), verb-endings (other endings), and prefinal verb-endings (prefinal endings). It is these functional morphemes, rather than eojeol's order, which determine most of the grammatical relations such as noun's syntactic flmctions, verb's tense, aspect, modals, and even modi~ing relations between eojeots.</Paragraph>
    <Paragraph position="5"> For example. ~/jS' is an atuxiliary particle, so eojeol ~'G-~-&amp;quot; has a subject role due to the particle :~/jS'.</Paragraph>
    <Paragraph position="6"> * Complex spelling changes frequently occur between morphemes when two morphemes combine to form an eojeol. These spelling changes make it difficult to segment the original morphemes out before assigning the POS tag symbols.</Paragraph>
    <Paragraph position="7"> Fig. 1 shows a tag set extracted from 100 full POS tag hierarchies in Korean. This tag set will be used in our experiments in section 6.</Paragraph>
  </Section>
  <Section position="4" start_page="85" end_page="87" type="metho">
    <SectionTitle>
3 Unknown morpheme guessing
</SectionTitle>
    <Paragraph position="0"> during morphological analysis Morphological analysis is a basic step to natural language processing which segments input texts into morphotactically connectable morphemes and assigns all possible POS tags to each morpheme by looking up a morpheme dictionary.</Paragraph>
    <Paragraph position="1"> Our morphological analysis follows general three steps (Sproat, 1992): morpheme segmentation, original morpheme recovery from spelling changes, and morphotactics modeling.</Paragraph>
    <Paragraph position="2"> Input texts are scanned from left to right., character3by character, to be matched to morphemes in a morpheme dictionary. The morpheme dictionary (Fig. 2) has a separate entry for each variant form (called allomorph) of the original morpheme form so we can easily reconstruct the original inorphemes from spelling changes.</Paragraph>
    <Paragraph position="3"> For morphotactics modeling, we used the POS tags and the morphotactic adjacency symbols in the dictionary. The full hierarchy of POS tags and morphotactic adjacency symbols are encoded in the morpheme dictionary for each mor- null tag descrip t ion' tag  pheme. To model the morpheme's connectability to one another, besides the morpheme dictionary, the separate morpheme-connectivity table encodes all the connectable pairs of morpheme groups using the morpheme's tag and morphotactic adjacency symbol patterns. After an input eojeol is segmented by trie indexed dictionary search, the morphological analysis checks if each segmentation is grammatically connectable by looking into the morpheme-connectivity table. null For unknown morpheme guessing, we develop a general unknown morpheme estimation method for number-free and position-free unknown morpheme handling. Using a morpheme pattern dictionary, we can look up unknown morphemes in the dictionary exactly same way as we do the registered morphemes. And when morphemes are checked if they are connectable, we can use the iEformation of the adjacent morphemes in the same eojeol. The basic idea of the morpheme-pattern dictionary is to collect all the possible general lexical patterns of Korean morphemes and encode each lexical syllable pattern with all the candidate POS tags. So we can assign initial POS tags to each unknown morpheme by only matching the syllable patterns in the pattern dictionary. In this way, we don't need a special rule-based unknown morpheme handling module in our morphological analyzer, and all the possible POS tags for unknown morphemes can be assigned just like the registered morphemes. This method can guess the POS of each and every unknown morpheme, if more than one unknown morphemes are in an eojeol, regardless of their positions since the morpheme segmentation is applied to both the unknown morphemes and the registered morphemes dur- null ing the trie indexed dictionary search.</Paragraph>
    <Section position="1" start_page="87" end_page="87" type="sub_section">
      <SectionTitle>
3.1 Morpheme pattern dictionary
</SectionTitle>
      <Paragraph position="0"> The morpheme pattern dictionary covers all necessary syllable patterns for unknown morphemes including common nouns, propernouns, adnominals, adverbs, regular and irregular verbs, regular and irregular adjectives, and special symbols for foreign words. The lexical.</Paragraph>
      <Paragraph position="1"> patterns for morphemes are collected from the previous studies (Kang, 1993) where the constraints of Korean syllable patterns as to the morpheme connectabilities are well described.</Paragraph>
      <Paragraph position="2"> Fig. 3 shows some example entries of the morpheme pattern dictionary, where ;Z', ;V', ;*' are meta characters which indicate a consonant, a vowel, and any number of Korean characters respectively. For example, ~L..7_-l--g\]&amp;quot; (thanks), which is a morpheme and an eojeol at the same time, is matched &amp;quot;(ZV*N)&amp;quot; (shown in Fig. 3) in the morpheme pattern dictionary, and is recovered into the original morpheme form &amp;quot;2_~&amp;quot;.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="87" end_page="89" type="metho">
    <SectionTitle>
4 A hybrid tagging model
</SectionTitle>
    <Paragraph position="0"> sentence i ~ ................ .:.. l ........... ~: C/lictloOary ::: morphologiea I !!i ...... ~' ~mo, p.ho~ = \] i.C/o~,iCth,~,.&amp;quot;l ....... .,. ili )-. ':.i tta~t...:..::.)t i analyzer i'i ~. &amp;quot;~::'::':, tram, re .&gt;-:4 ..... t e'~'~': ~J . ~</Paragraph>
    <Paragraph position="2"> chitecture for Korean POS tagging.</Paragraph>
    <Paragraph position="3"> Fig. 4 shows a proposed hybrid architecture for Korean POS tagging with generalized unknown-morpheme guessing. There are three major components: the morphological analyzer with unknown-morpheme handler, the statistical tagger, and the rule-based error corrector. The morphological analyzer segments the morphemes out of eojeols in a sentence and reconstructs the original morphemes from spelling changes from irregular conjugations. It also assigns all possible POS tags to each morpheme by consulting a morpheme dictionary. The unknown-morpheme handler integrated into the morphological analyzer assigns the POS's of the morphemes which are not registered in tim dictionary. null The statistical tagger runs the Viterbi algorithm (Forney, 1973) on the morpheme graph for searching the optimal tag sequence for POS disambiguation. For remeding the defects of a statistical tagger, we introduce a post error-correction mechanism. The error-corrector is a rule-based transformer (Brill, 1992), and it corrects the mis-tagged morphemes by considering the lexical patterns and the necessary contextual information.</Paragraph>
    <Section position="1" start_page="87" end_page="88" type="sub_section">
      <SectionTitle>
4.1 Statistical POS tagger
</SectionTitle>
      <Paragraph position="0"> Statistical tagging model has the morpheme graph as input and selects the best morpheme and POS tag sequence r for sentences represented ill the graph. The morpheme-graph is a compact way of representing nmltiple morpheme sequences for a sentence. We put each morpheme with the tag as a node and the morpheme connectivity as a link.</Paragraph>
      <Paragraph position="1"> Our statistical tagging model is adjusted from standard bi-grams using the Viterbi-search (Cutting et al., 1992) plus on-the-fly extra computing of lexical probabilities for unknown morphemes. The equation of statistical tagging model used is a modified hi-gram model with left to right search:</Paragraph>
      <Paragraph position="3"> where T&amp;quot; is an optimal tag sequence that maximizes the forward Viterbi scores. Pr(tilti-1) is a bi-gram tag transition probability and Pr( ti lrni ) PT(td is a modified morpheme lexical probability. This equation is finally selected from the extensive experiments using the following six different equations:</Paragraph>
      <Paragraph position="5"> In the experiments, we used 10204 morpheme training corpus from :'Kemong Encyclopedia 5,.</Paragraph>
      <Paragraph position="6"> Table 1 shows the tagging performance of each equation.</Paragraph>
      <Paragraph position="7"> Training of the statistical tagging model requires parameter estimation process for two parameters, that is, morpheme lexical probabilities and bi-gram tag transition probabilities. Several studies show that using as much as tagged corpora for training gives much better Sprovided from ETRI pattern dictionary performance than unsupervised training using Baum-Welch algorithm (blerialdo, 1994). So we decided to use supervised training using tagged corpora with relative frequency counts. The three necessary probabilities can be estimated as follows:</Paragraph>
      <Paragraph position="9"> where N(mi, ti) indicates the total number of occurrences of morpheme .mi together with specific tag ti, while N(mi) shows the total number .of occurrences of morpheme rni in the tagged training corpus. The N(ti_l,ti) and N(ti-1) can be interpreted similarly for two consecutive tags ti-1 and ti.</Paragraph>
    </Section>
    <Section position="2" start_page="88" end_page="89" type="sub_section">
      <SectionTitle>
4.2 Lexlcal probability estimation for
</SectionTitle>
      <Paragraph position="0"> unknown morpheme guessing The lexical probabilities for unknown morphemes cannot be pre-calculated using the equation (8), so a special method should be applied. We suggest to use syllable tri-grams since Korean syllables can duly play important roles as restricting units for guessing POS of a Pr(tilmi) morpheme. So the lexical probability e,-(td for unknown morphemes can be estimated using the frequency of syllable tri-gram products according to the following formula:</Paragraph>
      <Paragraph position="2"/>
      <Paragraph position="4"> where ~m' is a morpheme, 'e' is a syllable, 't' is a POS tag, '#&amp;quot; is a morpheme boundary symbol, and ft(eilei-~., el-l) is a frequency data for tag ~t' with cooccurrence syllables el-2, ei-1, ei.</Paragraph>
      <Paragraph position="5"> A tri-gram probabilities are smoothed by. equation (13) to cope with the sparse-data problem.</Paragraph>
      <Paragraph position="6"> For example, &amp;quot;~=1-&amp;quot;. o ,_ is a name of a person, so is an unknown morpheme. The lexical probability of &amp;quot;~'~'&amp;quot; -, o ,~ as tag MPN is estimated using the formula:</Paragraph>
      <Paragraph position="8"> All tri-grams for Korean syllables were pre-calculated and stored in the table, and are applied with the candidate tags during the unknown morpheme POS guessing and smoothing.</Paragraph>
      <Paragraph position="9"> 5 A posteriori error correction rules The statistical morpheme tagging covers only the limited range of contextual information.</Paragraph>
      <Paragraph position="10"> Moreover, it cannot refer to tile lexical patterns as a context for POS disambiguation. As mentioned before, Korean eojeol has very complex morphological structure so it is necessary to look at the functional morphemes selectively to get the grammatical relations between eojeols. For these reasons, we designed error-correcting rules for eojeols to compensate estimation and modeling errors of the statistical morpheme tagging. However, designing the error-correction rules with knowledge engineering is tedious and error-prone. Instead, we adopted Brill's approach (Brill, 1992) to auto.</Paragraph>
      <Paragraph position="11"> matically learn the error-correcting rules from small amount of tagged corpus. Fortunately, Brill showed that we don't need a large amount of tagged corpus to extract the symbolic tagging rules compared with the case in tile statistical tagging. Table 2 shows some rule schemata we used to extract the error-correcting rules: where a rule schema designates the context of rule applications, i.e.. the morpheme position and the lexical/tag decision in the context eojeol.</Paragraph>
      <Paragraph position="12"> The rules which can be automatically learned using table 2's schemata are in the form of table 3, where \[current eojeol or morpheme\] consists of morpheme (with current tag) sequence in the eojeol, and \[corrected eojeol or morpheme\] consists of morpheme (with corrected tag) sequence ill the same eojeol. For example, the rule</Paragraph>
      <Paragraph position="14"> the current eojeol was statistically tagged as common-noun (MC) plus auxiliary particle (iS), but when the next first eojeol's (N1) first position morpheme tag (FT) is another common-noun (MC), the eojeol should be tagged as regular verb (DR) plus adnominal ending (eCNMG). This statistical error is caused from the ambiguity of the morpheme &amp;quot;N&amp;quot; which has two meanings as &amp;quot;Chinese ink:' (noun) and &amp;quot;to eat&amp;quot; (verb). Since the morpheme segmentation is very difficult ill Korean, many of the tagging errors also come from the morpheme segmentation errors. Our error-correcting rules call cope with these morpheme  next first eojeol (N1) first morpheme's tag (FT) previous first eojeol (P1) last morpheme's tag (LT) next second eojeol (N2) first morpheme's tag (FT) next third eojeol (N3) first morpheme's tag (FT) previous first eojeol (P1) last morpheme's lexical form (LM) previous first eojeol (P1) first morpheme's lexical form (FM) next first eojeol (N1) first morpheme's lexical form (FM)  segmentation errors by correcting the errors in the whole eojeol together. For example, the following rule can correct morpheme segmentation errors: \[~/MC 4- ol.7_./jO\]\[P1LM, @\] -~ \[~ o\]/DR + _7-~eCCl. This rule says that the eojeol &amp;quot;@old&amp;quot; is usually segmented as common-noun &amp;quot;~&amp;quot; (meaning string or rope) plus other-particle &amp;quot;o\]..v_.&amp;quot;, but when the morpheme &amp;quot;~&amp;quot; appears before the eojeol, it should be segmented as regular-verb ,;@o\] &amp;quot; (meaning shrink) plus conjunctive-ending ':2.&amp;quot;. This kind of segmentation-error correction can greatly enhance the tagging performance in Korean. The rules are automatically learned by Comparing the correctly tagged corpus with the outputs of the statistical tagger. The training is leveraged (Brill, 1992) so the error-correcting rules are gradually learned as the statistical tagged texts are corrected by the rules learned so far.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="89" end_page="89" type="metho">
    <SectionTitle>
6 Experiment results
</SectionTitle>
    <Paragraph position="0"> For morphological analysis and POS tagging experiments, we used 130000 morpheme-balanced training corpus for statistical parameter estimation and 50000 morpheme corpus for learning the post error-correction rules. These training corpora were collected from various sources such as internet documents, encyclopedia, newspapers, and school textbooks.</Paragraph>
    <Paragraph position="1"> For the test set, we carefully selected three different document sets aiming for a broad coverage. The document set 1 (25299 morphemes; 1338 sentences) is collected from :'Kemong encyclopedia 6,, hotel reservation dialog corpus 7 and internet document, and contains 10% of unknown morphemes. The documents set 2 (15250 morphemes; 5774 sentences) is solely collected from various internet documents from assorted domains such as broadcasting scripts and newspapers, and has about 8.5% of unknown morphemes. The document set 3 (20919 morphemes; 555 sentence) is from Korean standard document collection set called KTSET 2.0 s and contains academic articles and electronic newspapers. This document set contains about 1470 unknown morphemes (mainly technical jargons). null Table 4 showsour taggingperformance for these three document sets. This experiment shows efficiency of our unknown morpheme handling and guessing techniques since we can confirm the sharp performance drops between tagger-a and tagger-b. The post error correction rules are also proved to be effective by the performance drops between the full tagger and taggera, but the drop rates are mild due to the performance saturation at tagger-a, which means that our statistical tagging alone already achieves state-of-the-art performance for Korean morpheme tagging.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML