File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-3010_metho.xml

Size: 7,960 bytes

Last Modified: 2025-10-06 14:09:01

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-3010">
  <Title>Part-of-Speech Tagging Considering Surface Form for an Agglutinative Language</Title>
  <Section position="3" start_page="0" end_page="2" type="metho">
    <SectionTitle>
2 Korean POS tagging model
</SectionTitle>
    <Paragraph position="0"> In this section, we first describe the standard morpheme-unit tagging model and point out a mistake of this model. Then, we describe the proposed model.</Paragraph>
    <Section position="1" start_page="0" end_page="2" type="sub_section">
      <SectionTitle>
2.1 Standard morpheme-unit model
</SectionTitle>
      <Paragraph position="0"> This section describes the HMM-based morpheme-unit model. The morpheme-unit POS tagging model is to find the most likely sequence of morphemes C5 and corresponding POS tags CC for a given sentence CF, as follows (Kim et al., 1998; Lee et al., 2000):</Paragraph>
      <Paragraph position="2"> In the equation, D9B4BQBP D2B5 denotes the number of morphemes in the sentence. A sequence of CF BP</Paragraph>
      <Paragraph position="4"> denote a sequence of D9 lexical forms of morphemes and a sequence of D9 morpheme categories (POS tags), respectively. To simplify Equation 2, a Markov assumption is</Paragraph>
      <Paragraph position="6"> is a pseudo tag which denotes the beginning of word and is also written as BUC7CF. D4 denotes a type of transition from the previous tag to the current tag. It has a binary value according to the type of the transition (either intra-word or inter-word transition).</Paragraph>
      <Paragraph position="7"> As can be seen, the word  calculation of the probability. A lexical form of a word can be mapped to more than one surface word. In this case, although the different surface forms are given, if they have the same lexical form, then the probabilities will be the same. For example, a lexical form mong-go/nc+leul/jc  , can be mapped from two surface forms mong-gol and mong-go-leul.By applying Equation 1 and Equation 2 to both words, the following equations can be derived:</Paragraph>
      <Paragraph position="9"> As a result, we can acquire the following equation from Equation 4 and Equation 5:</Paragraph>
      <Paragraph position="11"> That is, they assume that probabilities of the results that have the same lexical form are the same. However, we can easily show that Equation 6 is mistaken: Actually,</Paragraph>
      <Paragraph position="13"> To overcome the disadvantage, we propose a new tagging model that can consider the surface form.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
2.2 The proposed model
</SectionTitle>
      <Paragraph position="0"> This section describes the proposed model. To simplify the notation, we introduce a variable R, which means a tagging result of a given sentence and consists of C5 and CC.</Paragraph>
      <Paragraph position="2"> mong-go means Mongolia, nc is a common noun, and jc is a objective case postposition.</Paragraph>
      <Paragraph position="3"> The probability C8B4CACYCFB5 is given as follows:  denotes a pseudo variable to indicate the beginning of word. Equation 9 becomes Equation 10 by the chain rule. To be a more tractable form, Equation 10 is simplified by a Markov assumption as Equation 11.</Paragraph>
      <Paragraph position="4">  Equation 12 is derived by Bayes rule, Equation 13 by a chain rule and an independence assumption, and Equation 15 by Bayes rule. In Equation 15, we call the left term &amp;quot;morphological analysis model&amp;quot; and right one &amp;quot;transition model&amp;quot;.</Paragraph>
      <Paragraph position="5"> The morphological analysis model C8B4D6</Paragraph>
      <Paragraph position="7"> be implemented in a morphological analyzer. If a morphological analyzer can provide the probability, then the tagger can use the values as they are. Actually, we use the probability that a morphological analyzer, ProKOMA (Lee and Rim, 2004) produces.</Paragraph>
      <Paragraph position="8"> Although it is not necessary to discuss the morphological analysis model in detail, we should note that surface forms are considered here.</Paragraph>
      <Paragraph position="9"> The transition model is a form of point-wise mutual information.</Paragraph>
      <Paragraph position="10">  position of the word in a sentence.</Paragraph>
      <Paragraph position="11"> The denominator means a joint probability that the morphemes and the tags in a word appear together, and the numerator means a joint probability that all the morphemes and the tags between two words appear together. Due to the sparse data problem, they cannot also be calculated directly from the test data. By a Markov assumption, the denominator and the numerator can be broken down into Equation 18 and Equation 19, respectively.</Paragraph>
      <Paragraph position="12">  ity between the last morpheme of the B4CXA0BDB5th word and the first morpheme of the CXth word.</Paragraph>
      <Paragraph position="13"> By applying Equation 18 and Equation 19 to Equation 17, we obtain the following equation:  For a given sentence, Figure 2 shows the bigram HMM-based tagging model, and Figure 3 the proposed model. The main difference between the two models is the proposed model considers surface forms but the HMM does not.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="2" end_page="2" type="metho">
    <SectionTitle>
3 Experiments
</SectionTitle>
    <Paragraph position="0"> For evaluation, two data sets are used: ETRI POS tagged corpus and KAIST POS tagged corpus. We divided the test data into ten parts. The performances of the model are measured by averaging over the ten test sets in the 10-fold cross-validation experiment. Table 1 shows the summary of the corpora. null  Generally, POS tagging goes through the following steps: First, run a morphological analyzer, where it generates all the possible interpretations for a given input text. Then, a POS tagger takes the results as input and chooses the most likely one among them. Therefore, the performance of the tagger depends on that of the preceding morphological analyzer.</Paragraph>
    <Paragraph position="1"> If the morphological analyzer does not generate the exact result, the tagger has no chance to select the correct one, thus an answer inclusion rate of the morphological analyzer becomes the upper bound of the tagger. The previous works preprocessed the dictionary to include all the exact answers in the morphological analyzer's results. However, this evaluation method is inappropriate to the real application in the strict sense. In this experiment, we present the accuracy of the morphological analyzer instead of preprocessing the dictionary. ProKOMA's results with the test data are listed in  In the table, 1-best accuracy is defined as the number of words whose result with the highest probability is matched to the gold standard over the entire words in the test data. This can also be a tagging model that does not consider any outer context. To compare the proposed model with the standard model, the results of the two models are given in Table 3. As can be seen, our model outperforms the HMM model. Moreover, the HMM model is even worse than the ProKOMA's 1-best accuracy. This tells that the standard HMM by itself is not a good model for agglutinative languages.</Paragraph>
  </Section>
  <Section position="5" start_page="2" end_page="2" type="metho">
    <SectionTitle>
4 Conclusion
</SectionTitle>
    <Paragraph position="0"> We have presented a new POS tagging model that can consider the surface form for Korean, which  HMM and the proposed model</Paragraph>
  </Section>
  <Section position="6" start_page="2" end_page="2" type="metho">
    <SectionTitle>
Corpus ETRI KAIST
</SectionTitle>
    <Paragraph position="0"> The standard HMM 87.47 89.83 The proposed model 90.66 92.01 is an agglutinative language. Although the model leaves much room for improvement, it outperforms the HMM based model according to the experimental results.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML