File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-2034_metho.xml
Size: 8,380 bytes
Last Modified: 2025-10-06 14:09:37
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-2034"> <Title>Probabilistic Models for Korean Morphological Analysis</Title> <Section position="4" start_page="197" end_page="199" type="metho"> <SectionTitle> 3 Probabilistic morphological analysis </SectionTitle> <Paragraph position="0"> model Probabilistic morphological analysis generates all the possible interpretations and their probabilities for a given Eojeol DB. The probability that a given Eojeol DB is analyzed to a certain interpretation CA is represented as C8B4CACYDBB5. The interpretation CA is made up of a morpheme sequence C5 and its corresponding POS sequence CC as given in Equa-</Paragraph> <Paragraph position="2"> In the following subsections, we describe the three morphological analysis models based on three different linguistic units (Eojeol, morpheme, and syllable ).</Paragraph> <Section position="1" start_page="197" end_page="198" type="sub_section"> <SectionTitle> 3.1 Eojeol-unit model </SectionTitle> <Paragraph position="0"> For the Eojeol-unit model, it is sufficient to store the frequencies of each Eojeol (surface level form) and its interpretation acquired from the POS tagged corpus .</Paragraph> <Paragraph position="1"> The probabilities of Equation 1 are estimated by the maximum likelihood estimator (MLE) using relative frequencies in the training data. In Korean written text, each character has one syllable. We do not distinguish between character and syllable in this paper.</Paragraph> <Paragraph position="2"> ProKOMA extracts only Eojeols occurred five times or more in training data.</Paragraph> <Paragraph position="3"> The most prominent advantage of the Eojeol-unit analysis is its simplicity. As mentioned before, morphological analysis of Korean is very complex. The Eojeol-unit analysis can avoid such complex process so that it is very efficient and fast. Besides, it can reduce unnecessary results by only producing the interpretations that really appeared in the corpus. So, we also expect an improvement in accuracy.</Paragraph> <Paragraph position="4"> Due to the high productivity of Korean Eojeol, the number of possible Eojeols is very large so storing all kinds of Eojeols is impossible. Therefore, using the Eojeol-unit analysis alone is undesirable, but a small number of Eojeols with high frequency can cover a significant portion of the entire ones, thus this model will be helpful.</Paragraph> </Section> <Section position="2" start_page="198" end_page="199" type="sub_section"> <SectionTitle> 3.2 Morpheme-unit model </SectionTitle> <Paragraph position="0"> As discussed, not all Eojeols can be covered by the Eojeol-unit analysis. The ultimate goal of morphological analysis is to recognize every morpheme within an Eojeol. For these reasons, most previous systems have used morpheme as a processing unit for morphological analysis.</Paragraph> <Paragraph position="1"> The morpheme-unit morphological analysis model is derived as follows by introducing lexical form D0:</Paragraph> <Paragraph position="3"> where D0 should satisfy the following condition:</Paragraph> <Paragraph position="5"> is a set of lexical forms that can be derived from the surface form DB. This condition means that among all possible lexical forms for</Paragraph> <Paragraph position="7"> ), the only lexical form D0 is deterministically derived from the interpretation CA.</Paragraph> <Paragraph position="9"> Equation 3 assumes the interpretation CA and the surface form DB are conditionally independent given the lexical form D0. Since the lexical form A lexical form is just the concatenation of morphemes. the lexical form D0 can be omitted as in equation 4. In Equation 5, the left term C8B4D0CYDBB5 denotes &quot;the morphological restoration model&quot;, and the right C8B4C5BNCCB5 &quot;the morpheme segmentation and POS assignment model&quot;.</Paragraph> <Paragraph position="10"> We describe the morphological restoration model first. The model is the probability of the lexical form given a surface form and is to encode the probability that the CZ substrings between the surface form and its lexical form correspond to each other. The equation of the model is as follows: null</Paragraph> <Paragraph position="12"> denote the CYth substrings of the surface form and the lexical form, respectively. We call such pairs of substrings &quot;morphological information&quot;. This information can be acquired by the following steps: If a surface form (Eojeol) and its lexical form are the same, each syllable pair of them is mapped one-to-one and extracted. Otherwise, it means that a morphological change occurs. In this case, the pair of two substrings from the beginning to the end of the mismatch is extracted. The morphological information is also automatically extracted from training data. Table 2 shows some examples of applying the morphological restoration model.</Paragraph> <Paragraph position="13"> Now we turn to the morpheme segmentation and POS assignment model. It is the joint probability of the morpheme sequence and the tag sequence. null</Paragraph> <Paragraph position="15"> are pseudo tags to indicate the beginning and the end of Eojeol, respectively. We introduce the D8 BXC7CF symbol to reflect the preference for well-formed structure of a given Eojeol. The model is represented as the well-known bigram hidden Markov model (HMM), which is widely used in POS tagging. The morpheme dictionary and the morphosyntactic rules that have been used in the previous</Paragraph> </Section> <Section position="3" start_page="199" end_page="199" type="sub_section"> <SectionTitle> 3.3 Syllable-unit model </SectionTitle> <Paragraph position="0"> One of the most difficult problems in morphological analysis is the unknown word problem, which is caused by the fact that we cannot register every possible morpheme in the dictionary. In English, contextual information and suffix information is helpful to estimate the POS tag of an unknown word. In Korean, the syllable characteristics can be utilized. For instance, a syllable &quot;eoss&quot; can only be a pre-final ending.</Paragraph> <Paragraph position="1"> The syllable-unit model is derived from Equation 4 as follows:</Paragraph> <Paragraph position="3"> is the syllable sequence of the lexical form, and CD BP D9 BDBND1 is its corresponding syllable tag sequence.</Paragraph> <Paragraph position="4"> In the above equation, C8B4D0 CY DBB5 is the same as that of the morpheme-unit model (Equation 6), we use the morpheme-unit model's result as it is. The right term C8B4BVBNCDB5 is referred to as &quot;the POS assignment model&quot;.</Paragraph> <Paragraph position="5"> The POS assignment model is to assign the D1 syllables to the D1 syllable tags: s denote the pseudo syllables and the pseudo tags, respectively. They indicate the beginning of Eojeol. Analogously, CR</Paragraph> <Paragraph position="7"> denote the pseudo syllables and the pseudo tags to indicate the end of Eojeol, respectively.</Paragraph> <Paragraph position="8"> Two Markov assumptions are applied in Equation 10. One is that the probability of the current syllable CR</Paragraph> </Section> </Section> <Section position="5" start_page="199" end_page="199" type="metho"> <SectionTitle> CX </SectionTitle> <Paragraph position="0"> conditionally depends only on the previous two syllables and two syllable tags. The other is that the probability of the current syllable tag D9</Paragraph> </Section> <Section position="6" start_page="199" end_page="199" type="metho"> <SectionTitle> CX </SectionTitle> <Paragraph position="0"> conditionally depends only on the previous syllable, the current syllable, and the previous two syllable tags. This model can consider broader context by introducing the less strict independent assumption than the HMM.</Paragraph> <Paragraph position="1"> In order to convert the syllable sequence BV and the syllable tag sequence CD to the morpheme sequence C5 and the morpheme tag sequence CC,we can use two additional symbols (&quot;B&quot; and &quot;I&quot;) to indicate the boundary of morphemes: a &quot;B&quot; denotes the first syllable of a morpheme and an &quot;I&quot; any non-initial syllable. Examples of syllable-unit tagging with BI symbols are given in Table</Paragraph> </Section> class="xml-element"></Paper>