File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-1060_metho.xml

Size: 10,087 bytes

Last Modified: 2025-10-06 14:08:21

<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-1060">
  <Title>A Syllable Based Word Recognition Model for Korean Noun Extraction</Title>
  <Section position="3" start_page="0" end_page="1" type="metho">
    <SectionTitle>
2 A new definition of word
</SectionTitle>
    <Paragraph position="0"> Korean spacing unit is an Eojeol, which is delimited by whitespace, as with word in English. In Korean, an Eojeol is made up of one or more words, and a word is made up of one or more morphemes. Figure 1 represents the relationships among morphemes, words, and Eojeols with an example sentence. Syllables are delimited by a hyphen in the figure.</Paragraph>
    <Paragraph position="1"> All of the previous noun extraction methods regard a morpheme as a processing unit. In order to extract nouns, nouns in a given Eojeol should be segmented. To do this, the morphological analysis has been used, but it requires complicated processes because of the surface forms caused by various morphological phenomena such as irregular conjugation of verbs, contraction, and elision. Most of the morphological phenomena occur at the inside of a morpheme or the boundaries between morphemes, not a word. We have also observed that a noun belongs to a morpheme as well as a word. Thus, we do not have to do morphological analysis in the noun extraction point of view.</Paragraph>
    <Paragraph position="2"> In Korean linguistics, a word is defined as a morpheme or a sequence of morphemes that can be used independently. Even though a postposition is not used independently, it is regarded as a word because it is easily segmented from the preceding word. This definition is rather vague for computational processing. If we follow the definition of the word in linguistics, it would be difficult to analyze a word like the morphological analysis. For this reason, we define a different notion of a word.</Paragraph>
    <Paragraph position="3"> According to our definition of a word, each uninflected morpheme or a sequence of successive inflected morphemes is regarded as an individual word.</Paragraph>
    <Paragraph position="4">  By virtue of the new definition of a word, we need not consider mismatches between the surface level form and the lexical level one in recognizing words.</Paragraph>
    <Paragraph position="5"> The example sentence &amp;quot;CMD3BPFFF6DXAFC0 A6AJDGFFFSCJH2D8COA7EM C3EUD3 A2AJ(Cheol-Su saw the persons)&amp;quot; represented in Figure 1 includes six words such as &amp;quot;CMD3BPFFF6(Cheol-Su)&amp;quot;, &amp;quot;DXAFC0(neun)&amp;quot;, &amp;quot;A6AJDGFFFS(sa-lam)&amp;quot;, &amp;quot;CJH2D8(deul)&amp;quot;, &amp;quot;COA7EM(eul)&amp;quot;, and &amp;quot;C3EUD3A2AJ(bwass-da)&amp;quot;. Unlike the Korean linguistics, a noun suffix such as &amp;quot;CNE4A7(nim)&amp;quot;, &amp;quot;CJH2D8(deul)&amp;quot;, or &amp;quot;B2CWA3(jeog)&amp;quot; is also regarded as a word because it is an uninflected morpheme.</Paragraph>
  </Section>
  <Section position="4" start_page="1" end_page="5" type="metho">
    <SectionTitle>
3 Syllable based word recognition model
</SectionTitle>
    <Paragraph position="0"> A Korean syllable consists of an obligatory onset (initial-grapheme, consonant), an obligatory peak (nuclear grapheme, vowel), and an optional coda (final-grapheme, consonant). In theory, the number of syllables that can be used in Korean is the same as the number of every combination of the graphemes.</Paragraph>
    <Paragraph position="1">  Fortunately, only a fixed number of syllables is frequently used in practice.</Paragraph>
    <Paragraph position="2">  The amount of information that a Korean syllable has is larger than that of an alphabet in English. In addition, there are particular characteristics in Korean syllables. The fact that words do not start with certain syllables is one of such examples. Several attempts have been made to use characteristics of Korean syllables. Kang (1995) used syllable information to reduce the over-generated results in analyzing conjugated forms of verbs. Syllable statistics have been also used for automatic word spacing (Shim, 1996; Kang and Woo, 2001; Lee et al., 2002).</Paragraph>
    <Paragraph position="3"> The syllable based word recognition model is represented as a function A0 like the following equations. It is to find the most probable syllable-tag sequence  Korean morphemes can be classified into two types: uninflected morphemes having fixed word forms (such as noun, unconjugated adjective, postposition, adverb, interjection, etc.) and inflected morphemes having conjugated word forms (such as a morpheme with declined or conjugated endings, predicative postposition, etc.)  BDBDBN BDBJBE(BPBDBLA2BEBDA2BEBK) of pure Korean syllables are possible null  Actually, BEBN BGBHBJ of syllables are used in the training data, including Korean characters and non-Korean characters (e.g. alphabets, digits, Chinese characters, symbols).  conditionally depends on only the previous syllable tag. The other is that the probability of a cur-</Paragraph>
    <Paragraph position="5"> conditionally depends on the current tag. In order to reflect word spacing information in Equation 2, which is very useful in Korean POS tagging, Equation 2 is changed to Equation 3 which can consider the word spacing information by calculating the transition probabilities like the equation used in Kim et al. (1998).</Paragraph>
    <Paragraph position="7"> In the equation, CZ becomes zero if the transition occurs in the inside of an Eojeol; otherwise CZ is one.</Paragraph>
    <Paragraph position="8"> Word boundaries can be detected by an additional tag. This method has been used in some tasks such as text chunking and named entity recognition to represent a boundary of an element (e.g. individual phrase or named entity). There are several possible representation schemes to do this. The simplest one is the BIO representation scheme (Ramshaw and Marcus, 1995), where a &amp;quot;B&amp;quot; denotes the first item of an element and an &amp;quot;I&amp;quot; any non-initial item, and a syllable with tag &amp;quot;O&amp;quot; is not a part of any element. Because every syllable corresponds to one syllable tag, &amp;quot;O&amp;quot; is not used in our task. The representation schemes used in this paper are described in detail in Section 4.</Paragraph>
    <Paragraph position="9"> The probabilities in Equation 3 are estimated by the maximum likelihood estimator (MLE) using relative frequencies in the training data.</Paragraph>
    <Paragraph position="10">  The most probable sequence of syllable tags in a sentence (a sequence of syllables) can be efficiently computed by using the Viterbi algorithm.</Paragraph>
    <Paragraph position="11">  Since the MLE suffers from zero probability, to avoid zero probability, we just assign a very low value such as BDBMBCA2BDBC</Paragraph>
    <Paragraph position="13"> Given a sequence of syllables and syllable tags, it is straightforward to obtain the corresponding sequence of words and word tags. Among the words recognized through this process, we can extract nouns by just selecting words tagged as nouns.</Paragraph>
  </Section>
  <Section position="5" start_page="5" end_page="5" type="metho">
    <SectionTitle>
4 Constructing training data
</SectionTitle>
    <Paragraph position="0"> Our model is a supervised learning approach, so it requires a training data. Because the existing Korean POS tagged corpora are annotated by a morpheme level, we cannot use them as a training data without converting the data suitable for the word recognition model. The corpus can be modified through the following steps: Step 1 For a given Eojeol, segment word boundaries and assign word tags to each word.</Paragraph>
    <Paragraph position="1"> Step 2 For each separated word, assign the word tag to each syllable in the word according to one of the representations.</Paragraph>
    <Paragraph position="2">  For the purpose of noun extraction, we only select common nouns here (tagged as &amp;quot;nc&amp;quot; or &amp;quot;NC&amp;quot;) among other kinds of nouns.</Paragraph>
    <Paragraph position="3"> In step 1, word boundaries are identified by using the information of an uninflected morpheme and a sequence of successive inflected morphemes. An uninflected morpheme becomes one word and its tag is assigned to the morpheme's tag. Successive inflected morphemes form a word and the combined form of the first and the last morpheme's tag represents its tag. For example, the morpheme-unit POS tagged form of the Eojeol &amp;quot;DDDWEKB1BFAIA2AJ(gass-eoss-da)&amp;quot; is &amp;quot;A0AJ(ga)/pv+DKDWEK(ass)/ep+B1BFAI(eoss)/ep+A2AJ(da)/ef&amp;quot;, and all of them are inflected morphemes. Hence, the Eojeol &amp;quot;DDDWEKB1BFAIA2AJ(gass-eoss-da)&amp;quot; becomes one word and its tag is represented as &amp;quot;pv ef&amp;quot; by using the first morpheme's tag (&amp;quot;pv&amp;quot;) and the last one's (&amp;quot;ef&amp;quot;).</Paragraph>
    <Paragraph position="4"> In step 2, a syllable tag is assigned to each of syllables forming a word. The syllable tag should express not only POS tag but also the boundary of the word. In order to detect the word boundaries, we use the following four representation schemes: BI representation scheme Assign &amp;quot;B&amp;quot; tag to the first syllable of a word, and &amp;quot;I&amp;quot; tag to the others. BIS representation scheme Assign &amp;quot;S&amp;quot; tag to a syllable which forms a word, and other tags (&amp;quot;B&amp;quot; and &amp;quot;I&amp;quot;) are the same as &amp;quot;BI&amp;quot; representation scheme.</Paragraph>
    <Paragraph position="5"> IE representation scheme Assign &amp;quot;E&amp;quot; tag to the last syllable of a word, and &amp;quot;I&amp;quot; tag to the others. IES representation scheme Assign &amp;quot;S&amp;quot; tag to a syllable which forms a word, and other tags (&amp;quot;I&amp;quot; and &amp;quot;E&amp;quot;) are the same as &amp;quot;IE&amp;quot; representation scheme.</Paragraph>
    <Paragraph position="6"> Table 1 shows an example of assigning word tag by syllable unit to the morpheme unit POS tagged corpus.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML