File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/j02-1004_metho.xml
Size: 5,580 bytes
Last Modified: 2025-10-06 14:07:53
<?xml version="1.0" standalone="yes"?> <Paper uid="J02-1004"> <Title>Technology</Title> <Section position="3" start_page="54" end_page="54" type="metho"> <SectionTitle> 2. Linguistic Characteristics of Korean </SectionTitle> <Paragraph position="0"> Korean is classified as an agglutinative language. In Korean, an eojeol consists of several morphemes that have clear-cut morpheme boundaries. For example, na-neun gamgi-e geol-lyeoss-dda 'I caught a cold' consists of 3 eojeols and 7 morphemes:</Paragraph> <Paragraph position="2"> are the characteristics of Korean that must be considered for morphological-level natural language processing and POS tagging.</Paragraph> <Paragraph position="3"> POS tagging of Korean is usually performed on a morpheme basis rather than on an eojeol basis. Accordingly, morphological analysis is essential to POS tagging because morpheme segmentation is much more important and difficult than POS assignment. Moreover, morphological analysis should segment eojeols that contain unknown morphemes as well as known morphemes. Hence, unknown-morpheme handling should be integrated into the morphological analysis process. Because a single eojeol can have many possible analyses (e.g., na-neun: na('I')/T + neun('topic marker')/jS, na('sprout')/DR + neun('adnominal')/eCNMG, nal('fly')/DI + neun('adnominal')/eCNMG, morpheme segmentation is inherently ambiguous.</Paragraph> <Paragraph position="4"> Korean is a postpositional language with many kinds of noun endings (particles), verb endings, and prefinal verb endings. It is these functional morphemes, rather than the order of eojeols, that determine grammatical</Paragraph> </Section> <Section position="4" start_page="54" end_page="57" type="metho"> <SectionTitle> 4 Here, &quot;+&quot; represents a morpheme boundary in an eojeol and &quot;/&quot; introduces the POS tag symbols (see </SectionTitle> <Paragraph position="0"> relations such as a noun's syntactic function, a verb's tense, aspect, modals, and even modifying relations between eojeols. For example, ga/jC is a case particle, so the eojeol uri(we)-ga has a subject role due to the particle ga/jC. Korean has a clear syllable structure within the morpheme; most nominal content morphemes keep their surface form when they are combined with functional morphemes.</Paragraph> <Paragraph position="1"> Korean is basically an SOV language but has relatively free word order compared with English. The weight , in Equation (1) (Section 4.1) reflects the fact that transition probability is less important in Korean than in English. However, Korean does have some word order constraints: verbs must appear in sentence-final position, and modifiers must be placed before the element they modify. So some order constraints must be selectively utilized as contextual information in the POS tagging process, which is taken well into account in the design of error correction rules (Section 4.3).</Paragraph> <Paragraph position="2"> Complex spelling changes (irregular conjugations) frequently occur between morphemes when two morphemes combine to form an eojeol.</Paragraph> <Paragraph position="3"> These spelling changes make it difficult to segment the original morphemes before the POS tag symbols are assigned.</Paragraph> <Paragraph position="4"> The unknown-morpheme problem in Korean differs in some ways from the unknown-word problem in English. In English, it is easy to identify unknown words because they occur between spaces. However, in Korean, since unknown morphemes are hidden in an eojeol, we only know that morphological analysis failed in that eojeol; pinpointing the exact unknown morphemes is usually difficult. This is why, unlike in English, it is not possible to fully guess an unknown morpheme using only affixes. The distribution of POS tags for unknown morphemes extracted from a 130,000-morpheme training corpus (9,718 unknown morphemes) is shown in Table 1. The distribution from even a small corpus shows that we need to estimate various parts of speech for unknown morphemes rather than simply guess them as nouns.</Paragraph> <Paragraph position="5"> Table 2 shows the tagset that was used in the experiments reported in Section 5. The tagset was selected from hierarchically organized POS tags for Korean. We defined about 100 different POS tags, which can be used in morphological analysis as well as in POS tagging. We also designed over 300 morphotactic adjacency symbols to be used in morpheme connectivity checks for correct morpheme segmentation (to be explained in the next section). The POS tags are hierarchically organized symbols Computational Linguistics Volume 28, Number 1 that were iteratively refined from the eight major grammatical categories of Korean: nominal, predicate, modifier, particle, ending, affix, special symbol, and interjection. For a given morpheme, the acronym of a path name in the symbol hierarchy up to a certain level is assigned as a POS tag.</Paragraph> <Paragraph position="6"> The rest of the detailed hierarchies, which are related only to morpheme connectivity, are independently assigned as morphotactic adjacency symbols. Therefore, we can use either full or partial path names as POS tags in order to adjust the total number of tags. The size of the tagset can thus be adapted by refining grammatical categories that are more pertinent to a given application. For example, for text-indexing applications, we refine nominals more than predicates since index terms are usually nominals in these applications.</Paragraph> </Section> class="xml-element"></Paper>