File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/w98-1110_intro.xml

Size: 3,644 bytes

Last Modified: 2025-10-06 14:06:45

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1110">
  <Title>Generalized unknown morpheme guessing for hybrid POS tagging of Korean*</Title>
  <Section position="2" start_page="0" end_page="85" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Part-of-speech (POS) tagging has many difficult problems to attack such as insufficient training data, inherent POS ambiguities: and most seriously unknown words. Unknown words are ubiquitous in any application and cause major tagging failures in many cases. Since Korean is an agglutinative language, we have unknown morpheme problems instead of unknown words in our POS tagging.</Paragraph>
    <Paragraph position="1"> The usual way of unknown-morpheme handling before was to guess possible POS's for an unknown-morpheme by checking connectable &amp;quot; This project was supported by KOSEF (teukjeongkicho #970-1020-301-3, 1997).</Paragraph>
    <Paragraph position="2"> functional morphemes in the same eojeol l (Kang, 1993). In this way, they could guess possible POS's for a single unknown-morpheme only when it is positioned in the begining of an eojeol. If an eojeol contains more than one unknown-morphemes or if unknown-morphemes appear other than the first position, all the previous methods cannot efficiently estimate them. sO, we propose a morpheme-pattern dictionary which enables us to treat unknown-morphemes in the same way as registered known morphemes, and thereby to guess them regardless of their numbers and positions in an eojeol. The unknown-morpheme handling using the morpheme-pattern dictionary is integrated into a hybrid POS disambiguation.</Paragraph>
    <Paragraph position="3"> The POS disambiguation has usually been performed by statistical approaches mainly using hidden markov model (HMM) (Cutting et al., 1992; Kupiec. 1992; Weischedel et al., 1993).</Paragraph>
    <Paragraph position="4"> However. since statistical approaches take into account neighboring tags only within a limited window (usually two or three), sometimes the decision cannot cover all linguistic contexts necessary for POS disambiguation. Also the approaches are inappropriate for idiomatic expressions for which lexical terms need to be directly referenced. The statistical approaches are not enough especially for agglutinative languages (such as Korean) which have usually complex morphological structures. In agglutinative languages, a word (called eojeol in Korean) usually consists of separable single stem-morpheme plus one or more functional morphemes, and the POS tag should be assigned to each morpheme to cope with the complex morphological phenomena. Recently, rule-based approaches are tAn eojeol is a Korean spacing unit(similar to English word) which usually consists of one or more stem morphemes and functional morphemes.</Paragraph>
    <Paragraph position="5">  re-studied to overcome the limitations of staffstical approaches by learning symbolic tagging rules automatically from a corpus (Brill, 1992; Bril!. 1994). Some systems even perform the POS tagging as part of a syntactic analysis process (Voutilainen, 1995). However, rule-based approaches alone, in general, are not very robust, and not portable enough to be adjusted to new tag sets and new languages. Also the.</Paragraph>
    <Paragraph position="6"> performance is usually no better than the statistical counterparts (Brill, 1992). To gain the portability and robustness and also to overcome the limited coverage of statistical approaches, we adopt a hybrid method that can combine both statistical and rule-based approaches for POS disambiguation.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML