File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/02/j02-1004_abstr.xml
Size: 6,304 bytes
Last Modified: 2025-10-06 13:42:23
<?xml version="1.0" standalone="yes"?> <Paper uid="J02-1004"> <Title>Technology</Title> <Section position="2" start_page="1" end_page="54" type="abstr"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> Part-of-speech (POS) tagging involves many difficult problems, such as insufficient amounts of training data, inherent POS ambiguities, and (most seriously) many types of unknown words. Unknown words are ubiquitous in any application and cause major tagging failures in many cases. Since Korean is an agglutinative language, it presents more serious problems with unknown morphemes than with unknown words because more than one morpheme can be unknown in a single word and morpheme Computational Linguistics Volume 28, Number 1 Previous techniques for guessing unknown words mostly utilize the guessing rules to analyze the word features by looking at leading and trailing characters. Most of them employ the analysis of trailing characters and other features such as capitalization and hyphenation (Kupiec 1992; Weischedel et al. 1993). Some of them use more morphologically oriented word features such as suffixes, prefixes, and character lengths (Brill 1995; Voutilainen 1995). The guessing rules are usually handcrafted using knowledge of morphology but sometimes are acquired automatically using lexicons and corpora (Brill 1995; Mikheev 1996; Oflazer and T &quot;ur 1996). Previously developed methods for guessing unknown morphemes in Korean are not much different from the methods used for English. Basically, they rely on the rules that reflect knowledge of Korean morphology and word formation. The usual way of handling unknown morphemes is to guess all the possible POS tags for an unknown morpheme by checking connectable functional morphemes in the same eojeol (Kang 1993).</Paragraph> <Paragraph position="1"> However, in this way, it is only possible to guess probable POS tags for a single unknown morpheme when it occurs at the beginning of an eojeol. Unlike in English, in Korean, more than one unknown morpheme can appear in a single eojeol because an eojeol can include complex components such as Chinese characters, Japanese words, and other foreign words. If an eojeol contains more than one unknown morpheme or if the unknown morphemes appear in other than first position in the eojeol, all previous methods fail to efficiently estimate them. This is the reason why we try to avoid conventional guessing rules using word morphology features such as those proposed in Mikheev (1996) and Oflazer and T &quot;ur (1996).</Paragraph> <Paragraph position="2"> In this paper, we propose a syllable-pattern-based generalized unknown-morpheme estimation method using a morpheme pattern dictionary that enables us to treat unknown morphemes in the same way as registered known morphemes, and thereby to guess them regardless of their numbers or positions in an eojeol. The method for estimating unknown morphemes using the morpheme pattern dictionary in Korean needs to be tightly integrated into morphological analysis and POS disambiguation systems.</Paragraph> <Paragraph position="3"> POS disambiguation has usually been performed by statistical approaches, mainly using the hidden Markov model (HMM) in English research communities (Cutting et al. 1992; Kupiec 1992; Weischedel et al. 1993). These approaches are also dominant for Korean, with slight improvements to accommodate the agglutinative nature of Korean. For Korean, early HMM tagging was based on eojeols. The eojeol-based tagging model calculates lexical and transition probabilities with eojeols as a unit; it suffers from severe data sparseness problems since a single eojeol consists of many different morphemes (Lee, Choi, and Kim 1993). Later, morpheme-based HMM tagging was tried; such models assign a single tag to a morpheme regardless of the space in a sentence. Morpheme-based tagging can reduce data sparseness problems but incurs multiple observation sequences in Viterbi decoding since an eojeol can be segmented in many different ways. Researchers then tried many ways of reducing computation due to multiple observation sequences, such as shared word sequences and virtual words (Kim, Lim, and Seo 1995) and two-ply HMM for morpheme unit computation but restricted within an eojeol (Kim, Im, and Im 1996). However, since statistical approaches take neighboring tags into account only within a limited win- null Lee, Cha, and Lee Syllable-Pattern-Based Unknown-Morpheme Estimation dow (usually two or three), sometimes the decision fails to cover important linguistic contexts necessary for POS disambiguation. Also, approaches using only statistical methods are inappropriate for idiomatic expressions, for which lexical terms need to be directly referenced. And especially, statistical approaches alone do not suffice for agglutinative languages, which usually have complex morphological structures. In agglutinative languages, a word usually consists of one or more stem morphemes plus a series of functional morphemes; therefore, each morpheme should receive a POS tag appropriate to its functional role to cope with the complex morphological phenomena in such languages. Recently, rule-based approaches, which learn symbolic tagging rules automatically from a corpus, have been reconsidered, to overcome the limitations of statistical approaches (Brill 1995). Some systems even perform POS tagging as part of a syntactic analysis process (Voutilainen 1995). Following the success of transformation-based approaches, attempts have been made to use transformation rules in systems for tagging Korean (Im, Kim, and Im 1996). However, in general, rule-based approaches alone are not very robust and are not portable enough to be adjusted to new tagsets or new languages. Also, they usually perform no better than their statistical counterparts (Brill 1995). To gain portability and robustness and also to overcome the limited coverage of statistical approaches, we need to somehow combine the two approaches to gain the advantages of each. In this paper, we propose a hybrid method that combines statistical and rule-based approaches to POS disambiguation and can be tightly coupled with generalized unknown-morpheme-guessing techniques.</Paragraph> </Section> class="xml-element"></Paper>