File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/p03-1060_intro.xml

Size: 3,613 bytes

Last Modified: 2025-10-06 14:01:48

<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-1060">
  <Title>A Syllable Based Word Recognition Model for Korean Noun Extraction</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Noun extraction is a process to find every noun in a document (Lee et al., 2001). In Korean, Nouns are used as the most important terms (features) that express the document in NLP applications such as information retrieval, document categorization, text summarization, information extraction, and etc.</Paragraph>
    <Paragraph position="1"> Korean is a highly agglutinative language and nouns are included in Eojeols. An Eojeol is a surface level form consisting of more than one combined morpheme. Therefore, morphological analysis or POS tagging is required to extract Korean nouns.</Paragraph>
    <Paragraph position="2"> The previous Korean noun extraction methods are classified into two categories: morphological analysis based method (Kim and Seo, 1999; Lee et al., 1999a; An, 1999) and POS tagging based method (Shim et al., 1999; Kwon et al., 1999). The morphological analysis based method tries to generate all possible interpretations for a given Eojeol by implementing a morphological analyzer or a simpler method using lexical dictionaries. It may over-generate or extract inaccurate nouns due to lexical ambiguity and shows a low precision rate. Although several studies have been proposed to reduce the over-generated results of the morphological analysis by using exclusive information (Lim et al., 1995; Lee et al., 2001), they cannot completely resolve the ambiguity.</Paragraph>
    <Paragraph position="3"> The POS tagging based method chooses the most probable analysis among the results produced by the morphological analyzer. Due to the resolution of the ambiguities, it can obtain relatively accurate results. But it also suffers from errors not only produced by a POS tagger but also triggered by the preceding morphological analyzer.</Paragraph>
    <Paragraph position="4"> Furthermore, both methods have serious deficien-</Paragraph>
    <Paragraph position="6"> cies in that they require considerable manual labor to construct and maintain linguistic knowledge and suffer from the unknown word problem. If a morphological analyzer fails to recognize an unknown noun in an unknown Eojeol, the POS tagger would never extract the unknown noun. Although the morphological analyzer properly recognizes the unknown noun, it would not be extracted due to the sparse data problem.</Paragraph>
    <Paragraph position="7"> This paper proposes a new noun extraction method that uses a syllable based word recognition model. The proposed method does not require labor for constructing and maintaining linguistic knowledge and it can also alleviate the unknown word problem or the sparse data problem. It finds the most probable syllable-tag sequence of the input sentence by using statistical information and extracts nouns by detecting the word boundaries. The statistical information is automatically acquired from a POS annotated corpus and the word boundary can be detected by using an additional tag to represent the boundary of a word.</Paragraph>
    <Paragraph position="8"> This paper is organized as follows. In Section 2, the notion of word is defined. Section 3 presents the syllable based word recognition model. Section 4 describes the method of constructing the training data from existing POS tagged corpora. Section 5 discusses experimental results. Finally, Section 6 concludes the paper.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML