File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/a00-1027_intro.xml

Size: 4,908 bytes

Last Modified: 2025-10-06 14:00:40

<?xml version="1.0" standalone="yes"?>
<Paper uid="A00-1027">
  <Title>Compound Noun Segmentation Based on Lexical Data Extracted from Corpus*</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Morphological analysis is crucial for processing the agglutinative language like Korean since words in such languages have lots of morphological variants.</Paragraph>
    <Paragraph position="1"> A sentence is represented by a sequence of eojeols which are the syntactic unit~ delimited by spacing characters in Korean. Unlike in English, an eojeol is not one word but composed of a series of words (content words and functional words). In particular, since an eojeol can often contain more than one noun, we cannot get proper interpretation of the sentence or phrase without its accurate segmentation.</Paragraph>
    <Paragraph position="2"> The problem in compound noun segmentation is that it is not possible to register all compound nouns in the dictionary since nouns are in the open set of words as well as the number of them is very large. Thus, they must be treated as unseen words without a segmentation process. Furthermore, accurate compound noun segmentation plays an important role in the application system. Compound noun segmentation is necessarily required for improving recall and precision in Korean information * This work was supported by a KOSEF's postdoctoral fellowship grant.</Paragraph>
    <Paragraph position="3"> retrieval, and obtaining better translation in machine translation. For example, suppose that a compound noun 'seol'agsan-gugrib-gongwon(Seol'ag Mountain National Park)' appear in documents.</Paragraph>
    <Paragraph position="4"> A user might want to retrieve documents about 'seol'agsan(Seol'ag Mountain)', and then it is likely that the documents with seol'agsan-gugrib-gongwon' are also the ones in his interest. Therefore, it should be exactly segmented before indexing in order for the documents to be retrieved with the query 'seol'agsan'. Also, to translate 'seol'agsan-gugribgongwon' to Seol'ag Mountain National Park, the constituents should be identified first through the process of segmentation.</Paragraph>
    <Paragraph position="5"> This paper presents two methods for segmentation of compound nouns. First, we extract compound nouns from a large size of corpus, manually divide them into simple nouns and construct the hand built segmentation dictionary with them. The dictionary includes compound nouns which are frequently used and need exceptional process. The number of data are about 100,000.</Paragraph>
    <Paragraph position="6"> Second, the segmentation algorithm is applied if the compound noun does not exist in the built-in dictionary. Basically, the segmenter is based on frequency of individual nouns extracted from corpus.</Paragraph>
    <Paragraph position="7"> However, the problem is that it is difficult to distinguish proper noun and common noun since there is no clue like capital letters in Korean. Thus, just a large amount of lexical knowledge does not make good results if it contains incorrect data and also it is not appropriate to use frequencies obtained by automatically tagging large corpus. Moreover, sufficient lexical data cannot be acquired from small amounts of tagged corpus.</Paragraph>
    <Paragraph position="8"> In this paper, we propose a method to get simple nouns and their frequencies from frequently occurring eojeols using repetitiveness of natural language. The amount of eojeols investigated is manually tractable and frequently used nouns extracted from them are crucial for compound noun segmentation. Furthermore, we propose rain-max composition to divide a sequence of syllables, which would be proven to be an effective method by experiments.</Paragraph>
    <Paragraph position="9"> l_qF~ To briefly show the reason that we select the operation, let us consider the following example. Suppose that a compound noun be composed of four syllables 'sl s2s3s4 '. There are several possibilities of segmentation in the sequence of syllables, where we consider the following possibilities (Sl/S2S3S4) and (sls2/s3s4). Assume that 'sl' is a frequently appearing word in texts whereas 's2s3s4' is a rarely occurring sequence of syllables as a word. On the other hand 'sis2' and 's3s4' occurs frequently but although they don't occur as frequently as 'sl'. In this case, the more likely segmentation would be (sls2/s3s4). It means that a sequence of syllables should not be divided into frequently occurring one and rarely occurring one. In this sense, min-max is the appropriate operation for the selection. In other words, rain value is selected between two sequences of syllables, and then max is taken from min values selected. To apply the operation repetitively, we use the CYK tabular parsing style algorithm.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML