File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/c96-1087_intro.xml
Size: 4,706 bytes
Last Modified: 2025-10-06 14:05:57
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-1087"> <Title>A Probabilistic Approach to Compound Noun Indexing in Korean Texts</Title> <Section position="2" start_page="0" end_page="514" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Automatic indexing renders a form of document representation that visualizes the content of the document more explicitly. Indices that are carefully chosen to represent a document will bring about the improvement of retrieval performance in accuracy and time efficiency. The potential of a candidate index is often judged on the basis of its discriminating power over a docmnent set as well as its linguistic significance in the document.</Paragraph> <Paragraph position="1"> Thus, a good index term should distinguish a certain class of documents from the rest of the documents and be relevant to the subject matters of the class of documents to be indexed by the term.</Paragraph> <Paragraph position="2"> In general, automatic indexing consists of the identification of index terms and the assignment of weights to the terms (Salton 1983).</Paragraph> <Paragraph position="3"> An index term can be either a simple noun or a compound noun composed of more than one simple nouns. Compound nouns tend to carry more specific contextual information than simple nouns, thus they are likely to contribute to the retrieval precision. Compound nouns may contain useful simple nouns that usually refer general contexts, and thus will boost the recall of retrieval. Processing compound nouns is decomposing them into simple nonns and evaluating the simple nouns as potential index terms. In both identifying and evaluating index terms, compound nouns require a different strategy from that for simple nouns.</Paragraph> <Paragraph position="4"> The identification of compound nouns involves a certain degree of linguistic or statistical analysis that varies from simple stemming to morphological analysis (Fagan 1989).</Paragraph> <Paragraph position="5"> What makes it even more complicated to handle compound nouns in Korean documents lies in the convention of writing compound nouns. In Korean, it is allowed to write compound nouns with or without intervening blanks between constituent nouns. Arbitrarily long compound nouns are possible and not rare in real texts. The decomposition of a compound noun is particularly problematic because of the severe ambiguity of segmentations.</Paragraph> <Paragraph position="6"> In this paper, we propose a method to identify and evaluate the candidate index terms from compound nouns. First, each possible decomposition of a compound noun is identified. To'see the potential of the component nouns of the decomposition, we observe how the component nouns are distributed over the total document set, and also examine how the simple and componnd nouns of the current document are distributed over the same document set. The similarity of the two distributions implies how consistently the two term sets will behave given a query at retriewd time.</Paragraph> <Paragraph position="7"> The proposed method assumes a dictionary of nouns that is automatically constructed from the document set. 3'his is the practice that has never been tried in Korean document indexing, but has some important merits. A laborious work for the manual construction of nominal dictionaries is not needed. Since the noun dictionary contains only those in a document set, the ambiguity in analyzing words is greatly reduced.</Paragraph> <Paragraph position="8"> Previous researches on the problem of compound noun indexing in Korean have been done in two directions. One approach adopts a full-scale morphological analysis to decompose a word into a sequence of the smallest morpheme units that are all treated as index terms. The other approach tries to avoid the complexity of the full scale analysis by using bigrams as in (Fnjii 199'3; l,ee 11996; Ogawa 1993). Since these methods take all the components of compound nouns as index terms without evaluation, irrelewmt terms can decrease retrieval precision.</Paragraph> <Paragraph position="9"> Experiments on 1000 documents show that our evaluation scheme gave results closet&quot; to the human intuition and maintained the highest precision ratio of tile existing methods.</Paragraph> <Paragraph position="10"> In the following section, a brief review of related work on automatic indexing for Korean docnments is made. Section 3 explains tile proposed method in detail. The verification of the method through experiments is described in section 4.</Paragraph> <Paragraph position="11"> Section 5 concludes the paper.</Paragraph> </Section> class="xml-element"></Paper>