File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1106_metho.xml

Size: 16,095 bytes

Last Modified: 2025-10-06 14:07:27

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1106">
  <Title>Corpus-Based Learning of Compound Noun Indexing *</Title>
  <Section position="4" start_page="57" end_page="58" type="metho">
    <SectionTitle>
2 Previous Research
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="57" end_page="57" type="sub_section">
      <SectionTitle>
2.1 Compound Noun Indexing
</SectionTitle>
      <Paragraph position="0"> There have been two different methods for compound noun indexing: statistical and linguistic. In one Statistical method, (Fagan, 1989) indexed phrases using six different parameters, including information on co-occurrence of phrase elements, relative location of phrase elements, etc., and achieved reasonable performance. However, his method couldn't reveal consistent substantial improvements on five experimental document collections in effectiveness. (Strzalkowski et al., 1996; Evans and Zhai, 1996) indexed subcompounds from complex noun phrases using noun-phrase analysis. These methods need to find the head-modifier relations from noun phrases and therefore require difficult syntactic parsing in Korean.</Paragraph>
      <Paragraph position="1"> For Korean, in one statistical method, (Lee and Ahn, 1996) indexed general Korean nouns using n-grams without linguistic knowledge and the experiment results showed that the proposed method might be Mmost as effective as the linguistic noun indexing. However, this method can generate many spurious n-grarn~ which decrease the precision in search performance. In linguistic methods, (Kim, 1994) used five manually chosen compound noun indexing rule patterns based on linguistic knowledge. However, this method cannot index the diverse types of compound nouns. (Won et al., 2000) used a full parser and increased the precision in search experiments. However, this linguistic method cannot be applied to unrestricted texts robustly. In summary, the previous methods, whether they are statistical or linguistic, have their own shortcomings. Statistical methods require signiAcant amounts of co-occurrence information for reasonable performance and can not index the diverse types of compound nouns. Linguistic methods need compound noun indexing rules described by human and sometimes result in meaningless compound nouns, which decreases the performance of information retrieval systems. They cannot also cover the various types of compound nouns because of the limitation of human linguistic knowledge.</Paragraph>
      <Paragraph position="2"> In this paper, we present a hybrid method that uses linguistic rules but these rules are automatically acquired from a large corpus through statistical learning. Our method generates more diverse compound noun indexing rule patterns than the previous standard methods (Kim, 1994; Lee et ah, 1997), because previous methods use only most general rule patterns (shown in Table 2) and are based solely on human linguistic knowledge.</Paragraph>
      <Paragraph position="3">  noun indexing rule patterns for Korean Noun without case makers / Noun Noun with a genitive case maker / Noun Noun with a nominal case maker or an accusative case maker \[ Verbal common noun or adjectival common noun Noun with an adnominal ending \] Noun Noun within predicate particle phrase / Noun (The two nouns before and after a slash in the pattern can form a single compound noun.)</Paragraph>
    </Section>
    <Section position="2" start_page="57" end_page="58" type="sub_section">
      <SectionTitle>
2.2 Compound Noun Filtering
</SectionTitle>
      <Paragraph position="0"> Compound noun indexing methods, whether they are statistical or linguistic, tend to generate spurious compound nouns when they are actually applied. Since an information retrieval system can be evaluated by its effectiveness and also by its efficiency (van Rijsbergen, 1979), the spurious compound nouns should be efficiently filtered. (Kando et al., 1998) insisted that, for Japanese, the smaller the number of index terms is, the better the performance of the information retrieval system should be.</Paragraph>
      <Paragraph position="1">  For Korean, (Won et al., 2000) showed that segmentation of compound nouns is more efficient than compound noun synthesis in search performance. There have been many works on compound noun filtering methods; (Kim, 1994) used mutual information only, and (Yun et al., 1997) used mutual information and relative frequency of POS (Part-Of-Speech) pairs together. (Lee et ai., 1997) used stop word dictionaries which were constructed manually. Most of the previous methods for compound noun filtering utilized only one consistent method for generated compound nouns irrespective of the different origin of compound noun indexing rules, and the methods cause many problems due to data sparsehess in dictionary and training data. Our approach solves the data sparseness problem by using co-occurrence information on automatically extracted compound noun elements together with a statistical precision measure which fits best to each rule.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="58" end_page="62" type="metho">
    <SectionTitle>
3 Overall System Architecture
</SectionTitle>
    <Paragraph position="0"> The compound noun indexing system proposed in this paper Consists of two major modules: one for automatically extracting compound noun indexing rules (in Figure 1) and the other for indexing documents, filtering the automatically generated compound nouns, and weighting the indexed compound nouns (in Figure 2).</Paragraph>
    <Paragraph position="1">  There are three major steps in automatically extracting compound noun indexing rules. The first step is to collect compound noun statistical information, and the second step is to extract the rules from a large tagged corpus using the collected statistical information. The final step is to learn each rule's precision..</Paragraph>
    <Section position="1" start_page="58" end_page="59" type="sub_section">
      <SectionTitle>
4.1 Collecting Compound Noun
Statistics
</SectionTitle>
      <Paragraph position="0"> We collect initial compound noun seeds which were gathered from various types of well-balanced documents such as ETRI Kemong encyclopaedia 2 and many dictionaries on the Internet, and we collected 10,368 seeds, as shown in Table 3. The small number of seeds are bootstrapped to extract the Compound noun indexing rules for various corpora.</Paragraph>
      <Paragraph position="1">  compound nouns, we made a 1,000,000 eojeol(Korean spacing unit which corresponds 2 Courteously provided by ETRI, Korea.</Paragraph>
      <Paragraph position="2">  to an English word or phrase) tagged corpus for a compound noun indexing experiment from a large document set (Korean Information Base). We collected complete compound nouns (a continuous noun sequence composed of at least two nouns on the condition that both the preceding and the following POS of the sequence axe not nouns (Yoon et al., 1998)) composed of 1 - 3 no, ms from the tagged training corpus (Table 4).</Paragraph>
    </Section>
    <Section position="2" start_page="59" end_page="59" type="sub_section">
      <SectionTitle>
4.2 Extracting Indexing Rules
</SectionTitle>
      <Paragraph position="0"> We define a template (in Table 5) to extract the compound noun indexing rules from a POS tagged corpus.</Paragraph>
      <Paragraph position="1"> The template means that if a frontcondition-tag, a rear-condition-tag, and sub-string-tags are coincident with input sentence tags, the lexical item in the synthesis position of the sentence can be indexed as a compound noun as &amp;quot;x/ y (for 3-noun compounds, x / y / z)&amp;quot;. The tags used in the template are POS (Part-Of-Speech) tags and we use the POSTAG set (Table 17).</Paragraph>
      <Paragraph position="2"> The following is an algorithm to extract compound noun indexing rules from a large tagged corpus using the two-noun compound seeds and the template defined above. The rule extraction scope is limited to the end of a sentence or, if there is a conjunctive ending (eCC) in the sentence, only to the conjunctive ending of the sentence. A rule extraction example is shown in Figure 3.</Paragraph>
      <Paragraph position="3">  The next step is to refine the extracted rules to select the proper ones. We used a rule filtering algorithm (Algorithm 2) using the frequency together with the heuristics that the rules with negative lexical items (shown in Table 6) will make spurious compound nouns.</Paragraph>
      <Paragraph position="4">  Algorithm 2: Filtering extracted rules using frequency and heuristics  I. For each compound noun seed, select the rules whose frequency is greater than 2.  2. Among rules selected by step 1, select only rules that are extracted at least by 2 seeds.</Paragraph>
      <Paragraph position="5"> 3. Discard rules which contain negative lexical items.</Paragraph>
      <Paragraph position="6">  2,036 rules from the large tagged corpus (Korean Information Base, 1,000,000 eojeol) using the above Algorithm 2. Among the illtered rules, there are 19 rules with negative lexical items and we finally selected 2,017 rules. Table 7 shows a distribution of the final rules according to the number of elements in their sub-string-tags.</Paragraph>
      <Paragraph position="7">  over 6 tags 1.6 % The automatically extracted rules have more rule patterns and lexical items than human-made rules so they can cover more diverse types of compound nouns (Table 8). When checking the overlap between the two rule collections, we found that the manual linguistic rules are a subset of our automatically generated statistical rules. Table 9 shows some of the example rules newly generated from our extraction algorithm, which were originally missing in the manual rule patterns.</Paragraph>
    </Section>
    <Section position="3" start_page="59" end_page="61" type="sub_section">
      <SectionTitle>
4.3 Learning the Precision of
Extracted Rules
</SectionTitle>
      <Paragraph position="0"> In the proposed method, we use the precision of rules to solve the compound noun over-generation and the data sparseness problems. The precision of a rule can be defined by  where Prec(rule) is the precision of a rule, Ndctual is the number of actual compound nouns, and Ncandidat e is the number of compound noun candidates generated by the automatic indexing rules.</Paragraph>
      <Paragraph position="1"> To calculate the precision, we need a defining measurement for compound noun identification. (Su et al., 1994) showed that the average mutual information of a compound noun tends to be higher than that of a non-compound noun, so we try to use the mutual information as the measure for identifying the compound nouns. If the mutual information of the compound noun candidate is higher than the average mutual information of the compound noun seeds, we decide that it is a compound noun. For mutual information (MI), we use two different equations: one for two-element compound nouns (Church and Hanks, 1990) and the other for three-element compound nouns (Suet al., 1994). The equation for two-element compound nouns is as follow:</Paragraph>
      <Paragraph position="3"> where x and y are two words in the corpus, and I(x; y) is the mutual information of these two words (in this order). Table 10 shows the average MI value of the two and three elements.</Paragraph>
      <Paragraph position="4"> Table 10: Average value of the mutual information (MI) of compound noun seeds .Number of elements \[ 2 I 3 Average MI 3.56 3.62 The MI was calculated from the statistics of the complete compound nouns collected from the tagged training corpus (see Section 4.1). However, complete compound nouns are continuous noun sequences and cause the data sparseness problem. Therefore, we need to expand the statistics. Figure 4 shows the architecture of the precision learning module by expanding the statistics of the complete compound nouns along with an algorithmic explanation (Algorithm 3) of the process. Table 11 shows the improvement in the average precision during the repetitive execution of this learning process.</Paragraph>
      <Paragraph position="5">  2. Calculate the average precision of the rules.</Paragraph>
      <Paragraph position="6"> 3. Multiply a rule's precision by the frequency of the compound noun made by the rule.</Paragraph>
      <Paragraph position="7"> We call this value the modified frequency (MF).</Paragraph>
      <Paragraph position="8"> 4. Collect the same compound nouns, and sum all the modified frequencies for each compound noun.</Paragraph>
      <Paragraph position="9"> 5. If the sunm~ed modified frequency is greater than a threshold, add this compound noun to the complete compound noun statistical information.</Paragraph>
      <Paragraph position="10"> 6. Calculate all rules' precision again using the changed complete compound noun statistical information.</Paragraph>
      <Paragraph position="11"> 7. Calculate the average precision of the rules. 8. If the average precision of the rules is equal to the previous average precision, stop. Othervise, go to step 2.</Paragraph>
      <Paragraph position="12">  In this section, we explai n how to use the automatically extracted rules to actually index the compound nouns, and describe how to filter and weight the indexed compound nouns.</Paragraph>
    </Section>
    <Section position="4" start_page="61" end_page="61" type="sub_section">
      <SectionTitle>
5.1 Compound Noun Indexing
</SectionTitle>
      <Paragraph position="0"> To index compound nouns from documents, we use a natural language processing engine, SKOPE (Standard KOrean Processing Engine) (Cha et al., 1998), which processes documents by analysing words into morphemes and tagging part-of-speeches. The tagging results are compared with the automatically learned compound noun indexing rules and, if they are coincident with each other, we index them as compound nouns. Figure 5 shows a process of the compound noun indexing with an example.</Paragraph>
    </Section>
    <Section position="5" start_page="61" end_page="62" type="sub_section">
      <SectionTitle>
5.2 Compound Noun Filtering
</SectionTitle>
      <Paragraph position="0"> Among the indexed compound nouns above, still there can be meaningless compound nouns, which increases the number of index terms and the search time. To solve compound noun over-generation problem, we experiment with seven different filtering methods (shown in Table 12) by analyzing their  relative effectiveness and efficiency, as shown in Table 16. These methods can be divided into three categories: first one using MI, second one using the frequency of the compound nouns (FC), and the last one using the frequency of the compound noun elements (FE). MI (Mutual Information) is a measure of word association, and used under the assumption that a highly associated word n-gram is more likely to be a compound noun. FC is used under the assumption that a frequently encountered word n-gram is more likely to be a compound than a rarely encountered n-gram.</Paragraph>
      <Paragraph position="1"> FE is ;used under the assumption that a word n-gram with a frequently encountered specific element is more likely to be a compound. In the method of C, D, E, and F, each threshold was decided by calculating the average number of compound nouns of each method.</Paragraph>
      <Paragraph position="2">  (MI) A. Mutual information of compound noun elements (0) (MI) B. Mutual information of compound noun elements (average of MI of compound noun seeds) (FC) C. Frequency of compound nouns  ated the smallest number of compound nouns best efficiency and showed the reasonable effectiveness (Table 16). On the basis of this filtering method, we develop a smoothing method by combining the precision of rules with the mutual information of the compound noun elements, and propose our final filtering method (H) as follows:</Paragraph>
      <Paragraph position="4"> where a is a weighting coefficient and Precision is the applied rules learned in Section 4.3.</Paragraph>
      <Paragraph position="5"> For the three-element compound nouns, the MI part is replaced with the three-element MI equation 3 (Su et al., 1994).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML