File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1219_metho.xml

Size: 16,921 bytes

Last Modified: 2025-10-06 14:07:27

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1219">
  <Title>Extraction of Chinese Compound Words - An Experimental Study on a Very Large Corpus</Title>
  <Section position="3" start_page="0" end_page="132" type="metho">
    <SectionTitle>
2 Technique description
</SectionTitle>
    <Paragraph position="0"> Statistical extraction of Chinese compounds has been used in (Lee-Feng Chien 1997)(WU Dekai and Xuanyin XIA 1995) and (Ming-Wen Wu and Keh-Yih Su 1993). The basic idea is that a  Chinese compound should appear as a stable sequence in corpus. That is, the components in the compound are strongly correlated, while the components lie at both ends should have low correlations with otiter words.</Paragraph>
    <Paragraph position="1"> The method consists of two steps. At fast, a fist of candidate compounds is extracted from a very large corpus by using mutual information. Then, context dependency is used to remove undesirable compounds. In what follows, we will describe them in more detail.</Paragraph>
    <Section position="1" start_page="132" end_page="132" type="sub_section">
      <SectionTitle>
2.1 Mutual Information
</SectionTitle>
      <Paragraph position="0"> According to our study on Chinese corpora, most compounds are of length less than 5 characters. The average length of words in the segmented-corpus is of approximately 1.6 characters. Therefore, only word bi-gram, tn'gram, and quad-gram in the corpus are of interest to us in compound extraction.</Paragraph>
      <Paragraph position="1"> We use a criterion, called mutual inform~on, to evaluate the correlation of different components in the compound. Mutual information Ml(x,y) of a</Paragraph>
      <Paragraph position="3"> Where f(x) is the occurrence frequency of word x in the corpus, and fix, y) is the occurrence frequency of the word pair (x,y) in the corpus. The higher the value of MI is, the more likely x and y are to form a compound.</Paragraph>
      <Paragraph position="4"> The mutual information MI(x,y,z) of tri-gram</Paragraph>
      <Paragraph position="6"> The estimation of mutual information of quadgrams is similar to that of tri-grams. The extracted compounds should be of higher value of MI than a pre-set threshold.</Paragraph>
    </Section>
    <Section position="2" start_page="132" end_page="132" type="sub_section">
      <SectionTitle>
2.2 Context Dependency
</SectionTitle>
      <Paragraph position="0"> Figure 1 The extracted Chinese compounds should be complete. That is, we should generate a whole word, not a part of it. For example, ~,~-~-~t'~J(missile defense plan) is a complete word, and -~.~0-1~J~ (missile defense) is not, although both have relatively high value of mutual information.</Paragraph>
      <Paragraph position="1"> Therefore, we use another feature, called context dependency. The contexts of the word l~(defense) are illustrated by figure 1.</Paragraph>
      <Paragraph position="2"> A compound X has NO left context dependency if</Paragraph>
      <Paragraph position="4"> Where tl, t2 are threshold value, j\[.) is frequency, L is the set of left adjacent strings of X, tz~L and ILl means the number of unique left adjacent strings. Similarly, a compound X has NO right context dependency if RSize ~ R l&gt; t3 or f(/~) &lt; t4 MaxR = MAX a f ( X ) Where tl, t2, t3, t4 are threshold value, f(.) is frequency, R is the set of right adjacent strings of X, tiER and \[R I means the number of unique left adjacent strings.</Paragraph>
      <Paragraph position="5"> The extracted complete compounds should have neither left nor fight context dependency.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="132" end_page="136" type="metho">
    <SectionTitle>
3 Experimental results
</SectionTitle>
    <Paragraph position="0"> In our experiments, three corpora were used to test the performance of the presented approach.</Paragraph>
    <Paragraph position="1"> These corpora are described in table 1. Corpus A consists of local news with more than 325 million characters. Corpus B consists of documents from different domains of novel, news, technique report, etc., with approximately 650 million characters. Corpus C consists of People's Daily news and Xinhua news from TREC5 and TREC6 (Harman and Voorhees, 1996) with 75 million characters.</Paragraph>
    <Paragraph position="2">  'cCOrpus \[TREC 5/6 Chinese 75 M ,, ,coMus I I In the first experiment, we test the perfomaance of our method on corpus A, which is homogeneity in style. We then use corpus B in the second experiment to test if the method works as well on the corpus that is heterogeneity in style. We also use different parameter settings in order to figure out the best combination of the two statistical features, i.e. mutual information and context dependency. In the third experiment, we apply the results of the method to information retrieval system. We extract new compounds on corpus C, and add them to the indexing lexicon, and we achieve a higher average precision-recall. In all experiments, corpora are segmented automatically into words using a lexicon consisting of 65,502 entries.</Paragraph>
    <Section position="1" start_page="133" end_page="134" type="sub_section">
      <SectionTitle>
3.1 Compounds Extraction from Homogeneous
Corpus
</SectionTitle>
      <Paragraph position="0"> Corpus A contains political and economic news.</Paragraph>
      <Paragraph position="1"> In this series of tests, we gradually loosen the conditions to form a compound, i.e. MI threshold becomes smaller and MaxL/MaxR becomes larger.</Paragraph>
      <Paragraph position="2"> Results for quad-graras, tri-graras and bi-grams are shown in tables 2,3,4. Some compounds extracted are fisted in table 5.</Paragraph>
      <Paragraph position="3">  It turns out that our algorithm successfully extracted a large number of new compounds (&gt;50000) from raw texts. Compared with previous methods described in the next section, the precision is very high. We can also find that there is little precision loss when we loose restriction. The result may be due to three reasons. First, the two statistical features really characterize the nature of compounds, and provide a simple and efficient way to estimate the possibility of a word sequence being a compound. Second, the corpus we use is very large. It is always true that more data leads to better results. Third, the corpus we used in this experiment is homogeneity in style.</Paragraph>
      <Paragraph position="4"> The raw corpus is composed of news on politics, economy, science and technology. These are formal articles, and the sentences and compounds are well normalized and strict. This is very helpful for compound extraction.</Paragraph>
    </Section>
    <Section position="2" start_page="134" end_page="136" type="sub_section">
      <SectionTitle>
3.2 Compounds Extraction from Heterogeneous
Corpus
</SectionTitle>
      <Paragraph position="0"> In this experiment, we use a heterogeneous corpus. It is a combination of corpus A, and some other novels, technique reports, etc. For simplicity, we discuss the extraction of bi-gram compounds only. In comparison with the first experiment, we find that the precision is strongly affected by the corpus we used. As shown in table 6, for each corpus, we use the same parameter setting, say MI &gt;0.005, LSize &gt;3, MaxL &lt;0.5,  As we mentioned early, the larger the corpus we use, the better results we obtain. Therefore, we intuitively expect better result on corpus B, which is larger than corpus A. But, the result shown in table 6 is just the opposite.</Paragraph>
      <Paragraph position="1"> There are mainly two reasons for this. The first one is that our method works better on homogeneous corpus than on heterogeneous corpus. The second one is that it might not be suitable to use the same parameter settings on two different corpora. We then try different parameter settings on corpus B.</Paragraph>
      <Paragraph position="2"> There are two groups of parameters. MI measures the correlation between adjacent words, and other four parameters, namely LSize, RSize, MaxL, and MaxR, measure the context dependency. Therefore, each time, we fix one parameter, and relax another from fight to loose to see what happens. The Number of extracted compounds and precision of each parameter setting are shown in table 7.</Paragraph>
      <Paragraph position="3">  Table 7 shows the extraction results with different parameters. These results fit our intuition. While parameters become more and more strict, less and less compounds are found and precisions become higher. This phenomena is also illustrated in figure 2 and 3, in which the &amp;quot;correct compounds extracted&amp;quot; is an estimation from tableT, i.e. number of compounds found x precision. (These two figures are very useful for one who wants to automatically extract a new lexicon with pre-defined size from a large corpus.)  The precision of extraction is estimated in the following way. We extract a set of compounds based on a seres of pre-defined parameter set. For each set of compotinds, we randomly select 200 compounds. Then we merge those selected compounds to a new file for manually check. This file consists of about 9,800 new compounds because there are 49 compounds lists. One person will group these 'compounds' into two sets, say set A and set B. Set A contains the items that are considered to be correct, and set B contains incorrect ones. Then for each original group of about 200 compounds we select in the first step, we check how many items that also appear in set A and how many items in set B. Suppose these two values are al and bl, then we estimate the precision as al/(al+bl).</Paragraph>
      <Paragraph position="4"> So, there are two important points in our evaluation process. First, it is difficult to give a definition of the term &amp;quot;compound&amp;quot; to be accepted popularly. Different people may have different judgement. Only one person takes part in the evaluation in our experiment. This can eliminate the effect of divergence among different persons. Second, we merge those items together. This can eliminate the effect of different time period. One may feel tired after checked too many items. If he checks those 49 files one by one, the latter results are incomparable with the previous one.</Paragraph>
      <Paragraph position="5"> The precisions estimated by the above method are not exactly correct. However, as described above, the precisions of different parameter settings are comparable. In this experiment, what we want to show is how the parameter settings affect the results.</Paragraph>
      <Paragraph position="6"> Both MI and CD can affect number of extracted compounds, as shown in table 7. Compared with MI, CD has stronger effect in this aspect. For each row in table 7, numbers of extracted compounds finally decrease to 10% of that showed in the first column. For each column, while MI changes from 0.0002 to 0.0014, the number is decreased of about 20%. This may be explained by the fact that it is difficult for candidate to fulfill all four restrictions in CD simultaneously. Many disqualified candidates are cut off. Table 7 lists the precisions of extracted results. It shows that there is no clear increasing/decreasing pattern in each row. That is to say, CD doesn't strongly affect the precision. When we check each column, we can see that precision is in a growing progress. As we defined above, MI and CD are two different measurements. What role they play in our extraction procedure? Our conclusion is that mutual information mainly affects the precision while context dependency mainly affects the count of extracted items. This conclusion is also confirmed by Fig2 and Fig3. That is, the curves in Fig2 are more fiat than corresponding curves in Fig3.</Paragraph>
    </Section>
    <Section position="3" start_page="136" end_page="136" type="sub_section">
      <SectionTitle>
3.3 Testing the Extracted Compounds in
Information Retrieval
</SectionTitle>
      <Paragraph position="0"> In this experiment, we apply our method to improve information retrieval results. We use SMART system (Buckley 1985) for our experiments. SMART is a robust, efficient and flexible information retrieval system. The corpus used in this experiment is TREC Chinese corpus (Harman and Voorhees, 1996). The corpus contains about 160,000 articles, including articles published in the People's Daily from 1991 to 1993, and a part of the news released by the Xinhua News Agency in 1994 and 1995. A set of 54 queries has been set up and evaluated by people in NIST(Nafional Institute of Standards and Technology).</Paragraph>
      <Paragraph position="1"> We first use an initial lexicon consisting of 65,502 entries to segment the corpus. When running SMART on the segmented corpus, we obtain an average precision of 42.90%.</Paragraph>
      <Paragraph position="2"> Then we extract new compounds from the segmented corpus, and add them into the initial lexicon. With the new lexicon, the TREC Chinese corpus is re-segmented. When running SMART on this re-segmented corpus, we obtain an average precision of 43.42%, which shows a slight improvement of 1.2%.</Paragraph>
      <Paragraph position="3"> Further analysis shows that the new lexicon brings positive effect to 10 queries and negative effect to 4 queries. For other 40 queries, there is no obvious effect. Some improved queries are listed in table 8 as well as new compounds being contained.</Paragraph>
      <Paragraph position="4"> As an example, we give the segmentation results with the two lexicons for query 23 in table  Query 23 segment with small lexicon , bk Query 23 segment with new lexicon Another interesting example is query 30. There is no new compound extracted from that query. Its result is also improved significantly because its relevant documents are segmented better than before.</Paragraph>
      <Paragraph position="5"> Because the compounds extracted from the corpus are not exactly correct, the new lexicon will bring negative effect to some queries, such as query 10. The retrieval precision changes from 0.3086 to 0.1359. The main reason is that &amp;quot;~ \[\]~&amp;quot;(Chinese XinJiang) is taken as a new compound in the query.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="136" end_page="138" type="metho">
    <SectionTitle>
4 Related works
</SectionTitle>
    <Paragraph position="0"> Several methods have been proposed for extracting compounds from corpus by statistical approaches. In this section, we will briefly describe some of them.</Paragraph>
    <Paragraph position="1"> (Lee-Feng Chien 1997) proposed an approach based on PAT-Tree to automatically extracting domain specific terms from online text collections. Our method is primary derived from (Lee-Feng Chien 1997), and use the similar statistical features, i.e. mutual informan'on and context dependency. The difference is that we use n-gram instead of PAT-Tree, due to the efficiency issue. Another difference lies in the experiments. In Chien's work, only domain specific terms are extracted from domain specific corpus, and the size of the corpus is relatively small, namely 1,872 political news abstracts.</Paragraph>
    <Paragraph position="2"> (Cheng-Huang Tung and His-Jian Lee 1994) also presented an efficient method for identifying unknown words from a large corpus. The statistical features used consist of string (character sequence) frequency and entropy of left/fight neighbonng characters (similar to left/fight context dependency). The corpus consists of 178,027 sentences, representing a total of more than 2 million Chinese characters. 8327 unknown words were identified and 5366 items of them were confirmed manually.</Paragraph>
    <Paragraph position="3"> (Ming-Wen Wu and Keh-Yih Su 1993) presented a method using mutual information and relative frequency. 9,124 compounds are extracted from the corpus consists of 74,404 words, with the precision of 47.43%. In this method, the compound extraction problem is formulated as classification problem. Each bi-grarn (tri-grarn) is assigned to one of those two clusters. It also needs a training corpus to estimate parameters for classification model. In our method, we didn't  make use of any training corpus. Another difference is that they use the method for English compounds extraction while we extract Chinese compounds in our experiments.</Paragraph>
    <Paragraph position="4"> (Pascale Fung 1998) presented two simple systems for Chinese compound extraction----CXtract. CXtract uses predominantly statistical lexical information to find term boundaries in large text. Evaluations on the corpus consisting of 2 million characters show that the average precision is 54.09%.</Paragraph>
    <Paragraph position="5"> We should note that since the experiment setup and evaluation systems of the methods mentioned above are not identical, the results are not comparable. However, by showing our experimental results on much larger and heterogenous corpus, we can say that our method is an efficient and robust one.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML