XML Viewer - p06-1126

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1126_metho.xml
Size: 20,382 bytes
Last Modified: 2025-10-06 14:10:23
<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1126">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Discriminative Pruning of Language Models for Chinese Word Segmentation</Title>
  <Section position="5" start_page="1001" end_page="1004" type="metho">
    <SectionTitle>
3 Discriminative Pruning for Chinese
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1001" end_page="1002" type="sub_section">
      <SectionTitle>
Word Segmentation
3.1 Problem Definition
</SectionTitle>
      <Paragraph position="0"> In this paper, discussions are restricted to bigram language model P(w</Paragraph>
      <Paragraph position="2"> ). In a bigram model, three kinds of parameters are involved: bigram probability P</Paragraph>
      <Paragraph position="4"> training corpus, unigram probability P  As equation (3) shows, the probability of an unseen bigram is computed by the product of the unigram probability and the corresponding back-off coefficient. If we remove a seen bigram from the model, we can still yield a bigram probability for it, by regarding it as an unseen bigram. Thus, we can reduce the number of bigram probabilities explicitly stored in the model. By doing this, model size decreases. This is the foundation for bigram model pruning.</Paragraph>
      <Paragraph position="5"> The research issue is to find an effective criterion to compute &amp;quot;importance&amp;quot; of each bigram. Here, &amp;quot;importance&amp;quot; indicates the performance loss caused by pruning the bigram. Generally, given a target model size, the method for language model pruning is described in Figure 1.</Paragraph>
      <Paragraph position="6"> In fact, deciding which bigrams should be excluded from the model is equivalent to deciding  which bigrams should be included in the model.</Paragraph>
      <Paragraph position="7"> Hence, we suggest a growing algorithm through which a model of desired size can also be achieved. It is illustrated in Figure 2. Here, two terms are introduced. Full-bigram model is the unpruned model containing all seen bigrams in training corpus. And base model is currently the unigram model.</Paragraph>
      <Paragraph position="8"> For the discriminative pruning method suggested in this paper, growing algorithm instead of pruning algorithm is applied to generate the model of desired size. In addition, &amp;quot;importance&amp;quot; of each bigram indicates the performance improvement caused by adding a bigram into the base model.</Paragraph>
    </Section>
    <Section position="2" start_page="1002" end_page="1002" type="sub_section">
      <SectionTitle>
Model Pruning
3.2 Discriminative Pruning Criterion
</SectionTitle>
      <Paragraph position="0"> Given a Chinese character string S, a word segmentation system chooses a sequence of words W* as the segmentation result, satisfying:</Paragraph>
      <Paragraph position="2"> The sum of the two logarithm probabilities in equation (4) is called discriminant function:</Paragraph>
      <Paragraph position="4"> Where G denotes a language model that is used to compute P(W), and L denotes a generative model that is used to compute P(S|W). In language model pruning, L is an invariable.</Paragraph>
      <Paragraph position="5"> The discriminative pruning criterion is inspired by the comparison of segmented sentences using full-bigram model G F and using base model</Paragraph>
      <Paragraph position="7"> . Given a sentence S, full-bigram model chooses as the segmentation result, and base model chooses as the segmentation result, satisfying:  1. Given the desired model size, compute the number of bigrams that should be pruned. The number is denoted as m; 2. Compute &amp;quot;importance&amp;quot; of each bigram; 3. Sort all bigrams in the language model, according to their &amp;quot;importance&amp;quot;; 4. Remove m most &amp;quot;unimportant&amp;quot; bigrams from the model; 5. Re-compute backoff coefficients in the model.</Paragraph>
      <Paragraph position="9"> Here, given a language model G , we define a misclassification function representing the difference between discriminant functions of and</Paragraph>
      <Paragraph position="11"> The misclassification function reflects which one of and is inclined to be chosen as the segmentation result. If , we may extract some hints from the comparison of them, and select a few valuable bigrams. By adding these bigrams to base model, we should make the model choose the correct answer between and . If , no hints can be extracted.</Paragraph>
      <Paragraph position="12">  1. Given the desired model size, compute the number of bigrams that should be added into the base model. The number is denoted as n; 2. Compute &amp;quot;importance&amp;quot; of each bigram included in the full-bigram model but excluded from the base model; 3. Sort the bigrams according to their &amp;quot;importance&amp;quot;; null 4. Add n most &amp;quot;important&amp;quot; bigrams into the base model; 5. Re-compute backoff coefficients in the base model.</Paragraph>
    </Section>
    <Section position="3" start_page="1002" end_page="1002" type="sub_section">
      <SectionTitle>
Let W
</SectionTitle>
      <Paragraph position="0"> be the known correct word sequence.</Paragraph>
      <Paragraph position="1"> Under the precondition , we describe our method in the following three cases.</Paragraph>
      <Paragraph position="2">  Here, full-bigram model chooses the correct answer, while base model does not. Based on equation (6), (7) and (8), we know that d(S;L, G</Paragraph>
      <Paragraph position="4"> ) &lt; 0. It implies that adding bi-grams into base model may lead the misclassification function from positive to negative. Which bigram should be added depends on the variation of misclassification function caused by adding it.</Paragraph>
      <Paragraph position="5"> If adding a bigram makes the misclassification function become smaller, it should be added with higher priority.</Paragraph>
      <Paragraph position="6"> We add each bigram individually to G  Where denotes the number of times the bigram w</Paragraph>
      <Paragraph position="8"> appears in sequence .</Paragraph>
      <Paragraph position="9"> Note that in equation (12), base model is treated as a bigram model instead of a unigram model. The reason lies in two respects. First, the uni-gram model can be regarded as a particular bi-gram model by setting all backoff coefficients to 1. Second, the base model is not always a uni-gram model during the step-by-step growing algorithm, which will be discussed in the next subsection. null</Paragraph>
      <Paragraph position="11"> tracted from full-bigram model, so P' (w</Paragraph>
      <Paragraph position="13"> ). In addition, similar deductions can be conducted to the second bracket in equation (11).</Paragraph>
      <Paragraph position="14"> Thus, we have:  In case 1 and 2, bigrams are added so that discriminant function of correct word sequence becomes bigger, and that of incorrect word sequence becomes smaller. In case 3, both and are incorrect. Thus, the misclassification function in equation (8) does not represent the likelihood that S will be incorrectly segmented. Therefore, variation of the misclassification function in equation (13) can not be used to measure the &amp;quot;importance&amp;quot; of a bigram. Here, sentence S is ignored, and the &amp;quot;importance&amp;quot; of all bigrams on S are zero.</Paragraph>
      <Paragraph position="16"> The above three cases are designed for one sentence. The &amp;quot;importance&amp;quot; of each bigram on the whole training corpus is the sum of its &amp;quot;importance&amp;quot; on each single sentence, as equation</Paragraph>
      <Paragraph position="18"/>
    </Section>
    <Section position="4" start_page="1002" end_page="1004" type="sub_section">
      <SectionTitle>
of Bigrams
</SectionTitle>
      <Paragraph position="0"> We illustrate the process of computing &amp;quot;importance&amp;quot; of bigrams with a simple example.</Paragraph>
      <Paragraph position="1"> Suppose S is &amp;quot; Zhe (zhe4) Yang (yang4) Cai (cai2) Neng (neng2) Geng (geng4) Fang (fang1) Bian (bian4)&amp;quot;. The segmented result using full-bigram model is &amp;quot;Zhe Yang (zhe4yang4)/Cai (cai2)/Neng (neng2)/Geng (geng4)/Fang Bian (fang1bian4)&amp;quot;, which is the correct word sequence. The segmented result using base model  is &amp;quot; Zhe Yang (zhe4yang4)/ Cai Neng (cai2neng2)/ Geng (geng4)/ Fang Bian (fang1bian4)&amp;quot;. Obviously, it matches case 1. For bigram &amp;quot;Zhe Yang (zhe4yang4)Cai (cai2)&amp;quot;, it occurs in once, and does not occur in . According to equation (13), its &amp;quot;importance&amp;quot; on sentence S is:</Paragraph>
    </Section>
    <Section position="5" start_page="1004" end_page="1004" type="sub_section">
      <SectionTitle>
3.3 Step-by-step Growing
</SectionTitle>
      <Paragraph position="0"> Given the target model size, we can add exact number of bigrams to the base model at one time by using the growing algorithm illustrated in Figure 2. But it is more suitable to adopt a step-by-step growing algorithm illustrated in Figure 4. As shown in equation (13), the &amp;quot;importance&amp;quot; of each bigram depends on the base model. Initially, the base model is set to the unigram model.</Paragraph>
      <Paragraph position="1"> With bigrams added in, it becomes a growing bigram model. Thus, and  change. So, the added bigrams will affect the calculation of &amp;quot;importance&amp;quot; of bigrams to be added. Generally, adding more bigrams at one time will lead to more negative impacts. Thus, it is expected that models produced by step-by-step growing algorithm may achieve better performance than growing algorithm, and smaller step size will lead to even better performance.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="1004" end_page="1006" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1004" end_page="1004" type="sub_section">
      <SectionTitle>
4.1 Experiment Settings
</SectionTitle>
      <Paragraph position="0"> The training corpus comes from People's daily 2000, containing about 25 million Chinese characters. It is manually segmented into word sequences, according to the word segmentation specification of Peking University (Yu et al., 2003). The testing text that is provided by Peking University comes from the second international Chinese word segmentation bakeoff organized by SIGHAN. The testing text is a part of People's daily 2001, consisting of about 170K Chinese characters.</Paragraph>
      <Paragraph position="1"> The vocabulary is automatically extracted from the training corpus, and the words occurring only once are removed. Finally, about 67K words are included in the vocabulary. The full-bigram model and the unigram model are trained by CMU language model toolkit (Clarkson and Rosenfeld, 1997). Without any count cut-off, the full-bigram model contains about 2 million bigrams. null The word segmentation system is developed based on a source-channel model similar to that described in (Gao et al., 2003). Viterbi algorithm is applied to find the best word segmentation path.</Paragraph>
    </Section>
    <Section position="2" start_page="1004" end_page="1004" type="sub_section">
      <SectionTitle>
4.2 Evaluation Metrics
</SectionTitle>
      <Paragraph position="0"> The language models built in our experiments are evaluated by two metrics. One is F-Measure of the word segmentation result; the other is language model perplexity.</Paragraph>
      <Paragraph position="1"> For F-Measure evaluation, we firstly segment the raw testing text using the model to be evaluated. Then, the segmented result is evaluated by comparing with the gold standard set. The evaluation tool is also from the word segmentation bakeoff. F-Measure is calculated as:  1. Given step size s; 2. Set the base model to be the unigram model; 3. Segment corpus with full-bigram model; 4. Segment corpus with base model; 5. Compute &amp;quot;importance&amp;quot; of each bigram included in the full-bigram model but excluded from the base model; 6. Sort the bigrams according to their &amp;quot;importance&amp;quot;; null 7. Add s bigrams with the biggest &amp;quot;importance&amp;quot; to the base model; 8. Re-compute backoff coefficients in the base model; 9. If the base model is still smaller than the desired size, go to step 4; otherwise, stop.</Paragraph>
      <Paragraph position="3"> For perplexity evaluation, the language model to be evaluated is used to provide the bigram probabilities for each word in the testing text.</Paragraph>
      <Paragraph position="4"> The perplexity is the mean logarithm probability as shown in equation (18):</Paragraph>
    </Section>
    <Section position="3" start_page="1004" end_page="1005" type="sub_section">
      <SectionTitle>
4.3 Comparison of Pruning Methods
The Kullback-Leibler Distance (KLD) based
</SectionTitle>
      <Paragraph position="0"> method is the state-of-the-art method, and is  taken as the baseline  . Pruning algorithm illustrated in Figure 1 is used for KLD based pruning. Growing algorithms illustrated in Figure 2 and Figure 4 are used for discriminative pruning method. Growing algorithms are not applied to KLD based pruning, because the computation of KLD is independent of the base model. At step 1 for KLD based pruning, m is set to produce ten models containing 10K, 20K, ..., 100K bigrams. We apply each of the models to the word segmentation system, and evaluate the segmented results with the evaluation tool. The F-Measures of the ten models are illustrated in Figure 5, denoted by &amp;quot;KLD&amp;quot;.</Paragraph>
      <Paragraph position="1"> For the discriminative pruning criterion, the growing algorithm illustrated in Figure 2 is firstly used. Unigram model acts as the base model. At step 1, n is set to 10K, 20K, ..., 100K separately. At step 2, &amp;quot;importance&amp;quot; of each bi-gram is computed following Figure 3. Ten models are produced and evaluated. The F-Measures are also illustrated in Figure 5, denoted by &amp;quot;Discrim&amp;quot;. null By adding bigrams step by step as illustrated in Figure 4, and setting step size to 10K, 5K, and 2K separately, we obtain other three series of models, denoted by &amp;quot;Step-10K&amp;quot;, &amp;quot;Step-5K&amp;quot; and &amp;quot;Step-2K&amp;quot; in Figure 5.</Paragraph>
      <Paragraph position="2"> We also include in Figure 5 the performance of the count cut-off method. Obviously, it is inferior to other methods.</Paragraph>
    </Section>
    <Section position="4" start_page="1005" end_page="1005" type="sub_section">
      <SectionTitle>
Pruning Methods
</SectionTitle>
      <Paragraph position="0"> First, we compare the performance of &amp;quot;KLD&amp;quot; and &amp;quot;Discrim&amp;quot;. When the model size is small,  Our pilot study shows that the method based on Kullback-Leibler distance outperforms methods based on other criteria introduced in section 2.</Paragraph>
      <Paragraph position="1"> such as those models containing less than 70K bigrams, the performance of &amp;quot;Discrim&amp;quot; is better than &amp;quot;KLD&amp;quot;. For the models containing more than 70K bigrams, &amp;quot;KLD&amp;quot; gets better performance than &amp;quot;Discrim&amp;quot;. The reason is that the added bigrams affect the calculation of &amp;quot;importance&amp;quot; of bigrams to be added, which has been discussed in section 3.3.</Paragraph>
      <Paragraph position="2"> If we add the bigrams step by step, better performance is achieved. From Figure 5, it can be seen that all of the models generated by step-by-step growing algorithm outperform &amp;quot;KLD&amp;quot; and &amp;quot;Discrim&amp;quot; consistently. Compared with the base-line KLD based method, step-by-step growing methods result in at least 0.2 percent improvement for each model size.</Paragraph>
      <Paragraph position="3"> Comparing &amp;quot;Step-10K&amp;quot;, &amp;quot;Step-5K&amp;quot; and &amp;quot;Step2K&amp;quot;, they perform differently before the 60K-bigram point, and perform almost the same after that. The reason is that they are approaching their saturation states, which will be discussed in section 4.5. Before 60K-bigram point, smaller step size yields better performance.</Paragraph>
      <Paragraph position="4"> An example of detailed comparison result is shown in Table 1, where the F-Measure is 96.33%. The last column shows the relative model sizes with respect to the KLD pruned model. It shows that with the F-Measure of 96.33%, number of bigrams decreases by up to 90%.</Paragraph>
    </Section>
    <Section position="5" start_page="1005" end_page="1006" type="sub_section">
      <SectionTitle>
4.4 Correlation between Perplexity and F-
Measure
</SectionTitle>
      <Paragraph position="0"> Perplexities of the models built above are evaluated over the gold standard set. Figure 6 shows how the perplexities vary with the bigram numbers in models. Here, we notice that the KLD models achieve the lowest perplexities. It is not a surprising result, because the goal of KLD based pruning is to minimize the Kullback-Leibler distance that can be interpreted as a relative change of perplexity (Stolcke, 1998).</Paragraph>
      <Paragraph position="1"> Now we compare Figure 5 and Figure 6. Perplexities of KLD models are much lower than that of the other models, but their F-Measures are much worse than that of step-by-step growing  models. It implies that lower perplexity does not always lead to higher F-Measure.</Paragraph>
      <Paragraph position="2"> However, when the comparison is restricted in a single pruning method, the case is different.</Paragraph>
      <Paragraph position="3"> For each pruning method, as more bigrams are included in the model, the perplexity curve falls, and the F-Measure curve rises. It implies there are correlations between them. We compute the Pearson product-moment correlation coefficient for each pruning method, as listed in Table 2. It shows that the correlation between perplexity and F-Measure is very strong.</Paragraph>
      <Paragraph position="4"> To sum up, the correlation between language model perplexity and system performance (here represented by F-Measure) depends on whether the models come from the same pruning method.</Paragraph>
      <Paragraph position="5"> If so, the correlation is strong. Otherwise, the correlation is weak.</Paragraph>
    </Section>
    <Section position="6" start_page="1006" end_page="1006" type="sub_section">
      <SectionTitle>
4.5 Combination of Saturated Model and
</SectionTitle>
      <Paragraph position="0"/>
    </Section>
  </Section>
  <Section position="7" start_page="1006" end_page="1006" type="metho">
    <SectionTitle>
KLD
</SectionTitle>
    <Paragraph position="0"> The above experimental results show that step-by-step growing models achieve the best performance when less than 100K bigrams are added in. Unfortunately, they can not grow up into any desired size. A bigram has no chance to be added into the base model, unless it appears in the mis-aligned part of the segmented corpus, where [?] . It is likely that not all bigrams have the opportunity. As more and more bigrams are added into the base model, the segmented training corpus using the current base model approaches to that using the full-bigram model.</Paragraph>
    <Paragraph position="1"> Gradually, none bigram can be added into the current base model. At that time, the model stops growing, and reaches its saturation state. The model that reaches its saturation state is named as saturated model. In our experiments, three step-by-step growing models reach their saturation states when about 100K bigrams are added in.</Paragraph>
    <Paragraph position="2">  By combining with the baseline KLD based method, we obtain models that outperform the baseline for any model size. We combine them as follows. If the desired model size is smaller than that of the saturated model, step-by-step growing is applied. Otherwise, Kullback-Leibler distance is used for further growing over the saturated model. For instance, by growing over the saturated model of &amp;quot;Step-2K&amp;quot;, we obtain combined models containing from 100K to 2 million bigrams. The performance of the combined models and that of the baseline KLD models are illustrated in Figure 7. It shows that the combined model performs consistently better than KLD model over all of bigram numbers.</Paragraph>
    <Paragraph position="3"> Finally, the two curves converge at the performance of the full-bigram model.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML