File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/p00-1073_metho.xml
Size: 17,780 bytes
Last Modified: 2025-10-06 14:07:20
<?xml version="1.0" standalone="yes"?> <Paper uid="P00-1073"> <Title>Distribution-Based Pruning of Backoff Language Models</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Backoff Bigram and Cutoff </SectionTitle> <Paragraph position="0"> One of the most successful forms of SLM is the n-gram LM. N-gram LM estimates the probability of a word given the n-1 previous ). In practice, n is usually set to 2 (bigram), or 3 (trigram). For simplicity, we restrict our discussion to bigram, P(w</Paragraph> <Paragraph position="2"> which assumes that the probability of a word depends only on the identity of the immediately preceding word. But our approach extends to any n-gram.</Paragraph> <Paragraph position="3"> Perplexity is the most common metric for evaluating a bigram LM. It is defined as,</Paragraph> <Paragraph position="5"> where N is the length of the testing data. The perplexity can be roughly interpreted as the geometric mean of the branching factor of the document when presented to the language model. Clearly, lower perplexities are better.</Paragraph> <Paragraph position="6"> One of the key issues in language modelling is the problem of data sparseness. To deal with the problem, (Katz, 1987) proposed a backoff scheme, which is widely used in bigram language modelling. Backoff scheme estimates the probability of an unseen bigram by utilizing unigram estimates. It is of the form:</Paragraph> <Paragraph position="8"> ) in training data, P d represents the Good-Turing discounted estimate for seen word pairs, and a(w ) is a normalization factor.</Paragraph> <Paragraph position="9"> Due to the memory limitation in realistic applications, only a finite set of word pairs have conditional probabilities P(w</Paragraph> <Paragraph position="11"> represented in the model, especially when the model is trained on a large corpus. The remaining word pairs are assigned a probability by back-off (i.e. unigram estimates). The goal of bigram pruning is to remove uncommon explicit bigram estimates P(w</Paragraph> <Paragraph position="13"> ) from the model to reduce the number of parameters, while minimizing the performance loss. The most common way to eliminate unused count is by means of count cutoffs (Jelinek, 1990). A cutoff is chosen, say 2, and all probabilities stored in the model with 2 or fewer counts are removed. This method assumes that there is not much difference between a bigram occurring once, twice, or not at all. Just by excluding those bigrams with a small count from a model, a significant saving in memory can be achieved. In a typical training corpus, roughly 65% of unique bigram sequences occur only once.</Paragraph> <Paragraph position="14"> Recently, several improvements over count cutoffs have been proposed. (Seymore and Rosenfeld, 1996) proposed a different pruning scheme for backoff models, where bigrams are ranked by a weighted difference of the log probability estimate before and after pruning. Bigrams with difference less than a threshold are pruned.</Paragraph> <Paragraph position="15"> (Stolcke, 1998) proposed a criterion for pruning based on the relative entropy between the original and the pruned model. The relative entropy measure can be expressed as a relative change in training data perplexity. All bigrams that change perplexity by less than a threshold are removed from the model. Stolcke also concluded that, for practical purpose, the method in (Seymore and Rosenfeld, 1996) is a very good approximation to this method.</Paragraph> <Paragraph position="16"> All previous cutoff methods described above use a similar criterion for pruning, that is, the difference (or information loss) between the original estimate and the backoff estimate.</Paragraph> <Paragraph position="17"> After ranking, all bigrams with difference small enough will be pruned, since they contain no more information.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Distribution-Based Cutoff </SectionTitle> <Paragraph position="0"> As described in the previous section, previous cutoff methods assume that training data covers testing data. Bigrams that are infrequent in training data are also assumed to be infrequent in testing data, and will be cutoff. But in the real world, no matter how large the training data, it is still always very sparse compared to all data in the world. Furthermore, training data will be biased by its mixture of domain, time, or style, etc. For example, if we use newspaper in training, a name like &quot;Lewinsky&quot; may have high frequency in certain years but not others; if we use Gone with the Wind in training, &quot;Scarlett O'Hara&quot; will have disproportionately high probability and will not be cutoff.</Paragraph> <Paragraph position="1"> We propose another approach to pruning.</Paragraph> <Paragraph position="2"> We aim to keep bigrams that are more likely to occur in a new document. We therefore propose a new criterion for pruning parameters from bigram models, based on the bigram distribution i.e. the probability that a bigram will occur in a new document. All bigrams with the probability less than a threshold are removed.</Paragraph> <Paragraph position="3"> We estimate the probability that a bigram occurs in a new document by dividing training data into partitions, called subunits,andusea cross-validation-like approach. In the remaining part of this section, we firstly investigate several methods for term distribution modelling, and extend them to bigram distribution modelling. Then we investigate the effects of the definition of the subunit, and experiment with various ways to divide a training set into subunits. Experiments show that this not only allows a much more efficient computation for bigram distribution modelling, but also results in a more general bigram model, in spite of the domain, style, or temporal bias of training data.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Measure of Generality Probability </SectionTitle> <Paragraph position="0"> In this section, we will discuss in detail how to estimate the probability that a bigram occurs in a new document. For simplicity, we define a document as the subunit of the training corpus.</Paragraph> <Paragraph position="1"> In the next section, we will loosen this constraint.</Paragraph> <Paragraph position="2"> Term distribution models estimate the</Paragraph> <Paragraph position="4"> (k), the proportion of times that of a word w i appears k times in a document. In bigram distribution models, we wish to model the probability that a word pair (w</Paragraph> <Paragraph position="6"> in a new document. The probability can be expressed as the measure of the generality of a bigram. Thus, in what follows, it is denoted by</Paragraph> <Paragraph position="8"> one particular document, the less informative the bigram is, but for all documents, the more general the bigram is.</Paragraph> <Paragraph position="9"> We now consider several methods for term distribution modelling, which are widely used in Information Retrieval, and extend them to bigram distribution modelling. These methods include models based on the Poisson distribution (Mood et al., 1974), inverse document frequency (Salton and Michael, 1983), and Katz's K mixture (Katz, 1996).</Paragraph> <Paragraph position="10"> The standard probabilistic model for the distribution of a certain type of event over units of a fixed size (such as periods of time or volumes of liquid) is the Poisson distribution, which is defined as follows:</Paragraph> <Paragraph position="12"> In the most common model of the Poisson distribution in IR, the parameter l</Paragraph> <Paragraph position="14"> average number of occurrences of w</Paragraph> <Paragraph position="16"> the total number of documents in the collection.</Paragraph> <Paragraph position="17"> In our case, the event we are interested in is the occurrence of a particular word pair (w</Paragraph> <Paragraph position="19"> and the fixed unit is the document. We can use the Poisson distribution to estimate an answer to the question: what is the probability that a word pair occurs in a document. Therefore, we</Paragraph> <Paragraph position="21"> this criterion is equivalent to count cutoff.</Paragraph> <Paragraph position="22"> IDF is a widely used measure of specificity (Salton and Michael, 1983). It is the reverse of generality. Therefore we can also derive generality from IDF. IDF is defined as follows:</Paragraph> <Paragraph position="24"> where, in the case of bigram distribution, N is the total number of documents, and df</Paragraph> <Paragraph position="26"> =log gives full weight to a word pair (w</Paragraph> <Paragraph position="28"> ) that occurred in one document. Therefore, let's assume,</Paragraph> <Paragraph position="30"> It turns out that based on IDF, our criterion is equivalent to the count cutoff weighted by the reverse of IDF. Unfortunately, experiments show that using (6) directly does not get any improvement. In fact, it is even worse than count cutoff methods. Therefore, we use the following form instead,</Paragraph> <Paragraph position="32"> where a is a weighting factor tuned to maximize the performance.</Paragraph> <Paragraph position="33"> As stated in (Manning and Schutze, 1999), the Poisson estimates are good for non-content words, but not for content words. Several improvements over Poisson have been proposed. These include two-Poisson Model (Harter, 1975) and Katz's K mixture model (Katz, 1996). The K mixture is the better. It is also a simpler distribution that fits empirical distributions of content words as well as non-content words. Therefore, we try to use K mixture for bigram distribution modelling. According to (Katz, 1996), K mixture model estimates the probability that word w =0 otherwise. a and b are parameters that can be fit using the observed mean l and the observed inverse document frequency IDF as follow:</Paragraph> <Paragraph position="35"> where again, cf is the total number of occurrence of word w</Paragraph> <Paragraph position="37"> in the collection, df is the number of documents in the collection that w</Paragraph> <Paragraph position="39"> occurs in, and N is the total number of documents.</Paragraph> <Paragraph position="40"> The bigram distribution model is a variation of the above K mixture model, where we estimate the probability that a word pair</Paragraph> <Paragraph position="42"> where K is dependent on the size of the subunit, the larger the subunit, the larger the value (in our experiments, we set K from 1 to 3), and</Paragraph> <Paragraph position="44"> by equation (8), where a ,andb are estimated by equations (9) to (12). Accordingly, cf is the total number of occurrence of a word pair</Paragraph> <Paragraph position="46"> ) in the collection, df is the number of documents that contain (w</Paragraph> <Paragraph position="48"> total number of documents.</Paragraph> <Paragraph position="49"> Our experiments show that K mixture is the best among the three in most cases. Some partial experimental results are shown in table</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Algorithm </SectionTitle> <Paragraph position="0"> The bigram distribution model suggests a simple thresholding algorithm for bigram backoff model pruning: 1. Select a threshold th.</Paragraph> <Paragraph position="1"> 2. Compute the probability that each bigram occurs in a document individually by equation (13).</Paragraph> <Paragraph position="2"> 3. Remove all bigrams whose probability to occur in a document is less than th,and recomputed backoff weights.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> In this section, we report the experimental results on bigram pruning based on distribution versus count cutoff pruning method.</Paragraph> <Paragraph position="1"> In conventional approaches, a document is defined as the subunit of training data for term distribution estimating. But for a very large training corpus that consists of millions of documents, the estimation for the bigram distribution is very time-consuming. To cope with this problem, we use a cluster of documents as the subunit. As the number of clusters can be controlled, we can define an efficient computation method, and optimise the clustering algorithm.</Paragraph> <Paragraph position="2"> In what follows, we will report the experimental results with document and cluster being defined as the subunit, respectively. In our experiments, documents are clustered in three ways: by similar domain, style, or time. In all experiments described below, we use an open testing data consisting of 15 million characters that have been proofread and balanced among domain, style and time.</Paragraph> <Paragraph position="3"> Training data are obtained from newspaper (People's Daily) and novels.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Using Documents as Subunits </SectionTitle> <Paragraph position="0"> Figure 1 shows the results when we define a document as the subunit. We used approximately 450 million characters of pruning and distribution based bigram pruning using a document as the subunit.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Using Clusters by Domain as Subunits </SectionTitle> <Paragraph position="0"> Figure 2 shows the results when we define a domain cluster as the subunit. We also used approximately 450 million characters of People's Daily training data (1996). To cluster the documents, we used an SVM classifier developed by Platt (Platt, 1998) to cluster documents of similar domains together automatically, and obtain a domain hierarchy incrementally. We also added a constraint to balance the size of each cluster, and finally we obtained 105 clusters. It turns out that using domain clusters as subunits performs almost as well as the case of documents as subunits.</Paragraph> <Paragraph position="1"> Furthermore, we found that by using the pruning criterion based on bigram distribution, a lot of domain-specific bigrams are pruned. It then results in a relatively domain-independent language model. Therefore, we call this pruning method domain subtraction based pruning.</Paragraph> <Paragraph position="2"> pruning and distribution based bigram pruning using a domain cluster as the subunit.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Using Clusters by Style as Subunits </SectionTitle> <Paragraph position="0"> Figure 3 shows the results when we define a style cluster as the subunit. For this experiment, we used 220 novels written by different writers, each approximately 500 kilonbytes in size, and defined each novel as a style cluster. Just like in domain clustering, we found that by using the pruning criterion based on bigram distribution, a lot of style-specific bigrams are pruned. It then results in a relatively style-independent language model. Therefore, we call this pruning method style subtraction based pruning.</Paragraph> <Paragraph position="1"> pruning and distribution based bigram pruning using a style cluster as the subunit.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.4 Using Clusters by Time as Subunits </SectionTitle> <Paragraph position="0"> In practice, it is relatively easier to collect large training text from newspaper. For example, many Chinese SLMs are trained from newspaper, which has high quality and consistent in style. But the disadvantage is the temporal term phenomenon. In other words, some bigrams are used frequently during one time period, and then never used again.</Paragraph> <Paragraph position="1"> Figure 4 shows the results when we define a temporal cluster as the subunit. In this experiment, we used approximately 9,200 million characters of People's Daily training data (1978--1997). We simply clustered the document published in the same month of the same year as a cluster. Therefore, we obtained 240 clusters in total. Similarly, we found that by using the pruning criterion based on bigram distribution, a lot of time-specific bigrams are pruned. It then results in a relatively time-independent language model. Therefore, we call this pruning method temporal subtraction based pruning.</Paragraph> <Paragraph position="2"> pruning and distribution based bigram pruning using a temporal cluster as the subunit.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Summary </SectionTitle> <Paragraph position="0"> In our research lab, we are particularly interested in the problem of pinyin to Chinese character conversion, which has a memory limitation of 2MB for programs. At 2MB memory, our method leads to 7-9% word perplexity reduction, as displayed in table 2. of size 2M.</Paragraph> <Paragraph position="1"> As shown in figure 1-4, although as the size of language model is decreased, the perplexity rises sharply, the models created with the bigram distribution based pruning have consistently lower perplexity values than for the count cutoff method. Furthermore, when modelling bigram distribution on document clusters, our pruning method results in a more general n-gram backoff model, which resists to domain, style or temporal bias of training data.</Paragraph> </Section> </Section> class="xml-element"></Paper>