File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/p02-1002_metho.xml
Size: 14,494 bytes
Last Modified: 2025-10-06 14:07:58
<?xml version="1.0" standalone="yes"?> <Paper uid="P02-1002"> <Title>Sequential Conditional Generalized Iterative Scaling</Title> <Section position="4" start_page="2" end_page="2" type="metho"> <SectionTitle> 3 Previous Work </SectionTitle> <Paragraph position="0"> Although sequential updating was described for joint probabilities in the original paper on GIS by Darroch and Ratcliff (1972), GIS with sequential updating for conditional models appears previously unknown.</Paragraph> <Paragraph position="1"> Note that in the NLP community, almost all max-ent models have used conditional models (which are typically far more efficient to learn), and none to our knowledge has used this speedup.</Paragraph> <Paragraph position="2"> There appear to be two main reasons this speedup has not been used before for conditional models.</Paragraph> <Paragraph position="3"> One issue is that for joint models, it turns out to be more natural to compute the sumss[x], while for conditional models, it is more natural to compute the ls and not store the sums s. Storing sis essential for our speedup. Also, one of the first and best known uses of conditional maxent models is for language modeling (Rosenfeld, 1994), where the number of output classes is the vocabulary size, typically 5,000-60,000 words. For such applications, the array s[j,y] would be of a size at least 5000 times the number of training instances: clearly impractical (but see below for Berger et al. (1996) use an algorithm that might appear sequential, but an examination of the definition off # and related work shows that it is not.</Paragraph> <Paragraph position="4"> a recently discovered trick). Thus, it is unsurprising that this speedup was forgotten.</Paragraph> <Paragraph position="5"> There have been several previous attempts to speed up maxent modeling. Best known is the work of Della Pietra et al. (1997), the Improved Iterative Scaling (IIS) algorithm. Instead of treating f # as a constant, we can treat it as a function of x</Paragraph> <Paragraph position="7"> Notice that in the special case where f . It is, however, hard to think of applications where this difference is typically large. We only know of one limited experiment comparing IIS to GIS (Lafferty, 1995). That experiment showed roughly a factor of 2 speedup. It should be noted that compared to GIS, IIS is much harder to implement efficiently. When solving Equation 4, one uses an algorithm such as Newton's method that repeatedly evaluates the function. Either one must repeatedly cycle through the training data to compute the right hand side of this equation, or one must use tricks such as bucketing by the values of f</Paragraph> <Paragraph position="9"> ,y).</Paragraph> <Paragraph position="10"> The first option is inefficient and the second adds considerably to the complexity of the algorithm. Note that IIS and SCGIS can be combined by using an update rule where one solves for</Paragraph> <Paragraph position="12"> For many model types, the f i take only the values 1 or 0. In this case, Equation 5 reduces to the normal SCGIS update.</Paragraph> <Paragraph position="13"> Brown (1959) describes Iterative Scaling (IS), applied to joint probabilities, and Jelinek (1997, page 235) shows how to apply IS to conditional probabilities. For binary-valued features, without the caching trick, SCGIS is the same as the algorithm described by Jelinek. The advantage of SCGIS over IS is the caching - without which there is no speedup - and because it is a variation on GIS, it can be applied to non-binary valued features. Also, with SCGIS, it is clear how to apply other improvements such as the smoothing technique of Chen and Rosenfeld (1999). Several techniques have been developed specifically for speeding up conditional maxent models, especially when Y is large, such as language models, and space precludes a full discussion here. These techniques include unigram caching, cluster expansion (Lafferty et al., 2001; Wu and Khudanpur, 2000), and word clustering (Goodman, 2001). Of these, the best appears to be word clustering, which leads to up to a factor of 35 speedup, and which has an additional advantage: it allows the SCGIS speedup to be used when there are a large number of outputs.</Paragraph> <Paragraph position="14"> The word clustering speedup (which can be applied to almost any problem with many outputs, not just words) works as follows. Notice that in both GIS and in SCGIS, there are key loops over all outputs, y. Even with certain optimizations that can be applied, the length of these loops will still be bounded by, and often be proportional to, the number of outputs. We therefore change from a model of the form P(y|x) to modeling P(cluster(y)|x) x P(y|x, cluster(y)).</Paragraph> <Paragraph position="15"> Consider a language model in which y is a word, x represents the words precedingy, and the vocabulary size is 10,000 words. Then for a model P(y|x), there are 10,000 outputs. On the other hand, if we create 100 word clusters, each with 100 words per cluster, then for a model P(cluster(y)|x), there are 100 outputs, and for a model P(y|x, cluster(y)) there are also 100 outputs. Thus, instead of training one model with a time proportional to 10,000, we train two models, each with time proportional to 100. Thus, in this example, there is a 50 times speedup. In practice, the speedups are not quite so large, but we do achieve speedups of up to a factor of 35. Although the model form learned is not exactly the same as the original model, the perplexity of the form using two models is actually marginally lower (better) than the perplexity of the form using a single model, so there does not seem to be any disadvantage to using it.</Paragraph> <Paragraph position="16"> The word clustering technique can be extended to use multiple levels. For instance, by putting words into superclusters, such as their part of speech, and clusters, such as semantically similar words of a given part of speech, one could use a three level model. In fact, the technique can be extended to up to log Y levels with two outputs per level, meaning that the space requirements are proportional to 2 instead of to the original Y . Since SCGIS works by increasing the step size, and the cluster-based speedup works by increasing the speed of the inner loop (whchi SCGIS shares), we expect that the two techniques would complement each other well, and that the speedups would be nearly multiplicative. Very preliminary language modeling experiments are consistent with this analysis.</Paragraph> <Paragraph position="17"> There has been interesting recent unpublished work by Minka (2001). While this work is very preliminary, and the experimental setting somewhat unrealistic (dense features artificially generated), especially for many natural language tasks, the results are dramatic enough to be worth noting. In particular, Minka found that a version of conjugate gradient descent worked extremely well - much faster than GIS. If the problem domain resembles Minka's, then conjugate gradient descent and related techniques are well worth trying, and it would be interesting to try these techniques for more realistic tasks.</Paragraph> <Paragraph position="18"> SCGIS turns out to be related to boosting. As shown by Collins et al. (2002), boosting is in some ways a sequential version of maxent. The single largest difference between our algorithm and Collins'is that we update each feature in order, while Collins' algorithms select a (possibly new) feature to update. That algorithm also require more storage than our algorithm when data is sparse: fast implementations require storage of both the training data matrix (to compute which feature to update) and the transpose of the training data matrix (to perform the update efficiently.)</Paragraph> </Section> <Section position="5" start_page="2" end_page="3" type="metho"> <SectionTitle> 4 Experimental Results </SectionTitle> <Paragraph position="0"> In this section, we give experimental results, showing that SCGIS converges up to an order of magnitude faster than GIS, or more, depending on the number of non-zero indicator functions, and the method of measuring performance.</Paragraph> <Paragraph position="1"> There are at least three ways in which one could measure performance of a maxent model: the objective function optimized by GIS/SCGIS; the entropy on test data; and the percent correct on test data. The objective function for both SCGIS and GIS when smoothing is Equation 3: the probability of the training data times the probability of the model. The most interesting measure, the percent correct on test data, tends to be noisy.</Paragraph> <Paragraph position="2"> For a test corpus, we chose to use exactly the same training, test, problems, and feature sets used by Banko and Brill (2001). These problems consisted of trying to guess which of two confusable words, e.g.</Paragraph> <Paragraph position="3"> &quot;their&quot; or &quot;there&quot;, a user intended. Banko and Brill chose this data to be representative of typical machine learning problems, and, by trying it across data sizes and different pairs of words, it exhibits a good deal of different behaviors. Banko and Brill used a standard set of features, including words within a window of 2, part-of-speech tags within a window of 2, pairs of word or tag features, and whether or not a given word occurred within a window of 9. Altogether, they had 55 feature types. That is, there were many thousands of features in the model (depending on the exact model), but at most 55 could be &quot;true&quot; for a given training or test instance.</Paragraph> <Paragraph position="4"> We examine the performance of SCGIS versus GIS across three different axes. The most important variable is the number of features. In addition to trying Banko and Brill's 55 feature types, we tried using feature sets with 5 feature types (words within a window of 2, plus the &quot;unigram&quot; feature) and 15 feature types (words within a window of 2, tags within a window of 2, the unigram, and pairs of words within a window of 2). We also tried not using smoothing, and we tried varying the training data size.</Paragraph> <Paragraph position="5"> In Table 1, we present a &quot;typical&quot; configuration, using 55 feature types, and 10 million words of training, and smoothing with a Gaussian prior. The first two columns show the different confusable words.</Paragraph> <Paragraph position="6"> Each column shows the ratio of how much longer (in terms of elapsed time) it takes GIS to achieve the same results as 10 iterations of SCGIS. An &quot;XXX&quot; denotes a case in which GIS did not achieve the performance level of SCGIS within 1000 iterations.</Paragraph> <Paragraph position="7"> (XXXs were not included in averages.) The &quot;objec&quot; column shows the ratio of time to achieve the same value of the objective function (Equation 3); the &quot;ent&quot; column show the ratio of time to achieve the same test entropy; and the &quot;cor&quot; column shows the ratio of time to achieve the same test error rate.</Paragraph> <Paragraph position="8"> For all three measurements, the ratio can be up to a factor of 30, though the average is somewhat lower, and in two cases, GIS converged faster.</Paragraph> <Paragraph position="9"> In Table 2 we repeat the experiment, but without smoothing. On the objective function - which with no smoothing is just the training entropy - the increase from SCGIS is even larger. On the other On a 1.7 GHz Pentium IV with 10,000,000 words training, and 5 feature types it took between .006 and .24 seconds per iteration of SCGIS, and between .004 and .18 seconds for GIS. With 55 feature types, it took between .05 and 1.7 seconds for SCGIS and between .03 and 1.2 seconds for GIS. Note that many experiments use much larger datasets or many more feature types; run time scales linearly with training data size. objec ent cor increase from SCGIS is smaller than it was with smoothing, but still consistently large.</Paragraph> <Paragraph position="10"> In Tables 3 and 4, we show results with small and medium feature sets. As can be seen, the speedups with smaller features sets (5 feature types) are less than the speedups with the medium sized feature set (15 feature types), which are smaller than the base-line speedup with 55 features.</Paragraph> <Paragraph position="11"> Notice that across all experiments, there were no cases where GIS converged faster than SCGIS on the objective function; two cases where it coverged faster on test data entropy; and 5 cases where it converged faster on test data correctness. The objective function measure is less noisy than test data entropy, and test data entropy is less noisy than test data error rate: the noisier the data, the more chance of an unexpected result. Thus, one possibility is that these cases are simply due to noise. Similarly, the four cases in which GIS never reached the test data objec ent cor accept except 6.0 4.8 3.7 affect effect 3.6 3.6 1.0 among between 5.8 1.0 0.7 its it's 8.7 5.6 3.3 peace piece 25.2 2.9 XXX principal principle 6.7 18.6 1.0 then than 6.9 6.7 9.6 their there 4.7 4.2 3.6 entropy of SCGIS and the four cases in which GIS never reached the test data error rate of SCGIS might also be attributable to noise. There is an alternative explanation that might be worth exploring. On a different data set, 20 newsgroups, we found that early stopping techniques were helpful, and that GIS and SCGIS benefited differently depending on the exact settings. It is possible that effects similar to the smoothing effect of early stopping played a role in both the XXX cases (in which SCGIS presumably benefited more from the effects) and in the cases where GIS beat SCGIS (in which cases GIS presumably benefited more.) Additional research would be required to determine which explanation - early stopping or noise - is correct, although we suspect both explanations apply in some cases.</Paragraph> <Paragraph position="12"> We also ran experiments that were the same as the baseline experiment, except changing the training data size to 50 million words and to 1 million words. We found that the individual speedups were often different at the different sizes, but did not appear to be overall higher or lower or qualitatively different.</Paragraph> </Section> class="xml-element"></Paper>