File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/p02-1002_intro.xml
Size: 11,159 bytes
Last Modified: 2025-10-06 14:01:22
<?xml version="1.0" standalone="yes"?> <Paper uid="P02-1002"> <Title>Sequential Conditional Generalized Iterative Scaling</Title> <Section position="3" start_page="0" end_page="2" type="intro"> <SectionTitle> 2 Algorithms </SectionTitle> <Paragraph position="0"> We begin by describing the classic GIS algorithm.</Paragraph> <Paragraph position="1"> Recall that GIS converges towards a model in which, for each f</Paragraph> <Paragraph position="3"> Whenever they are not equal, we can move them closer. One simple idea is to just add</Paragraph> <Paragraph position="5"> . The problem with this is that it ignores the interaction with other ls. If updates to other ls made on the same iteration of GIS have a similar effect, we could easily go too far, and even make things worse. GIS introduces a slowing factor, f # , equal to the largest total value of</Paragraph> <Paragraph position="7"> . This update provably converges to the global optimum. GIS for joint models was given by Darroch and Ratcliff (1972); the conditional version is due to Brown et al. (Unpublished), as described by Rosenfeld (1994).</Paragraph> <Paragraph position="8"> In practice, we use the pseudocode of Figure 1. We will write I for the number of training instances, Many published versions of the GIS algorithm require inclusion of a &quot;slack&quot; indicator function so that the same number of constraints always applies. In practice it is only necessary that the total of the indicator functions be bounded by f</Paragraph> <Paragraph position="10"> necessarily equal to it. Alternatively, one can see this as including the slack indicator, but fixing the corresponding l to 0, and</Paragraph> <Paragraph position="12"> and F for number of indicator functions; we use Y for the number of output classes (values for y). We assume that we keep a data structure listing, for each</Paragraph> <Paragraph position="14"> Now we can describe our variation on GIS. Basically, instead of updating all l's simultaneously, we will loop over each indicator function, and compute an update for that indicator function, in turn. In particular, the first change we make is that we exchange the outer loops over training instances and indicator functions. Notice that in order to do this efficiently, we also need to rearrange our data structures: while we previously assumed that the training data was stored as a sparse matrix of indicator functions with non-zero values for each instance, we now assume that the data is stored as a sparse matrix of instances with non-zero values for each indicator. The size of the two matrices is obviously the same.</Paragraph> <Paragraph position="15"> The next change we make is to update each l i near the inner loop, immediately after expected[i] is computed, rather than after expected values for all features have been computed. If we update the features one at a time, then the meaning of f # changes.</Paragraph> <Paragraph position="16"> In the original version of GIS, f # is the largest total of all features. However, f # only needs to be the largest total of all the features being updated, and in not updating it, so that it can be ommitted from any equations; the proofs that GIS improves at each iteration and that there is a global optimum still hold.</Paragraph> <Paragraph position="18"> this case, there is only one such feature. Thus, instead of f</Paragraph> <Paragraph position="20"> ,y). In many max-ent applications, the f i take on only the values 0 or 1, and thus, typically, max</Paragraph> <Paragraph position="22"> fore, instead of slowing by a factor of f # , there may be no slowing at all! We make one last change in order to get a speedup. Rather than recompute for each instance j and each</Paragraph> <Paragraph position="24"> ,y), and the corresponding normalizing factors z =</Paragraph> <Paragraph position="26"> stead keep these arrays computed as invariants, and incrementally update them whenever a l i changes.</Paragraph> <Paragraph position="27"> With this important change, we now get a substantial speedup. The code for this transformed algorithm is given in Figure 2.</Paragraph> <Paragraph position="28"> The space of models in the form of Equation 1 is convex, with a single global optimum. Thus, GIS and SCGIS are guaranteed to converge towards the same point. For convergence proofs, see Darroch and Ratcliff (1972), who prove convergence of the algorithm for joint models.</Paragraph> <Section position="1" start_page="1" end_page="2" type="sub_section"> <SectionTitle> 2.1 Time and Space </SectionTitle> <Paragraph position="0"> In this section, we analyze the time and space requirements for SCGIS compared to GIS. The space results depend on Y, the number of output classes.</Paragraph> <Paragraph position="1"> When Y is small, SCGIS requires only a small amount more space than GIS. Note that in Section 3, we describe a technique that, when there are many output classes, uses clustering to get both a speedup and to reduce the number of outputs, thus alleviating the space issues.</Paragraph> <Paragraph position="2"> Typically for GIS, the training data is stored as a sparse matrix of size T of all non-zero indicator functions for each instancej and outputy. The transposed matrix used by SCGIS is the same size T.</Paragraph> <Paragraph position="3"> In order to make the relationship between GIS and SCGIS clearer, the algorithms in Figures 1 and 2 are given with some wasted space. For instance, the matrix s[j,y] of sums of ls only needs to be a simple array s[y] for GIS, but we wrote it as a matrix so that it would have the same meaning in both algorithms. In the space and time analyses, we will assume that such space-wasting techniques are optimized out before coding.</Paragraph> <Paragraph position="4"> Now we can analyze the space and time for GIS.</Paragraph> <Paragraph position="5"> GIS requires the training matrix, of size T, the ls, of size F, as well as the expected and observed arrays, which are also size F. Thus, GIS requires space O(T + F). Since T must be at least as large as F (we can eliminate any indicator functions that don't appear in the training data), this is O(T).</Paragraph> <Paragraph position="6"> SCGIS is potentially somewhat larger. SCGIS also needs to store the training data, albeit in a different form, but one that is also of size T. In particular, the matrix is interchanged so that its outermost index is over indicator functions, instead of training data. SCGIS also needs the observed and l arrays, both of size F, and the array z[j] of size I, and, more importantly, the full array s[j,y], which is of size IY. In many problems, Y is small - often2-andIY is negligible, but in problems like language modeling, Y can be very large (60,000 or more). The overall space for SCGIS, O(T +IY), is essentially the same as for GIS when Y is small, but much larger when Y is large - but see the optimization described in Section 3.</Paragraph> <Paragraph position="7"> Now, consider the time for each algorithm to execute one iteration. Assume that for every instance and output there is at least one non-zero indicator function, which is true in practice. Notice that for GIS, the top loops end up iterating over all non-zero indicator functions, for each output, for each training instance. In other words, they examine every entry in the training matrix T once, and thus require time T. The bottom loops simply require time F, which is smaller than T. Thus, GIS requires time O(T).</Paragraph> <Paragraph position="8"> For SCGIS, the top loops are also over each non-zero entry in the training data, which takes time O(T). The bottom loops also require time O(T).</Paragraph> <Paragraph position="9"> Thus, one iteration of SCGIS takes about as long as one iteration of GIS, and in practice in our implementation, each SCGIS iteration takes about 1.3 times as long as each GIS iteration. The speedup in SCGIS comes from the step size: the update in GIS is slowed by f # , while the update in SCGIS is not. Thus, we expect SCGIS to converge by up to a factor of f # faster. For many applications, f # can be large.</Paragraph> <Paragraph position="10"> The speedup from the larger step size is difficult to analyze rigorously, and it may not be obvious whether the speedup we in fact observe is actually due to the f # improvement or to the caching. Note that without the caching, each iteration of SCGIS would be O(f#) times slower than an iteration of GIS; the caching is certainly a key component. But with the caching, each iteration of SCGIS is still marginally slower than GIS (by a small constant factor). In Section 4, we in fact empirically observe that fewer iterations are required to achieve a given level of convergence, and this reduction is very roughly proportional to f#. Thus, the speedup does appear to be because of the larger step size. However, the exact speedup from the step size depends on many factors, including how correlated features are, and the order in which they are trained.</Paragraph> <Paragraph position="11"> Although we are not aware of any problems where maxent training data does not fit in main memory, and yet the model can be learned in reasonable time, it is comforting that SCGIS, like GIS, requires sequential, not random, access to the training data. So, if one wanted to train a model using a large amount of data on disk or tape, this could still be done with reasonable efficiency, as long as the s and z arrays, for which we need random access, fit in main memory. null All of these analyses have assumed that the training data is stored as a precomputed sparse matrix of the non-zero values for f</Paragraph> <Paragraph position="13"> for each output. In some applications, such as language modeling, this is not the case; instead, the</Paragraph> <Paragraph position="15"> are computed on the fly. However, with a bit of thought, those data structures also can be rearranged.</Paragraph> <Paragraph position="16"> Chen and Rosenfeld (1999) describe a technique for smoothing maximum entropy that is the best currently known. Maximum entropy models are naturally maximally smooth, in the sense that they are as close as possible to uniform, subject to satisfying the constraints. However, in practice, there may be enough constraints that the models are not nearly smooth enough - they overfit the training data. Chen and Rosenfeld describe a technique whereby a Gaussian prior on the parameters is assumed. The models no longer satisfy the constraints exactly, but work much better on test data. In particular, instead of attempting to maximize the probability of the training data, they maximize a slightly different objective function, the probability of the training data times the prior probability of the model:</Paragraph> <Paragraph position="18"> the probability of the ls is a simple normal distribution with 0 mean, and a standard deviation of s.</Paragraph> <Paragraph position="19"> Chen and Rosenfeld describe a modified update rule in which to find the updates, one solves for d</Paragraph> <Paragraph position="21"/> </Section> </Section> class="xml-element"></Paper>