XML Viewer - n04-1039

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-1039_metho.xml
Size: 14,331 bytes
Last Modified: 2025-10-06 14:08:53
<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-1039">
  <Title>Exponential Priors for Maximum Entropy Models</Title>
  <Section position="3" start_page="2" end_page="2" type="metho">
    <SectionTitle>
2 Learning algorithms and discounting
</SectionTitle>
    <Paragraph position="0"> In this section we derive a learning algorithm for exponential priors, with provable convergence properties, and show that it leads to a simple discounting formula. Note that the simple learning algorithm is an important contribution: the algorithm for a Gaussian prior is quite a bit more complicated, and previous related work with the Laplacian prior (two-sided exponential) has had a difficult time finding learning algorithms; because the Laplacian does not have a continuous first derivative, and because the exponential prior is bounded at 0, standard gradient descent type algorithms may exhibit poor behavior. Williams (1995) devotes a full ten pages to describing a somewhat heuristic approach for solving this problem, and even this discussion concludes &amp;quot;In summary it is left to the reader to supply the algorithms for determining successive search directions and the initially preferred value of&amp;quot; the step size (page 130).</Paragraph>
    <Paragraph position="1">  We show that a very simple variation on a standard algorithm, Generalized Iterative Scaling (GIS) (Darroch and Ratcliff, 1972), solves this problem. In particular, as we will show, while GIS uses an update rule of the form</Paragraph>
    <Paragraph position="3"> Williams' algorithm is for the much more complex case of a multi-layer network, while ours is for the one layer case, but there are no obvious simplifications to his approach for the one layer case.</Paragraph>
    <Paragraph position="4"> our modified algorithm uses a rule of the form</Paragraph>
    <Paragraph position="6"> Note that there are two different styles of model that one can use, especially in the common case that there are two outputs (values for y.) Consider a word sense disambiguation problem, trying to determine whether a word like &amp;quot;bank&amp;quot; means the river or financial sense, with questions like whether or not the word &amp;quot;water&amp;quot; occurs nearby. One could have a single indicator function f</Paragraph>
    <Paragraph position="8"> &lt; [?]. We call this style &amp;quot;double sided&amp;quot; indicators.</Paragraph>
    <Paragraph position="9"> Alternatively, one could have two indicator functions,</Paragraph>
    <Paragraph position="11"> (x, y)=1 if water occurs nearby and y=financial.In this case, one could allow either [?][?] &lt;l  , exactly the same results will be achieved. With a Laplacian (double sided exponential), one could also use either style. With an exponential prior, only positive values are allowed, so one must use the double sided style, so that one can learn that some indicators push towards one sense, and some push towards the other - that is, rather than having one weight which is positive or negative, we have two weights which are positive or zero, one of which pushes towards one answer, and the other pushing towards the other.</Paragraph>
    <Paragraph position="12"> We derive our constraints and learning algorithm by very closely following the derivation of the algorithm by Chen and Rosenfeld. It will be convenient to maximize the log of the expression in 2 rather than the expression itself. This leads to an objective function:</Paragraph>
    <Paragraph position="14"> Note that this objective function is convex (since it is the sum of two convex functions.) Thus, there is a global maximum value for this objective function.</Paragraph>
    <Paragraph position="15"> Now, we wish to find the maximum. Normally, we would do this by setting the derivative to 0, but the bound of l k [?] 0 changes things a bit. The maximum can then occur at the discontinuity in the derivative (l</Paragraph>
    <Paragraph position="17"> &gt; 0. We can explicitly check the value of the objective function at the point l k =0. When there is a maximum with l k &gt; 0 we know that the partial derivative with respect to l</Paragraph>
    <Paragraph position="19"> This implies that at the optimum, when l</Paragraph>
    <Paragraph position="21"> the absolute discounting equation. Sometimes it is better for l k to be set to 0 - another possible optimal point is</Paragraph>
    <Paragraph position="23"> of these two cases must hold at the optimum.</Paragraph>
    <Paragraph position="24"> Notice an important property of exponential priors (analogous to a similiar property for Laplace priors (Williams, 1995; Tibshirani, 1994)): they often favor parameters that are exactly 0. This leads to a kind of natural pruning for exponential priors, not found in Gaussian priors, which are only very rarely 0. (Note, however, that one should not increase the a's in the exponential prior to control pruning, as this may lead to oversmoothing. If additional pruning is needed for speed or memory savings, feature selection techniques should be used, such as pruning small or infrequent parameters, instead of a strengthened prior.) Now, we can derive the update equations. The derivation is exactly the same as Chen and Rosenfeld's (2000) with the minor change of an exponential prior instead of a Gaussian prior (we include it in the appendix.) Let</Paragraph>
    <Paragraph position="26"> (x, y). Then, in the end, we get an update equation of the form</Paragraph>
    <Paragraph position="28"> Compare this equation to the corresponding equation with a Gaussian prior (Chen and Rosenfeld., 2000). With a Gaussian prior, one can derive an equation of the form</Paragraph>
    <Paragraph position="30"> and then solve for d k . There is no closed form solution: it must be solved using numerical methods, such as Newton's method, making this update much more complex and time consuming than the exponential prior. One can also derive variations on this update. For instance, in the Appendix, we derive an update for Improved Iterative Scaling (Della Pietra et al., 1997) with an exponential prior. Perhaps more importantly, one can also derive updates for Sequential Conditional Generalized Iterative Scaling (SCGIS) (Goodman, 2002), which is several times faster than GIS. The SCGIS update for binary features with an exponential prior is simply</Paragraph>
    <Paragraph position="32"> One might wonder why we simply don't use conjugate gradient (CG) methods, which have been shown to converge quickly for maxent. There has not been a formal comparison of SCGIS to conjugate gradient methods.</Paragraph>
    <Paragraph position="33"> In our own pilot studies, SCGIS is sometimes faster and sometimes slower. Also, some versions of CG use heuristics for the step size, and lose convergence guarantees. Finally, SCGIS is simpler to implement than conjugate gradient, and even for those with a conjugate gradient library, because of the parameter constraints (for an exponential prior) or discontinuous derivatives (for a Laplacian) standard conjugate gradient techniques need to be at least somewhat modified.</Paragraph>
    <Paragraph position="34"> Good-Turing discounting has been used or suggested for language modeling several times (Rosenfeld, 1994, page 38), (Lau, 1994). In particular, it has been suggested to use an update of the form</Paragraph>
    <Paragraph position="36"> is the Good-Turing discounted value of observed[k]. This update has a problem, as noted by its proponents: the constraints are probably now inconsistent - there is no model that can simultaneously satisfy them. We note that a simple variation on this update, inspired by the exponential prior, does not have these problems:  for each observed[k]. This does not constitute a Bayesian prior, since the value is picked after the counts are observed, but it does lead to a convex objective function with a global maximum, and the update function will converge towards this maximum. Variations on the constraints of Equation 5 will apply for this modified objective function. Furthermore, in the experimental results section, we will see that on a language modeling task, this modified update function outperforms the traditional update. By using a well motivated approach inspired by exponential priors, we can find a simple variation that has better performance both theoretically and empirically.</Paragraph>
  </Section>
  <Section position="4" start_page="2" end_page="2" type="metho">
    <SectionTitle>
3 Previous Work
</SectionTitle>
    <Paragraph position="0"> There has been a fair amount of previous work on regularization of maxent models. Early approaches focused on feature selection (Della Pietra et al., 1997) or, similarly, count cutoffs (Rosenfeld, 1994). By not using all features, there is typically extra probability left-over for unobserved events, which is distributed in a maximum entropy fashion. The problem with this approach is that it ignores useful information - although low count or low discrimination features may cause overfitting, they do contain valuable information. Because of this, more recent approaches (Rosenfeld, 1994, page 38), (Lau, 1994) have tried techniques such as Good-Turing discounts (Good, 1953).</Paragraph>
    <Paragraph position="1"> There are a number of other approaches (Khudanpur, 1995; Newman, 1997) based on the fuzzy maxent framework (Della Pietra and Della Pietra, 1993). Chen and Rosenfeld (Chen and Rosenfeld., 2000) give a more complete discussion of those approaches.</Paragraph>
    <Paragraph position="2"> Chen and Rosenfeld (2000), following a suggestion of Lafferty, implemented a Gaussian prior for maxent models. They compared this technique to most of the previously discussed techniques, on a language modeling task, and concluded that it was consistently the best technique.</Paragraph>
    <Paragraph position="3"> Tibshirani (1994) introduced Laplacian priors for linear models (linear regressions) and showed that an objective function that minimizes the absolute values of the parameters corresponds to a Laplacian prior. He called this the Least Absolute Shrinkage and Selection Operator (LASSO) and showed that it leads to a type of feature selection. Exponential priors are sometimes called singlesided Laplacians, so obviously, the two techniques are very closely related, so closely that we would not want to claim that either was better than the other.</Paragraph>
    <Paragraph position="4"> Williams (1995) introduced a Laplacian prior for neural networks. Single layer neural networks with certain loss functions are equivalent to logistic regression/maximum entropy models. Williams' algorithm was for a more general case, and much more complex than the one we describe.</Paragraph>
    <Paragraph position="5"> More recently, Figueiredo et al. (2003) in unpublished independent work also examined Laplacian priors for logistic regression, deriving a somewhat more complex algorithm than ours, but one that they extended to partially supervised learning. He has shown that the results are comparable to transductive SVMs.</Paragraph>
    <Paragraph position="6"> Perkins and Theiler (2003) used logisitic regression with a Laplacian prior as well. Their learning algorithm was conjugate gradient descent. Normally, conjugate gradient methods are only used on data that has a continuous first derivative, so the code was modified to prune weights that go exactly to zero.</Paragraph>
    <Paragraph position="7"> Our main contribution then is not the use of Laplacian priors with logistic regression, nor even the first good learning algorithm for the model type. Our contributions are performing an explicit comparison to a Gaussian prior and showing improved performance on real data; noticing that the fixed point of the models leads to absolute discounting; showing that iterative-scaling style algorithms including GIS, IIS, and SCGIS can be trivially modified to use this prior; and explicitly showing that in at least one case, parameters are actually consistent with this prior.</Paragraph>
  </Section>
  <Section position="5" start_page="2" end_page="2" type="metho">
    <SectionTitle>
4 Kneser-Ney smoothing
</SectionTitle>
    <Paragraph position="0"> In this section, we help explain the excellent performance of Kneser-Ney smoothing, the best performing language model smoothing technique. This justification not only answers an important question - why do discounting methods work well - but also gives guidance for extending Kneser-Ney smoothing to problems with fractional counts, where solutions were not previously known.</Paragraph>
    <Paragraph position="1"> Chen and Goodman (1999) performed an extensive comparison of different smoothing (regularization) techniques for language modeling. They found that a version of Kneser-Ney smoothing (Kneser and Ney, 1995) consistently was the best performing technique. Unfortunately, while there are partial theoretical justifications for Kneser-Ney smoothing, in terms of preserving marginals, one important part has previously had no justification: Kneser-Ney smoothing discounts counts, while most conventional regularization techniques, justified by Dirichlet priors, add to counts. Given that discounting was previously unjustified, it is exciting that we have found a way to explain it. In fact, we can show that the Back-off version of Kneser-Ney smoothing is an excellent approximation to a maximum entropy model with an exponential prior. In particular, Kneser-Ney smoothing is derived by assuming absolute discounting, and then finding the distribution that preserves marginals, i.e. making expected = observed - discount. The differences between Backoff Kneser-Ney smoothing and maxent models with exponential priors are twofold. First, the backoff version does not exactly preserve marginals - an approximation is made. Second, Backoff Kneser-Ney always performs discounting, even when this effectively results in lowering the probability, e.g. the equivalent of a negative value for l. There is also an interpolated version of Kneser-Ney smoothing. The interpolated version of Kneser Ney smoothing works even better. It exactly preserves marginals (except for discounting.) Also, the secondary distribution is combined with the primary distribution; this has several effects, including that it is unlikely to get the equivalent of a large negative l value.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML