File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/n04-1039_intro.xml

Size: 7,297 bytes

Last Modified: 2025-10-06 14:02:15

<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-1039">
  <Title>Exponential Priors for Maximum Entropy Models</Title>
  <Section position="2" start_page="0" end_page="2" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Conditional Maximum Entropy (maxent) models have been widely used for a variety of tasks, including language modeling (Rosenfeld, 1994), part-of-speech tagging, prepositional phrase attachment, and parsing (Ratnaparkhi, 1998), word selection for machine translation (Berger et al., 1996), and finding sentence boundaries (Reynar and Ratnaparkhi, 1997). They are also sometimes called logistic regression models, maximum likelihood exponential models, log-linear models, and are even equivalent to a form of perceptrons/single layer neural networks. In particular, perceptrons that use the standard sigmoid function, and optimize for log-loss are equivalent to maxent. Multi-layer neural networks that optimize log-loss are closely related, and much of the discussion will apply to them implicitly.</Paragraph>
    <Paragraph position="1"> Conditional maxent models have traditionally either been unregularized or regularized by using a Gaussian prior on the parameters. We will show that by using an exponential distribution as the prior, several advantages can be gained. We will show that in at least one case, an exponential prior experimentally better matches the actual distribution of the parameters. We will also show that it can lead to improved accuracy, and a simpler learning algorithm. In addition, the exponential prior inspires an improved version of Good Turing discounting with lower perplexity.</Paragraph>
    <Paragraph position="2"> Conditional maxent models are of the form</Paragraph>
    <Paragraph position="4"> where x is an input vector, y is an output, the f i are the so-called indicator functions or feature values that are true if a particular property of x, y is true, F is the number of such features, L represents the parameter set l</Paragraph>
    <Paragraph position="6"> is a weight for the indicator f i . For instance, if trying to do word sense disambiguation for the word &amp;quot;bank&amp;quot;, x would be the context around an occurrence of the word; y would be a particular sense, e.g. financial or river; f i (x,y) could be 1 if the context includes the word &amp;quot;money&amp;quot; and y is the financial sense; and l i would be a large positive number.</Paragraph>
    <Paragraph position="7"> Maxent models have several valuable properties (Della Pietra et al. (1997) give a good overview.) The most important is constraint satisfaction. For a given f</Paragraph>
    <Paragraph position="9"> training data with value y, observed[i]= summationtext</Paragraph>
    <Paragraph position="11"> For a model P L with parameters L, we can see how many times the model predicts that f i would be expected to occur: expected[i]= summationtext</Paragraph>
    <Paragraph position="13"> ,y). Maxent models have the property that expected[i]=observed[i] for all i and y. These equalities are called constraints. The next important property is that the likelihood of the training data is maximized (thus, the name maximum likelihood exponential model.) Third, the model is as similar as possible to the uniform distribution (minimizes the Kullback-Leibler divergence), given the constraints, which is why these models are called maximum entropy models.</Paragraph>
    <Paragraph position="14"> This last property - similarity to the uniform distribution - is a form of regularization. However, it turns out to be an extremely weak one - it is not uncommon for models, especially those that use all or most possible features, to assign near-zero probabilities (or, if ls may be infinite, even actual zero probabilities), and to exhibit other symptoms of severe overfitting. There have been a number of approaches to this problem, which we will discuss in more detail in Section 3. The most relevant approach, however, is the work of Chen and Rosenfeld (2000), who implemented a Gaussian prior for maxent models. They compared this technique to most of the previously implemented techniques, on a language modeling task, and concluded that it was consistently the best. We thus use it as a baseline for our comparisons, and similar considerations motivate our own technique, an exponential prior. Chen et al. place a Gaussian prior with 0 mean and s</Paragraph>
    <Paragraph position="16"> variance on the model parameters (the l i s), and then find a model that maximizes the posterior probability of the data and the model. In particular, maxent models without priors use the parameters L that maximize argmax</Paragraph>
    <Paragraph position="18"> are training data instances. With a Gaussian prior we find</Paragraph>
    <Paragraph position="20"> parenrightbigg In this case, a trained model does not satisfy the constraints expected[i]=observed[i], but, as Chen and Rosenfeld show, instead satisfies constraints</Paragraph>
    <Paragraph position="22"> That is, instead of a model that matches the observed count, we get a model matching the observed count minus</Paragraph>
    <Paragraph position="24"> : in language modeling terms, this is &amp;quot;discounting.&amp;quot; We do not believe that all models are generated by the same process, and therefore we do not believe that a single prior will work best for all problem types.</Paragraph>
    <Paragraph position="25"> In particular, as we will describe in our experimental results section, when looking at one particular set of parameters, we noticed that it was not at all Gaussian, but much more similar to a 0 mean Laplacian,</Paragraph>
    <Paragraph position="27"> . In some cases, learned parameter distributions will not match the prior distribution, but in some cases they will, so it seemed worth exploring one of these alternate forms. Later, when we try to optimize our models, the parameter seach will turn out to be much simpler with an exponential prior, so we focus on that distribution.</Paragraph>
    <Paragraph position="28"> With an exponential prior, we maximize arg max</Paragraph>
    <Paragraph position="30"> As we will describe, it is significantly simpler to perform this maximization than the Gaussian maximization. Furthermore, as we will describe, models satisfying Equation 2 will have the property that, for each l</Paragraph>
    <Paragraph position="32"> . In other words, we essentially just discount the observed counts by the constant</Paragraph>
    <Paragraph position="34"> (which is the reciprocal of the standard deviation), subject to the constraint that l i is non-negative. This is much simpler and more intuitive than the constraints with the Gaussian prior (Equation 1), since those constraints change as the values of l i change. Furthermore, as we will describe in Section 3, discounting by a constant is a common technique for language model smoothing (Ney et al., 1994; Chen and Goodman, 1999), but one that has not previously been well justified; the exponential prior gives some Bayesian justification.</Paragraph>
    <Paragraph position="35"> In Section 5 we will show that on two very different tasks - grammar checking and a collaborative filtering task - the exponential prior yields lower error rates than the Gaussian.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML