File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/n04-1039_evalu.xml
Size: 6,973 bytes
Last Modified: 2025-10-06 13:59:09
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-1039"> <Title>Exponential Priors for Maximum Entropy Models</Title> <Section position="6" start_page="2" end_page="2" type="evalu"> <SectionTitle> 5 Experimental Results </SectionTitle> <Paragraph position="0"> In this section, we detail our experimental results, showing that exponential priors outperform Gaussian priors on two different data sets, and inspire improvements for a third. For all experiments, except the language model experiments, we used a single variance for both the Gaussian and the exponential prior, rather than one per parameter, with the variance optimized on held out data. For the language modeling experiments, we used three variances, one each for the unigram, bigram, and trigram models, again optimized on held out data.</Paragraph> <Paragraph position="1"> We were inspired to use an exponential prior by an actual examination of a data set. In particular, we used the grammar-checking data of Banko and Brill (2001).</Paragraph> <Paragraph position="2"> We chose this set because there are commonly used versions both with small amounts of data (which is when we expect the prior to matter) and with large amounts of data (which is required to easily see what the distribution over &quot;correct&quot; parameter values is.) For one experiment, we trained a model using a Gaussian prior, using a large amount of data. We then found those parameters (l's) that had at least 35 training instances - enough to typically overcome the prior and train the parameter reliably. We then graphed the distribution of these parameters. While it is common to look at the distribution of data, the NLP and machine learning communities rarely examine distributions of model parameters, and yet this seems like a good way to get inspiration for priors to try, using those parameters with enough data to help guess the priors for those with less, or at least to determine the correct form for the prior, if not the exact values.</Paragraph> <Paragraph position="3"> The results are shown in Figure 1, which is a histogram of l's with a given value. If the distribution were Gaussian, we would expect this to look like an upside-down parabola.</Paragraph> <Paragraph position="4"> If the distribution were Laplacian, we would expect it to appear as a triangle (the bottom formed from the X-axis.) Indeed, it does appear to be roughly triangular, and to the extent that it diverges from this shape, it is convex, while a Gaussian would be concave. We don't believe that the exponential prior is right for every problem - our argument here is that based on both better accuracy (our next experiment) and a better fit to at least some of the parameters, that the exponential prior is better for some models. We then tried actually using exponential priors with this application, and were able to demonstrate improve- null Of course, those parameters with lots of data might be generated from a different prior than those with little data. This technique is meant as a form of inspiration and evidence, but not of proof. Similarly, all parameters may be generated by some other process, e.g. a mixture of Gaussians. Finally, a prior might be accurate but still behave poorly, because it might interact poorly with other approximations. For instance, it might interact poorly with the fact that we use argmax rather than the true Bayesian posterior over models.</Paragraph> <Paragraph position="5"> amples of the confusable word pairs of interest, so the actual number of training examples for each word-pair was less than 100,000). We tried different priors for the Gaussian and exponential prior, and found the best single prior on average across all ten pairs. With this best setting, we achieved a 14.51% geometric average error rate with the exponential prior, and 15.45% with the Gaussian. To avoid any form of cheating, we then tried 10 different word pairs (the same as those used by Banko and Brill (2001)) with this best parameter setting.</Paragraph> <Paragraph position="6"> The results were 18.07% and 19.47% for the exponential and Gaussian priors respectively. (The overall higher rate is due to the test set words being slightly more difficult.) We also tried experiments with 1 million and 10 million words, but there were not consistent differences; improved smoothing mostly matters with small amounts of training data.</Paragraph> <Paragraph position="7"> We also tried experiments with a collaborative-filtering style task, television show recommendation, based on Nielsen data. The dataset used, and the definition of a collaborative filtering (CF) score is the same as was used by Kadie et al. (2002), although our random train/test split is not the same, so the results are not strictly comparable. We first ran experiments with different priors on a held-out section of the training data, and then using the single best value for the prior (the same one across all features), we ran on the test data. With a Gaussian prior, the CF score was 42.11, while with an exponential prior, it was 45.86, a large improvement.</Paragraph> <Paragraph position="8"> Finally, we ran experiments with language modeling, with mixed success. We used 1,000,000 words of training data (a small model, but one where smoothing matters) from the WSJ corpus and a trigram model with a cluster-based speedup (Goodman, 2001). We evaluated on test data using the standard language modeling measure, perplexity, where lower scores are better. We tried five experiments: using Katz smoothing (a widely used version of Good-Turing smoothing) (perplexity 238.0); using Good Turing discounting to smooth maxent (perplexity 224.8); using our variation on Good-Turing, inspired by exponential priors, where l's are bounded at 0 (perplexity 204.5); using an exponential prior (perplexity 190.8); using a Gaussian prior (perplexity 183.7); and using interpolated modified Kneser-Ney smoothing (perplexity 180.2). On the one hand, an exponential prior is worse than a Gaussian prior in this case, and modified interpolated Kneser-Ney smoothing is still the best known smoothing technique (Chen and Goodman, 1999), within noise of a Gaussian prior. On the other hand, searching for parameters is extremely time consuming, and Good-Turing is one of the few parameter-free smoothing methods. Of the three Good-Turing smoothing methods, the one inspired by exponentials priors was the best.</Paragraph> <Paragraph position="9"> Note that perplexity is 2 entropy and in general, we have found that exponential priors work slightly worse on entropy measures than the Gaussian prior, even when they are better on accuracy. This may be due to the fact that an exponential prior &quot;throws away&quot; some information, whenever the l would be negative. (In a pilot experiment with a variation that does not throw away information, the entropies are closer to the Gaussian.)</Paragraph> </Section> class="xml-element"></Paper>