File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2101_metho.xml

Size: 17,686 bytes

Last Modified: 2025-10-06 14:10:29

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2101">
  <Title>Optimization Finnish- French- German- Procedure English English English</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1 Direct Minimization of Error
</SectionTitle>
    <Paragraph position="0"> Researchers in empirical natural language processinghaveexpendedsubstantialinkandeffortin null developing metrics to evaluate systems automatically against gold-standard corpora. The ongoing evaluationliteratureisperhapsmostobviousinthe machine translation community's efforts to better BLEU (Papineni et al., 2002).</Paragraph>
    <Paragraph position="1"> Despite this research, parsing or machine translation systems are often trained using the much simpler and harsher metric of maximum likelihood. One reason is that in supervised training, the log-likelihood objective function is generally convex, meaning that it has a single global maximum that can be easily found (indeed, for supervised generative models, the parameters at this maximum may even have a closed-form solution).</Paragraph>
    <Paragraph position="2"> In contrast to the likelihood surface, the error surface for discrete structured prediction is not only riddled with local minima, but piecewise constant [?]This work was supported by an NSF graduate research fellowship for the first author and by NSF ITR grant IIS0313193 and ONR grant N00014-01-1-0685. The views expressed are not necessarily endorsed by the sponsors. We thank Sanjeev Khudanpur, Noah Smith, Markus Dreyer, and the reviewers for helpful discussions and comments.</Paragraph>
    <Paragraph position="3"> and not everywhere differentiable with respect to the model parameters (Figure 1). Despite these difficulties, some work has shown it worthwhile to minimize error directly (Och, 2003; Bahl et al., 1988).</Paragraph>
    <Paragraph position="4"> We show improvements over previous work on error minimization by minimizing the risk or expected error--a continuous function that can be  derivedbycombiningthelikelihoodwithanyevaluation metric (SS2). Seeking to avoid local minima, deterministic annealing (Rose, 1998) gradually changes the objective function from a convex entropy surface to the more complex risk surface (SS3). We also discuss regularizing the objective function to prevent overfitting (SS4). We explain how to compute expected loss under some evaluation metrics common in natural language tasks (SS5). We then apply this machinery to training log-linear combinations of models for dependency parsing and for machine translation (SS6). Finally, we note the connections of minimum risk training to max-margin training and minimum Bayes risk decoding (SS7), and recapitulate our results (SS8).</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="788" type="metho">
    <SectionTitle>
2 Training Log-Linear Models
</SectionTitle>
    <Paragraph position="0"> In this work, we focus on rescoring with log-linear models. In particular, our experiments consider log-linear combinations of a relatively small number of features over entire complex structures, such as trees or translations, known in some previous work as products of experts (Hinton, 1999) or logarithmic opinion pools (Smith et al., 2005).</Paragraph>
    <Paragraph position="1"> A feature in the combined model might thus be a log probability from an entire submodel. Giving this feature a small or negative weight can discount a submodel that is foolishly structured, badly trained, or redundant with the other features.</Paragraph>
    <Paragraph position="2"> For each sentence xi in our training corpus S, we are given Ki possible analyses yi,1,...yi,Ki.</Paragraph>
    <Paragraph position="3"> (These may be all of the possible translations or parse trees; or only the Ki most probable under  tem: while other parameters are held constant, we vary the weights on the distortion and word penalty features. Note the piecewise constant regions with several local maxima. some other model; or only a random sample of size Ki.) Each analysis has a vector of real-valued features (i.e., factors, or experts) denoted fi,k. The score of the analysis yi,k is th * fi,k, the dot product of its features with a parameter vector th. For each sentence, we obtain a normalized probability distribution over the Ki analyses as</Paragraph>
    <Paragraph position="5"> We wish to adjust this model's parameters th to minimize the severity of the errors we make when using it to choose among analyses. A loss function Ly[?](y) assesses a penalty for choosing y when y[?] is correct. We will usually write this simply as L(y) since y[?] is fixed and clear from context. For clearer exposition, we assume below that the total loss over some test corpus is the sum of the losses on individual sentences, although we will revisit that assumption in SS5.</Paragraph>
    <Section position="1" start_page="787" end_page="787" type="sub_section">
      <SectionTitle>
2.1 Minimizing Loss or Expected Loss
</SectionTitle>
      <Paragraph position="0"> One training criterion directly mimics test conditions. It looks at the loss incurred if we choose the best analysis of each xi according to the model:</Paragraph>
      <Paragraph position="2"> Since small changes in th either do not change the best analysis or else push a different analysis to the top, this objective function is piecewise constant, hence not amenable to gradient descent.</Paragraph>
      <Paragraph position="3"> Och (2003) observed, however, that the piecewiseconstant property could be exploited to characterize the function exhaustively along any line in parameter space, and hence to minimize it globally along that line. By calling this global line minimization as a subroutine of multidimensional optimization, he was able to minimize (2) well enough to improve over likelihood maximization for training factored machine translation systems.</Paragraph>
      <Paragraph position="4"> Instead of considering only the best hypothesis for any th, we can minimize risk, i.e., the expected loss under pth across all analyses yi:</Paragraph>
      <Paragraph position="6"> This &amp;quot;smoothed&amp;quot; objective is now continuous and differentiable. However, it no longer exactly mimics test conditions, and it typically remains nonconvex, so that gradient descent is still not guaranteed to find a global minimum. Och (2003) found that such smoothing during training &amp;quot;gives almost identical results&amp;quot; on translation metrics.</Paragraph>
      <Paragraph position="7"> The simplest possible loss function is 0/1 loss, where L(y) is 0 if y is the true analysis y[?]i and 1 otherwise. This loss function does not attempt to give partial credit. Even in this simple case, assuming P negationslash= NP, there exists no general polynomial-time algorithm for even approximating (2) to within any constant factor, even for Ki = 2 (Hoffgen et al., 1995, from Theorem 4.10.4).1 The same is true for for (3), since for Ki = 2 it can be easily shown that the min 0/1 risk is between 50% and 100% of the min 0/1 loss.</Paragraph>
    </Section>
    <Section position="2" start_page="787" end_page="788" type="sub_section">
      <SectionTitle>
2.2 Maximizing Likelihood
</SectionTitle>
      <Paragraph position="0"> Rather than minimizing a loss function suited to the task, many systems (especially for language modeling) choose simply to maximize the probability of the gold standard. The log of this likelihood is a convex function of the parameters th:</Paragraph>
      <Paragraph position="2"> where y[?]i is the true analysis of sentence xi. The only wrinkle is that pth(y[?]i  |xi) may be left undefined by equation (1) if y[?]i is not in our set of Ki hypotheses. When maximizing likelihood, therefore, we will replace y[?]i with the min-loss analysis in the hypothesis set; if multiple analyses tie  1Knownalgorithmsareexponentialbutonlyinthedimensionality of the feature space (Johnson and Preparata, 1978).  [?]10 [?]5 0 5 10  weight varies: the gray line (g = [?]) shows true BLEU (to be optimizedinequation(2)). Theblacklinesshowtheexpected BLEU as g in equation (5) increases from 0.1 toward[?]. for this honor, we follow Charniak and Johnson (2005) in summing their probabilities.2 Maximizing (4) is equivalent to minimizing an upper bound on the expected 0/1 loss summationtexti(1 [?] pth(y[?]i  |xi)). Though the log makes it tractable, this remains a 0/1 objective that does not give partial credit to wrong answers, such as imperfect but useful translations. Most systems should be evaluated and preferably trained on less harsh metrics.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="788" end_page="788" type="metho">
    <SectionTitle>
3 Deterministic Annealing
</SectionTitle>
    <Paragraph position="0"> To balance the advantages of direct loss minimization, continuous risk minimization, and convex optimization, deterministic annealing attempts the solution of increasingly difficult optimization problems (Rose, 1998). Adding a scale hyperparameter g to equation (1), we have the following family of distributions:</Paragraph>
    <Paragraph position="2"> When g = 0, all yi,k are equally likely, giving the uniform distribution; when g = 1, we recover the model in equation (1); and as g - [?], we approach the winner-take-all Viterbi function that assigns probability 1 to the top-scoring analysis.</Paragraph>
    <Paragraph position="3"> For a fixed g, deterministic annealing solves</Paragraph>
    <Paragraph position="5"> 2An alternative would be to artificially add y[?] i (e.g., the reference translation(s)) to the hypothesis set during training. We then increase g according to some schedule and optimize th again. When g is low, the smooth objective might allow us to pass over local minima thatcould open upat higherg. Figure3 shows how the smoothing is gradually weakened to reach the risk objective (3) as g - 1 and approach the true error objective (2) as g - [?].</Paragraph>
    <Paragraph position="6"> Our risk minimization most resembles the work of Rao and Rose (2001), who trained an isolated-word speech recognition system for expected word-error rate. Deterministic annealing has also been used to tackle non-convex likelihood surfaces in unsupervised learning with EM (Ueda and Nakano, 1998; Smith and Eisner, 2004). Other work on &amp;quot;generalized probabilistic descent&amp;quot; minimizes a similar objective function but with g held constant (Katagiri et al., 1998).</Paragraph>
    <Paragraph position="7"> Although the entropy is generally higher at lower values of g, it varies as the optimization changes th. In particular, a pure unregularized log-linear model such as (5) is really a function ofg*th, so the optimizer could exactly compensate for increased g by decreasing the th vector proportionately!3 Most deterministic annealing procedures, therefore, express a direct preference on the entropy H, and choose g and th accordingly:</Paragraph>
    <Paragraph position="9"> In place of a schedule for raising g, we now use a cooling schedule to lower T from [?] to [?][?], thereby weakening the preference for high entropy. The Lagrange multiplier T on entropy is called &amp;quot;temperature&amp;quot; due to a satisfying connection to statistical mechanics. Once T is quite cool, it is common in practice to switch to raising g directly and rapidly (quenching) until some convergence criterion is met (Rao and Rose, 2001).</Paragraph>
  </Section>
  <Section position="6" start_page="788" end_page="789" type="metho">
    <SectionTitle>
4 Regularization
</SectionTitle>
    <Paragraph position="0"> Informally, high temperature or g &lt; 1 smooths our model during training toward higher-entropy conditional distributions that are not so peaked at the desired analyses y[?]i . Another reason for such smoothing is simply to prevent overfitting to these training examples.</Paragraph>
    <Paragraph position="1"> A typical way to control overfitting is to use a quadratic regularizing term, ||th||2 or more generallysummationtextd th2d/2s2d. Keepingthissmallkeepsweights  low and entropy high. We may add this regularizer to equation (6) or (7). In the maximum likelihood framework, we may subtract it from equation (4), which is equivalent to maximum a posteriori estimation with a diagonal Gaussian prior (Chen and Rosenfeld, 1999). The variance s2d may reflect a prior belief about the potential usefulness of feature d, or may be tuned on heldout data.</Paragraph>
    <Paragraph position="2"> Another simple regularization method is to stop cooling before T reaches 0 (cf. Elidan and Friedman (2005)). If loss on heldout data begins to increase, we may be starting to overfit. This technique can be used along with annealing or quadratic regularization and can achieve additional accuracy gains, which we report elsewhere (Dreyer et al., 2006).</Paragraph>
  </Section>
  <Section position="7" start_page="789" end_page="790" type="metho">
    <SectionTitle>
5 Computing Expected Loss
</SectionTitle>
    <Paragraph position="0"> At each temperature setting of deterministic annealing, we need to minimize the expected loss on the training corpus. We now discuss how this expectation is computed. When rescoring, we assume that we simply wish to combine, in some way, statistics of whole sentences4 to arrive at the overall loss for the corpus. We consider evaluation metrics for natural language tasks from two broadly applicable classes: linear and nonlinear.</Paragraph>
    <Paragraph position="1"> A linear metric is a sum (or other linear combination) of the loss or gain on individual sentences. Accuracy--in dependency parsing, part-of-speech tagging, and other labeling tasks--falls into this class, as do recall, word error rate in ASR, and the crossing-brackets metric in parsing. Thanks to thelinearityofexpectation, wecaneasilycompute our expected loss in equation (6) by adding up the expected loss on each sentence.</Paragraph>
    <Paragraph position="2"> Some other metrics involve nonlinear combinations over the sentences of the corpus. One common example is precision, P def= summationtexti ci/summationtexti ai, where ci is the number of correctly posited elements, and ai is the total number of posited elements, in the decoding of sentence i. (Depending on the task, the elements may be words, bigrams, labeled constituents, etc.) Our goal is to maximize P, so during a step of deterministic annealing, we need to maximize the expectation of P when the sentences are decoded randomly according to equation (5). Although this expectation is continuous and differentiable as a function of 4Computing sentence xi's statistics usually involves iterating over hypotheses yi,1,...yi,Ki. If these share substructure in a hypothesis lattice, dynamic programming may help. th, unfortunately it seems hard to compute for any given th. We observe however that an equivalent goal is to minimize [?]logP. Taking that as our loss function instead, equation (6) now needs to minimize the expectation of [?]logP,5 which decomposes somewhat more nicely:</Paragraph>
    <Paragraph position="4"> where the integer random variables A = summationtexti ai and C = summationtexti ci count the number of posited and correctly posited elements over the whole corpus.</Paragraph>
    <Paragraph position="5"> To approximate E[g(A)], where g is any twicedifferentiable function (here g = log), we can approximate g locally by a quadratic, given by the</Paragraph>
    <Paragraph position="7"> Here uA = summationtexti uai and s2A = summationtexti s2ai, since A is a sum of independent random variables ai (i.e., given the current model parameters th, our randomized decoder decodes each sentence independently). In other words, given our quadratic approximation to g, E[g(A)] depends on the (true) distribution of A only through the single-sentence means uai and variances s2ai, which can be found by enumerating the Ki decodings of sentence i.</Paragraph>
    <Paragraph position="8"> The approximation becomes arbitrarily good as we anneal g - [?], since then s2A - 0 and E[g(A)] focuses on g near uA. For equation (8),</Paragraph>
    <Paragraph position="10"> and E[logC] is found similarly.</Paragraph>
    <Paragraph position="11"> Similar techniques can be used to compute the expected logarithms of some other non-linear metrics, such as F-measure (the harmonic mean of precision and recall)6 and Papineni et al. (2002)'s  BLEU translation metric (the geometric mean of several precisions). In particular, the expectation of log BLEU distributes over its N +1 summands:</Paragraph>
    <Paragraph position="13"> where Pn is the precision of the n-gram elements in the decoding.7 As is standard in MT research, we take wn = 1/N and N = 4. The first term in the BLEU score is the log brevity penalty, a continuous function of A1 (the total number of uni-gram tokens in the decoded corpus) that fires only ifA1 &lt; r (the average word count of the reference corpus). We again use a Taylor series to approximate the expected log brevity penalty.</Paragraph>
    <Paragraph position="14"> Wementionanalternativewaytocompute(say) theexpectedprecisionC/A: integratenumerically over the joint density of C and A. How can we obtain this density? As (C,A) = summationtexti(ci,ai) is a sum of independent random length-2 vectors, its mean vector and 2 x 2 covariance matrix can be respectively found by summing the means and covariancematricesofthe(ci,ai), eachexactlycomputed from the distribution (5) over Ki hypotheses. We can easily approximate (C,A) by the (continuous) bivariate normal with that mean and covariance matrix8--or else accumulate an exact representation of its (discrete) probability mass function by a sequence of numerical convolutions.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML