File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-2101_evalu.xml

Size: 11,125 bytes

Last Modified: 2025-10-06 13:59:42

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2101">
  <Title>Optimization Finnish- French- German- Procedure English English English</Title>
  <Section position="8" start_page="790" end_page="792" type="evalu">
    <SectionTitle>
6 Experiments
</SectionTitle>
    <Paragraph position="0"> We tested the above training methods on two different tasks: dependency parsing and phrase-based machine translation. Since the basic setup was the same for both, we outline it here before describing the tasks in detail.</Paragraph>
    <Paragraph position="1"> In both cases, we start with 8 to 10 models (the &amp;quot;experts&amp;quot;) already trained on separate training data. To find the optimal coefficients th for a log-linear combination of these experts, we use separate development data, using the following procedure due to Och (2003):  1. Initialization: Initializeth tothe0vector. For each development sentence xi, set its Ki-best list to [?] (thus Ki = 0).</Paragraph>
    <Paragraph position="2"> 7BLEU is careful when measuring ci on a particular decoding yi,k. It only counts the first two copies of the (e.g.) as correct if the occurs at most twice in any reference translation of xi. This &amp;quot;clipping&amp;quot; does not affect the rest of our method. 8Reasonable for a large corpus, by Lyapunov's central limit theorem (allows non-identically distributed summands). 2. Decoding: For each development sentence xi, use the current th to extract the 200 analyses yi,k with the greatest scores expth *fi,k.</Paragraph>
    <Paragraph position="3"> Calcuate each analysis's loss statistics (e.g., ci and ai), and add it to the Ki-best list if it is not already there.</Paragraph>
    <Paragraph position="4"> 3. Convergence: If Ki has not increased for any development sentence, or if we have reached our limit of 20 iterations, stop: the search has converged.</Paragraph>
    <Paragraph position="5"> 4. Optimization: Adjust th to improve our ob null jective function over the whole development corpus. Return to step 2.</Paragraph>
    <Paragraph position="6"> Our experiments simply compare three procedures at step 4. We may either * maximize log-likelihood (4), a convex function, at a given level of quadratic regularization, by BFGS gradient descent; * minimize error (2) by Och's line search method, which globally optimizes each component of th while holding the others constant;9 or * minimize the same error (2) more effectively, by raising g - [?] while minimizing the annealed risk (6), that is, coolingT - [?][?](or g - [?]) and at each value, locally minimizing equation (7) using BFGS.</Paragraph>
    <Paragraph position="7"> Since these different optimization procedures will usually find different th at step 4, their K-best lists will diverge after the first iteration.</Paragraph>
    <Paragraph position="8"> For final testing, we selected among several variants of each procedure using a separate small heldout set. Final results are reported for a larger, disjoint test set.</Paragraph>
    <Section position="1" start_page="790" end_page="792" type="sub_section">
      <SectionTitle>
6.1 Machine Translation
</SectionTitle>
      <Paragraph position="0"> For our machine translation experiments, we trained phrase-based alignment template models of Finnish-English, French-English, and German-English, as follows. For each language pair, we aligned 100,000 sentence pairs from European Parliament transcripts using GIZA++. We then used Philip Koehn's phrase extraction software to merge the GIZA++ alignments and to extract 9The component whose optimization achieved the lowest loss is then updated. The process iterates until no lower loss can be found. In contrast, Papineni (1999) proposed a linear programming method that may search along diagonal lines.  and score the alignment template model's phrases (Koehn et al., 2003).</Paragraph>
      <Paragraph position="1"> The Pharaoh phrase-based decoder uses precisely the setup of this paper. It scores a candidate translation (including its phrasal alignment to the original text) as th * f, where f is a vector of the following 8 features:  1. the probability of the source phrase given the target phrase 2. the probability of the target phrase given the source phrase 3. the weighted lexical probability of the source words given the target words 4. the weighted lexical probability of the target words given the source words 5. a phrase penalty that fires for each template in the translation 6. a distortion penalty that fires when phrases translate out of order 7. a word penalty that fires for each English  word in the output 8. a trigram language model estimated on the English side of the bitext Our goal was to train the weights th of these 8 features. We used the method described above, employing the Pharaoh decoder at step 2 to generate the 200-best translations according to the current th. As explained above, we compared three procedures at step 4: maximum log-likelihood by gradient ascent; minimum error using Och's line-search method; and annealed minimum risk. As our development data for training th, we used 200 sentence pairs for each language pair.</Paragraph>
      <Paragraph position="2"> Since our methods can be tuned with hyperparameters, we used performance on a separate 200-sentence held-out set to choose the best hyper-parameter values. The hyperparameter levels for each method were * maximum likelihood: a Gaussian prior with all s2d at 0.25, 0.5, 1, or [?] * minimum error: 1, 5, or 10 different random starting points, drawn from a uniform  sentence test corpora, after training the 8 experts on 100,000 sentence pairs and fitting their weights th on 200 more, using settingstunedonafurther200. Thecurrentminimumriskannealing method achieved significant improvements over minimum error and maximum likelihood at or below the 0.001 level, using a permutation test with 1000 replications. distribution on [[?]1,1]x[[?]1,1]x***, when optimizing th at an iteration of step 4.10 * annealed minimum risk: with explicit entropy constraints, starting temperature T [?] {100,200,1000}; stopping temperature T [?] {0.01,0.001}. The temperature was cooled by half at each step; then we quenched by doubling g at each step. (We also ran experiments with quadratic regularization with all s2d at 0.5, 1, or 2 (SS4) in addition to the entropy constraint. Also, instead of the entropy constraint, we simply annealed on g while adding a quadratic regularization term. None of these regularized models beat the best setting of standard deterministic annealing on heldout or test data.) Finalresultsonaseparate2000-sentencetestset are shown in table 1. We evaluated translation using BLEU with one reference translation and n-grams up to 4. The minimum risk annealing procedure significantly outperformed maximum likelihood and minimum error training in all three language pairs (p &lt; 0.001, paired-sample permutation test with 1000 replications).</Paragraph>
      <Paragraph position="3"> Minimum risk annealing generally outperformed minimum error training on the held-out set, regardless of the starting temperatureT. However, higher starting temperatures do give better performance and a more monotonic learning curve (Figure 3), a pattern that held up on test data. (In the same way, for minimum error training, 10That is, we run step 4 from several starting points, finishingatseveraldifferentpoints; wepickthefinishingpointwith lowest development error (2). This reduces the sensitivity of this method to the starting value of th. Maximum likelihood is not sensitive to the starting value of th because it has only a global optimum; annealed minimum risk is not sensitive to it either, because initially g [?] 0, making equation (6) flat.  imum risk training with different starting temperatures, versus minimum error training with 10 random restarts.  restarts vs. only 1.</Paragraph>
      <Paragraph position="4"> more random restarts give better performance and a more monotonic learning curve--see Figure 4.) Minimum risk annealing did not always win on the training set, suggesting that its advantage is not superior minimization but rather superior generalization: under the risk criterion, multiple lowloss hypotheses per sentence can help guide the learner to the right part of parameter space.</Paragraph>
      <Paragraph position="5"> Although the components of the translation and languagemodelsinteractincomplexways, theimprovement on Finnish-English may be due in part to the higher weight that minimum risk annealing found for the word penalty. That system is therefore more likely to produce shorter output like i have taken note of your remarks and i also agree with that . than like this longer output from the minimum-error-trained system: i have taken note of your remarks and i shall also agree with all that the union .</Paragraph>
      <Paragraph position="6"> We annealed using our novel expected-BLEU approximation from SS5. We found this to perform significantly better on BLEU evaluation than if we trained with a &amp;quot;linearized&amp;quot; BLEU that summed per-sentence BLEU scores (as used in minimum Bayes risk decoding by Kumar and Byrne (2004)).</Paragraph>
    </Section>
    <Section position="2" start_page="792" end_page="792" type="sub_section">
      <SectionTitle>
6.2 Dependency Parsing
</SectionTitle>
      <Paragraph position="0"> We trained dependency parsers for three different languages: Bulgarian, Dutch, and Slovenian.11 Input sentences to the parser were already tagged for parts of speech. Each parser employed 10 experts, each parameterized as a globally normalized log-linear model (Lafferty et al., 2001). For example, the 9th component of the feature vector fi,k (which described the kth parse of the ith sentence) was the log of that parse's normalized probability according to the 9th expert.</Paragraph>
      <Paragraph position="1"> Each expert was trained separately to maximize the conditional probability of the correct parse given the sentence. We used 10 iterations of gradient ascent. To speed training, for each of the first 9 iterations, the gradient was estimated on a (different) sample of only 1000 training sentences.</Paragraph>
      <Paragraph position="2"> We then trained the vector th, used to combine the experts, to minimize the number of labeled dependency attachment errors on a 200-sentence development set. Optimization proceeded over lists of the 200-best parses of each sentence produced by a joint decoder using the 10 experts.</Paragraph>
      <Paragraph position="3"> Evaluating on labeled dependency accuracy on 200 test sentences for each language, we see that minimum error and annealed minimum risk training are much closer than for MT. For Bulgarian and Dutch, they are statistically indistinguishable using a paired-sample permutations test with 1000 replications. Indeed, on Dutch, all three optimization procedures produce indistinguishable results. On Slovenian, annealed minimum risk training does show a significant improvement over the other two methods. Overall, however, the results for this task are mediocre. We are still working on improving the underlying experts.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML