File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-1661_evalu.xml

Size: 8,690 bytes

Last Modified: 2025-10-06 13:59:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1661">
  <Title>Statistical Ranking in Tactical Generation</Title>
  <Section position="8" start_page="521" end_page="523" type="evalu">
    <SectionTitle>
4 Results
</SectionTitle>
    <Paragraph position="0"> In this section we present contrastive results for the models defined in Section 2 above, evaluated against the exact match accuracy and word accuracy as described in Section 2.4.</Paragraph>
    <Paragraph position="1"> As can be seen in Table 6, both the MaxEnt and SVM learner does a much better job than the D2-gram model at identifying the correct reference strings. The two discriminative models perform very similarly, however, although the Max-Ent model often seems to do slightly better.</Paragraph>
    <Paragraph position="2"> When working with a cross-validation set-up the difference between the learners can conveniently be tested using an approach such as the cross-validated paired D8-test described by Dietterich (1998). We also tried this approach using the Wilcoxon Matched-Pairs Signed-Ranks test as a non-parametric alternative without the assumption of normality of differences made in the D8-test. However, none of the two tests found that the differences between the MaxEnt model and the SVM model were significant for AB BP BCBMBCBH (using two-sided tests).</Paragraph>
    <Paragraph position="3"> Note that, due to memory constraints, we only included a random sample of maximum 50 non-preferred realizations per item in the training data used for the SVM ranker. Even so, the SVM trained on the full 'Jotunheimen' data had a total of 66,621 example vectors in its training data, which spawned a total of 639,301 preference constraints with respect to the optimization problem of Equations 8 and 10. We did not try to maximize performance on the development data by repeatedly training with different random samples, but this might be one way to improve the results.</Paragraph>
    <Paragraph position="4"> Although we were only able to present results using linear kernels for the SVM ranker in this paper, preliminary experiments using a polynomial kernel seem to give promising results. Due to memory constraints and long convergence times, we were only able to train such a model on half of the 'Jotunheimen' data. However, when testing on the remaining half, it achieved an exact match accuracy of BJBDBMBCBFB1. This is comparable to the performance achieved by the linear SVM through full 10-fold training and testing. Moreover, there is reason to believe that these results will improve once we manage to train on the full data set.</Paragraph>
    <Paragraph position="5"> In order to assess the effect of increasing the size of the training set, Figure 3 presents learning curves for two MaxEnt configurations, viz. the basic configurational model and the one including all features but the language model. Each data point  ferent models. Data items are binned with respect to the number of distinct realizations.</Paragraph>
    <Paragraph position="6">  inative models are averages from 10-fold cross-validation. A model trained on the entire 'Jotunheimen' data was used when testing on 'Rondane'. Note that the training accuracy of the SVM learner on the 'Jotunheimen' training set is 91.69%, while it's 92.99% for the MaxEnt model.  though there appears to be a saturation effect in model performance with increasing amounts of 'Jotunheimen' training data, for the richer configuration (using all features but the language model) further enlarging the training data still seems attractive. null corresponds to average exact match performance for 10-fold cross-validation on 'Jotunheimen', but restricting the amount of training data presented to the learner to between 10 and 100 per cent of the total. At 60 per cent training data, the two models already perform at BIBCBMBIB1 and BIBKBMBGB1 accuracy, and the learning curves are starting to flatten out.</Paragraph>
    <Paragraph position="7"> Somewhat remarkably, the richer model including partial daughter back-off, grandparenting, and lexical type trigrams already outperforms the baseline model by a clear margin with just a small fraction of the training data, so the MaxEnt learner appears to make effective use of the greatly enlarged feature space.</Paragraph>
    <Paragraph position="8"> When testing against the 'Rondane' held-out set and comparing to performance on the 'Jotunheimen' cross-validation set, we see that the performance of both the MaxEnt model and the SVM degrades quite a bit. Of course, some drop in performance is to be expected as the estimation parameters had been tuned to this development set.</Paragraph>
    <Paragraph position="9"> Furthermore, as can be seen from Table 2, the baseline is also slightly lower for the 'Rondane' test set as the average number of realizations is higher. Also, while basically from the same domain, the two text collections differ noticeably in style: 'Jotunheimen' is based on edited, high-quality guide books; 'Rondane' has been gathered from a variety of web sites. Note, however, that the performance of the BNC D2-gram model seems to be more stable across the different data sets.</Paragraph>
    <Paragraph position="10"> In any case we see that, for our realization ranking task, the use of discriminative models in combination with structural features extracted from treebanks, clearly outperforms the surface oriented, generative D2-gram model. This is in spite of the relatively modest size of the treebanked training data available to the discriminative models. On the 'Rondane' test set the reduction in error rate for the combined MaxEnt model relative to the D2-gram LM, is BEBEBMBCBFB1. The error reduction for the SVM over the LM on 'Rondane' is BEBCBMBIBFB1.</Paragraph>
    <Paragraph position="11"> Another factor that is likely to be important for the differences in performance is the fact that the treebank data is better tuned to the domain of application or the test data. The D2-gram language model, on the other hand, was only trained on the general-domain BNC data. Note, however, that when testing on 'Rondane', we also tried to combine this general-domain model with an additional in-domain model trained only on the text that formed the basis of the 'Jotunheimen' treebank, a total of 5024 sentences. The optimal weights for linearly combining these two models were calculated using the interpolation tool in the CMU toolkit (using the expectation maximization (EM) algorithm, minimizing the perplexity on a held out data set of 330 sentences). However, when applied to the 'Rondane' test set, this in- null els, viz. the BNC LM only, the MaxEnt model by itself (using all feature types except the LM probability), and the combined MaxEnt model. The intermediate column corresponds to ties or partial errors, i.e. the number of items for which multiple candidates were ranked at the top, of which some were actually preferred and some not. Primarily this latter error type is reduced by including the LM feature in the MaxEnt universe.</Paragraph>
    <Paragraph position="12"> terpolated model failed to improve on the results achieved by just using the larger general-domain model alone. This is probably due to the small amount of domain specific data that we presently have available for training.</Paragraph>
    <Paragraph position="13"> Another observation about our D2-gram experiments that is worth a mention is that we found that ranking realizations according to non-normalized log probabilities directly resulted in much better accuracy than using a length normalized score such as the geometric mean.</Paragraph>
    <Paragraph position="14"> Finally, Table 7 breaks down per-item exact match errors for three distinct ranking configurations, viz. the BNC LM only, the structural Max-Ent model only, and the combined MaxEnt model, which includes the LM probability as an additional feature; all numbers are for application to the held-out 'Rondane' test set. Further contrasting the first two of these, the BNC LM yields 129 unique errors, in the sense that the structural Max-Ent makes the correct predictions on these items, contrasted to 98 unique errors in the structural MaxEnt model. When compared to the only 124 errors made equally by both rankers, we conclude that the different approaches have partially complementary strengths and weaknesses. This observation is confirmed in the relatively substantial improvement in ranking performance of the combined model on the 'Rondane' test: The exact match accuracies of the D2-gram model, the basic MaxEnt model and the combined model are BHBGBMBDBLB1, BHBLBMBGBFB1 and BIBGBMBEBKB1, respectively.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML