File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/w03-0403_evalu.xml
Size: 7,623 bytes
Last Modified: 2025-10-06 13:58:56
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0403"> <Title>Active learning for HPSG parse selection</Title> <Section position="8" start_page="0" end_page="0" type="evalu"> <SectionTitle> 7 Results </SectionTitle> <Paragraph position="0"> Figure 4 shows the performance of the LL-CONFIG model as more examples are chosen according to the different selection methods. As can be seen, both tree entropy and disagreement are equally effective and significantly improve on random selection.4 Selection by sentence length is worse than random until 2100 examples have been annotated. Selecting more ambiguous sentences does eventually perform significantly better than random, but its accuracy does not rise nearly as steeply as tree entropy and 4LL-CONFIG was paired with LL-NGRAM for preferred parse disagreement in Figure 4(a).</Paragraph> <Paragraph position="1"> disagreement selection. Table 4 shows the precise values for all methods using different amounts of annotated sentences. The accuracies for entropy and disagreement are statistically significant improvements over random selection. Using a pair-wise t-test, the values for 500, 1000, and 2000 are significant at 99% confidence, and those for tion using 3000 examples, tree entropy and disagreement achieve higher accuracy while reducing the number of training examples needed by more than one half. Though selection by ambiguity does provide a reduction over random selection, it does not enjoy the same rapid increase as tree entropy and disagreement, and it performs roughly equal to or worse than random until 1100 examples, as is evident in Figure 4(b).</Paragraph> <Paragraph position="2"> tion methods to outperform random selection with 3000 examples. The final column gives the percentage reduction in the number of examples used.</Paragraph> <Paragraph position="3"> We also tested preferred parse disagreement by pairing LL-CONFIG with the perceptrons. The performance in these cases was nearly identical to that given for selection by disagreement in Figure 4, which used LL-CONFIG and LL-NGRAM for the committee. This indicates that differences either in terms of the algorithm or the feature set used are enough to bias the learners sufficiently for them to disagree on informative examples. This provides flexibility for applying selection by disagreement in different contexts where it may be easier to employ different 5The slightly lower confidence for 3000 examples indicates the fact that the small size of the corpus leaves the selection techniques with fewer informative examples to choose from and thereby differentiate itself from random selection.</Paragraph> <Paragraph position="4"> random, ambiguity, and sentence length.</Paragraph> <Paragraph position="5"> feature sets than different algorithms, or vice versa. The fact that using the same feature set with different algorithms is effective for active learning is interesting and is echoed by similar findings for co-training (Goldman and Zhou, 2000).</Paragraph> <Paragraph position="6"> Given the similar performance of tree entropy and preferred parse disagreement, it is interesting to see whether they select essentially the same examples. One case where they might not overlap is a distribution with two sharp spikes, which would be likely to provide excellent discriminating information. Though such a distribution has low entropy, each model might be biased toward a different spike and they would select the example by disagreement. null To test this, we ran a further experiment with a combined selection method that takes the intersection of tree entropy and disagreement. At each round, we randomly choose examples from the pool of unannotated sentences and sort them according to tree entropy, from highest to lowest. From the first 100 of these examples, we take the first a0 examples that are also selected by disagreement, varying the number selected in the same manner as for the previous experiments. When the size of the intersection is less than the number to be selected, we select the remainder according to tree entropy.</Paragraph> <Paragraph position="7"> The performance for combined selection is compared with entropy and random selection in Figure 5 and Table 6. There is an slight, though not significant improvement over entropy on its own. The improvement over random is significant for all values, using a pair-wise t-test at 99% confidence. The combined approach requires 1200 examples on average to outperform random selection with 3000 examples, a 60.0% reduction that improves on either method on its own.</Paragraph> <Paragraph position="8"> Tracking the examples chosen by tree entropy and dis- null bined selection selection with different amounts of training data.</Paragraph> <Paragraph position="9"> agreement at each round verifies that they do not select precisely the same examples. It thus appears that disagreement-based selection helps tease out examples that contain better discriminating information than other examples with higher entropy. This may in effect be approximating a more general method that could directly identify such examples.</Paragraph> <Paragraph position="10"> The accuracy of LL-CONFIG when using all 4802 available training examples for the tenfold cross-validation is 74.80%, and combined selection improves on this by reaching 75.26% (on average) with 3000 training examples. Furthermore, though active learning was halted at 3000 examples, the accuracy for all the selection methods was still increasing at this point, and it is likely than even higher accuracy would be achieved by allowing more examples to be selected. Sample selection thus appears to identify highly informative subsets as well as reduce the number of examples needed.</Paragraph> <Paragraph position="11"> Finally, we considered one further question regarding the behavior of sample selection under different conditions: can an impoverished model select informative examples for a more capable one? Thus, if active learning is actually used to extend a corpus, will the examples selected for annotation still be of high utility if we later devise a better feature selection strategy that gives rise on tree entropy alone and tree entropy combined with preferred parse disagreement.</Paragraph> <Paragraph position="12"> to better models? To test this, we created a log-linear model that uses only bigrams, used it to select examples by tree entropy, and simultaneously trained and tested LL-CONFIG on those examples. Utilizing all training material, the bigram model performs much worse than LL-CONFIG overall: 71.43% versus 74.80%.</Paragraph> <Paragraph position="13"> LL-CONFIG is thus a sort of passenger of the weaker bi-gram model, which drives the selection process. Figure 6 compares the accuracy of LL-CONFIG under this condition (which only involved one tenfold cross-validation run) with the accuracy when LL-CONFIG itself chooses examples according to tree entropy. Random selection is also included for reference.</Paragraph> <Paragraph position="14"> tree entropy according to LL-CONFIG itself and when LL-CONFIG is the passenger of an impoverished model.</Paragraph> <Paragraph position="15"> This experiment demonstrates that although accuracy does not rise as quickly as when LL-CONFIG itself selects examples, it is still significantly better than random (at 95% confidence) despite the bigram model's poorer performance. We can thus expect samples chosen by the current best model to be informative, though not necessarily optimal, for improved models in the future.</Paragraph> </Section> class="xml-element"></Paper>