File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0403_intro.xml
Size: 4,111 bytes
Last Modified: 2025-10-06 14:01:54
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0403"> <Title>Active learning for HPSG parse selection</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Even with significant resources such as the Penn Treebank, a major bottleneck for improving statistical parsers has been the lack of sufficient annotated material from which to estimate their parameters. Most statistical parsing research, such as Collins (1997), has centered on training probabilistic context-free grammars using the Penn Treebank. For richer linguistic frameworks, such as Head-Driven Phrase Structure Grammar (HPSG), there is even less annotated material available for training stochastic parsing models. There is thus a pressing need to create significant volumes of annotated material in a logistically efficient manner. Even if it were possible to bootstrap from the Penn Treebank, it is still unlikely that there would be sufficient quantities of high quality material. null There has been a strong focus in recent years on using the active learning technique of selective sampling to reduce the amount of human-annotated training material needed to train models for various natural language processing tasks. The aim of selective sampling is to identify the most informative examples, according to some selection method, from a large pool of unlabelled material.</Paragraph> <Paragraph position="1"> Such selected examples are then manually labelled. Selective sampling has typically been applied to classification tasks, but has also been shown to reduce the number of examples needed for inducing Lexicalized Tree Insertion Grammars from the Penn Treebank (Hwa, 2000).</Paragraph> <Paragraph position="2"> The suitability of active learning for HPSG-type grammars has as yet not been explored. This paper addresses the problem of minimizing the human effort expended in creating annotated training material for HPSG parse selection by using selective sampling. We do so in the context of Redwoods (Oepen et al., 2002), a treebank that contains HPSG analyses for sentences from the Verbmobil appointment scheduling and travel planning domains.</Paragraph> <Paragraph position="3"> We show that sample selection metrics based on tree entropy (Hwa, 2000) and disagreement between two different parse selection models significantly reduce the number of annotated sentences necessary to match a given level of performance according to random selection. Furthermore, by combining these two methods as an ensemble selection method, we require even fewer examples -achieving a 60% reduction in the amount of annotated training material needed to outperform a model trained on randomly selected material. These results suggest that significant reductions in human effort can be realized through selective sampling when creating annotated material for linguistically rich grammar formalisms.</Paragraph> <Paragraph position="4"> As the basis of our active learning approach, we create both log-linear and perceptron models, the latter of which has not previously been used for feature-based grammars.</Paragraph> <Paragraph position="5"> We show that the different biases of the two types of models is sufficient to create diverse members for a committee, even when they use exactly the same features. With respect to the features used to train models, we demonstrate that a very simple feature selection strategy that ignores the proper structure of trees is competitive with one that fully respects tree configurations.</Paragraph> <Paragraph position="6"> The structure of the paper is as follows. In sections 2 and 3, we briefly introduce active learning and the Redwoods treebank. Section 4 discusses the parse selection models that we use in the experiments. In sections 5 and 6, we explain the different selection methods that we use for active learning and explicate the setup in which the experiments were conducted. Finally, the results of the experiments are presented and discussed in section 7.</Paragraph> </Section> class="xml-element"></Paper>