File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0403_metho.xml
Size: 15,956 bytes
Last Modified: 2025-10-06 14:08:20
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0403"> <Title>Active learning for HPSG parse selection</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Active Learning </SectionTitle> <Paragraph position="0"> Active learning attempts to reduce the number of examples needed for training statistical models by allowing the machine learner to directly participate in creating the corpus it uses. There are a several approaches to active learning; here, we focus on selective sampling (Cohn et al., 1994), which involves identifying the most informative examples from a pool of unlabelled data and presenting only these examples to a human expert for annotation. The two main flavors of selective sampling are certainty-based methods and committee-based methods (Thompson et al., 1999). For certainty-based selection, the examples chosen for annotation are those for which a single learner is least confident, as determined by some criterion. Committee-based selection involves groups of learners that each maintain different hypotheses about the problem; examples on which the learners disagree in some respect are typically regarded as the most informative. null Active learning has been successfully applied to a number of natural language oriented tasks, including text categorization (Lewis and Gale, 1994) and part-of-speech tagging (Engelson and Dagan, 1996). Hwa (2000) shows that certainty-based selective sampling can reduce the amount of training material needed for inducing Probabilistic Lexicalized Tree Insertion Grammars by 36% without degrading the quality of the grammars. Like Hwa, we investigate active learning for parsing and thus seek informative sentences; however, rather than inducing grammars, our task is to select the best parse from the output of an existing hand-crafted grammar by using the Redwoods treebank.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 The Redwoods Treebank </SectionTitle> <Paragraph position="0"> The English Resource Grammar (ERG, Flickinger (2000)) is a broad coverage HPSG grammar that provides deep semantic analyses of sentences but has no means to prefer some analyses over others because of its purely symbolic nature. To address this limitation, the Redwoods treebank has been created to provide annotated training material to permit statistical models for ambiguity resolution to be combined with the precise interpretations produced by the ERG (Oepen et al., 2002).</Paragraph> <Paragraph position="1"> Whereas the Penn Treebank has an implicit grammar underlying its parse trees, Redwoods uses the ERG explicitly. For each utterance, Redwoods enumerates the set of analyses, represented as derivation trees, licensed by the ERG and identifies which analysis is the preferred one. For example, Figure 1 shows the preferred deriva- null can I do for you? The node labels are the names of the ERG rules used to build the analysis.</Paragraph> <Paragraph position="2"> tion tree, out of three ERG analyses, for what can I do for you?. From such derivation trees, the parse trees and semantic interpretations can be recovered using an HPSG parser.</Paragraph> <Paragraph position="3"> Redwoods is (semi-automatically) updated after changes have been made to the ERG, and it has thus far gone through three growths. Some salient characteristics of the first and third growths are given in Table 1 for utterances for which a unique preferred parse has been identified and for which there are at least two analyses.1 The ambiguity increased considerably between the first and third growths, reflecting the increased coverage of the ERG for more difficult sentences.</Paragraph> <Paragraph position="4"> used for the parse selection task. The columns indicate the number of sentences in the subset, their average length, and their average number of parses.</Paragraph> <Paragraph position="5"> The small size of the treebank makes it essential to explore the possibility of using methods such as active learning to speed the creation of more annotated material for training parse selection models.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Parse Selection </SectionTitle> <Paragraph position="0"> Committee-based active learning requires multiple learners which have different biases that cause them to make different predictions sometimes. As in co-training, one 1There are over 1400 utterances in both versions for which the ERG produces only one analysis and which therefore are irrelevant for parse selection. They contain no discriminating information and are thus not useful for the machine learning algorithms discussed in the next section.</Paragraph> <Paragraph position="2"> derivation tree in Figure 1.</Paragraph> <Paragraph position="3"> way such diverse learners can be created is by using independent or partially independent feature sets to reduce the error correlation between the learners. Another way is to use different machine learning algorithms trained on the same feature set. In this section, we discuss two feature sets and two machine learning algorithms that are used to produce four distinct models and we give their overall performance on the parse selection task.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Features </SectionTitle> <Paragraph position="0"> Our two feature sets are created by using only the derivation trees made available in Redwoods. The configurational set is loosely based on the derivation tree features given by Toutanova and Manning (2002), and thus encodes standard relations such as grandparent-of and left-sibling for the nodes in the tree. The ngram set is created by flattening derivation trees and treating them as strings of rule names over which ngrams are extracted, taking up to four rule names at a time and including the number of intervening parentheses between them. We ignore orthographic values for both feature sets.</Paragraph> <Paragraph position="1"> As examples of typical ngram features, the derivation tree given in Figure 1 generates features such as those depicted in Figure 2. Such features provide a reasonable approximation of trees that implicitly encodes many of the interesting relationships that are typically gathered from them, such as grandparent and sibling relations. They also capture further relationships that cross the brackets of the actual tree, providing some more long-distance relationships than the configurational features.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Algorithms </SectionTitle> <Paragraph position="0"> We use both log-linear and perceptron algorithms to create parse selection models. Both frameworks use iterative procedures to determine the weights a14a16a15 from annotated training material. Though they are otherwise quite different, this commonality facilitates their use in a committee since they can work with the same training material. When preparing the training material, we record observations about the distribution of analyses with a binary distinction that simply identifies the preferred parse, rather than using a full regression approach that recognizes similarities between the preferred parse and some of the dispreferred ones.</Paragraph> <Paragraph position="1"> Log-linear models have previously been used for stochastic unification-based grammars by Johnson et al. (1999) and Osborne (2000). Using Redwoods-1, Toutanova and Manning (2002) have shown that log-linear models for parse selection considerably outperform PCFG models trained on the same features. By using features based on both derivation trees and semantic dependency trees, they achieved 83.32% exact match whole-sentence parse selection with an an ensemble of log-linear models that used different subsets of the feature space.</Paragraph> <Paragraph position="2"> As standard for parse selection using log-linear modelling, we model the probability of an analysis a30a32a31 given a sentence with a set of analyses a33a35a34a36a26a20a30</Paragraph> <Paragraph position="4"> where a27 a49 a5a40a30a38a1 returns the number of times feature a52 occurs in analysis a30 and a51 a5 a42 a1 is a normalization factor for the sentence. The parse with the highest probability is taken as the preferred parse for the model.2 We use the limited memory variable metric algorithm (Malouf, 2002) to determine the weights.</Paragraph> <Paragraph position="5"> Perceptrons have been used by Collins and Duffy (2002) to re-rank the output of a PCFG, but have not previously been applied to feature-based grammars. Standard perceptrons assign a score rather than probability to each analysis. Scores are computed by taking the inner product of the analysis' feature vector with the parameter vector:</Paragraph> <Paragraph position="7"> The preferred parse is that with the highest score out of all analyses of a sentence.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Performance </SectionTitle> <Paragraph position="0"> Using the two feature sets (configurational and ngram) with both log-linear and perceptron algorithms, we create the four models shown in Table 2. To test their overall accuracy, we measured performance using exact match.</Paragraph> <Paragraph position="1"> This means we award a model a point if it picks some parse for a sentence and that parse happens to be the best analysis. We averaged performance over ten runs using a cross-validation strategy. For each run, we randomly split the corpus into ten roughly equally-sized subsets and tested the accuracy for each subset after training a model on the other nine. The accuracy when a model ranks a59 parses highest is given as a60a18a61a28a59 .</Paragraph> <Paragraph position="2"> The results for the four models on both Redwoods-1 and Redwoods-3 are given in Table 3, along with a base-line of randomly selecting parses. As can be seen, the increased ambiguity in the later version impacts the ac- null curacy heavily.</Paragraph> <Paragraph position="3"> The performance of LL-CONFIG on Redwoods-1 matches the accuracy of the best stand-alone log-linear model reported by Toutanova and Manning (2002), which uses essentially the same features. The log-linear model that utilizes the ngram features is not far behind, indicating that these simple features do indeed capture important generalizations about the derivation trees.</Paragraph> <Paragraph position="4"> The perceptrons both perform worse than the log-linear models. However, what is more important is that each model disagrees with all of the others on roughly 20% of the examples, indicating that differentiation by using either a different feature set or a different machine learning algorithm is sufficient to produce models with different biases. This is essential for setting up committee-based active learning and could also make them informative members in an ensemble for parse selection.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Selecting Examples for Annotation </SectionTitle> <Paragraph position="0"> In applying active learning to parse selection, we investigate two primary sample selection methods, one certainty-based and the other committee-based, and compare them to several baseline methods.</Paragraph> <Paragraph position="1"> The single-learner method uses tree entropy (Hwa, 2000), which measures the uncertainty of a learner based on the conditional distribution it assigns to the parses of a given sentence. Following Hwa, we use the following evaluation function to quantify uncertainty based on tree entropy:</Paragraph> <Paragraph position="3"> where a33 denotes the set of analyses produced by the ERG 2When only an absolute ranking of analyses is required, it is unnecessary to exponentiate and compute a23a25a24a3a26a16a27 .</Paragraph> <Paragraph position="4"> for the sentence. Higher values of a27 a0a3a2 a5 a42</Paragraph> <Paragraph position="6"> amples on which the learner is most uncertain and thus presumably are more informative. The intuition behind tree entropy is that sentences should have a skewed distribution over their parses and that deviation from this signals learner uncertainty. Calculating tree entropy is trivial with the conditional log-linear models described in section 4. Of course, tree entropy cannot be straight-forwardly used with standard perceptrons since they do not determine a distribution over the parses of a sentence.</Paragraph> <Paragraph position="7"> The second sample selection method is inspired by the Query by Committee algorithm (Freund et al., 1997; Argamon-Engelson and Dagan, 1999) and co-testing (Muslea et al., 2000). Using a fixed committee consisting of two distinct models, the examples we select for annotation are those for which the two models disagree on the preferred parse. We will refer to this method as preferred parse disagreement. The intuition behind this method is that the different biases of each of the learners will lead to different predictions on some examples and thus identify examples for which at least one of them is uncertain.</Paragraph> <Paragraph position="8"> We compare tree entropy and disagreement with the following three baseline selection methods to ensure the significance of the results:</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Experimental Setup </SectionTitle> <Paragraph position="0"> The pseudo-code for committee-based active learning with two members is given in Figure 3.3 Starting with a small amount of initial annotated training material, the learners on the committee are used to select examples, based on the method being used. These examples are subsequently manually annotated and added to the set of labelled training material and the learners are retrained on the extended set. This loop continues until all available unannotated examples are exhausted, or until some other pre-determined condition is met.</Paragraph> <Paragraph position="1"> As is standard for active learning experiments, we quantify the effect of different selection techniques by using them to select subsets of the material already annotated in Redwoods-3. For the experiments, we used tenfold cross-validation by moving a fixed window of 500 sentences through Redwoods-3 for the test set and selecting samples from the remaining 4802 sentences. Each run of active learning begins with 50 randomly chosen, annotated seed sentences. At each round, new examples are selected for annotation from a randomly chosen sub-set according to the operative selection method until the total amount of annotated training material made available to the learners reaches 3000. We select 25 examples at time until the training set contains 1000 examples, then 50 at a time until it has 2000, and finally 100 at a time until it has 3000. The results for each selection method are averaged over four tenfold cross-validation runs.</Paragraph> <Paragraph position="2"> Whereas Hwa (Hwa, 2000) evaluated the effectiveness of selective sampling according to the number of brackets which were needed to create the parse trees for selected sentences, we compare selection methods based on the absolute number of sentences they select. This is realistic in the Redwoods setting since the derivation trees are created automatically from the ERG, and the task of the human annotator is to select the best from all licensed parses. Annotation in Redwoods uses an interface that presents local discriminants which disambiguate large portions of the parse forest, so options are narrowed down quickly even for sentences with a large number of parses.</Paragraph> </Section> class="xml-element"></Paper>