File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/p02-1016_metho.xml
Size: 21,462 bytes
Last Modified: 2025-10-06 14:07:56
<?xml version="1.0" standalone="yes"?> <Paper uid="P02-1016"> <Title>Active Learning for Statistical Natural Language Parsing</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Sentence Distance and Clustering </SectionTitle> <Paragraph position="0"> To characterize the representativeness of a sentence, we need to know how far two sentences are apart so that we can measure roughly how many similar sentences there are in the active training set. For our purpose, the distance ought to have the property that two sentences with similar structures have a small distance, even if they are lexically different. This leads us to de ne the distance between two sentences based on their parse trees, which are obtained by applying an existing model to the active training set. However, computing the distance of two parse trees requires a digression of how they are represented in our parser.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Event Representation of Parse Trees </SectionTitle> <Paragraph position="0"> A statistical parser computesa2a4a3a6a5a8a7a9a11a10 , the probability of a parse a5 given a sentence a9 . Since the space of the entire parses is too large and cannot be modeled directly, a parse tree a5 is decomposed as a series of individual actions a12a14a13a16a15a17a12a19a18a20a15a22a21a23a21a22a21a24a15a17a12a19a25a27a26 . In the parser (Jelinek et al., 1994) we used in this study, this is accomplished through a bottomup-left-most (BULM) derivation. In the BULM derivation, there are three types of parse actions: tag, label and extension. There is a corresponding vocabulary for tag or label, and there are four extension directions: RIGHT, LEFT, UP and UNIQUE. If a child node is the only node under a label, the child node is said to extend UNIQUE to its parent node; if there are multiple children under a parent node, the left-most child is said to extend RIGHT to the parent node, the right-most child node is said to extend LEFT to the parent node, while all the other intermediate children are said to extend UP to their parent node. The BULM derivation can be best explained by an example in Figure 1.</Paragraph> <Paragraph position="1"> as 17 parsing actions: tags (1,3,5,7,11,13) blue boxes, labels (9,15,17) green underlines, extensions (2,4,6,8,10,12,14,16) red parentheses. Numbers indicate the order of actions.</Paragraph> <Paragraph position="2"> The input sentence is fly from new york to boston. Numbers on its semantic parse tree indicate the order of parse actions while colors indicate types of actions: tags are numbered in blue boxes, extensions in red parentheses and labels in green underlines. For this example, the rst action is tagging the rst word fly given the sentence; the second action is extending the tag wdRIGHT, as the tagwdis the left-most child of the constituent S; and the third action is tagging the second word from given the sentence and the two proceeding actions, and so on and so forth.</Paragraph> <Paragraph position="3"> We de ne an event as a parse action together with its context. It is clear that the BULM derivation converts a parse tree into a unique sequence of parse events, and a valid event sequence corresponds to a unique parse tree.</Paragraph> <Paragraph position="4"> Therefore a parse tree can be equivalently represented by a sequence of events. Let a28a24a3a29a9a30a10 be the set of tagging actions, a31a32a3a29a9a30a10 be the labeling actions and a33a34a3a29a9a11a10 be the extending actions of a9 , and let a35a36a3a12 a10 be the sequence of actions ahead of the action a12 , thena2a4a3a37a5a8a7a9a30a10 can be rewritten as: Note that raw context space a77a79a78a80a9 a15 a35a36a3a12 a10a59a81a83a82 is too huge to store and manipulate ef ciently. In our implementation, contexts are internally represented as bitstrings through a set of pre-designed questions. Answers of each question are represented as bitstrings. To support questions like what is the previous word (or tag, label, extension)? , word, tag, label and extension vocabularies are all encoded as bitstrings. Words are encoded through an automatic clustering algorithm (Brown et al., 1992) while tags, labels and extensions are normally encoded using diagonal bits. An example can be found in (Luo et al., 2002).</Paragraph> <Paragraph position="5"> In summary, a parse tree can be represented uniquely by a sequence of events, while each event can in turn be represented as a bitstring. With this in mind, we are now ready to de ne a structural distance for two sentences given an existing model.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Sentence Distance </SectionTitle> <Paragraph position="0"> Recall that it is assumed that there is a statistical parser a84 trained with a small amount of annotated data. To infer structures of two sentences a9</Paragraph> <Paragraph position="2"> is de ned as the distance between a3a65a9 a13 a15 a5 a13 a10 and a3a29a9 a18 a15 a5 a18 a10 , or: a85a70a86</Paragraph> <Paragraph position="4"> To emphasize the dependency on a84 , we denote the distance as</Paragraph> <Paragraph position="6"> and a9 a18 have similar true parses if they have similar structures under the current model a84 .</Paragraph> <Paragraph position="7"> We have shown in Section 2.1 that a parse tree can be represented by a sequence of events, each of which can in turn be represented as bitstrings through answering questions. Let a33 a41 a38a88a87 a44</Paragraph> <Paragraph position="9"> parsing action of the a98a19a99a6a100 event of the parse tree a5 a41 . We can de ne the distance between two sentences a9 a13 a15 a9 a18 as</Paragraph> <Paragraph position="11"> The distance between two sequences a33 a13 and a33 a18 is computed as the editing distance using dynamic programming (Rabiner and Juang, 1993). We now describe the distance between two individual events.</Paragraph> <Paragraph position="12"> We take advantage of the fact that contexts a77a16a35 a44a96</Paragraph> <Paragraph position="14"> be encoded as bitstrings, and de ne the distance between two contexts as the Hamming distance between their bit-string representations. We further de ne the distance between two parsing actions as follows: it is either a104 or a constant a62 if two parse actions are of the same type (recall there are three types of parsing actions: tag, label and extension), and in nity if different types. We choose a62 to be the number of bits ina35 a44a96 a41 to emphasize the importance of parsing actions in distance computation. Formally, let</Paragraph> <Paragraph position="16"> Computing the editing distance (3) requires dynamic programming and it is computationally extensive. To speed up computation, we can choose to ignore the difference in contexts, or in other words, (4) becomes</Paragraph> <Paragraph position="18"> That is, the sample density is de ned as the inverse of its average distance to other samples. We also de ne the</Paragraph> <Paragraph position="20"/> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 K-Means Clustering </SectionTitle> <Paragraph position="0"> With the model-based distance measure de ned above, we can use the K-means algorithm to cluster sentences.</Paragraph> <Paragraph position="1"> A sketch of the algorithm (Jelinek, 1997) is as follows.</Paragraph> <Paragraph position="2"> Let a121a132a38a133a77a83a9 a13 a15 a9 a18 a15 a63a43a63a134a63a15 a9a89a123a135a82 be the set of sentences to be clustered.</Paragraph> <Paragraph position="3"> 1. Initialization. Partition a77a16a9 a13a16a15 a9 a18a27a15 a63a43a63a134a63a15 a9 a123 a82 into k ini null gorithm converges (e.g., relative change of the total distortion is smaller than a threshold).</Paragraph> <Paragraph position="4"> For each iteration we need to compute: a156 the distance between samples</Paragraph> <Paragraph position="6"> a156 the pair-wise distances within each cluster.</Paragraph> <Paragraph position="7"> The basic operation here is to compute the distance between two sentences, which involves a dynamic programming process and is time-consuming. The complexity of this algorithm is, if we assume the N samples are uniformly distributed between the k clusters, approximately is not clear how to average sentences.</Paragraph> <Paragraph position="8"> To speed up, dynamic programming is constrained so that only the band surrounding the diagonal line (Rabiner and Juang, 1993) is allowed, and repeated sentences are stored as a unique copy with its count so that computation for the same sentence pair is never repeated. The latter is a quite effective for dialog systems as a sentence is often seen more than once in the training corpus.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Uncertainty Measures </SectionTitle> <Paragraph position="0"> Intuitively, we would like to select samples that the current model is not doing well. The current model's uncertainty about a sentence could be because similar sentences are under-represented in the (annotated) training set, or similar sentences are intrinsically dif cult. We take advantage of the availability of parsing scores from the existing statistical parser and propose three entropy-based uncertainty scores.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Change of Entropy </SectionTitle> <Paragraph position="0"> After decision trees are grown, we can compute the entropy of each leaf node a166 as:</Paragraph> <Paragraph position="2"> where a91 sums over either tag, label or extension vocab- null we only have to visit leaf nodes where counts change. In other words, a107a4a178 can be computed ef ciently. a107a150a178 characterizes how a sentencea9 surprises the existing model: if the addition of events due to a9 changes a</Paragraph> <Paragraph position="4"> a10a64a82 , and consequently,a107 , the sentence is probably not well represented in the initial training set anda107 a178 will be large. We would like to annotate these sentences.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Sentence Entropy </SectionTitle> <Paragraph position="0"> Now let us consider another measurement which seeks to address the intrinsic dif culty of a sentence. Intuitively, we can consider a sentence more dif cult if there are potentially more parses. We calculate the entropy of the distribution over all candidate parses as the sentence entropy to measure the intrinsic ambiguity.</Paragraph> <Paragraph position="1"> Given a sentencea9 , the existing modela84 could generate the top a180 most likely parses a77a23a5 a41a181a153 a91a30a38a68a93 a15a17a95a14a15a22a21a23a21a22a21a79a15 a180a119a82 , each a5 a41 having a probability a182 a41 :</Paragraph> <Paragraph position="3"> where a5 a41 is the a91a80a99a6a100 possible parse and a182 a41 is its associated score. Without confusion, we drop a182 a41 's dependency on a84 and de ne the sentence entropy as:</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Word Entropy </SectionTitle> <Paragraph position="0"> As we can imagine, a long sentence tends to have more possible parsing results not because it is dif cult but simply because it is long. To counter this effect, we can normalize the sentence entropy by the length of sentence to calculate per word entropy of a sentence: ent uncertainty scores versus sentence lengths. a107 a178 favors longer sentences more. This can be explained as follows: longer sentences tend to have more complex structures ( extension and labeling ) than shorter sentences. And the models for these complex structures are relatively less trained as compared with models for tagging. As a result, longer sentences would have higher change of entropy, in other words, larger impact on models.</Paragraph> <Paragraph position="1"> As explained above, longer sentences also have larger sentence entropy. After normalizing, this trend is reversed in word entropy.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Experimental Results and Analysis </SectionTitle> <Paragraph position="0"> All experiments are done with a shallow semantic parser (a.k.a. classer (Davies et al, 1999)) of the natural language understanding part in DARPA Communicator (DARPA Communicator Website, 2000). We built an initial model using 1000 sentences. We have 20951 unlabeled sentences for the active learner to select samples. An independent test set consists of 4254 sentences. A xed batch size a192a193a38a194a93a23a104a27a104 is used through out our experiments. null Exact match is used to compute the accuracy, i.e., the accuracy is the number of sentences whose decoding trees are exactly the same as human annotation divided by the number of sentences in the test set. The effectiveness of active learning is measured by comparing learning curves (i.e., test accuracy vs. number of training sentences ) of active learning and random selection.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Sample Selection Schemes </SectionTitle> <Paragraph position="0"> We experimented two basic sample selection algorithms.</Paragraph> <Paragraph position="1"> The rst one is selecting samples based solely on uncertainty scores, while the second one clusters sentences, and then selects the most uncertain ones from each cluster. null a156 Uncertainty Only: at each active learning iteration, the most uncertain a192 sentences are selected.</Paragraph> <Paragraph position="2"> The drawback of this selection method is that it risks selecting outliers because outliers are likely to get high uncertainty scores under the existing models.</Paragraph> <Paragraph position="3"> Figure 3 shows the test accuracy of this selection method against the number of samples selected from the active training set.</Paragraph> <Paragraph position="4"> Short sentences tends to have higher value of a107 a190 while sentence-based uncertainty scores (in terms of a107 a178 or a107 ) are low. Since we use the sentences as the basic units, it is not surprising that a107 a190 -based method performs poorly while the other two perform very well.</Paragraph> <Paragraph position="5"> a156 Most Uncertain Per Cluster: In our implementation, we cluster the active training set so that pick samples with highest entropies the number of clusters equals the batch size. This scheme selects the sentence with the highest uncertain score from each cluster.</Paragraph> <Paragraph position="6"> We expect that restricting sample selection to each cluster would x the problem that a107 a190 tends to be large for short sentences, as short sentences are likely to be in one cluster and long sentences will get a fair chance to be selected in other clusters. This is veri ed by the learning curves in Figure 4. Indeed, a107 a190 performs as well asa107a4a195 most of the time. And all active learning algorithms perform better than random selection.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Weighting Samples </SectionTitle> <Paragraph position="0"> In the sample selection process we calculated the density of each sample. For those samples selected, we also have the knowledge of their correct annotations, which can be used to evalutate the model's performance on them.</Paragraph> <Paragraph position="1"> We exploit this knowledge and experiment two weighting schemes.</Paragraph> <Paragraph position="2"> a156 Weight by Density: A sample with higher density should be assigned greater weights because the model can bene t more by learning from this sample as it has more neighbors. We calculate the density of a sample inside its cluster so we need to adjust the density by cluster size to avoid the unwanted bias toward small clusters. For cluster a136a88a38a196a77a16a9 a41a82a70a7 performance is to focus the model on its weakness when it knows about it. The model can test itself on its training set where the truth is known and assign greater weights to sentences it parses incorrectly. In our experiment, weights are updated as follows: the initial weight for a sentence is its count; and if the human annotation of a selected sentence differs from the current model output, its weight is multiplied by a93a27a63a198 . We did not experiment more complicated weighting scheme (like AdaBoost) since we only want to see if weighting has any effect on active learning result.</Paragraph> <Paragraph position="3"> Figure 5 and Figure 6 are learning curves when selected samples are weighted by density and performance, which are described in Section 4.2.</Paragraph> <Paragraph position="4"> weighted by density The effect of weighting samples is highlighted in Table 1, where results are obtained after 1000 samples are selected using the same uncertainty score a107a34a190 , but with different weighting schemes. Weighting samples by density leads to the best performance. Since weighting samples by density is a way to tweak sample distribution of training set toward the distribution of the entire sample space, including unannotated sentences, it indicates that it is important to ensure the distribution of training set matches that of the sample space. Therefore, we believe that clustering is a necessary and useful step.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Effect of Clustering </SectionTitle> <Paragraph position="0"> certainty score(i.e., sentence entropy in Figure 3) to select samples with the best learning curve resulted from clustering and the word entropya107a34a190 . It is clear that clustering results in a better learning curve.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.4 Summary Result </SectionTitle> <Paragraph position="0"> Figure 8 shows the best active learning result compared with that of random selection. The learning curve for active learning is obtained usinga107 a190 as uncertainty measure and selected samples are weighted by density. Both active learning and random selection are run 40 times, each time selecting 100 samples. The horizontal line on the graph is the performance if all 20K sentences are used. It is remarkable to notice that active learning can use far less samples ( usually less than one third ) to achieve the same level of performance of random selection. And after only about 2800 sentences are selected, the active learning result becomes very close to the best possible accuracy.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Previous Work </SectionTitle> <Paragraph position="0"> While active learning has been studied extensively in the context of machine learning (Cohn et al., 1996; Freund tences) of training data to achieve similar performance to random selection (about 4000 sentence).</Paragraph> <Paragraph position="1"> et al., 1997), and has been applied to text classi cation (McCallum and Nigam, 1998) and part-of-speech tagging (Dagan and Engelson, 1995), there are only a handful studies on natural language parsing (Thompson et al., 1999) and (Hwa, 2000; Hwa, 2001). (Thompson et al., 1999) uses active learning to acquire a shift-reduce parser, and the uncertainty of an unparseable sentence is de ned as the number of operators applied successfully divided by the number of words. It is more natural to dene uncertainty scores in our study because of the availbility of parse scores. (Hwa, 2000; Hwa, 2001) is related closely to our work in that both use entropy-based uncertainty scores, but Hwa does not characterize the distribution of sample space. Knowing the distribution of sample space is important since uncertainty measure, if used alone for sample selection, will be likely to select outliers. (Stolcke, 1998) used an entropy-based criterion to reduce the size of backoff n-gram language models.</Paragraph> <Paragraph position="2"> The major contribution of this paper is that a model-based distance measure is proposed and used in active learning. The distance measures structural difference of two sentences relative to an existing model. Similar idea is also exploited in (McCallum and Nigam, 1998) where authors use the divergence between the unigram word distributions of two documents to measure their difference. This distance enables us to cluster the active training set and a sample is then selected and weighted based on both its uncertainty score and its density. (Sarkar, 2001) applied co-training to statistical parsing, where two component models are trained and the most con dent parsing outputs of the existing model are incorporated into the next training. This is a different venue for reducing annotation work in that the current model output is directly used and no human annotation is assumed. (Luo et al., 1999; Luo, 2000) also aimed to making use of unlabeled data to improve statistical parsers by transforming model parameters.</Paragraph> </Section> class="xml-element"></Paper>