File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/p02-1016_intro.xml
Size: 3,427 bytes
Last Modified: 2025-10-06 14:01:24
<?xml version="1.0" standalone="yes"?> <Paper uid="P02-1016"> <Title>Active Learning for Statistical Natural Language Parsing</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> A prerequisite for building statistical parsers (Jelinek et al., 1994; Collins, 1996; Ratnaparkhi, 1997; Charniak, 1997) is the availability of a (large) corpus of parsed sentences. Acquiring such a corpus is expensive and time-consuming and is often the bottleneck to build a parser for a new application or domain. The goal of this study is to reduce the amount of annotated sentences (and hence the development time) required for a statistical parser to achieve a satisfactory performance using active learning.</Paragraph> <Paragraph position="1"> Active learning has been studied in the context of many natural language processing (NLP) applications such as information extraction(Thompson et al., 1999), text classi cation(McCallum and Nigam, 1998) and natural language parsing(Thompson et al., 1999; Hwa, 2000), to name a few. The basic idea is to couple tightly knowledge acquisition, e.g., annotating sentences for parsing, with model-training, as opposed to treating them separately. In our setup, we assume that a small amount of annotated sentences is initially available, which is used to build a statistical parser. We also assume that there is a large corpus of unannotated sentences at our disposal this corpus is called active training set. A batch of samples1 is selected using algorithms developed here, and are annotated by human beings and are then added to training data to rebuild the model. The procedure is iterated until the model reaches a certain accuracy level.</Paragraph> <Paragraph position="2"> Our efforts are devoted to two aspects: rst, we believe that the selected samples should re ect the underlying distribution of the training corpus. In other words, the selected samples need to be representative. To this end, a model-based structural distance is de ned to quantify how far two sentences are apart, and with the help of this distance, the active training set is clustered so that we can de ne and compute the density of a sample; second, we propose and test several entropy-based measures to quantify the uncertainty of a sample in the active training set using an existing model, as it makes sense to ask human beings to annotate the portion of data for which the existing model is not doing well. Samples are selected from the clusters based on uncertainty scores.</Paragraph> <Paragraph position="3"> The rest of the paper is organized as follows. In Section 2, a structural distance is rst de ned based on the sequential representation of a parse tree. It is then straight-forward to employ a k-means algorithm to cluster sentences in the active training set. Section 3 is devoted to con dence measures, where three uncertainty measures are proposed. Active learning results on the shallow semantic parser of an air travel dialog system are presented 1A sample means a sentence in this paper.</Paragraph> <Paragraph position="4"> Computational Linguistics (ACL), Philadelphia, July 2002, pp. 120-127. Proceedings of the 40th Annual Meeting of the Association for in Section 4. A summary of related work is given in Section 5. The paper closes with conclusions and future work.</Paragraph> </Section> class="xml-element"></Paper>