File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/p96-1042_intro.xml
Size: 3,522 bytes
Last Modified: 2025-10-06 14:06:10
<?xml version="1.0" standalone="yes"?> <Paper uid="P96-1042"> <Title>Minimizing Manual Annotation Cost In Supervised Training From Corpora</Title> <Section position="3" start_page="0" end_page="319" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Many corpus-based methods for natural language processing (NLP) are based on supervised training-acquiring information from a manually annotated corpus. Therefore, reducing annotation cost is an important research goal for statistical NLP. The ultimate reduction in annotation cost is achieved by unsupervised training methods, which do not require an annotated corpus at all (Kupiec, 1992; Merialdo, 1994; Elworthy, 1994). It has been shown, however, that some supervised training prior to the unsupervised phase is often beneficial. Indeed, fully unsupervised training may not be feasible for certain tasks. This paper investigates an approach for optimizing the supervised training (learning) phase, which reduces the annotation effort required to achieve a desired level of accuracy of the trained model.</Paragraph> <Paragraph position="1"> In this paper, we investigate and extend the committee-based sample selection approach to minimizing training cost (Dagan and Engelson, 1995).</Paragraph> <Paragraph position="2"> When using sample selection, a learning program examines many unlabeled (not annotated) examples, selecting for labeling only those that are most informative for the learner at each stage of training (Seung, Opper, and Sompolinsky, 1992; Freund et al., 1993; Lewis and Gale, 1994; Cohn, Atlas, and Ladner, 1994). This avoids redundantly annotating many examples that contribute roughly the same information to the learner.</Paragraph> <Paragraph position="3"> Our work focuses on sample selection for training probabilistic classifiers. In statistical NLP, probabilistic classifiers are often used to select a preferred analysis of the linguistic structure of a text (for example, its syntactic structure (Black et al., 1993), word categories (Church, 1988), or word senses (Gale, Church, and Yarowsky, 1993)). As a representative task for probabilistic classification in NLP, we experiment in this paper with sample selection for the popular and well-understood method of stochastic part-of-speech tagging using Hidden Markov Models.</Paragraph> <Paragraph position="4"> We first review the basic approach of committee-based sample selection and its application to part-of-speech tagging. This basic approach gives rise to a family of algorithms (including the original algorithm described in (Dagan and Engelson, 1995)) which we then describe. First, we describe the 'simplest' committee-based selection algorithm, which has no parameters to tune. We then generalize the selection scheme, allowing more options to adapt and tune the approach for specific tasks. The paper compares the performance of several instantiations of the general scheme, including a batch selection method similar to that of Lewis and Gale (1994).</Paragraph> <Paragraph position="5"> In particular, we found that the simplest version of the method achieves a significant reduction in annotation cost, comparable to that of other versions.</Paragraph> <Paragraph position="6"> We also evaluate the computational efficiency of the different variants, and the number of unlabeled examples they consume. Finally, we study the effect of sample selection on the size of the model acquired by the learner.</Paragraph> </Section> class="xml-element"></Paper>