File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/p96-1042_metho.xml
Size: 25,774 bytes
Last Modified: 2025-10-06 14:14:18
<?xml version="1.0" standalone="yes"?> <Paper uid="P96-1042"> <Title>Minimizing Manual Annotation Cost In Supervised Training From Corpora</Title> <Section position="4" start_page="319" end_page="320" type="metho"> <SectionTitle> 2 Probabilistic Classification </SectionTitle> <Paragraph position="0"> This section presents the framework and terminology assumed for probabilistic classification, as well as its instantiation for stochastic bigram part-of-speech tagging.</Paragraph> <Paragraph position="1"> A probabilistic classifier classifies input examples e by classes c E C, where C is a known set of possible classes. Classification is based on a score function, FM(C, e), which assigns a score to each possible class of an example. The classifier then assigns the example to the class with the highest score. FM is determined by a probabilistic model M. In many applications, FM is the conditional probability function, PM (cle), specifying the probability of each class given the example, but other score functions that correlate with the likelihood of the class are often used.</Paragraph> <Paragraph position="2"> In stochastic part-of-speech tagging, the model assumed is a Hidden Markov Model (HMM), and input examples are sentences. The class c, to which a sentence is assigned is a sequence of the parts of speech (tags) for the words in the sentence. The score function is typically the joint (or conditional) probability of the sentence and the tag sequence 1 . The tagger then assigns the sentence to the tag sequence which is most probable according to the HMM.</Paragraph> <Paragraph position="3"> The probabilistic model M, and thus the score function FM, are defined by a set of parameters, {hi}. During training, the values of the parameters are estimated from a set of statistics, S, extracted from a training set of annotated examples. We denote a particular model by M = {hi}, where each ai is a specific value for the corresponding cq.</Paragraph> <Paragraph position="4"> In bigram part-of-speech tagging the HMM model M contains three types of parameters: transition probabilities P(ti---*tj) giving the probability of tag tj occuring after tag ti, lexical probabilities P(t\[w) giving the probability of tag t labeling word w, and tag probabilities P(t) giving the marginal probability 2 of a tag occurring. The values of these parameters are estimated from a tagged corpus which provides a training set of labeled examples (see Section 4.1).</Paragraph> <Paragraph position="5"> ~ (Church, 1988). (P(wdt,) oC/ P(t,) J gate approach this evaluation implicitly, measuring an example's informativeness as the uncertainty in its classification given the current training data (Seung, Opper, and Sompolinsky, 1992; Lewis and Gale, 1994; MacKay, 1992). The reasoning is that if an example's classification is uncertain given current training data then the example is likely to contain unknown information useful for classifying similar examples in the future.</Paragraph> <Paragraph position="6"> We investigate the committee-based method, where the learning algorithm evaluates an example by giving it to a committee containing several variant models, all 'consistent' with the training data seen so far. The more the committee members agree on the classification of the example, the greater our certainty in its classification. This is because when the training data entails a specific classification with high certainty, most (in a probabilistic sense) classitiers consistent with the data will produce that classification. null The committee-based approach was first proposed in a theoretical context for learning binary non-probabilistic classifiers (Seung, Opper, and Sompolinsky, 1992; Freund et al., 1993). In this paper, we extend our previous work (Dagan and Engelson, 1995) where we applied the basic idea of the committee-based approach to probabilistic classification. Taking a Bayesian perspective, the posterior probability of a model, P(M\[S), is determined given statistics S from the training set (and some prior distribution for the models). Committee members are then generated by drawing models randomly from P(MIS ). An example is selected for labeling if the committee members largely disagree on its classification. This procedure assumes that one can sample from the models' posterior distribution, at least approximately. null To illustrate the generation of committeemembers, consider a model containing a single binomial parameter a (the probability of a success), with estimated value a. The statistics S for such a model are given by N, the number of trials, and x, the number of successes in those trials.</Paragraph> <Paragraph position="7"> Given N and x, the 'best' parameter value may be estimated by one of several estimation methods.</Paragraph> <Paragraph position="8"> For example, the maximum likelihood estimate for a X is a = ~, giving the model M = {a} = {~}. When generating a committee of models, however, we are not interested in the 'best' model, but rather in sampling the distribution of models given the statistics. For our example, we need to sample the posterior density of estimates for a, namely P(a = a\]S). Sampling this distribution yields a set of estimates scattered around ~ (assuming a uniform prior), whose variance decreases as N increases. In other words, the more statistics there are for estimating the parameter, the more similar are the parameter values used by different committee members.</Paragraph> <Paragraph position="9"> For models with multiple parameters, parame- null ter estimates for different committee members differ more when they are based on low training counts, and they agree more when based on high counts. Selecting examples on which the committee members disagree contributes statistics to currently uncertain parameters whose uncertainty also affects classification. null It may sometimes be difficult to sample P(M\[S) due to parameter interdependence. Fortunately, models used in natural language processing often assume independence between most model parameters. In such cases it is possible to generate committee members by sampling the posterior distribution for each independent group of parameters separately.</Paragraph> </Section> <Section position="5" start_page="320" end_page="320" type="metho"> <SectionTitle> 4 Bigram Part-Of-Speech Tagging </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="320" end_page="320" type="sub_section"> <SectionTitle> 4.1 Sampling model parameters </SectionTitle> <Paragraph position="0"> In order to generate committee members for bigram tagging, we sample the posterior distributions for transition probabilities, P(ti---~tj), and for lexical probabilities, P(t\[w) (as described in Section 2).</Paragraph> <Paragraph position="1"> Both types of the parameters we sample have the form ofmultinomialdistributions. Each multinomial random variable corresponds to a conditioning event and its values are given by the corresponding set of conditioned events. For example, a transition probability parameter P(ti--*tj) has conditioning event ti and conditioned event tj.</Paragraph> <Paragraph position="2"> Let {ui} denote the set of possible values of a given multinomial variable, and let S = {hi} denote a set of statistics extracted from the training set for that variable, where ni is the number of times that the value ui appears in the training set for the variable, defining N = ~-~i hi. The parameters whose posterior distributions we wish to estimate are oil = P(ui).</Paragraph> <Paragraph position="3"> The maximum likelihood estimate for each of the multinomial's distribution parameters, ai, is &i = In practice, this estimator is usually smoothed in N' some way to compensate for data sparseness. Such smoothing typically reduces slightly the estimates for values with positive counts and gives small positive estimates for values with a zero count. For simplicity, we describe here the approximation of P(~i = ailS) for the unsmoothed estimator 3.</Paragraph> <Paragraph position="4"> We approximate the posterior P(ai = ai\[S) by first assuming that the multinomial is a collection of independent binomials, each of which corresponds to a single value ui of the multinomial; we then normalize the values so that they sum to 1. For each such binomial, we approximate P(ai = ai\[S) as a trun3In the implementation we smooth the MLE by interpolation with a uniform probability distribution, following Merialdo (1994). Approximate adaptation of P(c~i = ai\[S) to the smoothed version of the estimator is simple.</Paragraph> <Paragraph position="5"> cated normal distribution (restricted to \[0,1\]), with and variance ~2 = #(1--#) 4 estimated mean#---- N N &quot; To generate a particular multinomial distribution, we randomly choose values for the binomial parameters ai from their approximated posterior distributions (using the simple sampling method given in (Press et al., 1988, p. 214)), and renormalize them so that they sum to 1. Finally, to generate a random HMM given statistics S, we choose values independently for the parameters of each multinomial, since all the different multinomials in an HMM are independent. null</Paragraph> </Section> <Section position="2" start_page="320" end_page="320" type="sub_section"> <SectionTitle> 4.2 Examples in bigram training </SectionTitle> <Paragraph position="0"> Typically, concept learning problems are formulated such that there is a set of training examples that are independent of each other. When training a bigram model (indeed, any HMM), this is not true, as each word is dependent on that before it. This problem is solved by considering each sentence as an individual example. More generally, it is possible to break the text at any point where tagging is unambiguous.</Paragraph> <Paragraph position="1"> We thus use unambiguous words (those with only one possible part of speech) as example boundaries in bigram tagging. This allows us to train on smaller examples, focusing training more on the truly informative parts of the corpus.</Paragraph> </Section> </Section> <Section position="6" start_page="320" end_page="321" type="metho"> <SectionTitle> 5 Selection Algorithms </SectionTitle> <Paragraph position="0"> Within the committee-based paradigm there exist different methods for selecting informative examples.</Paragraph> <Paragraph position="1"> Previous research in sample selection has used either sequential selection (Seung, Opper, and Sompolinsky, 1992; Freund et al., 1993; Dagan and Engelson, 1995), or batch selection (Lewis and Catlett, 1994; Lewis and Gale, 1994). We describe here general algorithms for both sequential and batch selection.</Paragraph> <Paragraph position="2"> Sequential selection examines unlabeled examples as they are supplied, one by one, and measures the disagreement in their classification by the committee. Those examples determined to be sufficiently informative are selected for training. Most simply, we can use a committee of size two and select an example when the two models disagree on its classification. This gives the following, parameter-free, two member sequential selection algorithm, executed for each unlabeled input example e: 1. Draw 2 models randomly from P(MIS), where S are statistics acquired from previously labeled examples; ment, can be avoided. The posterior probability P(c~i -- ai\[S) for the multinomial is given exactly by the Dirichlet distribution (Johnson, 1972) (which reduces to the Beta distribution in the binomial case). In this work we assumed a uniform prior distribution for each model parameter; we have not addressed the question of how to best choose a prior for this problem.</Paragraph> <Paragraph position="3"> 2. Classify e by each model, giving classifications cl and c~; 3. If cl ~ c~, select e for annotation; 4. If e is selected, get its correct label and update S accordingly.</Paragraph> <Paragraph position="4"> This basic algorithm needs no parameters. If desired, it is possible to tune the frequency of selection, by changing the variance of P(MIS ) (or the variance of P(~i = ailS) for each parameter), where larger variances increase the rate of disagreement among the committee members. We implemented this effect by employing a temperature parameter t, used as a multiplier of the variance of the posterior parameter distribution.</Paragraph> <Paragraph position="5"> A more general algorithm results from allowing (i) a larger number of committee members, k, in order to sample P(MIS ) more precisely, and (it) more refined example selection criteria. This gives the following general sequential selection algorithm, executed for each unlabeled input example e: 1. Draw k models {Mi) randomly from P(MIS ) (possibly using a temperature t); 2. Classify e by each model Mi giving classifications {ci); 3. Measure the disagreement D over {ci); 4. Decide whether to select e for annotation, based on the value of D; 5. If e is selected, get its correct label and update S accordingly.</Paragraph> <Paragraph position="6"> It is easy to see that two member sequential selection is a special case of general sequential selection, where any disagreement is considered sufficient for selection. In order to instantiate the general algorithm for larger committees, we need to define (i) a measure for disagreement (Step 3), and (it) a selection criterion (Step 4).</Paragraph> <Paragraph position="7"> Our approach to measuring disagreement is to use the vote entropy, the entropy of the distribution of classifications assigned to an example ('voted for') by the committee members. Denoting the number of committee members assigning c to e by V(c, e), the vote entropy is:</Paragraph> <Paragraph position="9"> (Dividing by log k normalizes the scale for the number of committee members.) Vote entropy is maximized when all committee members disagree, and is zero when they all agree.</Paragraph> <Paragraph position="10"> In bigram tagging, each example consists of a sequence of several words. In our system, we measure D separately for each word, and use the average entropy over the word sequence as a measurement of disagreement for the example. We use the average entropy rather than the entropy over the entire sequence, because the number of committee members is small with respect to the total number of possible tag sequences. Note that we do not look at the entropy of the distribution given by each single model to the possible tags (classes), since we are only interested in the uncertainty of the final classification (see the discussion in Section 7).</Paragraph> <Paragraph position="11"> We consider two alternative selection criteria (for Step 4). The simplest is thresholded seleclion, in which an example is selected for annotation if its vote entropy exceeds some threshold 0. The other alternative is randomized selection, in which an example is selected for annotation based on the flip of a coin biased according to the vote entropy--a higher vote entropy entailing a higher probability of selection. We define the selection probability as a linear function of vote entropy: p = gD, where g is an entropy gain parameter. The selection method we used in our earlier work (Dagan and Engelson, 1995) is randomized sequential selection using this linear selection probability model, with parameters k, t and g.</Paragraph> <Paragraph position="12"> An alternative to sequential selection is batch selection. Rather than evaluating examples individually for their informativeness a large batch of examples is examined, and the m best are selected for annotation. The batch selection algorithm, executed for each batch B of N examples, is as follows: 1. For each example e in B: (a) Draw k models randomly from P(MIS); (b) Classify e by each model, giving classifications {ci}; (c) Measure the disagreement De for e over {ei}; 2. Select for annotation the m examples from B with the highest De; 3. Update S by the statistics of the selected exam- null ples.</Paragraph> <Paragraph position="13"> This procedure is repeated sequentially for successive batches of N examples, returning to the start of the corpus at the end. If N is equal to the size of the corpus, batch selection selects the m globally best examples in the corpus at each stage (as in (Lewis and Catlett, 1994)). On the other hand, as N decreases, batch selection becomes closer to sequential selection.</Paragraph> </Section> <Section position="7" start_page="321" end_page="323" type="metho"> <SectionTitle> 6 Experimental Results </SectionTitle> <Paragraph position="0"> This section presents results of applying committee-based sample selection to bigram part-of-speech tagging, as compared with complete training on all examples in the corpus. Evaluation was performed using the University of Pennsylvania tagged corpus from the ACL/DCI CD-ROM I. For ease of implementation, we used a complete (closed) lexicon which contains all the words in the corpus.</Paragraph> <Paragraph position="1"> The committee-based sampling algorithm was initialized using the first 1,000 words from the corpus,</Paragraph> <Paragraph position="3"> and thresholded runs, k = 5 and t = 50. (a) Number of ambiguous words selected for labeling versus classification accuracy achieved. (b) Accuracy versus number of words examined from the corpus (both labeled and unlabeled).</Paragraph> <Paragraph position="4"> and then sequentially examined the following examples in the corpus for possible labeling. The training set consisted of the first million words in the corpus, with sentence ordering randomized to compensate for inhomogeneity in corpus composition. The test set was a separate portion of the corpus, consisting of 20,000 words. We compare the amount of training required by different selection methods to achieve a given tagging accuracy on the test set, where both the amount of training and tagging accuracy are measured over ambiguous words. 5 The effectiveness of randomized committee-based 5Note that most other work on tagging has measured accuracy over all words, not just ambiguous ones. Complete training of our system on 1,000,000 words gave us an accuracy of 93.5% over ambiguous words, which corresponds to an accuracy of 95.9% over all words in the curacy achieved versus batch size at different numbers of selected training words. (b) Accuracy versus number of words examined from the corpus for different batch sizes. selection for part-of-speech tagging, with 5 and 10 committee members, was demonstrated in (Dagan and Engelson, 1995). Here we present and compare results for batch, randomized, thresholded, and two member committee-based selection.</Paragraph> <Paragraph position="5"> Figure 1 presents the results of comparing the several selection methods against each other. The plots shown are for the best parameter settings that we found through manual tuning for each method. Figure l(a) shows the advantage that sample selection gives with regard to annotation cost. For example, complete training requires annotated examples containing 98,000 ambiguous words to achieve a 92.6% accuracy (beyond the scale of the graph), while the selective methods require only 18,000-25,000 ambiguous words to achieve this accuracy. We also find test set, comparable to other published results on bigram tagging.</Paragraph> <Paragraph position="6"> the number of frequency counts > 0, plotted (y-axis) versus classification accuracy achieved (x-axis). (a) Lexicai counts (freq(t, w)) (b) Bigram counts (freq(tl--+t2)). that, to a first approximation, all selection methods considered give similar results. Thus, it seems that a refined choice of the selection method is not crucial for achieving large reductions in annotation cost.</Paragraph> <Paragraph position="7"> This equivalence of the different methods also largely holds with respect to computational efficiency. Figure l(b) plots classification accuracy versus number of words examined, instead of those selected. We see that while all selective methods are less efficient in terms of examples examined than complete training, they are comparable to each other. Two member selection seems to have a clear, though small, advantage.</Paragraph> <Paragraph position="8"> In Figure 2 we investigate further the properties of batch selection. Figure 2(a) shows that accuracy increases with batch size only up to a point, and then starts to decrease. This result is in line with theoretical difficulties with batch selection (Freund et al., 1993) in that batch selection does not account for the distribution of input examples. Hence, once batch size increases past a point, the input distribution has too little influence on which examples are selected, and hence classification accuracy decreases. Furthermore, as batch size increases, computational efficiency, in terms of the number of examples examined to attain a given accuracy, decreases tremendously (Figure 2(5)).</Paragraph> <Paragraph position="9"> The ability of committee-based selection to focus on the more informative parts of the training corpus is analyzed in Figure 3. Here we examined the number of lexical and bigram counts that were stored (i.e, were non-zero) during training, using the two member selection algorithm and complete training. As the graphs show, the sample selection method achieves the same accuracy as complete training with fewer lexical and bigram counts. This means that many counts in the data are less useful for correct tagging, as replacing them with smoothed estimates works just as well. 6 Committee-based selection ignores such counts, focusing on parameters which improve the model. This behavior has the practical advantage of reducing the size of the model significantly (by a factor of three here). Also, the average count is lower in a model constructed by selective training than in a fully trained model, suggesting that the selection method avoids using examples which increase the counts for already known parameters.</Paragraph> </Section> <Section position="8" start_page="323" end_page="324" type="metho"> <SectionTitle> 7 Discussion </SectionTitle> <Paragraph position="0"> Why does committee-based sample selection work? Consider the properties of those examples that are selected for training. In general, a selected training example will contribute data to several statistics, which in turn will improve the estimates of several parameter vMues. An informative example is therefore one whose contribution to the statistics leads to a significantly useful improvement of model parameter estimates. Model parameters for which acquiring additional statistics is most beneficial can be characterized by the following three properties: 1. The current estimate of the parameter is uncertain due to insufficient statistics in the training set. Additional statistics would bring the estimate closer to the true value.</Paragraph> <Paragraph position="1"> 2. Classification of examples is sensitive to changes in the current estimate of the parameter. Otherwise, even if the current value of the parameter is very uncertain, acquiring additional statistics will not change the resulting classifications. null ters that affect only few examples have low over-all utility.</Paragraph> <Paragraph position="2"> The committee-based selection algorithms work because they tend to select examples that affect parameters with the above three properties. Prop-erty 1 is addressed by randomly drawing the parameter values for committee members from the posterior distribution given the current statistics. When the statistics for a parameter are insufficient, the variance of the posterior distribution of the estimates is large, and hence there will be large differences in the values of the parameter chosen for different committee members. Note that property 1 is not addressed when uncertainty in classification is only judged relative to a single model 7 (as in, eg, (Lewis and Gale, 1994)). Property 2 is addressed by selecting examples for which committee members highly disagree in classification (rather than measuring disagreement in parameter estimates). Committee-based selection thus addresses properties 1 and 2 simultaneously: it acquires statistics just when uncertainty in current parameter estimates entails uncertainty regarding the appropriate classification of the example. Our results show that this effect is achieved even when using only two committee members to sample the space of likely classifications. By appropriate classification we mean the classification given by a perfectly-trained model, that is, one with accurate parameter values.</Paragraph> <Paragraph position="3"> Note that this type of uncertainty regarding the identity of the appropriate classification, is different than uncertainty regarding the correctness of the classification itself. For example, sufficient statistics may yield an accurate 0.51 probability estimate for a class c in a given example, making it certain that c is the appropriate classification. However, the certainty that c is the correct classification is low, since there is a 0.49 chance that c is the wrong class for the example. A single model can be used to estimate only the second type of uncertainty, which does not correlate directly with the utility of additional training. null Finally, property 3 is addressed by independently examining input examples which are drawn from the input distribution. In this way, we implicitly model the distribution of model parameters used for classifying input examples. Such modeling is absent in batch selection, and we hypothesize that this is the reason for its lower effectiveness.</Paragraph> </Section> class="xml-element"></Paper>