File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1027_metho.xml

Size: 16,309 bytes

Last Modified: 2025-10-06 14:10:18

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1027">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Semi-Supervised Conditional Random Fields for Improved Sequence Segmentation and Labeling</Title>
  <Section position="4" start_page="209" end_page="210" type="metho">
    <SectionTitle>
2 Semi-supervised CRF training
</SectionTitle>
    <Paragraph position="0"> In what follows, we use the same notation as (Lafferty et al. 2001). Let a0 be a random variable over data sequences to be labeled, and a1 be a random variable over corresponding label sequences. All components, a1a3a2 , of a1 are assumed to range over a finite label alphabet a4 . For example, a0 might range over sentences and a1 over part-of-speech taggings of those sentences; hence a4 would be the set of possible part-of-speech tags in this case.</Paragraph>
    <Paragraph position="1"> Assume we have a set of labeled examples, a5a7a6a9a8a11a10a13a12a15a14a17a16a19a18a21a20a23a22a25a24a26a16a27a18a21a20a29a28a30a22a32a31a32a31a32a31a33a22a34a12a15a14a17a16a36a35a37a20a38a22a25a24a39a16a36a35a37a20a41a40 , and unlabeled examples, a5a7a42a43a8 a10 a14a17a16a36a35a37a44a45a18a21a20a25a22a32a31a32a31a32a31a45a22a25a14a17a16a36a46a47a20 a40 . We would like to build a CRF model  a12a15a14a39a22a25a24a56a28a25a73 a40 Our goal is to learn such a model from the combined set of labeled and unlabeled examples, a5 a6a70a78 a5a79a42 .</Paragraph>
    <Paragraph position="2"> The standard supervised CRF training procedure is based upon maximizing the log conditional likelihood of the labeled examples in a5a43a6</Paragraph>
    <Paragraph position="4"> a94a38a96a77a97a70a98 . Regularization can be used to limit over-fitting on rare features and avoid degeneracy in the case of correlated features. Obviously, (1) ignores the unlabeled examples in a5 a42 .</Paragraph>
    <Paragraph position="5"> To make full use of the available training data, we propose a semi-supervised learning algorithm that exploits a form of entropy regularization on the unlabeled data. Specifically, for a semi-supervised CRF, we propose to maximize the following objective</Paragraph>
    <Paragraph position="7"> where the first term is the penalized log conditional likelihood of the labeled data under the CRF, (1), and the second line is the negative conditional entropy of the CRF on the unlabeled data.</Paragraph>
    <Paragraph position="8"> Here, a102 is a tradeoff parameter that controls the influence of the unlabeled data.</Paragraph>
    <Paragraph position="9">  This approach resembles that taken by (Grandvalet and Bengio 2004) for single variable classification, but here applied to structured CRF training. The motivation is that minimizing conditional entropy over unlabeled data encourages the algorithm to find putative labelings for the unlabeled data that are mutually reinforcing with the supervised labels; that is, greater certainty on the putative labelings coincides with greater conditional likelihood on the supervised labels, and vice versa.</Paragraph>
    <Paragraph position="10"> For a single classification variable this criterion has been shown to effectively partition unlabeled data into clusters (Grandvalet and Bengio 2004; Roberts et al. 2000).</Paragraph>
    <Paragraph position="11"> To motivate the approach in more detail, consider the overlap between the probability distribution over a label sequence a0 and the empirical distribution of a1a48 a12a15a14a45a28 on the unlabeled data a5a47a42 . The overlap can be measured by the Kullback-Leibler divergence a2 a12 a48 a49 a12a15a24a37a51a14a45a28 a1a48 a12a15a14a45a28a34a94 a1a48 a12a15a14a45a28a25a28 . It is well known that Kullback-Leibler divergence (Cover and Thomas 1991) is positive and increases as the overlap between the two distributions decreases.</Paragraph>
    <Paragraph position="12"> In other words, maximizing Kullback-Leibler divergence implies that the overlap between two distributions is minimized. The total overlap over all possible label sequences can be defined as  which motivates the negative entropy term in (2).</Paragraph>
    <Paragraph position="13"> The combined training objective (2) exploits unlabeled data to improve the CRF model, as we will see. However, one drawback with this approach is that the entropy regularization term is not concave. To see why, note that the entropy regularizer can be seen as a composition,</Paragraph>
    <Paragraph position="15"> given by (Boyd and Vandenberghe 2004)</Paragraph>
    <Paragraph position="17"> and a12a35a34 are concave here, since  is not nondecreasing, a11 is not necessarily concave. So in general there are local maxima in (2).</Paragraph>
  </Section>
  <Section position="5" start_page="210" end_page="211" type="metho">
    <SectionTitle>
3 An efficient training procedure
</SectionTitle>
    <Paragraph position="0"> As (2) is not concave, many of the standard global maximization techniques do not apply. However, one can still use unlabeled data to improve a supervised CRF via iterative ascent. To derive an efficient iterative ascent procedure, we need to compute gradient of (2) with respect to the parameters  . Taking derivative of the objective function (2) with respect to</Paragraph>
    <Paragraph position="2"> The first three items on the right hand side are just the standard gradient of the CRF objective,a36</Paragraph>
    <Paragraph position="4"> (Lafferty et al. 2001), and the final item is the gradient of the entropy regularizer (the derivation of which is given in Appendix A.</Paragraph>
    <Paragraph position="5"> Here,</Paragraph>
    <Paragraph position="7"> a20a23a22a25a24a56a28a54a53 is the conditional covariance matrix of the features,</Paragraph>
    <Paragraph position="9"> given sample sequence a14 a16</Paragraph>
    <Paragraph position="11"> To efficiently calculate the gradient, we need to be able to efficiently compute the expectations with respect to a24 in (3) and (4). However, this can pose a challenge in general, because there are exponentially many values for a24 . Techniques for computing the linear feature expectations in (3) are already well known if a24 is sufficiently structured (e.g. a24 forms a Markov chain) (Lafferty et al. 2001). However, we now have to develop efficient techniques for computing the quadratic feature expectations in (4).</Paragraph>
    <Paragraph position="12"> For the quadratic feature expectations, first note that the diagonal terms, a57 a8a67a66 , are straightforward, since each feature is an indicator, we have</Paragraph>
    <Paragraph position="14"> a12a15a14a39a22a25a24a56a28 , and therefore the diagonal terms in the conditional covariance are just linear feature expectations a61</Paragraph>
    <Paragraph position="16"> a12a15a14a39a22a25a24a56a28 a40 as before.</Paragraph>
    <Paragraph position="17"> For the off diagonal terms, a57a1a0a8 a66 , however, we need to develop a new algorithm. Fortunately, for structured label sequences, a1 , one can devise an efficient algorithm for calculating the quadratic expectations based on nested dynamic programming. To illustrate the idea, we assume that the dependencies of a1 , conditioned on a0 , form a Markov chain.</Paragraph>
    <Paragraph position="18"> Define one feature for each state pair a12 a0 a29 a22 a0 a28 , and one feature for each state-observation pair</Paragraph>
    <Paragraph position="20"> Following (Lafferty et al. 2001), we also add special start and stop states, a1a22a21 a8 start and a1a24a23 a44a45a18 a8 stop. The conditional probability of a label sequence can now be expressed concisely in a matrix form. For each position a57 in the observation sequence a14 , define the a51a4 a51a26a25 a51a4 a51 matrix random</Paragraph>
    <Paragraph position="22"> is the edge with labels a12 a1</Paragraph>
    <Paragraph position="24"> With these definitions, the expectation of the product of each pair of feature func-</Paragraph>
    <Paragraph position="26"> First define the summary matrix</Paragraph>
    <Paragraph position="28"> Then the quadratic feature expectations can be computed by the following recursion, where the two double sums in each expectation correspond to the two cases depending on which feature occurs first (a38 a67 occuring before a38 a69 ).</Paragraph>
    <Paragraph position="30"> The computation of these expectations can be organized in a trellis, as illustrated in Figure 1.</Paragraph>
    <Paragraph position="31"> Once we obtain the gradient of the objective function (2), we use limited-memory L-BFGS, a quasi-Newton optimization algorithm (McCallum 2002; Nocedal and Wright 2000), to find the local maxima with the initial value being set to be the optimal solution of the supervised CRF on labeled data.</Paragraph>
  </Section>
  <Section position="6" start_page="211" end_page="214" type="metho">
    <SectionTitle>
4 Time and space complexity
</SectionTitle>
    <Paragraph position="0"> The time and space complexity of the semi-supervised CRF training procedure is greater than that of standard supervised CRF training, but nevertheless remains a small degree polynomial in the size of the training data. Let</Paragraph>
    <Paragraph position="2"> Then the time required to classify a test sequence is a3 a12a9a48a18a69 a1 a96a77a28 , independent of training method, since the Viterbi decoder needs to access each path.</Paragraph>
    <Paragraph position="3"> For training, supervised CRF training requires  a96 a28 time, whereas semi-supervised CRF training requires a3 a12 a2a4a0 a6 a48 a6 a1 a96 a101 a2a4a0 a42 a48a85a96a42 a1a6a5 a28 time. The additional cost for semi-supervised training arises from the extra nested loop required to calculated the quadratic feature expectations, which introduces in an additional a48 a42 a1 factor.</Paragraph>
    <Paragraph position="4"> However, the space requirements of the two training methods are the same. That is, even though the covariance matrix has size a3 a12a33a65 a96 a28 , there is never any need to store the entire matrix in memory. Rather, since we only need to compute the product of the covariance with  , the calculation can be performed iteratively without using extra space beyond that already required by supervised CRF training.</Paragraph>
    <Paragraph position="5"> start  product over a pair of feature functions, a7a9a8a10a8 vs a7a12a11a13a8 , where the feature a7a9a8a10a8 occurs first. This leads to one double sum. 5 Identifying gene and protein mentions We have developed our new semi-supervised training procedure to address the problem of information extraction from biomedical text, which has received significant attention in the past few years. We have specifically focused on the problem of identifying explicit mentions of gene and protein names (McDonald and Pereira 2005). Recently, McDonald and Pereira (2005) have obtained interesting results on this problem by using a standard supervised CRF approach. However, our contention is that stronger results could be obtained in this domain by exploiting a large corpus of un-annotated biomedical text to improve the quality of the predictions, which we now show.</Paragraph>
    <Paragraph position="6"> Given a biomedical text, the task of identifying gene mentions can be interpreted as a tagging task, where each word in the text can be labeled with a tag that indicates whether it is the beginning of gene mention (B), the continuation of a gene mention (I), or outside of any gene mention (O).</Paragraph>
    <Paragraph position="7"> To compare the performance of different taggers learned by different mechanisms, one can measure the precision, recall and F-measure, given by</Paragraph>
    <Paragraph position="9"> In our evaluation, we compared the proposed semi-supervised learning approach to the state of the art supervised CRF of McDonald and Pereira (2005), and also to self-training (Celeux and Govaert 1992; Yarowsky 1995), using the same feature set as (McDonald and Pereira 2005). The CRF training procedures, supervised and semi- null supervised, were run with the same regularization function, a91a47a12</Paragraph>
    <Paragraph position="11"> a94a30a96a34a97a70a98 , used in (McDonald and Pereira 2005).</Paragraph>
    <Paragraph position="12"> First we evaluated the performance of the semi-supervised CRF in detail, by varying the ratio between the amount of labeled and unlabeled data, and also varying the tradeoff parameter a102 . We choose a labeled training set a0 consisting of 5448 words, and considered alternative unlabeled training sets, a1 (5210 words), a80 (10,208 words), and a2 (25,145 words), consisting of the same, 2 times and 5 times as many sentences as a0 respectively.</Paragraph>
    <Paragraph position="13"> All of these sets were disjoint and selected randomly from the full corpus, the smaller one in (McDonald et al. 2005), consisting of 184,903 words in total. To determine sensitivity to the parameter a102 we examined a range of discrete values</Paragraph>
    <Paragraph position="15"> In our first experiment, we train the CRF models using labeled set a0 and unlabeled sets a1 , a80 and a2 respectively. Then test the performance on the sets a1 , a80 and a2 respectively The results of our evaluation are shown in Table 1. The performance of the supervised CRF algorithm, trained only on the labeled set a0 , is given on the first row in Table 1 (corresponding to a102 a8 a46 ). By comparison, the results obtained by the semi-supervised CRFs on the held-out sets a1 , a80 and a2 are given in Table 1 by increasing the value of a102 .</Paragraph>
    <Paragraph position="16"> The results of this experiment demonstrate quite clearly that in most cases the semi-supervised CRF obtains higher precision, recall and F-measure than the fully supervised CRF, yielding a 20% improvement in the best case.</Paragraph>
    <Paragraph position="17"> In our second experiment, again we train the CRF models using labeled set a0 and unlabeled sets a1 , a80 and a2 respectively with increasing values of a102 , but we test the performance on the held-out set a7 which is the full corpus minus the labeled set a0 and unlabeled sets a1 , a80 and a2 . The results of our evaluation are shown in Table 2 and  of the supervised CRF algorithm, trained only on the labeled set a0 . In particular, by using the supervised CRF model, the system predicted 3334 out of 7472 gene mentions, of which 2435 were correct, resulting in a precision of 0.73, recall of 0.33 and F-measure of 0.45. The other curves are those of the semi-supervised CRFs.</Paragraph>
    <Paragraph position="18"> The results of this experiment demonstrate quite clearly that the semi-supervised CRFs simultane- null ously increase both the number of predicted gene mentions and the number of correct predictions, thus the precision remains almost the same as the supervised CRF, and the recall increases significantly. null Both experiments as illustrated in Figure 2 and Tables 1 and 2 show that clearly better results are obtained by incorporating additional unlabeled training data, even when evaluating on disjoint testing data (Figure 2). The performance of the semi-supervised CRF is not overly sensitive to the tradeoff parameter a102 , except that a102 cannot be set too large.</Paragraph>
    <Section position="1" start_page="213" end_page="214" type="sub_section">
      <SectionTitle>
5.1 Comparison to self-training
</SectionTitle>
      <Paragraph position="0"> For completeness, we also compared our results to the self-learning algorithm, which has commonly been referred to as bootstrapping in natural language processing and originally popularized by the work of Yarowsky in word sense disambiguation (Abney 2004; Yarowsky 1995). In fact, similar ideas have been developed in pattern recognition under the name of the decision-directed algorithm (Duda and Hart 1973), and also traced back to 1970s in the EM literature (Celeux and Govaert 1992). The basic algorithm works as follows:  stop; otherwise a0 a8 a0 a101 a53 , iterate.</Paragraph>
      <Paragraph position="1"> We implemented this self training approach and tried it in our experiments. Unfortunately, we were not able to obtain any improvements over the standard supervised CRF with self-learning, using the sets a5a7a6a3a8 a0 a22 and a5a79a42 a14a2a1 a1 a22 a80 a22 a2a4a3 . The semi-supervised CRF remains the best of the approaches we have tried on this problem.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML