XML Viewer - w03-1019

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1019_metho.xml
Size: 23,220 bytes
Last Modified: 2025-10-06 14:08:25
<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1019">
  <Title>Investigating Loss Functions and Optimization Methods for Discriminative Learning of Label Sequences</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Discriminative Modeling of Label
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Sequences Learning
</SectionTitle>
      <Paragraph position="0"> Label sequence learning is, formally, the problem of learning a function that maps a sequence of observations a7a9a8 a10a12a11a14a13a16a15a17a11a19a18a20a15a22a21a23a21a23a21a23a15a17a11a1a24a26a25 to a label sequence  a10a12a28 a13 a15a17a28 a18 a15a22a21a23a21a23a21a23a15a17a28 a24 a25 , where each a28a30a29a32a31a34a33 , the set of individual labels. For example, in POS tagging, the words a11a35a29 's construct a sentence a7 , and a27 is the labelling of the sentence where a28 a29 is the part of speech tag of the word a11 a29 . We are interested in the supervised learning setting, where we are given a corpus,  a25 in order to learn the classifier.</Paragraph>
      <Paragraph position="1"> The most popular model for label sequence learning is the Hidden Markov Model (HMM). An HMM, as a generative model, is trained by finding the joint probability distribution over the observation and label sequencesa41 a10a7 a15 a27 a25 that explains the corpus a36 the best (Figure 1a). In this model, each random variable is assumed to be independent of the other random variables, given its parents. Because of the long distance dependencies of natural languages that cannot be modeled by sequences, this conditional independence assumption is violated in many NLP tasks.</Paragraph>
      <Paragraph position="2"> Another shortcoming of this model is that, due to its generative nature, overlapping features are difficult to use in HMMs. For this reason, HMMs have been standardly used with current word-current label, and previous label(s)-current label features. However, if we incorporate information about the neighboring words and/or information about more detailed characteristics of the current word directly to our model, rather than propagating it through the previous labels, we may hope to learn a better classifier.</Paragraph>
      <Paragraph position="3"> Many different models, such as Maximum Entropy Markov Models (MEMMs) (McCallum et al., 2000), Projection based Markov Models (PMMs) (Punyakanok and Roth, 2000) and Conditional Random Fields (CRFs) (Lafferty et al., 2001), have been proposed to overcome these problems. The common property of these models is their discriminative approach. They model the probability distribution of the label sequences given the observation sequences:</Paragraph>
      <Paragraph position="5"> The best performing models of label sequence learning are MEMMs or PMMs (also known as Maximum Entropy models) whose features are carefully designed for the specific tasks (Ratnaparkhi, 1999; Toutanova and Manning, 2000). However, maximum entropy models suffer from the so called label bias problem, the problem of making local decisions (Lafferty et al., 2001). Lafferty et al. (2001) show that CRFs overcome the label-bias problem and outperform MEMMs in POS tagging.</Paragraph>
      <Paragraph position="6"> CRFs define a probability distribution over the whole sequence a27 , globally conditioning over the whole observation sequence a7 (Figure 1b). Because they condition on the observation (as opposed to generating it), they can use overlapping features.</Paragraph>
      <Paragraph position="7"> The features a44 a10a7 a15 a27 a15a17a45a17a25 used in this paper are of the form:  1. Current label and information about the observation sequence, such as the identity or spelling features of a word that is within a window  of the word currently labelled. Each of these features corresponds to a choice of a28 a29 and a11a47a46 where a48 a31a50a49a16a45a52a51a54a53a55a15a22a21a23a21a23a21a23a15a17a45a37a15a22a21a23a21a23a21a23a15a17a45a14a56a57a53a59a58 and a53 is the half window size 2. Current label and the neighbors of that label, i.e. features that capture the inter-label dependencies. Each of these features corresponds to a choice of a28 a29 and the neighbors of a28 a29 , e.g. in a bigram model, a44 a10a12a28 a29a1a0 a13a22a15a17a28 a29a25 . The conditional probability distribution defined by this model is :</Paragraph>
      <Paragraph position="9"> where a16 a14 's are the parameters to be estimated from the training corpus C and a19a21a3 a10a7 a25 is a normalization term to assure a proper probability distribution. In order to simplify the notation, we introduce a22 a14 a10a7 a15 a27 a25 a8 a12 a29 a44 a14 a10a7 a15 a27 a15a17a45a17a25 , which is the number of times feature a44 a14 is observed in a10a7 a15 a27 a25 pair and, a23 a3 a10a7 a15 a27 a25 a8 a12 a14 a16 a14 a22 a14 a10a7 a15 a27 a25 , which is the linear combination of all the features with</Paragraph>
      <Paragraph position="11"/>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Loss Functions for Label Sequences
</SectionTitle>
    <Paragraph position="0"> Given the theoretical advantages of discriminative models over generative models and the empirical support by (Klein and Manning, 2002), and that CRFs are the state-of-the-art among discriminative models for label sequences, we chose CRFs as our model, and trained by optimizing various objective functions a31 a3 a10a36 a25 with respect to the corpus a36 . The application of these models to the label sequence problems vary widely. The individual labels might constitute chunks (e.g. Named-Entity Recognition, shallow parsing), or they may be single entries (e.g. POS tagging). The difficulty, therefore the accuracy of the tasks are very different from each other. The evaluation of the systems differ from one task to another, and the nature of the statistical noise level is task and corpus dependent.</Paragraph>
    <Paragraph position="1"> Given this variety, using objective functions tailored for each task might result in better classifiers. We consider two dimensions in designing objective functions: exponential versus logarithmic loss functions, and sequential versus pointwise optimization functions.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Exponential vs Logarithmic Loss functions
</SectionTitle>
      <Paragraph position="0"> Most estimation procedures in NLP proceed by maximizing the likelihood of the training data. To  tions in a binary classification problem overcome the numerical problems of working with a product of a large number of small probabilities, usually the logarithm of the likelihood of the data is optimized. However, most of the time, these systems, sequence labelling systems in particular, are tested with respect to their error rate on test data, i.e. the fraction of times the function a23 a3 assigns a higher score to a label sequence a27 (such that a27a33a32a8 a27a24a34 ) than the correct label sequence a27a4a34 for every observation</Paragraph>
      <Paragraph position="2"> more natural objective to minimize.</Paragraph>
      <Paragraph position="4"> ranks higher than the correct label sequences for the training instances in the corpus a36 . Since optimizing the rank loss is NP-complete, one can optimize an upper bound instead, e.g. an exponential loss function: null</Paragraph>
      <Paragraph position="6"> The exponential loss function is well studied in the Machine Learning domain. The advantage of the exp-loss over the log-loss is its property of penalizing incorrect labellings very severely, whereas it penalizes almost nothing when the label sequence is correct. This is a very desirable property for a classifier. Figure 2 shows this property of exp-loss in contrast to log-loss in a binary classification problem. However this property also means that, exp-loss has the disadvantage of being sensitive to noisy data, since systems optimizing exp-loss spends more effort on the outliers and tend to be vulnerable to noisy data, especially label noise.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Sequential vs Pointwise Loss functions
</SectionTitle>
      <Paragraph position="0"> In many applications it is very difficult to get the whole label sequence correct since most of the time classifiers are not perfect and as the sequences get longer, the probability of predicting every label in the sequence correctly decreases exponentially. For this reason performance is usually measured pointwise, i.e. in terms of the number of individual labels that are correctly predicted. Most common optimization functions in the literature, however, treat the whole label sequence as one label, penalizing a label sequence that has one error and a label sequence that is all wrong in the same manner. We may be able to develop better classifiers by using a loss function more similar to the evaluation function. One possible way of accomplishing this may be minimizing pointwise loss functions. Sequential optimizations optimize the joint conditional probability distribution a2 a3 a10a27a43a42a7 a25 , whereas pointwise optimizations that we propose optimize the marginal conditional probability distribution, a2a20a3 a10a12a28 a29 a42a7 a34 a25 a8</Paragraph>
      <Paragraph position="2"/>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Four Loss functions
</SectionTitle>
      <Paragraph position="0"> We derive four loss functions by taking the cross product of the two dimensions discussed above: a5 Sequential Log-loss function: This function, based on the standard maximum likelihood optimization, is used with CRFs in (Lafferty et al., 2001).</Paragraph>
      <Paragraph position="2"> tion, was first introduced in (Collins, 2000) for NLP tasks with a structured output domain.</Paragraph>
      <Paragraph position="3"> However, there, the sum is not over the whole possible label sequence set, but over the a11 best label sequences generated by an external mechanism. Here we include all possible label sequences; so we do not require an external mechanism to identify the best a11 sequences..</Paragraph>
      <Paragraph position="4"> As shown in (Altun et al., 2002) it is possible to sum over all label sequences by using a dynamic algorithm.</Paragraph>
      <Paragraph position="6"> Note that the exponential loss function is just the inverse conditional probability plus a constant. null a5 Pointwise Log-loss function: This function optimizes the marginal probability of the labels at each position conditioning on the observation sequence:</Paragraph>
      <Paragraph position="8"> Obviously, this function reduces to the sequential log loss if the length of the sequence is a14 . a5 Pointwise Exp-loss function: Following the parallelism in log-loss vs exp-loss functions of sequential optimization (log vs inverse conditional probability), we propose minimizing the pointwise exp-loss function below, which reduces to the standard multi-class exponential loss when the length of the sequence is a14 .</Paragraph>
      <Paragraph position="10"/>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Comparison of the Four Loss Functions
</SectionTitle>
    <Paragraph position="0"> We now compare the performance of the four loss functions described above. Although (Lafferty et al., 2001) proposes a modification of the iterative scaling algorithm for parameter estimation in sequential log-loss function optimization, gradient-based methods have often found to be more efficient for minimizing the convex loss function in Eq. (1) (Minka, 2001). For this reason, we use a gradient based method to optimize the above loss functions.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Gradient Based Optimization
</SectionTitle>
      <Paragraph position="0"> The gradients of the four loss function can be computed as follows:</Paragraph>
      <Paragraph position="2"> where expectations are taken w.r.t. a2a20a3 a10a27a43a42a7 a25 .</Paragraph>
      <Paragraph position="3"> Thus at the optimum the empirical and expected values of the sufficient statistics are equal. The loss function and the derivatives can be calculated with one pass of the forward-backward algorithm.</Paragraph>
      <Paragraph position="5"> At the optimum the empirical values of the sufficient statistics equals their conditional expectations where the contribution of each instance is weighted by the inverse conditional probability of the instance. Thus this loss function focuses on the examples that have a lower conditional probability, which are usually the examples that the model labels incorrectly. The computational complexity is the same as the log-loss case.</Paragraph>
      <Paragraph position="7"> At the optimum the expected value of the sufficient statistics conditioned on the observation  a34 are equal to their expected value when also conditioned on the correct label sequence a28 a34a29 . The computations can be done using the dynamic programming described in (Kakade et al., 2002), with the computational complexity of the forward-backward algorithm scaled by a</Paragraph>
      <Paragraph position="9"> At the optimum the expected value of the sufficient statistics conditioned on a11 a34 are equal to the value when also conditioned on a28 a34a29 , where each point is weighted by a2 a3 a10a12a28 a34a29 a42a7</Paragraph>
      <Paragraph position="11"> putational complexity is the same as the log-loss case.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Experimental Setup
</SectionTitle>
      <Paragraph position="0"> Before presenting the experimental results of the comparison of the four loss functions described above, we describe our experimental setup. We ran experiments on Part-of-Speech (POS) tagging and Named-Entity-Recognition (NER) tasks.</Paragraph>
      <Paragraph position="1"> For POS tagging, we used the Penn TreeBank corpus. There are 47 individual labels in this corpus. Following the convention in POS tagging, we used a Tag Dictionary for frequent words. We used Sections 1-21 for training and Section 22 for testing. For NER, we used a Spanish corpus which was provided for the Special Session of CoNLL2002 on NER. There are training and test data sets and the training data consists of about 7200 sentences. The individual label set in the corpus consists of 9 labels: the beginning and continuation of Person, Organization, Location and Miscellaneous names and nonname tags.</Paragraph>
      <Paragraph position="2"> We used three different feature sets: a5a8a7 a14 is the set of bigram features, i.e. the current tag and the current word, the current tag and previous tags.</Paragraph>
      <Paragraph position="3"> a5a8a7a10a9 consists of a7 a14 features and spelling features of the current word (e.g. &amp;quot;Is the current word capitalized and the current tag is Person-Beginning?&amp;quot;). Some of the spelling features, which are mostly adapted from (Bikel et al., 1999) are the last one, two and three letters of the word; whether the first letter is lower case, upper case or alphanumeric; whether the word is capitalized and contains a dot; whether all the letters are capitalized; whether the word contains a hyphen.</Paragraph>
      <Paragraph position="4"> a5a8a7a10a11 includes a7a12a9 features not only for the current word but also for the words within a fixed window of size a53 . a7a12a9 is an instance of a7a12a11 where  Bank.</Paragraph>
      <Paragraph position="5"> For NER, we used a window of size 3 (i.e. considered features for the previous and next words). Since the Penn TreeBank is very large, including a7a12a11 features, i.e. incorporating the information in the neighboring words directly to the model, is intractable. Therefore, we limited our experiments to a7 a14 and a7a12a9 features for POS tagging.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Experimental Results
</SectionTitle>
      <Paragraph position="0"> As a gradient based optimization method, we used an off-the-shelf optimization tool that uses the limited-memory updating method. We observed that this method is faster to converge than the conjugate gradient descent method. It is well known that optimizing log-loss functions may result in overfitting, especially with noisy data. For this reason, we used a regularization term in our cost functions. We experimented with different regularization terms. As expected, we observed that the regularization term increases the accuracy, especially when the training data is small; but we did not observe much difference when we used different regularization terms. The results we report are with the Gaussian prior regularization term described in (Johnson et al., 1999).</Paragraph>
      <Paragraph position="1"> Our goal in this paper is not to build the best tagger or recognizer, but to compare different loss functions and optimization methods. Since we did not spend much effort on designing the most useful features, our results are slightly worse than, but comparable to the best performing models.</Paragraph>
      <Paragraph position="2"> We extracted corpora of different sizes (ranging from 300 sentences to the complete corpus) and ran experiments optimizing the four loss functions using different feature sets. In Table 1 and Table 2, we report the accuracy of predicting every individual label. It can be seen that the test accuracy obtained by different loss functions lie within a relatively small range and the best performance depends on what kind of features are included in the model.</Paragraph>
      <Paragraph position="3">  corpus. The window size is 3 for a7a12a11 .</Paragraph>
      <Paragraph position="4"> We observed similar behavior when the training set is smaller. The accuracy is highest when more features are included to the model. From these results we conclude that when the model is the same, optimizing different loss functions does not have much effect on the accuracy, but increasing the variety of the features included in the model has more impact.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Optimization methods
</SectionTitle>
    <Paragraph position="0"> In Section 4, we showed that optimizing different loss function does not have a large impact on the accuracy. In this section, we investigate different methods of optimization. The conjugate based method used in Section 4 is an exact method. If the training corpus is large, the training may take a long time, especially when the number of features are very large. In this method, the optimization is done in a parallel fashion by updating all of the parameters at the same time. Therefore, the resulting classifier uses all the features that are included in the model and lacks sparseness.</Paragraph>
    <Paragraph position="1"> We now consider two approximation methods to optimize two of the loss functions described above.</Paragraph>
    <Paragraph position="2"> We first present a perceptron algorithm for labelling sequences. This algorithm performs parallel optimization and is an approximation of the sequential log-loss optimization. Then, we present a boosting algorithm for label sequence learning. This algorithm performs sequential optimization by updating one parameter at a time. It optimizes the sequential exp-loss function. We compare these methods with the exact method using the experimental setup presented in Section 4.2.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Perceptron Algorithm for Label Sequences
</SectionTitle>
      <Paragraph position="0"> Calculating the gradients, i.e. the expectations of features for every instance in the training corpus can be computationally expensive if the corpus is very large. In many cases, a single training instance might be as informative as all of the corpus to update the parameters. Then, an online algorithm which makes updates by using one training example may converge much faster than a batch algorithm. If the distribution is peaked, one label is more likely than others and the contribution of this label dominates the expectation values. If we assume this is the case, i.e. we make a Viterbi assumption, we can calculate a good approximation of the gradients by considering only the most likely, i.e. the best label sequence according to the current model. The following on-line perceptron algorithm (Algorithm 1), presented in (Collins, 2002), uses these two approximations: Algorithm 1 Label sequence Perceptron algorithm .</Paragraph>
      <Paragraph position="1">  At each iteration, the perceptron algorithm calculates an approximation of the gradient of the sequential log-loss function (Eq. 3) based on the current training instance. The batch version of this algorithm is a closer approximation of the optimization of sequential log-loss, since the only approximation is the Viterbi assumption. The stopping criteria may be convergence, or a fixed number of iterations over the training data.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Boosting Algorithm for Label Sequences
</SectionTitle>
      <Paragraph position="0"> The original boosting algorithm (AdaBoost), presented in (Schapire and Singer, 1999), is a sequential learning algorithm to induce classifiers for single random variables. (Altun et al., 2002) presents a boosting algorithm for learning classifiers to predict label sequences. This algorithm minimizes an upper bound on the sequential exp-loss function (Eq. 2).</Paragraph>
      <Paragraph position="1"> As in AdaBoost, a distribution over observations is  This distribution which expresses the importance of every training instance is updated at each round, and the algorithm focuses on the more difficult examples. The sequence Boosting algorithm (Algorithm 2) optimizes an upper bound on the sequential exp-loss function by using the convexity of the exponential function. a24a26a25a28a27a30a29a14 is the maximum difference of the sufficient statistic a22 a14 in any label sequence and the correct label sequence of any observation a7 a34 . a24a31a25a33a32a35a34a14 has a similar meaning. a24a37a36  As it can be seen from Line 4 in Algorithm 2, the feature that was added to the ensemble at each round is determined by a function of the gradient of the sequential exp-loss function (Eq. 4). At each round, one pass of the forward backward algorithm over the training data is sufficient to calculate a41 a14 's for all a1 . Considering the sparseness of the features in each training instance, one can restrict the forward backward pass only to the training instances that contain the feature that is added to the ensemble in the last round. The stopping criteria may be a fixed number of rounds, or by cross-validation on a heldout corpus. null</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML