File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1019_intro.xml

Size: 3,705 bytes

Last Modified: 2025-10-06 14:02:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1019">
  <Title>Investigating Loss Functions and Optimization Methods for Discriminative Learning of Label Sequences</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Until recent years, generative models were the most common approach for many NLP tasks. Recently, there is a growing interest on discriminative models in the NLP community, and these models were shown to be successful for different tasks(Lafferty et al., 2001; Ratnaparkhi, 1999; Collins, 2000). Discriminative models do not only have theoretical advantages over generative models, as we discuss in Section 2, but they are also shown to be empirically favorable over generative models when features and objective functions are fixed (Klein and Manning, 2002).</Paragraph>
    <Paragraph position="1"> In this paper, we use discriminative models to investigate the optimization of different objective functions by a variety of optimization methods. We focus on label sequence learning tasks. Part-of-Speech (POS) tagging and Named Entity Recognition (NER) are the most studied applications among these tasks. However, there are many others, such as chunking, pitch accent prediction and speech edit detection. These tasks differ in many aspects, such as the nature of the label sequences (chunks or individual labels), their difficulty and evaluation methods. Given this variety, we think it is worthwhile to investigate how optimizing different objective functions affects performance. In this paper, we varied the scale (exponential vs logarithmic) and the manner of the optimization (sequential vs pointwise) and using different combinations, we designed 4 different objective functions. We optimized these functions on NER and POS tagging tasks. Despite our intuitions, our experiments show that optimizing objective functions that vary in scale and manner do not affect accuracy much. Instead, the selection of the features has a larger impact.</Paragraph>
    <Paragraph position="2"> The choice of the optimization method is important for many learning problems. We would like to use optimization methods that can handle a large number of features, converge fast and return sparse classifiers. The importance of the features, and therefore the importance of the ability to cope with a larger number of features is well-known. Since training discriminative models over large corpora can be expensive, an optimization method that converges fast might be advantageous over others. A sparse classifier has a shorter test time than a denser classifier. For applications in which the test time is crucial, optimization methods that result in sparser classifiers might be preferable over other methods  CRFs. Shaded areas indicate variables that the model conditions on.</Paragraph>
    <Paragraph position="3"> even if their training time is longer. In this paper we investigate these aspects for different optimization methods, i.e. the number of features, training time and sparseness, as well as the accuracy. In some cases, an approximate optimization that is more efficient in one of these aspects might be preferable to the exact method, if they have similar accuracy. We experiment with exact versus approximate as well as parallel versus sequential optimization methods.</Paragraph>
    <Paragraph position="4"> For the exact methods, we use an off-the-shelf gradient based optimization routine. For the approximate methods, we use a perceptron and a boosting algorithm for sequence labelling which update the feature weights parallel and sequentially respectively.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML