File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/n03-1028_intro.xml
Size: 5,393 bytes
Last Modified: 2025-10-06 14:01:41
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-1028"> <Title>Shallow Parsing with Conditional Random Fields</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Sequence analysis tasks in language and biology are often described as mappings from input sequences to sequences of labels encoding the analysis. In language processing, examples of such tasks include part-of-speech tagging, named-entity recognition, and the task we shall focus on here, shallow parsing. Shallow parsing identi es the non-recursive cores of various phrase types in text, possibly as a precursor to full parsing or information extraction (Abney, 1991). The paradigmatic shallow-parsing problem is NP chunking, which nds the non-recursive cores of noun phrases called base NPs. The pioneering work of Ramshaw and Marcus (1995) introduced NP chunking as a machine-learning problem, with standard datasets and evaluation metrics. The task was extended to additional phrase types for the CoNLL-2000 shared task (Tjong Kim Sang and Buchholz, 2000), which is now the standard evaluation task for shallow parsing.</Paragraph> <Paragraph position="1"> Most previous work used two main machine-learning approaches to sequence labeling. The rst approach relies on k-order generative probabilistic models of paired input sequences and label sequences, for instance hidden Markov models (HMMs) (Freitag and McCallum, 2000; Kupiec, 1992) or multilevel Markov models (Bikel et al., 1999). The second approach views the sequence labeling problem as a sequence of classi cation problems, one for each of the labels in the sequence. The classi cation result at each position may depend on the whole input and on the previous k classi cations. 1 The generative approach provides well-understood training and decoding algorithms for HMMs and more general graphical models. However, effective generative models require stringent conditional independence assumptions. For instance, it is not practical to make the label at a given position depend on a window on the input sequence as well as the surrounding labels, since the inference problem for the corresponding graphical model would be intractable. Non-independent features of the inputs, such as capitalization, suf xes, and surrounding words, are important in dealing with words unseen in training, but they are dif cult to represent in generative models.</Paragraph> <Paragraph position="2"> The sequential classi cation approach can handle many correlated features, as demonstrated in work on maximum-entropy (McCallum et al., 2000; Ratnaparkhi, 1996) and a variety of other linear classi ers, including winnow (Punyakanok and Roth, 2001), AdaBoost (Abney et al., 1999), and support-vector machines (Kudo and Matsumoto, 2001). Furthermore, they are trained to minimize some function related to labeling error, leading to smaller error in practice if enough training data are available. In contrast, generative models are trained to maximize the joint probability of the training data, which is 1Ramshaw and Marcus (1995) used transformation-based learning (Brill, 1995), which for the present purposes can be tought of as a classi cation-based method.</Paragraph> <Paragraph position="3"> not as closely tied to the accuracy metrics of interest if the actual data was not generated by the model, as is always the case in practice.</Paragraph> <Paragraph position="4"> However, since sequential classi ers are trained to make the best local decision, unlike generative models they cannot trade off decisions at different positions against each other. In other words, sequential classi ers are myopic about the impact of their current decision on later decisions (Bottou, 1991; Lafferty et al., 2001).</Paragraph> <Paragraph position="5"> This forced the best sequential classi er systems to resort to heuristic combinations of forward-moving and backward-moving sequential classi ers (Kudo and Matsumoto, 2001).</Paragraph> <Paragraph position="6"> Conditional random elds (CRFs) bring together the best of generative and classi cation models. Like classication models, they can accommodate many statistically correlated features of the inputs, and they are trained discriminatively. But like generative models, they can trade off decisions at different sequence positions to obtain a globally optimal labeling. Lafferty et al. (2001) showed that CRFs beat related classi cation models as well as HMMs on synthetic data and on a part-of-speech tagging task.</Paragraph> <Paragraph position="7"> In the present work, we show that CRFs beat all reported single-model NP chunking results on the standard evaluation dataset, and are statistically indistinguishable from the previous best performer, a voting arrangement of 24 forward- and backward-looking support-vector classi ers (Kudo and Matsumoto, 2001). To obtain these results, we had to abandon the original iterative scaling CRF training algorithm for convex optimization algorithms with better convergence properties. We provide detailed comparisons between training methods.</Paragraph> <Paragraph position="8"> The generalized perceptron proposed by Collins (2002) is closely related to CRFs, but the best CRF training methods seem to have a slight edge over the generalized perceptron.</Paragraph> </Section> class="xml-element"></Paper>