File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/98/p98-1081_abstr.xml
Size: 4,673 bytes
Last Modified: 2025-10-06 13:49:21
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-1081"> <Title>Improving Data Driven Wordclass Tagging by System Combination</Title> <Section position="1" start_page="0" end_page="491" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> In this paper we examine how the differences in modelling between different data driven systems performing the same NLP task can be exploited to yield a higher accuracy than the best individual system. We do this by means of an experiment involving the task of morpho-syntactic wordclass tagging. Four well-known tagger generators (Hidden Markov Model, Memory-Based, Transformation Rules and Maximum Entropy) are trained on the same corpus data. After comparison, their outputs are combined using several voting strategies and second stage classifiers. All combination taggers outperform their best component, with the best combination showing a 19.1% lower error rate than the best individual tagger.</Paragraph> <Paragraph position="1"> Introduction In all Natural Language Processing (NLP) systems, we find one or more language models which are used to predict, classify and/or interpret language related observations. Traditionally, these models were categorized as either rule-based/symbolic or corpusbased/probabilistic. Recent work (e.g. Brill 1992) has demonstrated clearly that this categorization is in fact a mix-up of two distinct Categorization systems: on the one hand there is the representation used for the language model (rules, Markov model, neural net, case base, etc.) and on the other hand the manner in which the model is constructed (hand crafted vs. data driven).</Paragraph> <Paragraph position="2"> Data driven methods appear to be the more popular. This can be explained by the fact that, in general, hand crafting an explicit model is rather difficult, especially since what is being modelled, natural language, is not (yet) wellunderstood. When a data driven method is used, a model is automatically learned from the implicit structure of an annotated training corpus. This is much easier and can quickly lead to a model which produces results with a 'reasonably' good quality.</Paragraph> <Paragraph position="3"> Obviously, 'reasonably good quality' is not the ultimate goal. Unfortunately, the quality that can be reached for a given task is limited, and not merely by the potential of the learning method used. Other limiting factors are the power of the hard- and software used to implement the learning method and the availability of training material. Because of these limitations, we find that for most tasks we are (at any point in time) faced with a ceiling to the quality that can be reached with any (then) available machine learning system. However, the fact that any given system cannot go beyond this ceiling does not mean that machine learning as a whole is similarly limited. A potential loophole is that each type of learning method brings its own 'inductive bias' to the task and therefore different methods will tend to produce different errors.</Paragraph> <Paragraph position="4"> In this paper, we are concerned with the question whether these differences between models can indeed be exploited to yield a data driven model with superior performance.</Paragraph> <Paragraph position="5"> In the machine learning literature this approach is known as ensemble, stacked, or combined classifiers. It has been shown that, when the errors are uncorrelated to a sufficient degree, the resulting combined classifier will often perform better than all the individual systems (Ali and Pazzani 1996; Chan and Stolfo 1995; Tumer and Gosh 1996). The underlying assumption is twofold. First, the combined votes will make the system more robust to the quirks of each learner's particular bias. Also, the use of information about each individual method's behaviour in principle even admits the possibility to fix collective errors.</Paragraph> <Paragraph position="6"> We will execute our investigation by means of an experiment. The NLP task used in the experiment is morpho-syntactic wordclass tagging. The reasons for this choice are several. First of all, tagging is a widely researched and well-understood task (cf. van Halteren (ed.) 1998). Second, current performance levels on this task still leave room for improvement: 'state of the art' performance for data driven automatic wordclass taggers (tagging English text with single tags from a low detail tagset) is 9697% correctly tagged words. Finally, a number of rather different methods are available that generate a fully functional tagging system from annotated text.</Paragraph> </Section> class="xml-element"></Paper>