File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0433_metho.xml
Size: 11,821 bytes
Last Modified: 2025-10-06 14:08:27
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0433"> <Title>Kowloon</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Classification Methods </SectionTitle> <Paragraph position="0"> To carry out the stacking and voting experiments, we constructed a number of relatively strong individual component models of the following kinds.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Boosting </SectionTitle> <Paragraph position="0"> The main idea behind boosting algorithms is that a set of many weak classifiers can be effectively combined to yield a single strong classifier. Each weak classifier is trained sequentially, increasingly focusing more heavily on the instances that the previous classifiers found difficult to classify.</Paragraph> <Paragraph position="1"> For the boosting framework, our system uses Ada-Boost.MH (Freund and Schapire, 1997), an n-ary classification variant of the original binary AdaBoost algorithm. It performs well on a number of natural language processing problems, including text categorization (Schapire and Singer, 2000) and word sense disambiguation (Escudero et al., 2000). In particular, it has also been demonstrated that boosting can be used to build language-independent NER models that perform exceptionally well (Wu et al., 2002; Carreras et al., 2002).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Support Vector Machines </SectionTitle> <Paragraph position="0"> Support Vector Machines (SVMs) have gained a considerable following in recent years (Boser et al., 1992), particularly in dealing with high-dimensional spaces such as commonly found in natural language problems like text categorization (Joachims, 1998). SVMs have shown promise when applied to chunking (Kudo and Matsumoto, 2001) and named entity recognition (Sassano and Utsuro, 2000; McNamee and Mayfield, 2002), though performance is quite sensitive to parameter choices.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Transformation-based Learning </SectionTitle> <Paragraph position="0"> Transformation-based learning (Brill, 1995) (TBL) is a rule-based machine learning algorithm that was first introduced by Brill and used for part-of-speech tagging.</Paragraph> <Paragraph position="1"> The central idea of transformation-based learning is to learn an ordered list of rules which progressively improve upon the current state of the training set. An initial assignment is made based on simple statistics, and then rules are greedily learned to correct the mistakes, until no net improvement can be made.</Paragraph> <Paragraph position="2"> The experiments presented in this paper were performed using the fnTBL toolkit (Ngai and Florian, 2001), which implements several optimizations in rule learning to drastically speed up the time needed for training.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Data Resources </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Preprocessing the Data </SectionTitle> <Paragraph position="0"> The data that was provided by the CoNLL organizers was sentence-delimited and tokenized, and hand-annotated with named entity chunks. The English data was automatically labeled with part-of-speech and chunk tags from the memory-based tagger and chunker (Daelemans et al., 1996), and the German data was labelled with the decision-tree-based TreeTagger (Schmidt, 1994). We replaced the English part-of-speech tags with those generated by a transformation-based learner (Ngai and Florian, 2001). The chunk tags did not appear to help in either case and were discarded.</Paragraph> <Paragraph position="1"> As we did not want to overly rely on characteristics which were specific to the Indo-European language family, we did not perform detailed morphological analysis; but instead, an approximation was made by simply extracting the prefixes and suffixes of up to 4 characters from all the words.</Paragraph> <Paragraph position="2"> In order to let the system generalize over word types, we normalized the case information of all the words in the corpus by converting them to uniform lower case. To recapture the lost information, each word was annotated with a tag that specified if it was in all lower case, all upper case, or was of mixed case.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Gazetteers </SectionTitle> <Paragraph position="0"> Apart from the training and test data, the CoNLL organizers also provided two lists of named entities, one in English and one in Dutch. Part of the challenge for this year's shared task was to find ways of using this resource in the system.</Paragraph> <Paragraph position="1"> To supplement the provided gazetteers, a large collection of names and words was downloaded from various web sources. This collection was used to compile a gazetteer of 120k uncategorized English proper names and a lexicon of 500k common English words. As there were no supplied gazetteers for German, we also compiled a gazetteer of 8000 German names, which were mostly personal first and last names and geographical locations, and a lexicon of 32k common German words.</Paragraph> <Paragraph position="2"> Named entities in the corpus which appeared in the gazetteers were identified lexically or using a maximum forward match algorithm similar to that used in Chinese word segmentation. Once named entities have been identified in this preprocessing step, each word can then be annotated with an NE chunk tag corresponding to the output from the system. The learner can view the NE chunk tag as an additional feature.</Paragraph> <Paragraph position="3"> The variations in this approach come from resolving conflicts between different possible type information for the same NE. The different ways that we dealt with the problem were: (1) Rank all the NE types by frequency in the training corpus. In the case of a conflict, default to the more common NE. (2) Give all the possible NEs to the boosting learner as a set of possible NE chunk tags.</Paragraph> <Paragraph position="4"> (3) Discard the NE type information and annotate each word with a tag indicating whether it is inside an NE.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Classifier Combination </SectionTitle> <Paragraph position="0"> It is a well-known fact that if several classifiers are available, they can be combined in various ways to create a system that outperforms the best individual classifier.</Paragraph> <Paragraph position="1"> Since we had several classifiers available to us, it was reasonable to investigate combining them in different ways.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Stacking </SectionTitle> <Paragraph position="0"> Like voting, stacking is a learning paradigm that constructs a combined model from several classifiers. The basic concept behind stacking is to train two or more classifiers sequentially, with each successive classifier incorporating the results of the previous ones in some fashion.</Paragraph> <Paragraph position="1"> As mentioned above, at the most basic level, lexicon and gazetteer information was integrated into our classifiers by including them as additional features. However we also experimented with several different ways of incorporating this information via stacking--one possible approach was to view the gazetteers as a separate system that would produce an output and then implement stacking to combine their outputs.</Paragraph> <Paragraph position="2"> One of the most straightforward approaches to stacking can be applied to tasks that are naturally divisible into hierarchically ordered subtasks. An example approach, which was taken by several of the participating teams in the CoNLL-2002 shared task, is to split the NER task into the identification phase, where named entities are identified in the text; and the classification phase, where the identified named entities are categorized into the various subtypes. Provided that the performance of the individual classifier is fairly high (otherwise errors made in the earlier stages could propagate down the chain), this has the advantage of reducing the complexity of the task for each individual classifier.</Paragraph> <Paragraph position="3"> To construct such a system, we trained a stacked Ada-Boost.MH classifier to perform NE reclassification on boundaries identified in the base model. The output of the initial models are postprocessed to remove all NE type information and then passed to this stacked classifier. As Table 1 shows, stacking the boosting models yields a significant gain in performance.</Paragraph> <Paragraph position="4"> Another approach to stacking that we investigated in this work involves a closer interaction between the models. The general overview of this approach is for a given model to use the output of another trained model as its initial state, and to improve beyond it. The idea is that the second model, with a different learning and representation bias, will be able to move out of the local maxima that the previous model has settled into.</Paragraph> <Paragraph position="5"> To accomplish this we introduced Stacked TBL (STBL), a variant of TBL tuned for this purpose (Wu et al., 2003). We found TBL to be an appropriate point of departure since it starts from an initial state of classification and learns rules to iteratively correct the current labeling. We aimed to use STBL to improve the base model from the preceding section.</Paragraph> <Paragraph position="6"> STBL proved quite effective; in fact it yielded the best base model performance among all our models. Table 2 shows the result of stacking STBL on the boosting base model.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Voting </SectionTitle> <Paragraph position="0"> The simplest approach to combining classifiers is through voting, which examines the outputs of the various mod- null els and selects the classifications which have a weight exceeding some threshold, where the weight is dependent upon the models that proposed this particular classification. The variations in this approach stem from the method by which weights are attached to the models. It is possible to assign varying weights to the models, in effect giving one model more &quot;say&quot; than the others. In our system, however, we simply assigned each model equal weight, and selected classifications which were proposed by a majority of models.</Paragraph> <Paragraph position="1"> Voting was thus used to further improve the base model. Four models chosen for heterogeneity participated in the voting: two variants of the AdaBoost.MH model, the SVM model, and the Stacked TBL model.</Paragraph> <Paragraph position="2"> As before, the stacked AdaBoost.MH reclassifier was applied to the voted result, yielding a final stacked voted stacked model.</Paragraph> <Paragraph position="3"> This model gave the best overall results on the task as a whole. Table 3 shows the results of our system.</Paragraph> </Section> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Overall Results </SectionTitle> <Paragraph position="0"> Complete results on the development and test sets, for both English and German, are shown in Table 4.</Paragraph> </Section> class="xml-element"></Paper>