File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1036_metho.xml
Size: 15,782 bytes
Last Modified: 2025-10-06 14:08:03
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1036"> <Title>Combining Outputs of Multiple Japanese Named Entity Chunkers by Stacking</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2. Combine the outputs of the several systems: </SectionTitle> <Paragraph position="0"> previously studied techniques include: i) voting techniques (van Halteren et al., 1998; Tjong Kim Sang, 2000; Henderson and Brill, 1999; Henderson and Brill, 2000), ii) switching among several systems according to confidence values they provide (Henderson and Brill, 1999), iii) stacking techniques (Wolpert, 1992) which train a second stage classifier for Association for Computational Linguistics.</Paragraph> <Paragraph position="1"> Language Processing (EMNLP), Philadelphia, July 2002, pp. 281-288. Proceedings of the Conference on Empirical Methods in Natural combining outputs of classifiers at the first stage (van Halteren et al., 1998; Brill and Wu, 1998; Tjong Kim Sang, 2000).</Paragraph> <Paragraph position="2"> In this paper, we propose a method for combining outputs of (Japanese) named entity chunkers, which belongs to the family of stacking techniques. In the sub-process 1, we focus on models which differ in the lengths of preceding/subsequent contexts to be incorporated in the models. As the base model for supervised learning of Japanese named entity chunking, we employ a model based on the maximum entropy model (Uchimoto et al., 2000), which performed the best in IREX (Information Retrieval and Extraction Exercise) Workshop (IREX Committee, 1999) among those based on machine learning techniques. Uchimoto et al. (2000) reported that the optimal number of preceding/subsequent contexts to be incorporated in the model is two morphemes to both left and right from the current position. In this paper, we train several maximum entropy models which differ in the lengths of preceding/subsequent contexts, and then combine their outputs.</Paragraph> <Paragraph position="3"> As the sub-process 2, we propose to apply a stacking technique which learns a classifier for combining outputs of several named entity chunkers.</Paragraph> <Paragraph position="4"> This second stage classifier learns rules for accepting/rejecting outputs of several individual named entity chunkers. The proposed method can be applied to the cases where the number of constituent systems is quite small (e.g., two). Actually, in the experimental evaluation, we show that the results of combining the best performing model of Uchimoto et al. (2000) with the one which performs poorly but extracts named entities quite different from those of the best performing model can help improve the performance of the best model.</Paragraph> </Section> <Section position="4" start_page="0" end_page="3" type="metho"> <SectionTitle> 2 Named Entity Chunking based on Maximum Entropy Models </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Task of the IREX Workshop </SectionTitle> <Paragraph position="0"> The task of named entity recognition of the IREX workshop is to recognize eight named entity types in Table 1 (IREX Committee, 1999). The organizer of the IREX workshop provided 1,174 newspaper articles which include 18,677 named entities as the training data. In the formal run (general domain) of the workshop, the participating systems were requested to recognize 1,510 named entities included in the held-out 71 newspaper articles.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Named Entity Chunking </SectionTitle> <Paragraph position="0"> We first provide our definition of the task of Japanese named entity chunking (Sekine et al., 1998; Borthwick et al., 1998; Uchimoto et al., 2000). Suppose that a sequence of morphemes is given as below:</Paragraph> <Paragraph position="2"> Given that the current position is at the morpheme</Paragraph> <Paragraph position="4"> , the task of named entity chunking is to assign a chunking state (to be described in section 2.3.1) to the morpheme M</Paragraph> <Paragraph position="6"> at the current position, considering the patterns of surrounding morphemes. Note that in the supervised learning phase, we can use the chunking information on which morphemes constitute a named entity, and which morphemes are in the left/right contexts of the named entity.</Paragraph> </Section> <Section position="3" start_page="0" end_page="3" type="sub_section"> <SectionTitle> 2.3 The Maximum Entropy Model </SectionTitle> <Paragraph position="0"> In the maximum entropy model (Della Pietra et al., 1997), the conditional probability of the output y given the context x can be estimated as the follow-</Paragraph> <Paragraph position="2"> where binary-valued indicator functions called feature functions f i (x, y) are introduced for expressing a set of &quot;features&quot;, or &quot;attributes&quot; of the context x and the output y.Aparameter l</Paragraph> <Paragraph position="4"> parenrightBig Uchimoto et al. (2000) defines the context x as the patterns of surrounding morphemes as well as that at the current position, and the output y as the named entity chunking state to be assigned to the morpheme at the current position.</Paragraph> <Paragraph position="5"> Uchimoto et al. (2000) classifies classes of named entity chunking states into the following 40 tags: * Each of eight named entity types plus an &quot;OP-TIONAL&quot; type are divided into four chunking states, namely, the beginning/middle/end of an named entity, or an named entity consisting of a single morpheme. This amounts to 9x4=36 classes.</Paragraph> <Paragraph position="6"> * Three more classes are distinguished for morphemes immediately preceding/following a named entity, as well as the one between two named entities.</Paragraph> <Paragraph position="7"> * Other morphemes are assigned the class &quot;OTHER&quot;.</Paragraph> <Paragraph position="8"> Following Uchimoto et al. (2000), feature functions for morphemes at the current position as well as the surrounding contexts are defined. More specifically, the following three types of feature functions are used: 1. 2052 lexical items that are observed five times or more within two morphemes from named entities in the training corpus.</Paragraph> <Paragraph position="9"> 2. parts-of-speech tags of morphemes 2 .</Paragraph> <Paragraph position="10"> 3. character types of morphemes (i.e., Japanese (hiragana or katakana), Chinese (kanji), numbers, English alphabets, symbols, and their combinations).</Paragraph> <Paragraph position="11"> As for the number of preceding/subsequent morphemes as contextual clues, we consider the following models: Minor modifications from those of Uchimoto et al. (2000) are: i) we used character types of morphemes because they are known to be useful in the Japanese named entity chunking, and ii) the sets of parts-of-speech tags are different. As a Japanese morphological analyzer, we used BREAKFAST (Sassano et al., 1997) with the set of about 300 part-of-speech tags. BREAKFAST achieves 99.6% part-of-speech accuracy against newspaper articles.</Paragraph> <Paragraph position="12"> 5-gram model This model considers the preceding two mor- null as the contextual clue. Both in (Uchimoto et al., 2000) and in this paper, this is the model which performs the best among all the individual models without system combination. the following three modifications to those models: * with all features * with lexical items and parts-of-speech tags (without the character types) of</Paragraph> <Paragraph position="14"> In our experiments, the number of features is 13,200 for 5-gram model and 15,071 for 9-gram model. The number of feature functions is 31,344 for 5-gram model and 35,311 for 9-gram model.</Paragraph> <Paragraph position="15"> Training a variable length (5[?]9-gram) model, testing with 9-gram model The major disadvantage of the 5/7/9-gram models is that in the training phase it does not take into account whether or not the preceding/subsequent morphemes constitute one named entity together with the morpheme at the current position. Considering this disadvantage, we examine another model, namely, variable length model, which incorporates variable length contextual information. In the training phase, this model considers which of the preceding/subsequent morphemes constitute one named entity together with the morpheme at the current position (Sassano and Utsuro, 2000). It also considers several morphemes in the left/right contexts of the named entity. Here we restrict this model to explicitly considering the cases of named entities of the length up to three morphemes and only implicitly considering those longer than three morphemes. We also restrict it to considering two morphemes in both left and right contexts of the named entity.</Paragraph> <Paragraph position="16"> 1. In the cases where the current named entity consists of up to three morphemes, all the con- null stituent morphemes are regarded as within the current named entity. The following is an example of this case, where the current named entity consists of three morphemes, and the current position is at the middle of those constituent morphemes as below: 2. In the cases where the current named entity consists of more than three morphemes, only the three constituent morphemes are regarded as within the current named entity and the rest are treated as if they were outside the named entity. For example, suppose that the current named entity consists of four morphemes: In the testing phase, we apply this model considering the preceding four morphemes as well as the subsequent four morphemes at every position, as in the case of 9-gram model .</Paragraph> <Paragraph position="17"> We consider the following three modifications to this model, where we suppose that the morpheme at the current position be M</Paragraph> </Section> <Section position="4" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 3.1 Data Sets </SectionTitle> <Paragraph position="0"> The following gives the training and test data sets for our framework of learning to combine outputs of named entity chunkers.</Paragraph> <Paragraph position="1"> 1. TrI: training data set for learning individual named entity chunkers.</Paragraph> <Paragraph position="2"> 2. TrC: training data set for learning a classifier for combining outputs of individual named entity chunkers.</Paragraph> <Paragraph position="3"> 3. Ts: test data set for evaluating the classifier for combining outputs of individual named entity chunkers.</Paragraph> </Section> <Section position="5" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 3.2 Procedure </SectionTitle> <Paragraph position="0"> The following gives the procedure for learning the classifier to combine outputs of named entity chun- null kers using TrI and TrC.</Paragraph> <Paragraph position="1"> 1. Train the individual named entity chunkers</Paragraph> <Paragraph position="3"> 2. Apply the individual named entity chunkers</Paragraph> <Paragraph position="5"> Note that, as opposed to the training phase, the length of preceding/subsequent contexts is fixed in the testing phase of this model. Although this discrepancy between training and testing damages the performance of this single model (section 4.1), it is more important to note that this model tends to have distribution of correct/over-generated named entities different from that of the 5-gram model. In section 4, we experimentally show that this difference is the key to improving the named entity chunking performance by system combination.</Paragraph> <Paragraph position="7"> of chunked named entities according to the positions of the chunked named entities in the text TrC, and obtain the event expression TrCev of TrC.</Paragraph> <Paragraph position="8"> 4. Train the classifier NEchk cmb for combining outputs of individual named entity chunkers using the event expression TrCev.</Paragraph> <Paragraph position="9"> The following gives the procedure for applying the learned classifier to Ts.</Paragraph> <Paragraph position="10"> 1. Apply the individual named entity chunkers</Paragraph> <Paragraph position="12"> and obtain the list of chunked named entities chunked named entities according to the positions of the chunked named entities in the text Ts, and obtain the event expression Tsev of Ts.</Paragraph> </Section> </Section> <Section position="5" start_page="3" end_page="4" type="metho"> <SectionTitle> 3. Apply NEchk </SectionTitle> <Paragraph position="0"> comb to Tsev and evaluate its performance.</Paragraph> <Section position="1" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 3.3 Data Expressions 3.3.1 Events </SectionTitle> <Paragraph position="0"> The event expression TrCev of TrC is obtained by aligning the lists NEList</Paragraph> <Paragraph position="2"> of chunked named entities, and is represented as a sequence of segments, where each segment is a set of aligned named entities. Chunked named entities are aligned under the constraint that those which share at least one constituent morpheme have to be aligned into the same segment. Examples of segments, into which named entities chunked by two systems are aligned, are shown in Table 2. In the first segment SegEv i , given the sequence of the two morphemes, the system No.0 decided to extract two named entities, while the system No.1 chunked the two morphemes into one named entity. In those event expressions, systems indicates the list of the indices of the systems which output the named entity, mlength gives the number of the constituent morphemes, NEtag gives one of the nine named entity types, POS gives the list of parts-of-speech of the constituent morphemes, and class NE indicates whether the named entity is a correct one compared against the gold standard (&quot;+&quot;), or the one over-generated by the systems (&quot;[?]&quot;). In the second segment SegEv i+1 , only the system No.1 decided to extract a named entity from the sequence of the three morphemes. In this case, the event expression for the system No.0 is the one which indicates that no named entity is extracted by the system No.0.</Paragraph> <Paragraph position="3"> In the training phase, each segment SegEv j of event expression constitutes a minimal unit of an event, from which features for learning the classifier are extracted. In the testing phase, the classes of each system's outputs are predicted against each In principle, features for learning the classifier for combining outputs of named entity chunkers are represented as a set of pairs of the system indices list <p,...,q> and a feature expression F of the named</Paragraph> </Section> <Section position="2" start_page="3" end_page="4" type="sub_section"> <SectionTitle> 3.4 Learning Algorithm </SectionTitle> <Paragraph position="0"> We apply a simple decision list learning method to the task of learning a classifier for combining outputs of named entity chunkers . A decision list (Yarowsky, 1994) is a sorted list of decision rules, each of which decides the value of class given some features f of an event. Each decision rule in a decision list is sorted in descending order with respect to some preference value, and rules with higher preference values are applied first when applying the decision list to some new test data. In this paper, we simply sort the decision list according to the conditional probability P(class</Paragraph> <Paragraph position="2"> of the i-th system's output given a feature f.</Paragraph> </Section> </Section> class="xml-element"></Paper>