File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1660_metho.xml
Size: 21,143 bytes
Last Modified: 2025-10-06 14:10:46
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1660"> <Title>Empirical Study on the Performance Stability of Named Entity Recognition Model across Domains</Title> <Section position="4" start_page="509" end_page="509" type="metho"> <SectionTitle> 2 Chinese NER Based on Multilevel Linguistic Features </SectionTitle> <Paragraph position="0"> In this paper, we focus on recognizing four types of NEs: Persons (PER), Locations (LOC), Organizations (ORG) and miscellaneous named entities (MISC) which do not belong to the previous three groups (e.g. products, conferences, events, brands, etc.). All the NER models in the following experiments are trained with a Chinese NER system. In this section, we simply describe this Chinese NER system. The Robust Risk Minimization (RRM) Classification method and multi-level linguistic features are used in this system (Guo et al., 2005).</Paragraph> <Section position="1" start_page="509" end_page="509" type="sub_section"> <SectionTitle> 2.1 Robust Risk Minimization Classifier </SectionTitle> <Paragraph position="0"> We can view the NER task as a sequential classification problem. If toki (i = 0,1,...,n) denotes the sequence of tokenized text which is the input to the system, then every token toki should be assigned a class-label ti.</Paragraph> <Paragraph position="1"> The class label value ti associated with each token toki is predicted by estimating the conditional probability P(ti = c|xi) for every possible class-label value c, where xi is a feature vector associated with token toki.</Paragraph> <Paragraph position="2"> We assume that P(ti = c|xi) = P(ti = c|toki,{tj}j[?]i). The feature vector xi can depend on previously predicted class labels {tj}j[?]i, but the dependency is typically assumed to be local.</Paragraph> <Paragraph position="3"> In the RRM method, the above conditional probability model has the following parametric form:</Paragraph> <Paragraph position="5"> of y into the interval [0, 1]. wc is a linear weight vector and bc is a constant. Parameters wc and bc can be estimated from the training data. Given training data (xi,ti) for i = 1,...,n, the model is estimated by solving the following optimization problem for each c (Zhang et al., 2002):</Paragraph> <Paragraph position="7"> where yic = 1 when ti = c, and yic = [?]1 otherwise. The function f is defined as:</Paragraph> <Paragraph position="9"> Given the above conditional probability model, the best possible sequence of ti's can be estimated by dynamic programming in the decoding stage (Zhang et al., 2002).</Paragraph> </Section> <Section position="2" start_page="509" end_page="509" type="sub_section"> <SectionTitle> 2.2 Multilevel Linguistic Features </SectionTitle> <Paragraph position="0"> This Chinese NER system uses Chinese characters (not Chinese words) as the basic token units, and then maps word-based features that are associated with each word into corresponding features of those characters that are contained in the word. This approach can effectively incorporate both character-based features and word-based features. In general, we may regard this approach as information integration from linguistic views at different abstraction levels.</Paragraph> <Paragraph position="1"> We integrate a diverse set of local linguistic features, including word segmentation information, Chinese word patterns, complex lexical linguistic features (e.g. part of speech and semantic features), aligned at the character level. In additional, we also use external NE hints and gazetteers, including surnames, location suffixes, organization suffixes, titles, high-frequency Chinese characters in Chinese names and translation names, and lists of locations and organizations. In this system, local linguistic features of a token unit are derived from the sentence containing this token unit. All special linguistic patterns (i.e. date, time, numeral expression) are encoded into pattern-specific class labels aligned with the tokens.</Paragraph> </Section> </Section> <Section position="5" start_page="509" end_page="513" type="metho"> <SectionTitle> 3 Impact of Training Data Size And </SectionTitle> <Paragraph position="0"> Domain Information on the NER Performance It is very important to keep the performance stability of NER models across domains in practice. However, the performance usually becomes unstable when NER models are applied in different domains. We focus on the impact of the training data size and domain information on the NER performance in this section.</Paragraph> <Section position="1" start_page="510" end_page="510" type="sub_section"> <SectionTitle> 3.1 Data </SectionTitle> <Paragraph position="0"> We built a large-scale high-quality Chinese NE annotated corpus. The corpus size is 114.25M Chinese characters. All the data are news articles selected from several Chinese newspapers in 2001 and 2002. All the NEs in the corpus are manually tagged. Documents in the corpus are also manually classified into eight domain categories, including politics, sports, science, economics, entertainment, life, society and others. Cross-validation is employed to ensure the tagging quality.</Paragraph> <Paragraph position="1"> All the training data and test data in the experiments are selected from this Chinese annotated corpus. The general training data are randomly selected from the corpus without distinguishing their domain categories. All the domain-specific training data are selected from the corpus according to their domain categories. One general test data set and seven domain-specific test data sets are used in our experiments (see Table 1). The size of the general test data set is 1.34M Chinese characters.</Paragraph> <Paragraph position="2"> Seven domain-specific test sets are extracted from the general test data set according to the document domain categories.</Paragraph> <Paragraph position="3"> domain-specific test data sets In our evaluation, only NEs with correct boundaries and correct class labels are considered as the correct recognition. We use the standard P (i.e.</Paragraph> <Paragraph position="4"> Precision), R (i.e. Recall), and F-measure (defined as 2PR/(P+R)) to measure the performance of NER models.</Paragraph> </Section> <Section position="2" start_page="510" end_page="511" type="sub_section"> <SectionTitle> 3.2 Impact of Training Data Size on the NER Performance across Domains </SectionTitle> <Paragraph position="0"> The amount of annotated data is always a bottleneck for supervised learning methods in practice.</Paragraph> <Paragraph position="1"> specific domain NER models Thus, we evaluate the impact of training data size on the NER performance across domains.</Paragraph> <Paragraph position="2"> In this baseline experiment, an initial general NER model is trained with 0.1M general data at first. Then the NER model is incrementally re-trained by adding 0.1M new general training data each time till the performance isn't enhanced significantly. The NER performance curve (labelled with the tag &quot;General&quot; ) in the whole retraining process is shown in Figure 1. Experimental results show that the performance of the general NER model is significantly enhanced in the first several retraining cycles since more training data are used. However, when the general training data set size is more than 2.4M, the performance enhancement is very slight.</Paragraph> <Paragraph position="3"> In order to analyze how the training data size impacting the performance of NER models in specific domains, seven domain-specific NER models are built using the similar retraining process. Each domain-specific NER model is also trained with 0.1M domain-specific data at first. Then, each initial domain-specific NER model is incrementally retrained by adding 0.1M new domain-specific data each time.</Paragraph> <Paragraph position="4"> The performance curves of these domain-specific NER models are also shown in Figure 1 (see the curves labelled with the domain tags). Although the initial performance of each domain-specific NER model varies with domains, the performance is also significantly enhanced in the first several retraining cycles. When the size of the domain-specific training data set is above a certain threshold, the performance enhancement is very slight as well.</Paragraph> <Paragraph position="5"> The final performance of the trained NER models, and the corresponding training data sets are shown in Table 2.</Paragraph> <Paragraph position="6"> From these NER performance curves, we obtain the following observations.</Paragraph> <Paragraph position="7"> 1. More training data are used, higher NER performance can be achieved. However, it is difficult to significantly enhance the performance when the training data size is above a certain threshold.</Paragraph> <Paragraph position="8"> 2. The threshold of the training data size and the final achieved performance vary with domains (see Table 2). For example, in entertainment domain, the threshold is 0.6M and the final F-measure achieves 83.31%. In economic domain, the threshold is 1.7M, and the corresponding F-measure is 85.46%.</Paragraph> </Section> <Section position="3" start_page="511" end_page="513" type="sub_section"> <SectionTitle> 3.3 The Performance Stability of Each NE Type Recognition across Domains </SectionTitle> <Paragraph position="0"> Statistic data on our large-scale annotated corpus (shown in Table 3) show that the distribution of NE types varies with domains. We define &quot; NE density &quot; to quantitatively measure the NE distribution in an annotated data set. NE density is defined as &quot;the count of NE instances in one thousand Chinese characters&quot;. Higher NE density usually indicates that more NEs are contained in the data set. We may easily measure the distribution of each NE type across domains using NE density. In this annotated corpus, PER, LOC, and ORG have similar NE density while MISC has the smallest NE density. All the NE types also have different NE density in each domain. For example, the NE density of ORG and LOC is much higher than that of PER in economic domain. PER and LOC have higher NE density than ORG in politics domain. PER has the highest NE density among these NE types in both sports and entertainment domains. The unbalanced NE distribution across domains shows that news articles on different domains usually focus on different specific NE types. These NE distribution features imply that each NE type has different domain dependency feature. The performance stability of domain-focused NE type recognition becomes more important in domain-specific applications. For example, since economic news articles usually focus on ORG and LOC NEs, the high-quality LOC and ORG recognition models will be more valuable in economic domain. In addition, these distribution features also can be used to guide training and test data selection.</Paragraph> <Paragraph position="1"> In this experiment, the performance stability of NER models across domains is evaluated, especially the performance stability of each NE type recognition. The general NER model is trained with 2.4M general data. Seven domain-specific models are trained with the corresponding domain-specific training sets (see Table 2 in Section 3.2).</Paragraph> <Paragraph position="2"> The performance stability of the general NER model is firstly evaluated on the general and domain-specific test data sets (see Table 1 in Sec- null model are shown in Figure 2, including the total F-measure curve of the NER model (labelled with the tag &quot;All&quot;) and F-measure curves of each NE type recognition in the specific domains (labelled with the NE tags respectively).</Paragraph> <Paragraph position="3"> The performance stability of the seven domain-specific NER models are also evaluated. Each domain-specific NER model is tested on the gen- null model in specific domains eral test data and the other six different domain-specific test data sets. The experimental results are shown in Table 5. The performance curves of three domain-specific NER models are shown in Figure 3, Figure 4 and Figure 5 respectively.</Paragraph> <Paragraph position="4"> From these experimental results, we have the following conclusions.</Paragraph> <Paragraph position="5"> 1. The performance stability of all the NER models is limited across domains. When a NER model is employed in a new domain, its performance usually decreases. Moreover, its performance is usually much lower than the performance of the corresponding domain- null NER model in the other specific domains formance stability than the domain-specific NER model when they are applied in new domains (see Table 5). Domain-specific models usually could achieve a higher performance in its corresponding domain after being trained with a smaller amount of domain-specific annotated data (see Table 2 in Section 3.2). However, the performance stability of domain-specific NER model is poor across different domains. Thus, it is very popular to build a general NER model for the general applications in practice.</Paragraph> <Paragraph position="6"> 3. The performance of PER, LOC and ORG recognition is better than that of MISC recog- null nition in NER (see Figure 2 [?] Figure 5).</Paragraph> <Paragraph position="7"> The main reason for the poor performance of MISC recognition is that there are less common indicative features among various MISC NEs which we do not distinguish. In addition, NE density of MISC is much less than that of PER, LOC, and ORG. There are a relatively small number of positive training samples for MISC recognition.</Paragraph> <Paragraph position="8"> 4. NE types have different domain dependency attribute. The performance stability of each NE type recognition varies with domains (see Figure 2 [?] Figure 5). The performance of PER and LOC recognition are more stable across domains. Thus, few efforts are needed to adapt the existing high-quality general PER and LOC recognition models in domain-specific applications. Since ORG and MISC NEs usually contain more domain-specific semantic information, ORG and MISC are more domain-dependent than PER and LOC.</Paragraph> <Paragraph position="9"> Thus, more domain-specific features should be mined for ORG and MISC recognition.</Paragraph> </Section> </Section> <Section position="6" start_page="513" end_page="515" type="metho"> <SectionTitle> 4 Use Informative Training Samples to </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="513" end_page="515" type="sub_section"> <SectionTitle> Enhance the Performance of NER Models across Domains </SectionTitle> <Paragraph position="0"> A higher performance system usually requires more features and a larger number of training data.</Paragraph> <Paragraph position="1"> This requires larger system memory and more efficient training method, which may not be available.</Paragraph> <Paragraph position="2"> Within the limitation of available training data and computational resources, it is necessary for us to either limit the number of features or select more informative data which can be efficiently handled by the training algorithm. Active learning method is usually employed in text classification (McCallum and Nigam et al., 1998). It is only recently employed in NER (Shen et al., 2004).</Paragraph> <Paragraph position="3"> In order to enhance the performance and overcome the limitation of available training data and computational resources, we present an informative sample selection method using a variant of uncertainty-sampling (Lewis and Catlett, 1994).</Paragraph> <Paragraph position="4"> The main steps are described as follows.</Paragraph> <Paragraph position="5"> 1. Build an initial NER model (Fmeasure=76.24%) using an initial data set. The initial data set (about 1M Chinese characters) is randomly selected from the models after being trained with informative samples and random samples respectively 2. Refine the training set by adding more informative samples and removing those redundant samples. In this refinement phase, all of the data are annotated by the current recognition model (e.g. the initial model built in Step 1). Each annotation has a confidence score associated with the prediction. In general, an annotation with lower confidence score usually indicates a wrong prediction. The confidence score of the whole sample sentence is defined as the average of the confidence scores of all the annotations contained in the sentence. Thus, we add those sample sentences with lower confidence scores into the training set. Meanwhile, in order to keep a reasonable size of the training set, those old training sample sentences with higher confidence scores are removed from the current training set. In each retraining phase, all of the sample sentences are sorted by the confidence score. The top 1000 new sample sentences with lowest confidence scores are added into the current training set. The top 500 old training sample sentences with highest confidence scores are removed from the current training set.</Paragraph> <Paragraph position="6"> 3. Retrain a new Chinese NER model with the newly refined training set 4. Repeat Step 2 and Step 3, until the performance doesn't improve any more.</Paragraph> <Paragraph position="7"> We apply this informative sample selection method to incrementally build the general domain NER model. The size of the final informative training sample set is 1.05M Chinese characters. This informative training sample set has higher NE density than the random training data set (see We denote this general NER model trained with the informative sample set as &quot;general informative model&quot;, and denote the general-domain model which is trained with 2.4M random general training data as &quot;general random model&quot;. The performance curves of the general NER models after being trained with informative samples and random data respectively are shown in Figure 6. Experiment results (see Table 6) show that there is a significant enhancement in F-measure if using informative training samples. Compared with the random model, the informative model can increase F-measure by 4.21 percent points.</Paragraph> <Paragraph position="8"> Type Using informative sample set Using random training set model in specific domains This informative model is also evaluated on the domain-specific test sets. Experimental results are shown in Table 7. We view the performance of the domain-specific NER model as the baseline performance in its corresponding domain (see Table 8), denoted as Fbaseline. The performance of informative model in specific domains is very close to the corresponding Fbaseline (see Figure 7). We define the domain-specific average F-measure as the average of all the F-measure of the NER model in seven specific domains, denote as F. The average of all the Fbaseline in specific domains is denoted as Fbaseline. The average F-measure of the informative model and the random model in specific domains is denoted as Finformative and Frandom respectively. Compared with Fbaseline (F =81.47%), the informative model increases F by 1.05 percent points. However, F decreases by 2.67 percent points if using the random model. Especially, the performance of the informative model is better than the corresponding baseline perfor- null tive model, random model, and the corresponding domain-specific models mance in politics, life, society and science domains. Moreover, the size of the informative sample set is much less than the life domain training set (1.7M).</Paragraph> <Paragraph position="9"> NER F(%) in specific domains model Eco- Poli- Spo- Entert- Life So- Sci- F nomic tics rts ainment ciety ence domain- null tive model, random model and the corresponding domain-specific model in each specific domain The informative model has much better performance than the random model in specific domains (see Table 8 and Figure 7). Finformative is 82.52% while Frandom is 78.80%. The informative model can increase F by 3.72 percent points. The informative model is also more stable than the random model in specific domains (see Table 8). Standard deviation of F-measure for the informative model is 4.74 while that for the random model is 4.94. Our experience with the incremental sample selection provides the following hints.</Paragraph> <Paragraph position="10"> 1. The performance of the NER model across domains can be significantly enhanced after being trained with informative samples. In order to obtain a high-quality and stable NER model, it is only necessary to keep the informative samples. Informative sample selection can alleviate the problem of obtaining a large amount of annotated data. It is also an effective method for overcoming the potential limitation of computational resources. 2. In learning NER models, annotated results with lower confidence scores are more useful than those samples with higher confidence scores. This is consistent with other studies on active learning.</Paragraph> </Section> </Section> class="xml-element"></Paper>