File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1075_metho.xml
Size: 21,769 bytes
Last Modified: 2025-10-06 14:08:58
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1075"> <Title>Multi-Criteria-based Active Learning for Named Entity Recognition</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Multi-criteria for NER Active Learning </SectionTitle> <Paragraph position="0"> Support Vector Machines (SVM) is a powerful machine learning method, which has been applied successfully in NER tasks, such as (Kazama et al.</Paragraph> <Paragraph position="1"> 2002; Lee et al. 2003). In this paper, we apply active learning methods to a simple and effective SVM model to recognize one class of names at a time, such as protein names, person names, etc. In NER, SVM is to cla ssify a word into positive class &quot;1&quot; indicating that the word is a part of an entity, or negative class &quot;-1&quot; in dicating that the word is not a part of an entity. Each word in SVM is represented as a high-dimensional feature vector including surface word information, orthographic features, POS feature and semantic trigger features (Shen et al. 2003). The semantic trigger features consist of some special head nouns for an entity class which is supplied by users. Furthermore, a window (size = 7), which represents the local context of the target word w, is also used to classify w. However, for active learning in NER, it is not reasonable to select a single word without context for human to label. Even if we require human to label a single word, he has to make an addition effort to refer to the context of the word. In our active learning process, we select a word sequence which consists of a machine-annotated named entity and its context rather than a single word.</Paragraph> <Paragraph position="2"> Therefore, all of the measures we propose for active learning should be applied to the machine-annotated named entities and we have to further study how to extend the measures for words to named entities. Thus, the active learning in SVM-based NER will be more complex than that in simple classification tasks, such as text classif ication on which most SVM active learning works are conducted (Schohn and Cohn 2000; Tong and Koller 2000; Brinker 2003). In the next part, we will introduce informativeness, representativeness and diversity measures for the SVM-based NER.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Informativeness </SectionTitle> <Paragraph position="0"> The basic idea of informativeness criterion is similar to certainty-based sample selection methods, which have been used in many previous works. In our task, we use a distance-based measure to evaluate the informativeness of a word and extend it to the measure of an entity using three scoring functions. We prefer the examples with high informative degree for which the current model are most uncertain.</Paragraph> <Paragraph position="1"> In the simplest linear form, training SVM is to find a hyperplane that can separate the posit ive and negative examples in training set with maximum margin. The margin is defined by the distance of the hyperplane to the nearest of the positive and negative examples. The training examples which are closest to the hyperplane are called support vectors. In SVM, only the support vectors are useful for the classific ation, which is different from statistical models. SVM training is to get these support vectors and their weights from training set by solving quadratic programming problem. The support vectors can later be used to classify the test data.</Paragraph> <Paragraph position="2"> Intuitively, we consider the informativeness of an example as how it can make effect on the support vectors by adding it to training set. An example may be informative for the learner if the distance of its feature vector to the hyperplane is less than that of the support vectors to the hyper-plane (equal to 1). This intuition is also justified by (Schohn and Cohn 2000; Tong and Koller 2000) based on a version space analysis. They state that labeling an example that lies on or close to the hyperplane is guaranteed to have an effect on the solution. In our task, we use the distance to measure the informativeness of an example.</Paragraph> <Paragraph position="3"> The distance of a word's feature vector to the hyperplane is computed as follows:</Paragraph> <Paragraph position="5"> where w is the feature vector of the word, ai, yi, si corresponds to the weight, the class and the feature vector of the ith support vector respectively. N is the number of the support vectors in current model.</Paragraph> <Paragraph position="6"> We select the example with minimal Dist, which indicates that it comes closest to the hyper-plane in feature space. This example is considered most informative for current model.</Paragraph> <Paragraph position="7"> Based on the above informativeness measure for a word, we compute the overall informativeness degree of a named entity NE. In this paper, we propose three scoring functions as follows. Let NE = w1...wN in which wi is the feature vector of the ith word of NE.</Paragraph> <Paragraph position="8"> * Info_Avg: The informativeness of NE is scored by the average distance of the words in NE to the hyperplane.</Paragraph> <Paragraph position="10"> where, wi is the feature vector of the ith word in NE.</Paragraph> <Paragraph position="11"> * Info_Min: The informativeness of NE is scored by the minimal distance of the words in NE.</Paragraph> <Paragraph position="12"> ()1{()} i iNEInfoNEMinDist[?]=[?]w w * Info_S/N: If the distance of a word to the hy- null perplane is less than a threshold a (= 1 in our task), the word is considered with short distance. Then, we compute the proportion of the number of words with short distance to the total number of words in the named entity and use this proportion to quantify the informativeness of the named entity.</Paragraph> <Paragraph position="14"> In Section 4.3, we will evaluate the effectiveness of these scoring functions.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Representativeness </SectionTitle> <Paragraph position="0"> In addition to the most informative example , we also prefer the most representative example . The representativeness of an example can be evaluated based on how many examples there are similar or near to it. So, the examples with high representative degree are less likely to be an outlier. Adding them to the training set will have effect on a large number of unlabeled examples. There are only a few works considering this selection criterion (McCallum and Nigam 1998; Tang et al. 2002) and both of them are specific to their tasks, viz. text classification and statistical parsing. In this section, we compute the simila rity between words using a general vector-based measure, extend this measure to named entity level using dynamic time warping algorithm and quantify the representativeness of a named entity by its density.</Paragraph> <Paragraph position="1"> In general vector space model, the similarity between two vectors may be measured by computing the cosine value of the angle between them. The smaller the angle is, the more similar between the vectors are. This measure, called cosine-similarity measure, has been widely used in information retrieval tasks (Baeza-Yates and Ribeiro-Neto 1999). In our task, we also use it to quantify the similarity between two words. Particularly, the calculation in SVM need be projected to a higher dimensional space by using a certain kernel function (,)ijK ww .</Paragraph> <Paragraph position="2"> Therefore, we adapt the cosine-similarity measure to SVM as follows:</Paragraph> <Paragraph position="4"> where, wi and wj are the feature vectors of the words i and j. This calculation is also supported by (Brinker 2003)'s work. Furthermore, if we use the linear kernel (,)ijijk =[?]wwww, the measure is the same as the traditional cosine similarity measure cos ij ij q [?]= [?]wwww and may be regarded as a general vector-based similarity measure.</Paragraph> <Paragraph position="5"> tities In this part, we compute the similarity between two machine-annotated named entities given the similarities between words. Regarding an entity as a word sequence, this work is analogous to the alignment of two sequences. We employ the dynamic time warping (DTW) algorithm (Rabiner et al. 1978) to find an optimal alig nment between the words in the sequences which maximize the accumulated similarity degree between the sequences. Here, we adapt it to our task. A sketch of the modified algorithm is as follows.</Paragraph> <Paragraph position="7"> word sequences to be matched. NE1 and NE2 consist of M and N words respectively. NE1(n) = w1n and NE2(m) = w2m. A similarity value Sim(w1n ,w2m) has been known for every pair of words (w1n,w2m) within NE1 and NE2. The goal of DTW is to find a path, m = map(n), which map n onto the corresponding m such that the accumulated similarity Sim* along the path is maximized.</Paragraph> <Paragraph position="8"> Certainly, the overall similarity measure Sim* has to be normalized as longer sequences normally give higher similarity value. So, the similarity between two sequences NE1 and NE2 is calc ulated as Given a set of machine-annotated named entities NESet = {NE1, ... , NEN}, the representativeness of a named entity NEi in NESet is quantified by its density. The density of NEi is defined as the average similarity between NEi and all the other entities NEj in NESet as follows.</Paragraph> <Paragraph position="10"> If NEi has the largest density among all the entities in NESet, it can be regarded as the centroid of NESet and also the most representative examples in NESet.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Diversity </SectionTitle> <Paragraph position="0"> Diversity criterion is to maximize the training utility of a batch. We prefer the batch in which the examples have high variance to each other. For example , given the batch size 5, we try not to select five repetitious examples at a time. To our knowledge, there is only one work (Brinker 2003) exploring this criterion. In our task, we propose two methods: local and global, to make the examples diverse enough in a batch.</Paragraph> <Paragraph position="1"> For a global consideration, we cluster all named entities in NESet based on the similarity measure proposed in Section 2.2.2. The named entities in the same cluster may be considered similar to each other, so we will select the named entities from different clusters at one time. We employ a K-means clustering algorithm (Jelinek 1997), which is shown in Figure 1.</Paragraph> <Paragraph position="2"> Given:</Paragraph> <Paragraph position="4"> Suppose: The number of clusters is K Initialization: Randomly equally partition {NE1, ... , NEN} into K initial clusters Cj (j = 1, ... , K). Loop until the number of changes for the centroids of all clusters is less than a threshold * Find the centroid of each cluster Cj (j = 1, ... , K).</Paragraph> <Paragraph position="6"/> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Means Clustering algorithm </SectionTitle> <Paragraph position="0"> In each round, we need to compute the pair-wise similarities within each cluster to get the centroid of the cluster. And then, we need to compute the similarities between each example and all centroids to repartition the example s. So, the algorithm is time-consuming. Based on the assumption that N examples are uniformly distributed between the K clusters, the time complexity of the alg orithm is about O(N2/K+NK) (Tang et al. 2002). In one of our experiments, the size of the NESet (N) is around 17000 and K is equal to 50, so the time complexity is about O(106). For efficiency, we may filter the entities in NESet before clustering them, which will be further discussed in Section 3.</Paragraph> <Paragraph position="1"> When selecting a machine-annotated named entity, we compare it with all previously selected named entities in the current batch. If the similarity between them is above a threshold ss, this example cannot be allowed to add into the batch. The order of selecting examples is based on some measure, such as informativeness measure, representativeness measure or their combination. This local selection method is shown in Figure 2. In this way, we avoid selecting too similar examples (simila rity value [?] ss) in a batch. The threshold ss may be the average similarity between the examples in NESet.</Paragraph> <Paragraph position="2"> This consideration only requires O(NK+K2) computational time. In one of our experiments (N ~ 17000 and K = 50), the time complexity is about O(105). It is more efficient than clustering alg orithm described in Section 2.3.1.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Sample Selection strategies </SectionTitle> <Paragraph position="0"> In this section, we will study how to combine and strike a proper balance between these criteria, viz.</Paragraph> <Paragraph position="1"> informativeness, representativeness and diversity, to reach the maximum effectiveness on NER active learning. We build two strategies to combine the measures proposed above. These strategies are based on the varying priorities of the criteria and the varying degrees to satisfy the criteria .</Paragraph> <Paragraph position="2"> * Strategy 1: We first consider the informativeness criterion. We choose m examples with the most informativeness score from NESet to an intermediate set called INTERSet. By this preselecting, we make the selection process faster in the later steps since the size of INTERSet is much smaller than that of NESet. Then we cluster the examples in INTERSet and choose the centroid of each cluster into a batch called BatchSet. The centroid of a cluster is the most representative example in that cluster since it has the largest density. Furthermore, the examples in different clusters may be considered diverse to each other. By this means, we consider representativeness and diversity criteria at the same time. This strategy is shown in Figure 3. One limitation of this strategy is that clustering result may not reflect the distrib ution of whole sample space since we only cluster on INTERSet for efficiency. The other is that since the representativeness of an example is only evaluated on a cluster. If the cluster size is too small, the most representative example in this cluster may not be representative in the whole sample space.</Paragraph> <Paragraph position="3"> Given:</Paragraph> <Paragraph position="5"> BatchSet with the maximal size K.</Paragraph> <Paragraph position="6"> INTERSet with the maximal size M * Strategy 2: (Figure 4) We combine the informativeness and representativeness criteria using the functio ()(1)()iiInfoNEDensityNEll+[?] , in which the Info and Density value of NEi are normalized first. The individual importance of each criterion in this function is adjusted by the trade-off parameter l ( 01l[?][?]) (set to 0.6 in our experiment). First, we select a candidate example NEi with the maximum value of this function from NESet. Second, we consider diversity criterion using the local method in Section 3.3.2. We add the candidate example NEi to a batch only if NEi is different enough from any previously selected example in the batch. The threshold ss is set to the average pair-wise similarity of the entities in NE-</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Experimental Results and Analysis </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Experiment Settings </SectionTitle> <Paragraph position="0"> In order to evaluate the effectiveness of our sele ction strategies, we apply them to recognize protein (PRT) names in biomedical domain using GENIA corpus V1.1 (Ohta et al. 2002) and person (PER), location (LOC), organization (ORG) names in newswire domain using MUC-6 corpus. First, we randomly split the whole corpus into three parts: an initial training set to build an in itial model, a test set to evaluate the performance of the model and an unlabeled set to select examples. The size of each data set is shown in Table 1. Then, iteratively, we select a batch of examples following the sele ction strategie s proposed, require human experts to label them and add them into the training set. The batch size K = 50 in GENIA and 10 in MUC-6.</Paragraph> <Paragraph position="1"> Each example is defined as a machine-recognized named entity and its context words (previous 3 words and next 3 words).</Paragraph> <Paragraph position="2"> The goal of our work is to minimize the human annotation effort to learn a named entity recognizer with the same performance level as supervised learning. The performance of our model is evaluated using &quot;precision/recall/F-measure&quot;.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Overall Result in GENIA and MUC-6 </SectionTitle> <Paragraph position="0"> In this section, we evaluate our selection strategies by comparing them with a random sele ction method, in which a batch of examples is randomly selected iteratively, on GENIA and MUC-6 corpus.</Paragraph> <Paragraph position="1"> Table 2 shows the amount of training data needed to achieve the performance of supervised learning using various selection methods, viz. Random, Strategy1 and Strategy2. In GENIA, we find: * The model achieves 63.3 F-measure using 223K words in the supervised learning.</Paragraph> <Paragraph position="2"> * The best performer is Strategy2 (31K words), requiring less than 40% of the training data that Random (83K words) does and 14% of the training data that the supervised learning does.</Paragraph> <Paragraph position="3"> * Strategy1 (40K words) performs slightly worse than Strategy2, requiring 9K more words. It is probably because Strategy1 cannot avoid selecting outliers if a cluster is too small.</Paragraph> <Paragraph position="4"> * Random (83K words) requires about 37% of the training data that the supervised learning does. It indicates that only the words in and around a named entity are useful for classific ation and the words far from the named entity may not be helpful.</Paragraph> <Paragraph position="5"> Furthermore, when we apply our model to news-wire domain (MUC-6) to recognize person, loca-tion and organization names, Strategy1 and Strategy2 show a more promising result by comparing with the supervised learning and Random, as shown in Table 2. On average, about 95% of the data can be reduced to achieve the same performance with the supervised learning in MUC-6. It is probably because NER in the newswire domain is much simpler than that in the biomedical domain (Shen et al. 2003) and named entities are less and distributed much sparser in the newswire texts than in the biomedical texts.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Effectiveness of Informativeness-based Selection Method </SectionTitle> <Paragraph position="0"> In this section, we investigate the effectiveness of informativeness criterion in NER task. Figure 5 shows a plot of training data size versus F-measure achieved by the informativeness-based measures in Section 3.1.2: Info_Avg, Info_Min and Info_S/N as well as Random. We make the comparisons in GENIA corpus. In Figure 5, the horizontal line is the performance level (63.3 F-measure) achieved by supervised learning (223K words). We find that the three informativeness-based measures perform similarly and each of them outperforms Ra ndom. Table 3 highlights the various data sizes to achieve the peak performance using these selection methods. We find that Random (83K words) on average requires over 1.5 times as much as data to achieve the same performance as the informative-</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.4 Effectiveness of Two Sample Selection Strategies </SectionTitle> <Paragraph position="0"> In addition to the informativeness criterion, we further incorporate representativeness and diversity criteria into active learning using two strategies described in Section 3. Comparing the two strategies with the best result of the single -criterion-based selection methods Info_Min , we are to justify that representativeness and diversity are also important factors for active learning. Figure 6 shows the learning curves for the various methods: Strategy1, Strategy2 and Info_Min. In the beginning iterations (F-measure < 60), the three methods performed similarly. But with the larger training set, the efficiencies of Stratety1 and Strategy2 begin to be evident. Table 4 highlights the final result of the three methods. In order to reach the performance of supervised learning, Strategy1 (40K words) and Strategyy2 (31K words) require about 80% and 60% of the data that Info_Min (51.9K) does. So we believe the effective combinations of informativeness, representativeness and diversity will help to learn the NER model more quickly and cost less in annotation.</Paragraph> <Paragraph position="1"> criteria-based selection strategies and the informativenesscriterion-based selection (Info_Min) to achieve the same performance level as the supervised learning.</Paragraph> </Section> </Section> class="xml-element"></Paper>