File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1015_metho.xml
Size: 24,240 bytes
Last Modified: 2025-10-06 14:08:24
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1015"> <Title>Bootstrapping Coreference Classifiers with Multiple Machine Learning Algorithms</Title> <Section position="4" start_page="2" end_page="3" type="metho"> <SectionTitle> 3 Learning Algorithms </SectionTitle> <Paragraph position="0"> We employ naive Bayes and decision list learners in our single-view, multiple-learner framework for bootstrapping coreference classifiers. This section gives an overview of the two learners.</Paragraph> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.1 Naive Bayes </SectionTitle> <Paragraph position="0"> A naive Bayes (NB) classifier is a generative classifier that assigns to a test instance i with feature val-</Paragraph> <Paragraph position="2"> The first equality above follows from the definition of MAP, the second one from Bayes rule, and the last one from the conditional independence assumption of the feature values. We determine the class priors P(y) and the class densities P(x i |y) directly from the training data using add-one smoothing.</Paragraph> </Section> <Section position="2" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.2 Decision Lists </SectionTitle> <Paragraph position="0"> Our decision list (DL) algorithm is based on that described in Collins and Singer (1999). For each available feature f</Paragraph> <Paragraph position="2"> in the training data, the learner induces an element of the</Paragraph> </Section> <Section position="3" start_page="2" end_page="3" type="sub_section"> <SectionTitle> Observations Justifications </SectionTitle> <Paragraph position="0"> Many feature-value pairs alone can determine the class value.</Paragraph> <Paragraph position="1"> For example, two NPs cannot be coreferent if they differ in gender or semantic class. Decision lists draw a decision boundary based on a single feature-value pair and can take advantage of this observation directly. On the other hand, naive Bayes classifiers make a decision based on a combination of features and thus cannot take advantage of this observation directly. The class distributions in coreference data sets are skewed. Specifically, the fact that most NP pairs in a document are not coreferent implies that the negative instances grossly outnumber the positives.</Paragraph> <Paragraph position="2"> Naive Bayes classifiers are fairly resistant to class skewness, which can only exert its influence on classifier prediction via the class priors. On the other hand, decision lists suffer from skewed class distributions. Elements corresponding to the negative class tend to aggregate towards the beginning of the list, causing the classifier to perform poorly on the minority class. Many instances contain redundant information as far as classification is concerned. For example, two NPs may differ in both gender and semantic class, but knowing one of these two differences is sufficient for determining the class value.</Paragraph> <Paragraph position="3"> Both naive Bayes classifiers and decision lists can take advantage of data redundancy. Frequency counts of feature-value pairs in these classifiers are updated independently, and thus a single instance can possibly contribute to the discovery of more than one useful feature-value pair. On the other hand, some classifiers such as decision trees are not able to take advantage of this redundancy because of their intrinsic nature of recursive data partitioning. the underlying learning algorithms for bootstrapping coreference classifiers are based on the corresponding observations on the coreference task and the features used by the coreference system in the left column. decision list for each class y. The elements in the list are sorted in decreasing order of the strength associated with each element, which is defined as the conditional probability P(y |f</Paragraph> <Paragraph position="5"> ) and is estimated based on the training data as follows:</Paragraph> <Paragraph position="7"> N(x) is the frequency of event x in the training data, a a smoothing parameter, and k the number of classes. In this paper, k = 2 and we set a to 0.01.</Paragraph> <Paragraph position="8"> A test instance is assigned the class associated with the first element of the list whose predicate is satisfied by the description of the instance.</Paragraph> <Paragraph position="9"> While generative classifiers estimate class densities, discriminative classifiers like decision lists focus on approximating class boundaries. Table 1 provides the justifications for choosing these two learners as components in our single-view, multi-learner bootstrapping algorithm. Based on observations of the coreference task and the features employed by our coreference system, the justifications suggest that the two learners can potentially compensate for each other's weaknesses.</Paragraph> </Section> </Section> <Section position="5" start_page="3" end_page="4" type="metho"> <SectionTitle> 4 Multi-View Co-Training </SectionTitle> <Paragraph position="0"> In this section, we describe the Blum and Mitchell (B&M) multi-view co-training algorithm and apply it to coreference resolution.</Paragraph> <Paragraph position="1"> This justifies the use of a decision list as a potential classifier for bootstrapping. See Yarowsky (1995) for details.</Paragraph> <Section position="1" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 4.1 The Multi-View Co-Training Algorithm </SectionTitle> <Paragraph position="0"> The intuition behind the B&M co-training algorithm is to train two classifiers that can help augment each other's labeled data by exploiting two separate but redundant views of the data. Specifically, each classifier is trained using one view of the labeled data and predicts labels for all instances in the data pool, which consists of a randomly chosen subset of the unlabeled data. Each then selects its most confident predictions, and adds the corresponding instances with their predicted labels to the labeled data while maintaining the class distribution in the labeled data.</Paragraph> <Paragraph position="1"> The number of instances to be added to the labeled data by each classifier at each iteration is limited by a pre-specified growth size to ensure that only the instances that have a high probability of being assigned the correct label are incorporated. The data pool is replenished with instances from the unlabeled data and the process is repeated.</Paragraph> <Paragraph position="2"> During testing, each classifier makes an independent decision for a test instance. In this paper, the decision associated with the higher confidence is taken to be the final prediction for the instance.</Paragraph> </Section> <Section position="2" start_page="3" end_page="4" type="sub_section"> <SectionTitle> 4.2 Experimental Setup </SectionTitle> <Paragraph position="0"> One of the goals of the experiments is to enable a fair comparison of the multi-view algorithm with our single-view bootstrapping algorithm. Since the B&M co-training algorithm is sensitive not only to the views employed but also to other input parame- null F-measure are provided. Except for the baselines, the best results (F-measure) achieved by the algorithms are shown. ters such as the pool size and the growth size (Pierce and Cardie, 2001), we evaluate the algorithm under different parameter settings, as described below.</Paragraph> <Paragraph position="1"> Evaluation. We use the MUC-6 (1995) and MUC-7 (1998) coreference data sets for evaluation. The training set is composed of 30 &quot;dry run&quot; texts, from which 491659 and 482125 NP pair instances are generated for the MUC-6 and MUC-7 data sets, respectively. Unlike Ng and Cardie (2003) where we choose one of the dryrun texts (contributing approximately 3500-3700 instances) form the labeled data set, however, here we randomly select 1000 instances. The remaining instances are used as unlabeled data. Testing is performed by applying the bootstrapped coreference classifier and the clustering algorithm described in section 2 on the 20-30 &quot;formal evaluation&quot; texts for each of the MUC-6 and MUC-7 data sets.</Paragraph> <Paragraph position="2"> Two sets of experiments are conducted, one using naive Bayes as the underlying supervised learning algorithm and the other the decision list learner. All results reported are averages across five runs.</Paragraph> <Paragraph position="3"> Co-training parameters. The co-training parameters are set as follows.</Paragraph> <Paragraph position="4"> Views. We used three methods to generate the views from the 25 features used by the coreference system: Mueller et al.'s (2002) greedy method, random splitting of features into views, and splitting of features according to the feature type (i.e. lexico-syntactic vs. non-lexico-syntactic features).</Paragraph> <Paragraph position="5"> Pool size. We tested values of 500, 1000, 5000.</Paragraph> <Paragraph position="6"> Growth size. We tested values of 10, 50, 100, 200.</Paragraph> </Section> <Section position="3" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 4.3 Results and Discussion </SectionTitle> <Paragraph position="0"> Results are shown in Table 2, where performance is reported in terms of recall, precision, and F-measure Space limitation precludes a detailed description of these methods. See Ng and Cardie (2003) for details. = 500, growth size = 50, views formed by randomly splitting the features) for MUC-6.</Paragraph> <Paragraph position="1"> using the model-theoretic MUC scoring program (Vilain et al., 1995). The baseline coreference system, which is trained only on the initially labeled data using all of the features, achieves an F-measure of 51.6 (NB) and 28.7 (DL) on the MUC-6 data set and 40.1 (NB) and 45.8 (DL) on MUC-7.</Paragraph> <Paragraph position="2"> The results shown in row 2 of Table 2 correspond to the best F-measure scores achieved by co-training across all of the parameter combinations described in the previous subsection. In comparison to the baseline, co-training is able to improve system performance in only two of the four classifier/data set combinations: F-measure increases by 2% and 6% for MUC-6/DL and MUC-7/NB, respectively. Nevertheless, co-training produces high-precision classifiers in all four cases (at the expense of recall). In practical applications in which precision is critical, the co-training classifiers may be preferable to the baseline classifiers despite the fact that they achieve similar F-measure scores.</Paragraph> <Paragraph position="3"> Figure 1 depicts the learning curve for the co-training run that gives rise to the best F-measure for the MUC-6 data set using naive Bayes. The horizontal (dotted) line shows the performance of the baseline system, as described above. As co-training progresses, F-measure rises to 48.7 at iteration ten and gradually drops to and stabilizes at 42.9. We observe similar performance trends for the other classifier/data set combinations. The drop in F-measure is potentially due to the pollution of the labeled data by mislabeled instances (Pierce and Cardie, 2001).</Paragraph> </Section> </Section> <Section position="6" start_page="4" end_page="4" type="metho"> <SectionTitle> 5 Single-View Bootstrapping </SectionTitle> <Paragraph position="0"> In this section, we describe and evaluate our singleview, multi-learner bootstrapping algorithm, which combines ideas from Goldman and Zhou (2000) and Steedman et al. (2003b). We will start by giving an overview of these two co-training algorithms.</Paragraph> </Section> <Section position="7" start_page="4" end_page="5" type="metho"> <SectionTitle> 5.1 Related Work </SectionTitle> <Paragraph position="0"> The Goldman and Zhou (G&Z) Algorithm.</Paragraph> <Paragraph position="1"> This single-view algorithm begins by training two classifiers on the initially labeled data using two different learning algorithms; it requires that each classifier partition the instance space into a set of equivalence classes (e.g. in a decision tree, each leaf node defines an equivalence class). Each classifier then considers each equivalence class and uses hypothesis testing to determine if adding all unlabeled instances within the equivalence class to the other classifier's labeled data will improve the performance of its counterparts. The process is then repeated until no more instances can be labeled.</Paragraph> <Paragraph position="2"> The Steedman et al. (Ste) Algorithm. This algorithm is a variation of B&M applied to two diverse statistical parsers. Initially, each parser is trained on the labeled data. Each then parses and scores all sentences in the data pool, and then adds the most confidently parsed sentences to the training data of the other parser. The parsers are retrained, and the process is repeated for several iterations.</Paragraph> <Paragraph position="3"> The algorithm differs from B&M in three main respects. First, the training data of the two parsers diverge after the first co-training iteration. Second, the data pool is flushed and refilled entirely with instances from the unlabeled data after each iteration.</Paragraph> <Paragraph position="4"> This reduces the possibility of having unreliably labeled sentences accumulating in the pool. Finally, the two parsers, each of which is assumed to hold a unique &quot;view&quot; of the data, are effectively two different learning algorithms.</Paragraph> <Section position="1" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 5.2 Our Single-View Bootstrapping Algorithm </SectionTitle> <Paragraph position="0"> As mentioned before, our algorithm uses two different learning algorithms to train two classifiers on the same set of features (i.e. the full feature set).</Paragraph> <Paragraph position="1"> At each bootstrapping iteration, each classifier labels and scores all instances in the data pool. The highest scored instances labeled by one classifier are added to the training data of the other classifier and vice versa. Since the two classifiers are trained on the same view, it is important to maintain a separate training set for each classifier: this reduces the probability that the two classifiers converge to the same hypothesis at an early stage and hence implicitly increases the ability to bootstrap. Like Ste, the entire data pool is replenished with instances drawn from the unlabeled data after each iteration, and the process is repeated. So our algorithm is effectively Ste applied to coreference resolution -- instead of two parsing algorithms that correspond to different features, we use two learning algorithms, each of which relies on the same set of features as in G&Z. The similarities and differences among B&M, G&Z, Ste, and our algorithm are summarized in Table 3.</Paragraph> </Section> <Section position="2" start_page="4" end_page="5" type="sub_section"> <SectionTitle> 5.3 Results and Discussion </SectionTitle> <Paragraph position="0"> We tested different pool sizes and growth sizes as specified in section 4.2 to determine the best parameter setting for our algorithm. For both data sets, the best F-measure score is achieved using a pool size of 5000 and a growth size of 50. The results under this parameter setting are given in row 3 of Table 2. In comparison to the baseline, we see dramatic improvement in F-measure for both classifiers and both data sets. In addition, we see simultaneous gains in recall and precision in all cases except MUC-7/DL. Furthermore, single-view bootstrapping beats co-training (in terms of F-measure scores) by a large margin in all four cases. These results provide suggestive evidence that single-view, multi-learner bootstrapping might be a better alternative to its multi-view, single-learner counterparts for coreference resolution.</Paragraph> <Paragraph position="1"> The bootstrapping run that corresponds to this parameter setting for the MUC-6 data set using naive Bayes is shown in Figure 2. Again, we see a &quot;typi-Blum and Mitchell Goldman and Zhou Steedman et al. Ours Bootstrapping basis Use different views Use different learners Use different parsers Use different learners strapping algorithm (pool size = 5000, growth size = 50) for MUC-6.</Paragraph> <Paragraph position="2"> cal&quot; bootstrapping curve: an initial rise in F-measure followed by a gradual deterioration. In comparison to Figure 1, the recall level achieved by co-training is much lower than that of single-view bootstrapping.</Paragraph> <Paragraph position="3"> This appears to indicate that each co-training view is insufficient for learning the target concept: the feature split limits any interaction of features that can produce better recall.</Paragraph> <Paragraph position="4"> Finally, Figure 2 shows that performance increases most rapidly in the first 200 iterations. This provides indirect evidence that the two classifiers have acquired different hypotheses from the initial data and are exchanging information with each other. To ensure that the classifiers are indeed benefiting from each other, we conducted a self-training experiment for each classifier separately: at each self-training iteration, each classifier labels all 5000 instances in the data pool using all available features and selects the most confidently labeled 50 instances for addition to its labeled data.</Paragraph> <Paragraph position="5"> The best F-measure scores achieved by self-training are shown in the last row of Table 2. Overall, self-training only yields marginal performance gains over the baseline.</Paragraph> <Paragraph position="6"> Nevertheless, self-training outperforms co-training in both cases where naive Bayes is used.</Paragraph> <Paragraph position="7"> While these results seem to suggest that co-training is inherently handicapped for coreference resolution, there are two plausible explanations against this conclusion. First, the fact that self-training has access to all of the available features may account for its superior performance to co-training. This is again partially supported by the fact that the recall level achieved by co-training is lower than that of self-training in both cases in which self-training outperforms co-training. Second, 1000 instances may simply not be sufficient for co-training to be effective for this task: in related work (Ng and Cardie, 2003), we find that starting with 3500-3700 labeled instances instead of 1000 allows co-training to improve the baseline by 4.6% and 9.5% in F-measure using naive Bayes classifiers for the MUC-6 and MUC-7 data sets, respectively.</Paragraph> </Section> </Section> <Section position="8" start_page="5" end_page="6" type="metho"> <SectionTitle> 6 An Alternative Ranking Method </SectionTitle> <Paragraph position="0"> As we have seen before, F-measure scores ultimately decrease as bootstrapping progresses. If the drop were caused by the degradation in the quality of the bootstrapped data, then a more &quot;conservative&quot; instance selection method than that of B&M would help alleviate this problem. Our hypothesis is that selection methods that are based solely on the confidence assigned to an instance by a single classifier Note that this is self-training without bagging, unlike the self-training algorithm discussed in Ng and Cardie (2003). uses to impose a partial ordering on the instances to be selected and added to the training set of binary classifier C are arbitrary instances, and u is a function that rounds a number to its closest integer. may be too liberal. In particular, these methods allow the addition of instances with opposing labels to the labeled data; this can potentially result in increased incompatibility between the classifiers. Consequently, we develop a new procedure for ranking instances in the data pool. The bootstrapping algorithm then selects the highest ranked instances to add to the labeled data in each iteration. The method favors instances whose label is agreed upon by both classifiers (Preference 1). However, incorporating instances that are confidently labeled by both classifiers may reduce the probability of acquiring new information from the data. Therefore, the method imposes an additional preference for instances that are confidently labeled by one but not both (Preference 2). If none of the instances receives the same label from the classifiers, the method resorts to the &quot;rank-by-confidence&quot; method used by B&M (Preference 3).</Paragraph> <Paragraph position="1"> More formally, define a binary classifier as a function that maps an instance to a value that indicates the probability that it is labeled as positive. Now, let u be a function that rounds a number to its nearest integer. Given two binary classifiers C , the ranking method shown in Figure 3 uses the three preferences described above to impose a partial ordering on the given instances for incorporation into C 's labeled data. The method similarly ranks instances to be added to C reversed.</Paragraph> <Paragraph position="2"> Steedman et al. (2003a) also investigate instance selection methods for co-training, but their goal is primarily to use selection methods as a means to explore the trade-off between maximizing coverage and maximizing accuracy.</Paragraph> <Paragraph position="3"> In contrast, our focus bootstrapping algorithm with different ranking methods (pool size = 5000, growth size = 50) for MUC-6.</Paragraph> <Paragraph position="4"> here is on examining whether a more conservative ranking method can alleviate the problem of performance deterioration. Nevertheless, Preference 2 is inspired by their S int-n selection method, which selects an instance if it belongs to the intersection of the set of the n percent highest scoring instances of one classifier and the set of the n percent lowest scoring instances of the other. To our knowledge, no previous work has examined a ranking method that combines the three preferences described above. To compare our ranking procedure with B&M's rank-by-confidence method, we repeat the bootstrapping experiment shown in Figure 2 except that we replace B&M's ranking method with ours. The learning curves generated using the two ranking methods with naive Bayes for the MUC-6 data set are shown in Figure 4. The results are consistent with our intuition regarding the two ranking methaccuracy and coverage by combining EM and active learning. ods. The B&M ranking method is more liberal.</Paragraph> <Paragraph position="5"> In particular, each classifier always selects the most confidently labeled instances to add to the other's labeled data at each iteration. If the underlying learners have indeed induced two different hypotheses from the data, then each classifier can potentially acquire informative instances from the other and yield performance improvements very rapidly.</Paragraph> <Paragraph position="6"> In contrast, our ranking method is more conservative in that it places more emphasis on maintaining labeled data accuracy than the B&M method. As a result, the classifier learns at a slower rate when compared to that in the B&M case: it is not until iteration 600 that we see a sharp rise in F-measure. Due to the &quot;liberal&quot; nature of the B&M method, however, its performance drops dramatically as bootstrapping progresses, whereas ours just dips temporarily. This can potentially be attributed to the more rapid injection of mislabeled instances into the labeled data in the B&M case. At iteration 2800, our method starts to outperform B&M's. Overall, our ranking method does not exhibit the performance trend observed with the B&M method: except for the spike between iterations 0 and 100, F-measure does not deteriorate as bootstrapping progresses. Since it is hard to determine a &quot;good&quot; stopping point for bootstrapping due to the paucity of labeled data in a weakly supervised setting, our ranking method can potentially serve as an alternative to the B&M method.</Paragraph> </Section> class="xml-element"></Paper>