File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-1042_metho.xml
Size: 20,794 bytes
Last Modified: 2025-10-06 14:08:14
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-1042"> <Title>Uncertainty Reduction in Collaborative Bootstrapping: Measure and Algorithm</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Collaborative Bootstrapping and Un- </SectionTitle> <Paragraph position="0"> certainty Reduction We consider the collaborative bootstrapping problem. null Let denote a set of instances (feature vectors) and let denote a set of labels (classes). Given a number of labelled instances, we are to construct a function fi:h . We also refer to it as a classifier. null In collaborative bootstrapping, we consider the use of two partial functions 1h and 2h , which either output a class label or a special symbol ^ denoting 'no decision'.</Paragraph> <Paragraph position="1"> Co-training and bilingual bootstrapping are two examples of collaborative bootstrapping.</Paragraph> <Paragraph position="2"> In co-training, the two collaborating classifiers are assumed to be based on two different views, namely two different subsets of the entire feature set. Formally, the two views are respectively interpreted as two functions )(1 xX and )x(X2 , ,x . Thus, the two collaborating classifiers 1h and 2h in co-training can be respectively represented as ))(( 11 xXh and ))(( 22 xXh .</Paragraph> <Paragraph position="3"> In bilingual bootstrapping, a number of classifiers are created in the two languages. The classes of the classifiers correspond to word senses and do not overlap, as shown in Figure 1. For example, the classifier )E|x(h 11 in language 1 takes sense 2 and sense 3 as classes. The classifier )C|x(h 12 in language 2 takes sense 1 and sense 2 as classes, and the classifier )C|x(h 22 takes sense 3 and sense 4 as classes. Here we use 211 ,, CCE to denote different words in the two languages. Collaborative bootstrapping is performed between the classifiers )(h *1 in language 1 and the classifiers )(h *2 in language 2. (See Li and Li 2002 for details). null For the classifier )E|x(h 11 in language 1, we assume that there is a pseudo classifier )C,C|x(h 212 in language 2, which functions as a collaborator of )E|x(h 11 . The pseudo classifier )C,C|x(h 212 is based on )C|x(h 12 and )C|x(h 22 , and takes sense 2 and sense 3 as classes. Formally, the two collaborating classifiers (one real classifier and one pseudo classifier) in bilingual bootstrapping are respectively represented as )|(1 Exh and )|(2 Cxh , ,x .</Paragraph> <Paragraph position="4"> Next, we introduce the notion of uncertainty reduction in collaborative bootstrapping.</Paragraph> <Paragraph position="5"> Definition 1 The uncertainty )(hU of a classifier h is defined as:</Paragraph> <Paragraph position="7"> In practice, we define )(hU as }), ,))((|({)( ,,&quot;<== xyyxhCxPhU q (2) where q denotes a predetermined threshold and )(*C denotes the confidence score of the classifier h.</Paragraph> <Paragraph position="8"> Definition 2 The conditional uncertainty )|( yhU of a classifier h given a class y is defined as:</Paragraph> <Paragraph position="10"> We note that the uncertainty (or conditional uncertainty) of a classifier (a partial function) is an indicator of the accuracy of the classifier. Let us consider an ideal case in which the classifier achieves 100% accuracy when it can make a classification decision and achieves 50% accuracy when it cannot (assume that there are only two classes).</Paragraph> <Paragraph position="11"> Thus, the total accuracy on the entire data space is )(5.01 hU*- .</Paragraph> <Paragraph position="12"> Definition 3 Given the two classifiers 1h and 2h in collaborative bootstrapping, the uncertainty reduction of 1h with respect to 2h (denoted as</Paragraph> <Paragraph position="14"> Similarly, we have }),)(,)(|({)\( 2112 ,=^,,^= xxhxhxPhhUR Uncertainty reduction is an important factor for determining the performance of collaborative bootstrapping. In collaborative bootstrapping, the more the uncertainty of one classifier can be reduced by the other classifier, the higher the performance can be achieved by the classifier (the more effective the collaboration is).</Paragraph> </Section> <Section position="5" start_page="0" end_page="21" type="metho"> <SectionTitle> 4 Uncertainty Correlation Coefficient </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Measure 4.1 Measure </SectionTitle> <Paragraph position="0"> We introduce the measure of uncertainty correlation coefficient (UCC) to collaborative bootstrapping. null Definition 4 Given the two classifiers 1h and 2h , the conditional uncertainty correlation coefficient (CUCC) between 1h and 2h given a class y (denoted as yhhr 21 ), is defined as tainties of the two classifiers are related. If UCC is high, then there are a large portion of instances which are uncertain for both of the classifiers. Note that UCC is a symmetric measure from both classifiers' perspectives, while UR is an asymmetric measure from one classifier's perspective (either )\( 21 hhUR or )\( 12 hhUR ).</Paragraph> </Section> <Section position="2" start_page="0" end_page="21" type="sub_section"> <SectionTitle> 4.2 Theoretical Analysis </SectionTitle> <Paragraph position="0"> Theorem 1 reveals the relationship between the CUCC (UCC) measure and uncertainty reduction.</Paragraph> <Paragraph position="1"> Assume that the classifier 1h can collaborate with either of the two classifiers 2h and 2'h . The two classifiers 2h and 2hC/ have equal conditional uncertainties. The CUCC values between 1h and 2hC/ are smaller than the CUCC values between 1h and 2h . Then, according to Theorem 1, 1h should collaborate with 2hC/ , because 2h C/ can help reduce its uncertainty more, thus, improve its accuracy more.</Paragraph> <Paragraph position="2"> Theorem 1 Given the two classifier pairs ),( 21 hh and ),( 21 hh C/ , if ,++ C/ yrr yhhyhh ,2121 and ),|()|( 22 yhUyhU C/= ,y , then we have )\()\( 2121 hhURhhUR C/PS Proof: We can decompose the uncertainty )( 1hU of 1h as follows: Theorem 1 states that the lower the CUCC values are, the higher the performances can be achieved in collaborative bootstrapping.</Paragraph> <Paragraph position="3"> Definition 6 The two classifiers in co-training are said to satisfy the view independence assumption (Blum and Mitchell, 1998), if the following equations hold for any class y.</Paragraph> <Paragraph position="4"> Theorem 2 indicates that in co-training with view independence, the CUCC values ( ,&quot;yr yhh , ) are small, since by definition Y=<< yhhr 0 . According to Theorem 1, it is easy to reduce the uncertainties of the classifiers. That is to say, co-training with view independence can perform well.</Paragraph> <Paragraph position="5"> How to conduct theoretical evaluation on the CUCC measure in bilingual bootstrapping is still an open problem.</Paragraph> </Section> <Section position="3" start_page="21" end_page="21" type="sub_section"> <SectionTitle> 4.3 Experimental Results </SectionTitle> <Paragraph position="0"> We conducted experiments to empirically evaluate the UCC values of collaborative bootstrapping. We also investigated the relationship between UCC and accuracy. The results indicate that the theoretical analysis in Section 4.2 is correct.</Paragraph> <Paragraph position="1"> In the experiments, we define accuracy as the percentage of instances whose assigned labels agree with their 'true' labels. Moreover, when we refer to UCC, we mean that it is the UCC value on the test data. We set the value of q in Equation (2) to 0.8.</Paragraph> </Section> <Section position="4" start_page="21" end_page="21" type="sub_section"> <SectionTitle> Co-Training for Artificial Data Classification </SectionTitle> <Paragraph position="0"> We used the data in (Nigam and Ghani 2000) to conduct co-training. We utilized the articles from four newsgroups (see Table 1). Each group had 1000 texts.</Paragraph> <Paragraph position="1"> By joining together randomly selected texts from each of the two newsgroups in the first row as positive instances and joining together randomly selected texts from each of the two newsgroups in the second row as negative instances, we created a two-class classification data with view independence. The joining was performed under the condition that the words in the two newsgroups in the first column came from one vocabulary, while the words in the newsgroups in the second column came from the other vocabulary.</Paragraph> <Paragraph position="2"> We also created a set of classification data without view independence. To do so, we randomly split all the features of the pseudo texts into two subsets such that each of the subsets contained half of the features.</Paragraph> <Paragraph position="3"> We next applied the co-training algorithm to the two data sets.</Paragraph> <Paragraph position="4"> We conducted the same pre-processing in the two experiments. We discarded the header of each text, removed stop words from each text, and made each text have the same length, as did in (Nigam and Ghani, 2000). We discarded 18 texts from the entire 2000 texts, because their main contents were binary codes, encoding errors, etc.</Paragraph> <Paragraph position="5"> We randomly separated the data and performed co-training with random feature split and co-training with natural feature split in five times. The results obtained (cf., Table 2), thus, were averaged over five trials. In each trial, we used 3 texts for each class as labelled training instances, 976 texts as testing instances, and the remaining 1000 texts as unlabelled training instances.</Paragraph> <Paragraph position="6"> From Table 2, we see that the UCC value of the natural split (in which view independence holds) is lower than that of the random split (in which view independence does not hold). That is to say, in natural split, there are fewer instances which are uncertain for both of the classifiers. The accuracy of the natural split is higher than that of the random split. Theorem 1 states that the lower the CUCC values are, the higher the performances can be achieved. The results in Table 2 agree with the claim of Theorem 1. (Note that it is easier to use CUCC for theoretical analysis, but it is easier to use UCC for empirical analysis).</Paragraph> <Paragraph position="7"> We also see that the UCC value of the natural split (view independence) is about 1.0. The result agrees with Theorem 2.</Paragraph> <Paragraph position="8"> Co-Training for Web Page Classification We used the same data in (Blum and Mitchell, 1998) to perform co-training for web page classification. null The web page data consisted of 1051 web pages collected from the computer science departments of four universities. The goal of classification was to determine whether a web page was concerned with an academic course. 22% of the pages were actually related to academic courses. The features for each page were possible to be separated into two independent parts. One part consisted of words occurring in the current page and the other part consisted of words occurring in the anchor texts pointed to the current page.</Paragraph> <Paragraph position="9"> We randomly split the data into three subsets: labelled training set, unlabeled training set, and test set. The labelled training set had 3 course pages and 9 non-course pages. The test set had 25% of the pages. The unlabelled training set had the re- null experiment was almost the same as that of Nigam and Ghani's. One exception was that we did not conduct feature selection, because we were not able to follow their method from their paper.</Paragraph> <Paragraph position="10"> We repeated the experiment five times and evaluated the results in terms of UCC and accuracy. Table 3 shows the average accuracy and UCC value over the five trials.</Paragraph> </Section> <Section position="5" start_page="21" end_page="21" type="sub_section"> <SectionTitle> Bilingual Bootstrapping </SectionTitle> <Paragraph position="0"> We also used the same data in (Li and Li, 2002) to conduct bilingual bootstrapping and word sense disambiguation.</Paragraph> <Paragraph position="1"> The sense disambiguation data were related to seven ambiguous English words, each having two Chinese translations. The goal was to determine the correct Chinese translations of the ambiguous English words, given English sentences containing the ambiguous words.</Paragraph> <Paragraph position="2"> For each word, there were two seed words used as labelled instances for training, a large number of unlabeled instances (sentences) in both English and Chinese for training, and about 200 labelled instances (sentences) for testing. Details on data are shown in Table 4.</Paragraph> <Paragraph position="3"> We used the data to perform bilingual bootstrapping and word sense disambiguation. The setting for the experiment was exactly the same as that of Li and Li's. Table 3 shows the accuracy and UCC value for each word.</Paragraph> <Paragraph position="4"> From Table 3 we see that both co-training and bilingual bootstrapping have low UCC values (around 1.0). With lower UCC (CUCC) values, higher performances can be achieved, according to Theorem 1. The accuracies of them are indeed high. Note that since the features and classes for each word in bilingual bootstrapping and those for web page classification in co-training are different, it is not meaningful to directly compare the UCC values of them.</Paragraph> </Section> </Section> <Section position="6" start_page="21" end_page="21" type="metho"> <SectionTitle> 5 Uncertainty Reduction Algorithm </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="21" end_page="21" type="sub_section"> <SectionTitle> 5.1 Algorithm </SectionTitle> <Paragraph position="0"> We propose a new algorithm for collaborative bootstrapping (both co-training and bilingual bootstrapping). null In the algorithm, the collaboration between the classifiers is driven by uncertainty reduction. Specifically, one classifier always selects the most uncertain unlabelled instances for it and asks the other classifier to label. Thus, the two classifiers can help each other more effectively.</Paragraph> <Paragraph position="1"> There exists, therefore, a similarity between our algorithm and active learning. In active learning the learner always asks the supervisor to label the Input: A set of labeled instances and a set of unlabelled instances.</Paragraph> <Paragraph position="2"> Loop while there exist unlabelled instances{ Create classifier 1h using the labeled instances; Create classifier 2h using the labeled instances;</Paragraph> <Paragraph position="4"> uncertain for 2h , label them with 1h and add them into the set of labeled instances; Pick up yb unlabelled instances whose labels</Paragraph> <Paragraph position="6"> uncertain for 1h , label them with 2h and add them into the set of labeled instances; most uncertain examples for it, while in our algorithm one classifier always asks the other classifier to label the most uncertain examples for it. Figure 2 shows the algorithm. Actually, our new algorithm is different from the previous algorithm only in one point. Figure 2 highlights the point in italic fonts. In the previous algorithm, when a classifier labels unlabeled instances, it labels those instances whose labels are most certain for the classifier. In contrast, in our new algorithm, when a classifier labels unlabeled instances, it labels those instances whose labels are most certain for the classifier, but at the same time most uncertain for the other classifier.</Paragraph> <Paragraph position="7"> As one implementation, for each class y, 1h first selects its most certain ya instances, 2h next selects from them its most uncertain yb instances ( yy ba ++ ), and finally 1h labels the yb instances with label y (Collaboration from the opposite direction is performed similarly.). We use this implementation in our experiments described below.</Paragraph> </Section> <Section position="2" start_page="21" end_page="21" type="sub_section"> <SectionTitle> 5.2 Experimental Results </SectionTitle> <Paragraph position="0"> We conducted experiments to test the effectiveness of our new algorithm. Experimental results indicate that the new algorithm performs better than the previous algorithm. We refer to them as 'new' and 'old' respectively.</Paragraph> </Section> <Section position="3" start_page="21" end_page="21" type="sub_section"> <SectionTitle> Co-Training for Artificial Data Classification </SectionTitle> <Paragraph position="0"> We used the artificial data in Section 4.3 and conducted co-training with both the old and new algorithms. Table 5 shows the results.</Paragraph> <Paragraph position="1"> We see that in co-training the new algorithm performs as well as the old algorithm when UCC is low (view independence holds), and the new algorithm performs significantly better than the old algorithm when UCC is high (view independence does not hold).</Paragraph> <Paragraph position="2"> Co-Training for Web Page Classification We used the web page classification data in Section 4.3 and conducted co-training using both the old and new algorithms. Table 6 shows the results. We see that the new algorithm performs as well as the old algorithm for this data set. Note that here UCC is low.</Paragraph> </Section> <Section position="4" start_page="21" end_page="21" type="sub_section"> <SectionTitle> Bilingual Bootstrapping </SectionTitle> <Paragraph position="0"> We used the word sense disambiguation data in Section 4.3 and conducted bilingual bootstrapping using both the old and new algorithms. Table 7 shows the results. We see that the performance of the new algorithm is slightly better than that of the old algorithm. Note that here the UCC values are also low.</Paragraph> <Paragraph position="1"> We conclude that for both co-training and bi-lingual bootstrapping, the new algorithm performs significantly better than the old algorithm when UCC is high, and performs as well as the old algorithm when UCC is low. Recall that when UCC is high, there are more instances which are uncertain for both classifiers and when UCC is low, there are fewer instances which are uncertain for both classifiers. null Note that in practice it is difficult to find a situation in which UCC is completely low (e.g., the view independence assumption completely holds), and thus the new algorithm will be more useful than the old algorithm in practice. To verify this, we conducted an additional experiment.</Paragraph> <Paragraph position="2"> Again, since the features and classes for each word in bilingual bootstrapping and those for web page classification in co-training are different, it is not meaningful to directly compare the UCC values of them.</Paragraph> <Paragraph position="3"> Co-Training for News Article Classification In the additional experiment, we used the data from two newsgroups (comp.graphics and comp.os.ms-windows.misc) in the dataset of (Joachims, 1997) to construct co-training and text classification.</Paragraph> <Paragraph position="4"> There were 1000 texts for each group. We viewed the former group as positive class and the latter group as negative class. We applied the new and old algorithms. We conducted 20 trials in the experimentation. In each trial we randomly split the data into labelled training, unlabeled training and test data sets. We used 3 texts per class as labelled instances for training, 994 texts for testing, and the remaining 1000 texts as unlabelled instances for training. We performed the same pre-processing as that in (Nigam and Ghani 2000). Table 8 shows the results with the 20 trials. The accuracies are averaged over each 5 trials. From the table, we see that co-training with the new algorithm significantly outperforms that using the old algorithm and also 'single bootstrapping'. Here, 'single bootstrapping' refers to the conventional bootstrapping method in which a single classifier repeatedly boosts its performances with all the features. null The above experimental results indicate that our new algorithm for collaborative bootstrapping performs significantly better than the old algorithm when the collaboration is difficult. It performs as well as the old algorithm when the collaboration is easy. Therefore, it is better to always employ the new algorithm.</Paragraph> <Paragraph position="5"> Another conclusion from the results is that we can apply our new algorithm into any single bootstrapping problem. More specifically, we can randomly split the feature set and use our algorithm to perform co-training with the split subsets.</Paragraph> </Section> </Section> class="xml-element"></Paper>