File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/p02-1046_metho.xml

Size: 22,912 bytes

Last Modified: 2025-10-06 14:07:56

<?xml version="1.0" standalone="yes"?>
<Paper uid="P02-1046">
  <Title>Bootstrapping</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 View Independence
</SectionTitle>
    <Paragraph position="0"> Blum and Mitchell assume that each instance x consists of two &amp;quot;views&amp;quot; x1;x2. We can take this as the assumption of functions X1 and X2 such that X1(x) = x1 and X2(x) = x2. They propose that views are conditionally independent given the label.</Paragraph>
    <Paragraph position="1"> Definition 1 A pair of views x1, x2 satisfy view independence just in case:</Paragraph>
    <Paragraph position="3"> A classification problem instance satisfies view independence just in case all pairs x1, x2 satisfy view independence.</Paragraph>
    <Paragraph position="4"> There is a related independence assumption that will prove useful. Let us define H1 to consist of rules that are functions of X1 only, and define H2 to consist of rules that are functions of X2 only.</Paragraph>
    <Paragraph position="5"> Definition 2 A pair of rules F 2 H1, G 2 H2 satisfies rule independence just in case, for all u;v;y:</Paragraph>
    <Paragraph position="7"> and similarly for F 2 H2, G 2 H1. A classification problem instance satisfies rule independence just in case all opposing-view rule pairs satisfy rule independence.</Paragraph>
    <Paragraph position="8"> If instead of generating H1 and H2 from X1 and X2, we assume a set of features F (which can be thought of as binary rules), and take</Paragraph>
    <Paragraph position="10"> the Naive Bayes independence assumption.</Paragraph>
    <Paragraph position="11"> The following theorem is not difficult to prove; we omit the proof.</Paragraph>
    <Paragraph position="12"> Theorem 1 View independence implies rule independence.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Rule Independence and
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Bootstrapping
</SectionTitle>
      <Paragraph position="0"> Blum and Mitchell's paper suggests that rules that agree on unlabelled instances are useful in bootstrapping.</Paragraph>
      <Paragraph position="1"> Definition 3 The agreement rate between rules F and G is</Paragraph>
      <Paragraph position="3"> Note that the agreement rate between rules makes no reference to labels; it can be determined from unlabeled data.</Paragraph>
      <Paragraph position="4"> The algorithm that Blum and Mitchell describe does not explicitly search for rules with good agreement; nor does agreement rate play any direct role in the learnability proof given in the Blum and Mitchell paper.</Paragraph>
      <Paragraph position="5"> The second lack is emended in (Dasgupta et al., 2001). They show that, if view independence is satisfied, then the agreement rate between opposing-view rules F and G upper bounds the error of F (or G). The following statement of the theorem is simplified and assumes non-abstaining binary rules.</Paragraph>
      <Paragraph position="6"> Theorem 2 For all F 2 H1, G 2 H2 that satisfy rule independence and are nontrivial predictors in the sense that minu Pr[F = u] &gt; Pr[F 6= G], one of the following inequalities holds:</Paragraph>
      <Paragraph position="8"> If F agrees with G on all but + unlabelled instances, then either F or -F predicts Y with error no greater than +. A small amount of labelled data suffices to choose between F and -F.</Paragraph>
      <Paragraph position="9"> I give a geometric proof sketch here; the reader is referred to the original paper for a formal proof. Consider figures 1 and 2. In these diagrams, area represents probability. For example, the leftmost box (in either diagram) represents the instances for which Y = +, and the area of its upper left quadrant represents</Paragraph>
      <Paragraph position="11"> a diagram, either the horizontal or vertical line is broken, as in figure 2. In the special case in which rule independence is satisfied, both horizontal and vertical lines are unbroken, as in figure 1.</Paragraph>
      <Paragraph position="12"> Theorem 2 states that disagreement upper bounds error. First let us consider a lemma, to wit: disagreement upper bounds minority probabilities. Define the minority value of F given</Paragraph>
      <Paragraph position="14"> probability of the minority value. (Note that minority probabilities are conditional probabilities, and distinct from the marginal probability minu Pr[F = u] mentioned in the theorem.) In figure 1a, the areas of disagreement are the upper right and lower left quadrants of each box, as marked. The areas of minority values are marked in figure 1b. It should be obvious that the area of disagreement upper bounds the area of minority values.</Paragraph>
      <Paragraph position="15"> The error values of F are the values opposite to the values of Y: the error value is ! when</Paragraph>
      <Paragraph position="17"> values are error values, as in figure 1, disagreement upper bounds error, and theorem 2 follows immediately.</Paragraph>
      <Paragraph position="18"> However, three other cases are possible. One possibility is that minority values are opposite to error values. In this case, the minority values of -F are error values, and disagreement between F and G upper bounds the error of -F.</Paragraph>
      <Paragraph position="20"> This case is admitted by theorem 2. In the final two cases, minority values are the same regardless of the value of Y. In these cases, however, the predictors do not satisfy the &amp;quot;nontriviality&amp;quot; condition of theorem 2, which requires that minu Pr[F = u] be greater than the disagreement between F and G.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 The Unreasonableness of Rule
</SectionTitle>
    <Paragraph position="0"> Independence Rule independence is a very strong assumption; one remarkable consequence will show just how strong it is. The precision of a rule F is defined to be Pr[Y = +jF = +]. (We continue to assume non-abstaining binary rules.) If rule independence holds, knowing the precision of any one rule allows one to exactly compute the precision of every other rule given onlyunlabeleddata and knowledge of the size of the target concept.</Paragraph>
    <Paragraph position="1"> Let F and G be arbitrary rules based on independent views. We first derive an expression for the precision of F in terms of G. Note that the second line is derived from the first by rule independence.</Paragraph>
    <Paragraph position="3"> To compute the expression on the righthand side of the last line, we require P(YjG), P(Y), P(GjF), and P(G). The first value, the precision of G, is assumed known. The second value, P(Y), is also assumed known; it can at any rate be estimated from a small amount of labeled data. The last two values, P(GjF) and P(G), can be computed from unlabeled data.</Paragraph>
    <Paragraph position="4"> Thus, given the precision of an arbitrary rule G, we can compute the precision of any other-view rule F. Then we can compute the precision of rules based on the same view as G by using the precision of some other-view rule F. Hence we can compute the precision of every rule given the precision of any one.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Some Data
</SectionTitle>
    <Paragraph position="0"> The empirical investigations described here and below use the data set of (Collins and Singer, 1999). The task is to classify names in text as person, location, or organization. There is an unlabeled training set containing 89,305 instances, and a labeled test set containing 289 persons, 186 locations, 402 organizations, and 123 &amp;quot;other&amp;quot;, for a total of 1,000 instances. Instances are represented as lists of features.</Paragraph>
    <Paragraph position="1"> Intrinsic features are the words making up the name, and contextual features are features of the syntactic context in which the name occurs. For example, consider Bruce Kaplan, president of Metals Inc. This text snippet contains two instances. The first has intrinsic features N:Bruce-Kaplan, C:Bruce, and C:Kaplan (&amp;quot;N&amp;quot; for the complete name, &amp;quot;C&amp;quot; for &amp;quot;contains&amp;quot;), and contextual feature M:president (&amp;quot;M&amp;quot; for &amp;quot;modified by&amp;quot;). The second instance has intrinsic features N:Metals-Inc, C:Metals, C:Inc, and contextual feature X:president-of (&amp;quot;X&amp;quot; for &amp;quot;in the context of&amp;quot;).</Paragraph>
    <Paragraph position="2"> Let us define Y(x) = + if x is a &amp;quot;location&amp;quot; instance, and Y(x) = ! otherwise. We can estimate P(Y) from the test sample; it contains 186=1000 location instances, giving P(Y) = :186.</Paragraph>
    <Paragraph position="3"> Let us treat each feature F as a rule predicting + when F is present and ! otherwise. The precision of F is P(YjF). The internal feature N:New-York has precision 1. This permits us to compute the precision of various contextual features, as shown in the &amp;quot;Co-training&amp;quot; column of Table 1. We note that the numbers do not even look like probabilities. The cause is the failure of view independence to hold in the data, combined with the instability of the estimator. (The &amp;quot;Yarowsky&amp;quot; column uses a seed rule to estimate</Paragraph>
    <Paragraph position="5"> P(YjF), as is done in the Yarowsky algorithm, and the &amp;quot;Truth&amp;quot; column shows the true value of P(YjF).)</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
7 Relaxing the Assumption
</SectionTitle>
    <Paragraph position="0"> Nonetheless, the unreasonableness of view independence does not mean we must abandon theorem 2. In this section, we introduce a weaker assumption, one that is satisfied by the data, and we show that theorem 2 holds under this weaker assumption.</Paragraph>
    <Paragraph position="1"> There are two ways in which the data can diverge from conditional independence: the rules may either be positively or negatively correlated, given the class value. Figure 2a illustrates positive correlation, and figure 2b illustrates negative correlation.</Paragraph>
    <Paragraph position="2"> If the rules are negatively correlated, then their disagreement (shaded in figure 2) is larger than if they are conditionally independent, and the conclusion of theorem 2 is maintained a fortiori. Unfortunately, in the data, they are positively correlated, so the theorem does not apply. Let us quantify the amount of deviation from conditional independence. We define the conditional dependence of F and G given Y = y to</Paragraph>
    <Paragraph position="4"> If F and G are conditionally independent, then dy = 0.</Paragraph>
    <Paragraph position="5"> This permits us to state a weaker version of rule independence: Definition 4 Rules F and G satisfy weak rule dependence just in case, for y 2f+;!g:</Paragraph>
    <Paragraph position="7"> By definition, p1 and p2 cannot exceed 0.5. If p1 = 0:5, then weak rule dependence reduces to independence: if p1 = 0:5 and weak rule dependence is satisfied, then dy must be 0, which is to say, F and G must be conditionally independent. However, as p1 decreases, the permissible amount of conditional dependence increases.</Paragraph>
    <Paragraph position="8"> We can now state a generalized version of theorem 2: Theorem 3 For all F 2 H1, G 2 H2 that satisfy weak rule dependence and are nontrivial predictors in the sense that minu Pr[F = u] &gt;</Paragraph>
    <Paragraph position="10"> Consider figure 3. This illustrates the most relevant case, in which F and G are positively correlated given Y. (Only the case Y = + is shown; the case Y = ! is similar.) We assume that the minority values of F are error values; the other cases are handled as in the discussion of theorem 2.</Paragraph>
    <Paragraph position="11"> Let u be the minority value of G when Y = +.</Paragraph>
    <Paragraph position="12"> In figure 3, a is the probability that G = u when F takes its minority value, and b is the probability that G = u when F takes its majority value.</Paragraph>
    <Paragraph position="13"> The value r = a ! b is the difference. Note</Paragraph>
    <Paragraph position="15"/>
    <Paragraph position="17"> Hence, in particular, we may write dy = a!b.</Paragraph>
    <Paragraph position="18"> Observe further that p , the minority probability of G when Y = +, is a weighted average of a and b, namely, p2 = p1a+q1b. Combining this with the equation dy = a!b allows us to express a and b in terms of the remaining variables, to wit: a = p2 +q1dy and b = p2 !p1dy.</Paragraph>
    <Paragraph position="19"> In order to prove theorem 3, we need to show that the area of disagreement (B [ C) upper bounds the area of the minority value of F (A[ B). This is true just in case C is larger than A, which is to say, if bq1 , ap1. Substituting our expressions for a and b into this inequality and solving for dy yields: dy * p2q1 !p12p  In short, disagreement upper bounds the minority probability just in case weak rule dependence is satisfied, proving the theorem.</Paragraph>
  </Section>
  <Section position="9" start_page="0" end_page="0" type="metho">
    <SectionTitle>
8 The Greedy Agreement Algorithm
</SectionTitle>
    <Paragraph position="0"> Dasgupta, Littman, and McAllester suggest a possible algorithm at the end of their paper, but they give only the briefest suggestion, and do not implement or evaluate it. I give here an algorithm, the Greedy Agreement Algorithm, Input: seed rules F, G loop for each atomic rule H</Paragraph>
    <Paragraph position="2"> evaluate cost of (F,G') keep lowest-cost G' if G' is worse than G, quit swap F, G'  that constructs paired rules that agree on unlabeled data, and I examine its performance. The algorithm is given in figure 4. It begins with two seed rules, one for each view. At each iteration, each possible extension to one of the rules is considered and scored. The best one is kept, and attention shifts to the other rule. Acomplexrule(orclassifier)isalistofatomic rules H, each associating a single feature h with a label '. H(x) = ' if x has feature h, and H(x) = ? otherwise. A given atomic rule is permitted to appear multiple times in the list. Each atomic rule occurrence gets one vote, and the classifier's prediction is the label that receives the most votes. In case of a tie, there is no prediction.</Paragraph>
    <Paragraph position="3"> The cost of a classifier pair (F;G) is based on a more general version of theorem 2, that admits abstaining rules. The following theorem is based on (Dasgupta et al., 2001).</Paragraph>
    <Paragraph position="4"> Theorem 4 If view independence is satisfied, and if F and G are rules based on different views, then one of the following holds:</Paragraph>
    <Paragraph position="6"> In other words, for a given binary rule F, a pessimistic estimate of the number of errors made by F is -=(,,!-) times the number of instances labeled by F, plus the number of instances left unlabeled by F. Finally, we note that the cost of F is sensitive to the choice of G, but the cost of F with respect to G is not necessarily the same as the cost of G with respect to F. To get an overall cost, we average the cost of F with respect to G and G with respect to F.</Paragraph>
    <Paragraph position="7">  Figure 5 shows the performance of the greedy agreement algorithm after each iteration. Because not all test instances are labeled (some are neither persons nor locations nor organizations), and because classifiers do not label all instances, we show precision and recall rather than a single error rate. The contour lines show levels of the F-measure (the harmonic mean of precision and recall). The algorithm is run to convergence, that is, until no atomic rule can be found that decreases cost. It is interesting to note that there is no significant overtraining with respect to F-measure. The final values are 89.2/80.4/84.5 recall/precision/F-measure, which compare favorably with the performance of the Yarowsky algorithm (83.3/84.6/84.0).</Paragraph>
    <Paragraph position="8"> (Collins and Singer, 1999) add a special final round to boost recall, yielding 91.2/80.0/85.2 for the Yarowsky algorithm and 91.3/80.1/85.3 for their version of the original co-training algorithm. All four algorithms essentially perform equally well; the advantage of the greedy agreement algorithm is that we have an explanation for why it performs well.</Paragraph>
  </Section>
  <Section position="10" start_page="0" end_page="0" type="metho">
    <SectionTitle>
9 The Yarowsky Algorithm
</SectionTitle>
    <Paragraph position="0"> For Yarowsky's algorithm, a classifier again consists of a list of atomic rules. The prediction of the classifier is the prediction of the first rule in the list that applies. The algorithm constructs a classifier iteratively, beginning with a seed rule.</Paragraph>
    <Paragraph position="1"> In the variant we consider here, one atomic rule is added at each iteration. An atomic rule F' is chosen only if its precision, Pr[G' = +jF' = +] (as measured using the labels assigned by the current classifier G), exceeds a fixed threshold  Yarowsky does not give an explicit justification for the algorithm. I show here that the algorithm can be justified on the basis of two independence assumptions. In what follows, F represents an atomic rule under consideration, and G represents the current classifier. Recall that Y' is the set of instances whose true label is ', and G' is the set of instances fx : G(x) = 'g.</Paragraph>
    <Paragraph position="2"> We write G/ for the set of instances labeled by the current classifier, that is, fx : G(x) 6= ?g.</Paragraph>
    <Paragraph position="3"> The first assumption is precision independence. null Definition 5 Candidate rule F' and classifier G satisfy precision independence just in case</Paragraph>
    <Paragraph position="5"> A bootstrapping problem instance satisfies precision independence just in case all rules G and all atomic rules F' that nontrivially overlap with G (both F'\G/ and F'!G/ are nonempty) satisfy precision independence.</Paragraph>
    <Paragraph position="6"> Precision independence is stated here so that it looks like a conditional independence assumption, to emphasize the similarity to the analysis of co-training. In fact, it is only &amp;quot;half&amp;quot; an independence assumption--for precision independence, it is not necessary that P(Y'j-F';G/) = P(Y'j-F').</Paragraph>
    <Paragraph position="7"> The second assumption is that classifiers make balanced errors. That is:</Paragraph>
    <Paragraph position="9"> Let us first consider a concrete (but hypothetical) example. Suppose the initial classifier correctly labels 100 out of 1000 instances, and makes no mistakes. Then the initial precision is 1(Yarowsky, 1995), citing (Yarowsky, 1994), actually uses a superficially different score that is, however, a monotone transform of precision, hence equivalent to precision, since it is used only for sorting.</Paragraph>
    <Paragraph position="10"> 1 and recall is 0.1. Suppose further that we add an atomic rule that correctly labels 19 new instances, and incorrectly labels one new instance.</Paragraph>
    <Paragraph position="11"> The rule's precision is 0.95. The precision of the new classifier (the old classifier plus the new atomic rule) is 119/120 = 0.99. Note that the new precision lies between the old precision and the precision of the rule. We will show that this is always the case, given precision independence and balanced errors.</Paragraph>
    <Paragraph position="12"> We need to consider several quantities: the precision of the current classifier, P(Y'jG'); the precision of the rule under consideration, P(Y'jF'); the precision of the rule on the current labeled set, P(Y'jF'G/); and the precision of the rule as measured using estimated labels, P(G'jF'G/).</Paragraph>
    <Paragraph position="13"> The assumption of balanced errors implies that measured precision equals true precision on labeled instances, as follows. (We assume here that all instances have true labels, hence that</Paragraph>
    <Paragraph position="15"> This, combined with precision independence, implies that the precision of F' as measured on the labeled set is equal to its true precision P(Y'jF').</Paragraph>
    <Paragraph position="16"> Now consider the precision of the old and new classifiers at predicting '. Of the instances that the old classifier labels ', let A be the number that are correctly labeled and B be the number that are incorrectly labeled. Defining Nt = A + B, the precision of the old classifier is Qt = A=Nt. Let [?]A be the number of new instances that the rule under consideration correctly labels, and let [?]B be the number that it incorrectly labels. Defining n = [?]A+[?]B, the precision of the rule is q = [?]A=n. The precision of the new classifier is Qt+1 = (A+[?]A)=Nt+1, which can be written as:</Paragraph>
    <Paragraph position="18"> That is, the precision of the new classifier is a weighted average of the precision of the old classifier and the precision of the new rule.</Paragraph>
    <Paragraph position="19">  rithm An immediate consequence is that, if we only accept rules whose precision exceeds a given threshold , then the precision of the new classifier exceeds . Since measured precision equals true precision under our previous assumptions, it follows that the true precision of the final classifier exceeds if the measured precision of every accepted rule exceeds .</Paragraph>
    <Paragraph position="20"> Moreover, observe that recall can be written  where N' is the number of instances whose true label is '. If Qt &gt; , then recall is bounded below by Nt =N', which grows as Nt grows.</Paragraph>
    <Paragraph position="21"> Hence we have proven the following theorem. Theorem 5 If the assumptions of precision independence and balanced errors are satisfied, then the Yarowsky algorithm with threshold obtains a final classifier whose precision is at least . Moreover, recall is bounded below by Nt =N', a quantity which increases at each round.</Paragraph>
    <Paragraph position="22"> Intuitively, the Yarowsky algorithm increases recall while holding precision above a threshold that represents the desired precision of the final classifier. The empirical behavior of the algorithm, as shown in figure 6, is in accordance with this analysis.</Paragraph>
    <Paragraph position="23"> We have seen, then, that the Yarowsky algorithm, like the co-training algorithm, can be justified on the basis of an independence assumption, precision independence. It is important to note, however, that the Yarowsky algorithm is not a special case of co-training. Precision independence and view independence are distinct assumptions; neither implies the other.2</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML