File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/j04-3004_metho.xml
Size: 39,307 bytes
Last Modified: 2025-10-06 14:08:45
<?xml version="1.0" standalone="yes"?> <Paper uid="J04-3004"> <Title>Understanding the Yarowsky Algorithm</Title> <Section position="3" start_page="366" end_page="366" type="metho"> <SectionTitle> YS-P Sequential update, &quot;antismoothing&quot; YS-R Sequential update, no smoothing YS-FS Sequential update, original Yarowsky smoothing </SectionTitle> <Paragraph position="0"> restricts the data sets on which the algorithm can be shown effective, but also for additional internal reasons. A detailed discussion would take us too far afield here, but suffice it to say that precision independence is a property that it would be preferable not to assume, but rather to derive from more basic properties of a data set, and that closer empirical study shows that precision independence fails to be satisfied in some data sets on which the Yarowsky algorithm is effective.</Paragraph> <Paragraph position="1"> This article proposes a different approach. Instead of making assumptions about the data, it views the Yarowsky algorithm as optimizing an objective function. We will show that several variants of the algorithm (though not the algorithm in precisely its original form) optimize either negative log likelihood H or an alternative objective function, K, that imposes an upper bound on H.</Paragraph> <Paragraph position="2"> Ideally, we would like to show that the Yarowsky algorithm minimizes H. Unfortunately, we are not able to do so. But we are able to show that a variant of the Yarowsky algorithm, which we call Y-1/DL-EM, decreases H in each iteration. It combines the outer loop of the Yarowsky algorithm with a different inner loop based on the expectation-maximization (EM) algorithm.</Paragraph> <Paragraph position="3"> A second proposed variant of the Yarowsky algorithm, Y-1/DL-1, has the advantage that its inner loop is very similar to the original Yarowsky inner loop, unlike Y-1/DL-EM, whose inner loop bears little resemblance to the original. Y-1/DL-1 has the disadvantage that it does not directly reduce H, but we show that it does reduce the alternative objective function K.</Paragraph> <Paragraph position="4"> We also consider a third variant, YS. It differs from Y-1/DL-EM and Y-1/DL-1 in that it updates sequentially (adding a single rule in each iteration), rather than in parallel (updating all rules in each iteration). Besides having the intrinsic interest of sequential update, YS can be proven effective when using exactly the same smoothing method as used in the original Yarowsky algorithm, in contrast to Y-1/DL-1, which uses either no smoothing or a nonstandard &quot;variable smoothing.&quot; YS is proven to decrease K.</Paragraph> <Paragraph position="5"> The Yarowsky algorithm variants that we consider are summarized in Table 1. To the extent that these variants capture the essence of the original algorithm, we have a better formal understanding of its effectiveness. Even if the variants are deemed to depart substantially from the original algorithm, we have at least obtained a family of new bootstrapping algorithms that are mathematically understood.</Paragraph> </Section> <Section position="4" start_page="366" end_page="388" type="metho"> <SectionTitle> 2. The Generic Yarowsky Algorithm </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="366" end_page="369" type="sub_section"> <SectionTitle> 2.1 The Original Algorithm Y-0 </SectionTitle> <Paragraph position="0"> The original Yarowsky algorithm, which we refer to as Y-0, is given in table 2. It is an iterative algorithm. One begins with a seed set L (0) of labeled examples and a of unlabeled examples. At each iteration, a classifier is constructed from the labeled examples; then the classifier is applied to the unlabeled examples to create a new labeled set.</Paragraph> <Paragraph position="1"> To discuss the algorithm formally, we require some notation. We assume first a set of examples X and a feature set F</Paragraph> <Paragraph position="3"> if and only if f [?] F x .</Paragraph> <Paragraph position="4"> We also require a series of labelings Y (t) , where t represents the iteration number. We write Y (t) x for the label of example x under labeling Y (t) . An unlabeled example is one for which Y (t) x is undefined, in which case we write Y (t) x = [?]. We write V (t) for the set of unlabeled examples and L (t) for the set of labeled examples. It will also be useful to have a notation for the set of examples with label j: for the set of labeled examples with feature f , trusting to the index to discriminate between L f (labeled examples with feature f ) and L j (labeled examples with label j). We always use f and g to represent features and j and k to represent labels. The reader may wish to refer to Table 3, which summarizes notation used throughout the article. In each iteration, the Yarowsky algorithm uses a supervised learner to train a classifier on the labeled examples. Let us call this supervised learner the base learning algorithm; it is a function from (X, Y (t) ) to a classifier p drawn from a space of classifiers P. It is assumed that the classifier makes confidence-weighted predictions. That is, the classifier defines a scoring function p(x, j), and the predicted label for example</Paragraph> <Paragraph position="6"> Ties are broken arbitrarily. Technically, we assume a fixed order over labels and define the maximization as returning the first label in the ordering, in case of a tie.</Paragraph> <Paragraph position="7"> It will be convenient to assume that the scoring function is nonnegative and bounded, in which case we can normalize it to make p(x, j) a conditional distribution over labels j for a given example x. Henceforward, we write p Computational Linguistics Volume 30, Number 3 Table 3 Summary of notation.</Paragraph> <Paragraph position="8"> X set of examples, both labeled and unlabeled Y the current labeling; Y (t) is the labeling at iteration t L the (current) set of labeled examples V the (current) set of unlabeled examples x an example index f , g feature indices j, k label indices (j) &quot;peaked&quot; precision (equation (25)) j+ the label that maximizes precision q f (j) for a given feature f (equation (26)) j[?] the label that maximizes rule score th fj for a given feature f (equation (28)) u(*) uniform distribution understanding p x to be a probability distribution over labels j. We call this distribution the prediction distribution of the classifier on example x. To complete an iteration of the Yarowsky algorithm, one recomputes labels for examples. Specifically, the label ^y is assigned to example x if the score p x (^y) exceeds a threshold z, called the labeling threshold. The new labeled set L constitutes the original manually labeled data, as opposed to data that have been labeled by the learning algorithm itself. The algorithm continues until convergence. The particular base learning algorithm that Yarowsky uses is deterministic, in the sense that the classifier induced is a deterministic function of the labeled data. Hence, the algorithm is known to have converged at whatever point the labeling remains unchanged. Note that the algorithm as stated leaves the base learning algorithm unspecified. We can distinguish between the generic Yarowsky algorithm Y-0, for which the base learning algorithm is an open parameter, and the specific Yarowsky algorithm, which includes a specification of the base learner. Informally, we call the generic algorithm the outer loop and the base learner the inner loop of the specific Yarowsky algorithm. The base learner that Yarowsky assumes is a decision list induction algorithm. We postpone discussion of it until Section 3.</Paragraph> <Paragraph position="9"> Abney Understanding the Yarowsky Algorithm</Paragraph> </Section> <Section position="2" start_page="369" end_page="370" type="sub_section"> <SectionTitle> 2.2 An Objective Function </SectionTitle> <Paragraph position="0"> Machine learning algorithms are typically designed to optimize some objective function that represents a formal measure of performance. The maximum-likelihood criterion is the most commonly used objective function. Suppose we have a set of examples L, with labels Y x for x [?] L, and a parametric family of models p th such that p(j|x;th) represents the probability of assigning label j to example x, according to the model. The likelihood of th is the probability of the full data set according to the model, viewed as a function of th, and the maximum-likelihood criterion instructs us to choose the parameter settings ^ th that maximize likelihood, or equivalently, log-likelihood:</Paragraph> <Paragraph position="2"> (The notation [[Ph]] represents the truth value of the proposition Ph; it is one if Ph is true and zero otherwise.) Let us define</Paragraph> <Paragraph position="4"> satisfies the formal requirements of a probability distribution over labels j: Specifically, it is a point distribution with all its mass concentrated on Y x . We call it the labeling distribution. Now we can write</Paragraph> <Paragraph position="6"> In (2) we have written p x for the distribution p(*|x;th), leaving the dependence on th implicit. We have also used the nonstandard notation H(p||q) for what is sometimes called cross entropy. It is easy to verify that H(p||q)=H(p)+D(p||q)(3) where H(p) is the entropy of p and D is Kullback-Leibler divergence. Note that when p is a point distribution, H(p)=0 and hence H(p||q)=D(p||q). In particular:</Paragraph> <Paragraph position="8"> Thus when, as here, ph x is a point distribution, we can restate the maximum-likelihood criterion as instructing us to choose the model that minimizes the total divergence between the empirical labeling distributions ph x and the model's prediction distributions p x .</Paragraph> <Paragraph position="9"> To extend l(th) to unlabeled examples, we need only observe that unlabeled examples are ones about whose labels the data provide no information. Accordingly, we Computational Linguistics Volume 30, Number 3 revise the definition of ph x to treat unlabeled examples as ones whose labeling distribution is the maximally uncertain distribution, which is to say, the uniform distribution: ) is minimized when the labels of examples agree with the predictions of the model. In short, we adopt as objective function</Paragraph> <Paragraph position="11"> We seek to minimize H.</Paragraph> </Section> <Section position="3" start_page="370" end_page="374" type="sub_section"> <SectionTitle> 2.3 The Modified Algorithm Y-1 </SectionTitle> <Paragraph position="0"> We can show that a modified version of the Yarowsky algorithm finds a local minimum of H. Two modifications are necessary: * The labeling function Y is recomputed in each iteration as before, but with the constraint that an example once labeled stays labeled. The label may change, but a labeled example cannot become unlabeled again.</Paragraph> <Paragraph position="1"> * We eliminate the threshold z or (equivalently) fix it at 1/L. As a result, the only examples that remain unlabeled after the labeling step are those for which p x is the uniform distribution. The problem with an arbitrary threshold is that it prevents the algorithm from converging to a minimum of H. A threshold that gradually decreases to 1/L would also address the problem but would complicate the analysis.</Paragraph> <Paragraph position="2"> The modified algorithm, Y-1, is given in Table 4.</Paragraph> <Paragraph position="3"> To obtain a proof, it will be necessary to make an assumption about the supervised classifier p (t+1) induced by the base learner in step 2.1 of the algorithm. A natural assumption is that the base learner chooses p ) satisfies the weaker assumption (7), inasmuch as the option of setting p</Paragraph> <Paragraph position="5"> is always available.</Paragraph> <Paragraph position="6"> Abney Understanding the Yarowsky Algorithm Table 4 The modified generic Yarowsky algorithm (Y-1). (1) Given: X, Y (0) (2) For t [?]{0, 1, ...} (2.1) Train classifier on (L (t) , Y (t) ); result is p (t+1) (2.2) For each example x [?] X:</Paragraph> <Paragraph position="8"> We also consider a somewhat stronger assumption, namely, that the base learner reduces divergence over all examples, not just over labeled examples:</Paragraph> <Paragraph position="10"> If a base learning algorithm satisfies (8), the proof of theorem 1 is shorter; but (7) is the more natural condition for a base learner to satisfy.</Paragraph> <Paragraph position="11"> We can now state the main theorem of this section.</Paragraph> <Paragraph position="12"> Theorem 1 If the base learning algorithm satisfies (7) or (8), algorithm Y-1 decreases H at each iteration until it reaches a critical point of H.</Paragraph> <Paragraph position="13"> We require the following lemma in order to prove the theorem: Lemma 1 For all distributions p Computational Linguistics Volume 30, Number 3 We have equality only if p(k)=max j p(j) for all k, that is, only if p is the uniform distribution.</Paragraph> <Paragraph position="14"> We now prove the theorem.</Paragraph> <Paragraph position="15"> Proof of Theorem 1 The algorithm produces a sequence of labelings ph In the training step (2.1) of the algorithm, we hold ph fixed and change p, and in the labeling step (2.2), we hold p fixed and change ph. We will show that the training step minimizes H as a function of p, and the labeling step minimizes H as a function of ph except in examples in which it is at a critical point of H. Hence, H is nonincreasing in each iteration of the algorithm and is strictly decreasing unless (ph</Paragraph> <Paragraph position="17"> ) is a critical point of H.</Paragraph> <Paragraph position="18"> Let us consider the labeling step first. In this step, p is held constant, but ph (possibly) changes, and we have</Paragraph> <Paragraph position="20"> We can show that [?]H is nonpositive if we can show that [?]H(x) is nonpositive for all x.</Paragraph> <Paragraph position="21"> We can guarantee that [?]H(x) [?] 0ifph , that is, for examples x that remain unlabeled at t + 1. However, in algorithm Y-1, any example that is unlabeled at t + 1 is necessarily also unlabeled at t, so for any such example, Abney Understanding the Yarowsky Algorithm [?]H(x)=0. Hence, if any label changes in the labeling step, H decreases, and if no label changes, H remains unchanged; in either case, H does not increase. We can show further that even for examples x [?] V is the uniform distribution (otherwise Y-1 would have labeled x). Hence the divergence between ph</Paragraph> <Paragraph position="23"> is zero, and thus at a minimum. It would be possible to decrease H(ph</Paragraph> <Paragraph position="25"> ), but all directions of motion (all ways of selecting labels to receive increased probability mass) are equally good. That is to say, the gradient of H is zero; we are at a critical point.</Paragraph> <Paragraph position="26"> Essentially, we have reached a saddle point. We have minimized H with respect to ph x (j) along those dimensions with a nonzero gradient. Along the remaining dimensions, we are actually at a local maximum, but without a gradient to choose a direction of descent.</Paragraph> <Paragraph position="27"> Now let us consider the algorithm's training step (2.1). In this step, ph is held constant, so the change in H is equal to the change in D--recall that H(ph||p)=H(ph)+ D(ph||p). By the hypothesis of the theorem, there are two cases: The base learner satisfies either (7) or (8). If it satisfies (8), the base learner minimizes D as a function of p, hence it follows immediately that it minimizes H as a function of p.</Paragraph> <Paragraph position="28"> Suppose instead that the base learner satisfies (7). We can express H as</Paragraph> <Paragraph position="30"> In the training step, the first term remains constant. The second term decreases, by hypothesis. But the third term may increase. However, we can show that any increase in the third term is more than offset in the labeling step.</Paragraph> <Paragraph position="31"> Consider an arbitrary example x in V (t) . Since it is unlabeled at time t, we know that ph (t) x is the uniform distribution u: Computational Linguistics Volume 30, Number 3 labeled in the labeling step. Hence the value of H(x) at the end of the iteration, after the labeling step, is H , but if we consider the change overall, we find that the increase in the training step is more than offset in the labeling step:</Paragraph> </Section> <Section position="4" start_page="374" end_page="375" type="sub_section"> <SectionTitle> 3.1 The Original Decision List Induction Algorithm DL-0 </SectionTitle> <Paragraph position="0"> When one speaks of the Yarowsky algorithm, one often has in mind not just the generic algorithm Y-0 (or Y-1), but an algorithm whose specification includes the particular choice of base learning algorithm made by Yarowsky. Specifically, Yarowsky's base learner constructs a decision list, that is, a list of rules of form f - j, where f is a feature and j is a label, with score th fj .Arulef - j matches example x if x possesses the feature f . The label predicted for a given example x is the label of the highest scoring rule that matches x.</Paragraph> <Paragraph position="1"> Yarowsky uses smoothed precision for rule scoring. As the name suggests, smoothed precision ~q f (j) is a smoothed version of (raw) precision q f (j), which is the probability that rule f - j is correct given that it matches</Paragraph> <Paragraph position="3"> is the set of labeled examples that possess feature f , and L fj is the set of labeled examples with feature f and label j. Smoothed precision ~q(j|f ;epsilon1) is defined as follows:</Paragraph> <Paragraph position="5"> We also write ~q f (j) when epsilon1 is clear from context. Yarowsky defines a rule's score to be its smoothed precision:</Paragraph> <Paragraph position="7"> Anticipating later needs, we will also consider raw precision as an alternative: th</Paragraph> <Paragraph position="9"> (j). Both raw and smoothed precision have the properties of a conditional probability distribution. Generally, we view th fj as a conditional distribution over labels j for a fixed feature f .</Paragraph> <Paragraph position="10"> Yarowsky defines the confidence of the decision list to be the score of the highest-scoring rule that matches the instance being classified. This is equivalent to defining</Paragraph> <Paragraph position="12"> is the set of features of x.) Since the classifier's prediction for x is defined, in equation (1), to be the label that maximizes p x (j), definition (11) implies Abney Understanding the Yarowsky Algorithm Table 5 The decision list induction algorithm DL-0. The value accumulated in N[f , j] is |L fj |, and the value accumulated in Z[f] is |L f |.</Paragraph> <Paragraph position="13"> (0) Given: a fixed value for epsilon1>0 Initialize arrays N[f , j]=0, Z[f]=0 for all f , j (1) For each example x [?] L (1.1) Let j be the label of x (1.2) Increment N[f , j], Z[f], for each feature f of x (2) For each feature f and label j that the classifier's prediction is the label of the highest-scoring rule matching x,as desired.</Paragraph> <Paragraph position="14"> We have written [?] in (11) rather than = because maximizing th fj across f [?] F x for each label j will not in general yield a probability distribution over labels--though the scores will be positive and bounded, and hence normalizable. Considering only the final predicted label ^y for a given example x, the normalization will have no effect, inasmuch as all scores th fj being compared will be scaled in the same way. As characterized by Yarowsky, a decision list contains only those rules f - j whose score ~q f (j) exceeds the labeling threshold z. This can be seen purely as an efficiency measure. Including rules whose score falls below the labeling threshold will have no effect on the classifier's predictions, as the threshold will be applied when the classifier is applied to examples. For this reason, we do not prune the list. That is, we represent a decision list as a set of parameters {th fj }, one for every possible rule f - j in the cross product of the set of features and the set of labels. The decision list induction algorithm used by Yarowsky is summarized in Table 5; we refer to it as DL-0. Note that the step labeled (*) is not actually a step of the induction algorithm but rather specifies how the decision list is used to compute a prediction distribution p x for a given example x.</Paragraph> <Paragraph position="15"> Unfortunately, we cannot prove anything about DL-0 as it stands. In particular, we are unable to show that DL-0 reduces divergence between prediction and labeling distributions (7). In the next section, we describe an alternative decision list induction algorithm, DL-EM, that does satisfy (7); hence we can apply Theorem 1 to the combination Y-1/DL-EM to show that it reduces H. However, a disadvantage of DL-EM is that it does not resemble the algorithm DL-0 used by Yarowsky. We return in section 3.4 to a close variant of DL-0 called DL-1 and show that though it does not directly reduce H, it does reduce the upper bound K.</Paragraph> </Section> <Section position="5" start_page="375" end_page="379" type="sub_section"> <SectionTitle> 3.2 The Decision List Induction Algorithm DL-EM </SectionTitle> <Paragraph position="0"> The algorithm DL-EM is a special case of the EM algorithm. We consider two versions of the algorithm: DL-EM-L and DL-EM-X. They differ in that DL-EM-L is trained on labeled examples only, whereas DL-EM-X is trained on both labeled and unlabeled examples. However, the basic outline of the algorithm is the same for both.</Paragraph> <Paragraph position="1"> First, the DL-EM algorithms do not assume Yarowsky's definition of p, given in (11). As discussed above, the parameters th fj can be thought of as defining a prediction distribution th f (j) over labels j for each feature f . Hence equation (11) specifies how the prediction distributions th f for the features of example x are to be combined to yield a distribution for each f , and since any convex combination of distributions is also a distribution, it follows that p x as defined in (12) is a probability distribution. The two definitions for p x (j), (11) and (12), will often have the same mode ^y, but that is guaranteed only in the rather severely restricted case of two features and two labels. Under definition (11), the prediction is determined entirely by the strongest</Paragraph> <Paragraph position="3"> to outvote the strongest one.</Paragraph> <Paragraph position="4"> Yarowsky explicitly wished to avoid the possibility of such interactions. Nonetheless, definition (12), used by DL-EM, turns out to make analysis of other base learners more manageable, and we will assume it henceforth, not only for DL-EM, but also for the algorithms DL-1 and YS discussed in subsequent sections.</Paragraph> <Paragraph position="5"> DL-EM also differs from DL-0 in that DL-EM does not construct a classifier &quot;from scratch&quot; but rather seeks to improve on a previous classifier. In the context of the Yarowsky algorithm, the previous classifier is the one from the previous iteration of the outer loop. We write th old fj for the parameters and p old x for the prediction distributions of the previous classifier.</Paragraph> <Paragraph position="6"> Conceptually, DL-EM considers the label j assigned to an example x to be generated by choosing a feature f [?] F x and then assigning the label j according to the feature's prediction distribution th f (j). The choice of feature f is a hidden variable. The degree to which an example labeled j is imputed to feature f is determined by the old distribution:</Paragraph> <Paragraph position="8"> One can think of p old (f|x, j) either as the posterior probability that feature f was responsible for the label j, or as the portion of the labeled example (x, j) that is imputed to feature f . We also write p old xj (f) as a synonym for p old (f|x, j). The new estimate th fj is obtained by summing imputed occurrences of (f , j) and normalizing across labels. For DL-EM-L, this takes the form</Paragraph> <Paragraph position="10"> The algorithm is summarized in Table 6.</Paragraph> <Paragraph position="11"> The second version of the algorithm, DL-EM-X, is summarized in Table 7. It is like DL-EM-L, except that it uses the update rule</Paragraph> <Paragraph position="13"> Update rule (13) includes unlabeled examples as well as labeled examples. Conceptually, it divides each unlabeled example equally among the labels, then divides the resulting fractional labeled example among the example's features.</Paragraph> <Paragraph position="14"> We note that both variants of the DL-EM algorithm constitute a single iteration of an EM-like algorithm. A single iteration suffices to prove the following theorem, though multiple iterations would also be effective: Theorem 2 The classifier produced by the DL-EM-L algorithm satisfies equation (7), and the classifier produced by the DL-EM-X algorithm satisfies equation (8). Combining Theorems 1 and 2 yields the following corollary: Corollary The Yarowsky algorithm Y-1, using DL-EM-L or DL-EM-X as its base learning algorithm, decreases H at each iteration until it reaches a critical point of H. Let th old represent the parameter values at the beginning of the call to DL-EM, let th represent a family of free variables that we will optimize, and let p old and p be the corresponding prediction distributions. The labeling distribution ph is fixed. For any set of examples a, let [?]D ) resulting from the change in th. We are obviously particularly interested in two cases: that in which a is the set of all examples X (for DL-EM-X) and that in which a is the set of labeled examples Computational Linguistics Volume 30, Number 3 L (for DL-EM-L). In either case, we will show that [?]D a [?] 0, with equality only if no choice of th decreases D.</Paragraph> <Paragraph position="15"> We first derive an expression for [?][?]D a that we will put to use shortly: Abney Understanding the Yarowsky Algorithm We seek a solution to the family of equations that results from expressing the gradient of (16) as a linear combination of the gradients of the constraints:</Paragraph> </Section> <Section position="6" start_page="379" end_page="381" type="sub_section"> <SectionTitle> 3.3 The Objective Function K </SectionTitle> <Paragraph position="0"> Y-1/DL-EM is the only variation on the Yarowsky algorithm that we can show to reduce negative log-likelihood, H. The variants that we discuss in the remainder of the article, Y-1/DL-1 and YS, reduce an alternative objective function, K, which we now define.</Paragraph> <Paragraph position="1"> The value K (or, more precisely, the value K/m) is an upper bound on H, which we derive using Jensen's inequality, as follows: ), K is reduced to zero if all examples are labeled, each feature concentrates its prediction distribution in a single label, and the label of every example agrees with the prediction of every feature it possesses. In this limiting case, any minimizer of K is also a minimizer of H. We hasten to add a proviso: It is not possible to reduce K to zero for all data sets. The following provides a necessary and sufficient condition for being able to do so. Consider an undirected bipartite graph G whose nodes are examples and features. There is an edge between example x and feature f just in case f is a feature of x.</Paragraph> </Section> <Section position="7" start_page="381" end_page="384" type="sub_section"> <SectionTitle> 3.4 Algorithm DL-1 </SectionTitle> <Paragraph position="0"> We consider two variants of DL-0, called DL-1-R and DL-1-VS. They differ from DL-0 in two ways. First, the DL-1 algorithms assume the &quot;mean&quot; definition of p x given in equation (12) rather than the &quot;max&quot; definition of equation (11). This is not actually a difference in the induction algorithm itself, but in the way the decision list is used to construct a prediction distribution p</Paragraph> <Paragraph position="2"> Second, the DL-1 algorithms use update rules that differ from the smoothed precision of DL-0. DL-1-R (Table 8) uses raw precision instead of smoothed precision.</Paragraph> <Paragraph position="3"> DL-1-VS (Table 9) uses smoothed precision, but unlike DL-0, DL-1-VS does not use a fixed smoothing constant epsilon1; rather epsilon1 varies from feature to feature. Specifically, in computing the score th</Paragraph> <Paragraph position="5"> |/L as its value for epsilon1.</Paragraph> <Paragraph position="6"> The value of epsilon1 used by DL-1-VS can be expressed in another way that will prove useful. Let us define</Paragraph> <Paragraph position="8"/> <Paragraph position="10"> First we show that smoothed precision can be expressed as a convex combination of raw precision (9) and the uniform distribution. Define d = epsilon1/|L</Paragraph> <Paragraph position="12"> Now we show that the mixing coefficient 1/(1 + Ld) of (22) is the same as the mixing coefficient p(L|f) of the lemma, when epsilon1 = |V</Paragraph> <Paragraph position="14"> The main theorem of this section (Theorem 5) is that the specific Yarowsky algorithm Y-1/DL-1 decreases K in each iteration until it reaches a critical point. It is proved as a corollary of two theorems. The first (Theorem 3) shows that DL-1 minimizes K as a function of th, holding ph constant, and the second (Theorem 4) shows that Y-1 decreases K as a function of ph, holding th constant. More precisely, DL-1-R minimizes K over labeled examples L, and DL-1-VS minimizes K over all examples This is the labeling rule used by Y-1.</Paragraph> <Paragraph position="15"> If the base learner minimizes over L only, rather than X, it can be shown that any increase in K on unlabeled examples is compensated for in the labeling step, as in the proof of Theorem 1.</Paragraph> <Paragraph position="16"> Theorem 5 The specific Yarowsky algorithms Y-1/DL-1-R and Y-1/DL-1-VS decrease K at each iteration until they reach a critical point.</Paragraph> </Section> <Section position="8" start_page="384" end_page="386" type="sub_section"> <SectionTitle> 4.1 The Family YS </SectionTitle> <Paragraph position="0"> The Yarowsky algorithm variants we have considered up to now do &quot;parallel&quot; updates in the sense that the parameters {th fj } are completely recomputed at each iteration. In Abney Understanding the Yarowsky Algorithm this section, we consider a family YS of &quot;sequential&quot; variants of the Yarowsky algorithm, in which a single feature is selected for update at each iteration. The YS algorithms resemble the &quot;Yarowsky-Cautious&quot; algorithm of Collins & Singer (1999), though they differ from that algorithm in that they update a single feature in each iteration, rather than a small set of features, as in Yarowsky-Cautious. The YS algorithms are intended to be as close to the Y-1/DL-1 algorithm as is consonant with singlefeature updates. The YS algorithms differ from one another, and from Y-1/DL-1, in the choice of update rule. An interesting range of update rules work in the sequential setting. In particular, smoothed precision with fixed epsilon1, as in the original algorithm Y-0/DL-0, works in the sequential setting, though with a proviso that will be spelled out later.</Paragraph> <Paragraph position="1"> Instead of an initial labeled set, there is an initial classifier consisting of a set of each iteration, one feature is selected to be added to the selected set. A feature, once selected, remains in the selected set. It is permissible for a feature to be selected more than once; this permits us to continue reducing K even after all features have been selected. In short, there is a sequence of selected features f</Paragraph> <Paragraph position="3"> The parameters for the selected feature are also updated. At iteration t, the pa-</Paragraph> <Paragraph position="5"> It follows that, for all t:</Paragraph> <Paragraph position="7"> However, parameters for features in S may not be modified, inasmuch as they play the role of manually labeled data. In each iteration, one selects a feature f</Paragraph> <Paragraph position="9"> and computes (or recomputes) the prediction distribution th</Paragraph> <Paragraph position="11"> for the selected feature f</Paragraph> <Paragraph position="13"> (j), where we continue to assume p x (j) to have the &quot;mixture&quot; definition (equation (12)). The label of example x is set to ^y if any feature of x belongs to S . In particular, all previously labeled examples continue to be labeled (though their labels may change), and any unlabeled examples possessing feature f t become labeled.</Paragraph> <Paragraph position="14"> The algorithm is summarized in Table 10. It is actually an algorithm schema; the definition for &quot;update&quot; needs to be supplied. We consider three different update functions: one that uses raw precision as its prediction distribution, one that uses smoothed precision, and one that goes in the opposite direction, using what we might call &quot;peaked precision.&quot; As we have seen, smoothed precision can be expressed as a mixture of raw precision and the uniform (i.e., maximum-entropy) distribution (22). Peaked precision ^q(f) mixes in a certain amount of the point (i.e., minimum-entropy) distribution that has all its mass on the label that maximizes raw precision: Note that peaked precision involves a variable amount of &quot;peaking&quot;; the mixing parameters depend on the relative proportions of labeled and unlabeled examples. Note also that j+ is a function of f , though we do not explicitly represent that dependence. The three instantiations of algorithm YS that we consider are We will show that the first two algorithms reduce K in each iteration. We will show that the third algorithm, YS-FS, reduces K in iterations in which f t is a new feature, not previously selected. Unfortunately, we are unable to show that YS-FS reduces K when f t is a previously selected feature. This suggests employing a mixed algorithm in which smoothed precision is used for new features but raw or peaked precision is used for previously selected features.</Paragraph> <Paragraph position="15"> A final issue with the algorithm schema YS concerns the selection of features in step 2.1. The schema as stated does not specify which feature is to be selected. In essence, the manner in which rules are selected does not matter, as long as one selects rules that have room for improvement, in the sense that the current prediction distribution th f differs from raw precision q f . (The justification for this choice is given in Theorem 9.) The theorems in the following sections show that K decreases in each iteration, so long as any such rule can be found.</Paragraph> <Paragraph position="16"> One could choose greedily by choosing the feature that maximizes gain G (equation (27)), though in the next section we give lower bounds for G that are rather more easily computed (Theorems 6 and 7).</Paragraph> </Section> <Section position="9" start_page="386" end_page="388" type="sub_section"> <SectionTitle> 4.2 Gain </SectionTitle> <Paragraph position="0"> From this point on, we consider a single iteration of the YS algorithm and discard the variable t. We write th old and ph old for the parameter set and labeling at the beginning of the iteration, and we write simply th and ph for the new parameter set and new labeling. The set L (respectively, V) represents the examples that are labeled (respectively, unlabeled) at the beginning of the iteration. The selected feature is f . We wish to choose a prediction distribution for f so as to guarantee that K decreases in each iteration. The gain in the current iteration is Abney Understanding the Yarowsky Algorithm Gain is the negative change in K; it is positive when K decreases. In considering the reduction in K from (ph old ,th old ) to (ph,th), it will be convenient to consider the following intermediate values: , the only selected feature at t + 1isf , hence j[?] = ^y for such examples. It follows that ps and ph agree on examples in V f . They also agree on examples that are unlabeled at t + 1, assigning them the uniform label distribution. If ps and ph differ, it is only on old labeled examples (L) that need to be relabeled, given the addition of f . The gain G can be represented as the sum of three intermediate gains, corresponding to the intermediate values just defined: intuitively represents the gain that is attributable to labeling previously unlabeled examples in accordance with the predictions of th. The gain G th represents the gain that is attributable to changing the values th fj , where f is the selected feature. The represents the gain that is attributable to changing the labels of previously labeled examples to make labels agree with the predictions of the new model th. The gain G th corresponds to step 2.3 of algorithm YS, in which th is changed but ph is held constant; and the combined G V and G L gains correspond to step 2.4 of algorithm YS, in which ph is changed while holding th constant. In the remainder of this section, we derive two lower bounds for G. In following sections, we show that the updates YS-P, YS-R, and YS-FS guarantee that the lower bounds given below are non-negative, and hence that G is non-negative.</Paragraph> </Section> <Section position="10" start_page="388" end_page="388" type="sub_section"> <SectionTitle> 4.3 Algorithm YS-P </SectionTitle> <Paragraph position="0"> We now use the results of the previous section to show that the algorithm YS-P is correct in the sense that it reduces K in every iteration.</Paragraph> <Paragraph position="1"> Theorem 10 In each iteration of algorithm YS-P, K decreases.</Paragraph> <Paragraph position="2"> Proof We wish to show that G > 0. By Theorem 6, that is true if expression (31) is positive. By Theorem 9, there exist choices for th for any distribution that concentrates all its mass in a single label j[?]; it is symmetric in all choices of j[?] and decreases monotonically as th fj[?] approaches one. Hence, the minimum of (37) will have j[?] equal to the mode of q f , though it may be more peaked than q f , at the cost of an increase in the first term, but offset by a decrease in the second term.</Paragraph> <Paragraph position="4"> (j). By the reasoning of the previous paragraph, we know that j+ = j[?] at the minimum of (37). Hence we can minimize (37) by minimizing</Paragraph> </Section> <Section position="11" start_page="388" end_page="388" type="sub_section"> <SectionTitle> 4.4 Algorithm YS-R </SectionTitle> <Paragraph position="0"> We now show that YS-R also decreases K in each iteration. In fact, it has essentially already been proven.</Paragraph> <Paragraph position="1"> yields strictly positive gain. This is the update rule used by YS-R.</Paragraph> </Section> </Section> class="xml-element"></Paper>