XML Viewer - w02-1003

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1003_metho.xml
Size: 18,708 bytes
Last Modified: 2025-10-06 14:08:02
<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1003">
  <Title>An Incremental Decision List Learner</Title>
  <Section position="3" start_page="0" end_page="2" type="metho">
    <SectionTitle>
2 The Algorithms
</SectionTitle>
    <Paragraph position="0"> In this section, we describe the traditional algorithm for decision list learning in more detail, and then motivate our new algorithm, and finally, describe our new algorithm and variations on it in detail. For simplicity only, we will state all algorithms for the binary output case; it should be clear how to extend all of the algorithms to the general case.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Traditional Algorithm
</SectionTitle>
      <Paragraph position="0"> Decision list learners attempt to find models that work well on test data. The test data consists of a series of inputs x  . For instance, in a word sense disambiguation task, a given x i could represent the set of words near the word, and y i could represent the correct sense of the word. Given a model D which predicts probabilities P D (y|x), the standard way of defining how well D works is the entropy of the model on the test data, defined as summationtext</Paragraph>
      <Paragraph position="2"> ). Lower entropy is better.</Paragraph>
      <Paragraph position="3"> There are many justifications for minimizing entropy. Among others, the &amp;quot;true&amp;quot; probability distribution has the lowest possible entropy. Also, minimizing training entropy corresponds to maximizing the probability of the training data.</Paragraph>
      <Paragraph position="4"> Now, consider trying to learn a decision list. Assume we are given a list of possible questions,</Paragraph>
      <Paragraph position="6"> . In our word sense disambiguation example, the questions might include &amp;quot;Does the word 'water' occur nearby,&amp;quot; or more complex ones, such as &amp;quot;does the word 'Charles' occur nearby and is the word before 'river.&amp;quot;' Let us assume that we have some training data, and that the system has two outputs (values for y), 0 and 1. Let C(q</Paragraph>
      <Paragraph position="8"> was true in the training data, the output was 0, and similarly for C(q</Paragraph>
      <Paragraph position="10"> ) be the total number of times that q</Paragraph>
      <Paragraph position="12"> true. Now, given a test instance, x,y for which q</Paragraph>
      <Paragraph position="14"> is true, what probability would we assign to y =1? The simplest answer is to just use the probability in the training data,C(q</Paragraph>
      <Paragraph position="16"> ). Unfortunately, this tends to overfit the training data. For instance, if q i was true only once in the training data, then, depending on the value for y that time, we would assign a probability of 1 or 0. The former is clearly an overestimate, and the latter is clearly an underestimate. Therefore, we smooth our estimates (Chen and Goodman, 1999). In particular, we used the interpolated absolute discounting method. Since both the traditional algorithm and the new algorithm use the same smoothing method, the exact smoothing technique will not typically affect the relative performance of the algorithms. Let C(0) be the total number of ys that were zero in the training, and let C(1) be the total number of ys that were one. Then, the &amp;quot;unigram&amp;quot;</Paragraph>
      <Paragraph position="18"> ) be the number of non-zero ys for a given question. In particular, in the two class case, N(q</Paragraph>
      <Paragraph position="20"> always had the same value, and 2 if both 1 and 0 values occurred. Now, we pick some value d (using heldout data) and discount all counts by d. Then, our probability distribution is</Paragraph>
      <Paragraph position="22"> Now, the predicted entropy for a question q</Paragraph>
      <Paragraph position="24"> The typical training algorithm for decision lists is very simple. Given the training data, compute the predicted entropy for each question. Then, sort the questions by their predicted entropy, and output a decision list with the questions in order. One of the questions should be the special question that is always TRUE, which returns the unigram probability.</Paragraph>
      <Paragraph position="25"> Any question with worse entropy than TRUE will show up later in the list than TRUE, and we will never get to it, so it can be pruned away.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="2" type="sub_section">
      <SectionTitle>
2.2 New Algorithm
</SectionTitle>
      <Paragraph position="0"> Consider two weathermen in Seattle in the winter.</Paragraph>
      <Paragraph position="1"> Assume the following (overly optimistic) model of Seattle weather. If today there is no wind, then tomorrow it rains. On one in 50 days, it is windy, and, the day after that, the clouds might have been swept away, leading to only a 50% chance of rain. So, overall, we get rain on 99 out of 100 days. The lazy weatherman simply predicts that 99 out of 100 days, it will rain, while the smart weatherman gives the true probabilities (i.e. 100% chance of rain tomorrow if no wind today, 50% chance of rain tomorrow if wind today.) Consider the entropy of the two weathermen.</Paragraph>
      <Paragraph position="2"> The lazy weatherman always says &amp;quot;There is a 99% chance of rain tomorrow; my average entropy is</Paragraph>
      <Paragraph position="4"> smart weatherman, if there is no wind, says &amp;quot;100% chance of rain tomorrow; my entropy is 0 bits.&amp;quot; If there is wind, however, the smart weatherman says, &amp;quot;50% chance of rain tomorrow; my entropy is 1 bit.&amp;quot; Now, if today is windy, who should we trust? The smart weatherman, whose expected entropy is 1 bit, or the lazy weatherman, whose expected entropy is .08 bits, which is obviously much better.</Paragraph>
      <Paragraph position="5"> The decision list equivalent of this is as follows.</Paragraph>
      <Paragraph position="6"> Using the classic learner, we learn as follows. We have three questions: if TRUE then predict rain with probability .99 (expected entropy = .081). If NO WIND then predict rain with probability 1 (expected  entropy = 0). If WIND then predict rain with probability 1/2 (expected entropy = 1). When we sort these by expected entropy, we get: IF NO WIND, output &amp;quot;rain: 100%&amp;quot; (entropy 0) ELSE IF TRUE, output &amp;quot;rain: 99%&amp;quot; (entropy .081) ELSE IF WIND, output &amp;quot;rain: 50%&amp;quot; (entropy 1)  Of course, we never reach the third rule, and on windy days, we predict rain with probabiliy .99! The two weathermen show what goes wrong with a naive algorithm; we can easily do much better. For the new algorithm, we start with a baseline question, the question which is always TRUE and pre-</Paragraph>
      <Paragraph position="8"> dicts the unigram probabilities. Then, we find the question which if asked before all other questions would decrease entropy the most. This is repeated until some minimum improvement, epsilon1, is reached.</Paragraph>
      <Paragraph position="9">  Figure 1 shows the new algorithm; the notation entropy(list) denotes the training entropy of a potential decision list, and entropy(prepend(q</Paragraph>
      <Paragraph position="11"> )&amp;quot; prepended.</Paragraph>
      <Paragraph position="12"> Consider the Parable of the Two Weathermen. The new learning algorithm starts with the baseline: If TRUE then predict rain with probability 99% (entropy .081). Then it prepends the rule that reduces the entropy the most. The entropy reduction from the question &amp;quot;NO WIND&amp;quot; is .081x.99 = .08, while the entropy for the question &amp;quot;WIND&amp;quot; is 1 bit for the new question, versus .5 x 1+.5 x[?]log</Paragraph>
      <Paragraph position="14"> .5+.5 x 6.64=3.82, for the old, for a reduction of 2.82 bits, so we prepend the &amp;quot;WIND&amp;quot; question. Finally, we learn (at the top of the list), that if &amp;quot;NO WIND&amp;quot;, then rain 100%, yielding the following de- null cision list: IF NO WIND, output &amp;quot;rain: 100%&amp;quot; (entropy 0) ELSE IF WIND, output &amp;quot;rain: 50%&amp;quot; (entropy 1) ELSE IF TRUE, output &amp;quot;rain: 99%&amp;quot; (entropy .081)  Of course, we never reach the third rule.</Paragraph>
      <Paragraph position="15"> Clearly, this decision list is better. Why did our entropy sorter fail us? Because sometimes a smart learner knows when it doesn't know, while a dumb rule, like our lazy weatherman who ignores the wind, doesn't know enough to know that in the current sit- null This means we are building the tree bottom up; it would be interesting to explore building the tree top-down, similar to a decision tree, which would probably also work well.</Paragraph>
      <Paragraph position="17"> uation, the problem is harder than usual.</Paragraph>
      <Paragraph position="18">  Unfortunately, the algorithm of Figure 1, if implemented in a straight-forward way, will be extremely inefficient. The problem is the inner loop, which requires computing entropy(prepend(q</Paragraph>
      <Paragraph position="20"> ,list)). The naive way of doing this is to run all of the training data through each possible decision list. In practice, the actual questions tend to be pairs or triples of simple questions. For instance, an actual question might be &amp;quot;Is word before 'left' and word after 'of'?&amp;quot; Thus, the total number of questions can be very large, and running all the data through the possible new decision lists for each question would be extremely slow.</Paragraph>
      <Paragraph position="21"> Fortunately, we can precompute entropyReduce(i) and incrementally update it. In order to do so, we also need to compute, for each training instance x</Paragraph>
      <Paragraph position="23"> the entropy with the current value of list. Furthermore, we store for each question q</Paragraph>
      <Paragraph position="25"> the list of instances</Paragraph>
      <Paragraph position="27"> ) is true. With these changes, the algorithm runs very quickly. Figure 2 gives the efficient version of the new algorithm.</Paragraph>
      <Paragraph position="29"> remove questions worse than TRUE for each training instance x</Paragraph>
      <Paragraph position="31"> Note that this efficient version of the algorithm may consume a large amount of space, because of the need to store, for each question q</Paragraph>
      <Paragraph position="33"> instances for which the question is true. There are a number of speed-space tradeoffs one can make. For instance, one could change the update loop from</Paragraph>
      <Paragraph position="35"> is actually written as a conjunction of simple questions, which we will denote</Paragraph>
      <Paragraph position="37"> . Assume that we store the list of instances that are true for each simple question Q</Paragraph>
      <Paragraph position="39"> . Then we can write an update loop in which we first find the simple question with the smallest number of true instances, and loop over only these instances when finding the instances for which q</Paragraph>
      <Paragraph position="41"/>
    </Section>
    <Section position="3" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
2.3 Compromise Algorithm
</SectionTitle>
      <Paragraph position="0"> Notice the original algorithm can actually allow rules which make things worse. For instance, in our lazy weatherman example, we built this decision list:</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="2" end_page="2" type="metho">
    <SectionTitle>
IF NO WIND, output &amp;quot;rain: 100%&amp;quot; (entropy 0)
ELSE IF TRUE, output &amp;quot;rain: 99%&amp;quot; (entropy .081)
ELSE IF WIND, output &amp;quot;rain: 50%&amp;quot; (entropy 1)
</SectionTitle>
    <Paragraph position="0"> Now, the second rule could simply be deleted, and the decision list would actually be much better (although in practice we never want to delete the &amp;quot;TRUE&amp;quot; question to ensure that we always output some probability.) Since the main reason to use decision lists is because of their understandability and small size, this optimization will be worth doing even if the full implementation of the new algorithm is too complex.</Paragraph>
    <Paragraph position="1"> The compromise algorithm is displayed in Figure 3.</Paragraph>
    <Paragraph position="2"> When the value ofepsilon1is 0, only those rules that improve entropy on the training data are included. When the value of epsilon1 is [?][?], all rules are included (the standard algorithm). Even when a benefit is predicted, this may be due to overfitting; we can get further improvements by setting the threshold to a higher value, such as 3, which means that only rules that save at least three bits - and thus are unlikely to lead to overfitting - are added.</Paragraph>
  </Section>
  <Section position="5" start_page="2" end_page="2" type="metho">
    <SectionTitle>
3 Previous Work
</SectionTitle>
    <Paragraph position="0"> There has been a modest amount of previous work on improving probabilistic decision lists, as well as a fair amount of work in related fields, especially in transformation-based learning (Brill, 1995).</Paragraph>
    <Paragraph position="1"> First, we note that non-probabilistic decision lists and transformation-based learning (TBL) are actually very similar formalisms. In particular, as observed by Roth (1998), in the two-class case, they are identical. Non-probabilistic decision lists learn rules of the form &amp;quot;If q i then output y&amp;quot; while TBLs output rules of the form &amp;quot;If q</Paragraph>
    <Paragraph position="3"> is TRUE end up with value y. The other difference between decision lists and TBLs is the list ordering.</Paragraph>
    <Paragraph position="4"> With a two-class TBL, one goes through the rules from last-to-first, and finds the last one that applies.</Paragraph>
    <Paragraph position="5"> With a decision list, one goes through the list in order, and finds the first one that applies. Thus in the two-class case, simply by changing rules of the form</Paragraph>
    <Paragraph position="7"> output y&amp;quot;, and reversing the rule order, we can change any TBL to an equivalent non-probabilistic decision list, and vice-versa. Notice that our incremental algorithm is analogous to the algorithm used by TBLs: in TBLs, at each step, a rule is added that minimizes the training data error rate. In our probabilistic decision list learner, at each step, a rule is added that minimizes the training data entropy.</Paragraph>
    <Paragraph position="8"> Roth notes that this equivalence does not hold in an important case: when the answers to questions are not static. For instance, in part-of-speech tagging (Brill, 1995), when the tag of one word is changed, it changes the answers to questions for nearby words.</Paragraph>
    <Paragraph position="9"> We call such problems &amp;quot;dynamic.&amp;quot; The near equivalence of TBLs and decision lists is important for two reasons. First, it shows the connection between our work and previous work. In particular, our new algorithm can be thought of as a probabilistic version of the Ramshaw and Marcus (1994) algorithm, for speeding up TBLs. Just as that algorithm stores the expected error rate improvement of each question, our algorithm stores the expected entropy improvement. (Actually, the Ramshaw and Marcus algorithm is somewhat more complex, because it is able to deal with dynamic problems such as part-of-speech tagging.) Similarly, the spaceefficient algorithm using compound questions at the end of Section 2.2.1 can be thought of as a static probabilistic version of the efficient TBL of Ngai and Florian (2001).</Paragraph>
    <Paragraph position="10"> The second reason that the connection to TBLs is important is that it shows us that probabilistic decision lists are a natural way to probabilize TBLs.</Paragraph>
    <Paragraph position="11"> Florian et al. (2000) showed one way to make probabilistic versions of TBLs, but the technique is somewhat complicated. It involved conversion to a decision tree, and then further growing of the tree. Their technique does have the advantage that it correctly handles the multi-class case. That is, by using a decision tree, it is relatively easy to incorporate the current state, while the decision list learner ignores that state. However, this is not clearly an advantage - adding extra dependencies introduces data sparseness, and it is an empirical question whether dependencies on the current state are actually helpful. Our probabilistic decision lists can thus be thought of as a competitive way to probabilize TBLs, with the advantage of preserving the list-structure and simplicity of TBL, and the possible disadvantage of losing the dependency on the current state.</Paragraph>
    <Paragraph position="12"> Yarowsky (1994) suggests two improvements to the standard algorithm. First, he suggests an optional, more complex smoothing algorithm than the one we applied. His technique involves estimating both a probability based on the global probability distribution for a question, and a local probability, given that no questions higher in the list were TRUE, and then interpolating between the two probabilities.</Paragraph>
    <Paragraph position="13"> He also suggests a pruning technique that eliminates 90% of the questions while losing 3% accuracy; as we will show in Section 4, our technique or variations eliminate an even larger percentage of questions while increasing accuracy. Yarowsky (2000) also considered changing the structure of decision lists to include a few splits at the top, thus combining the advantages of decision trees and decision lists.</Paragraph>
    <Paragraph position="14"> The combination of this hybrid decision list and the improved smoothing was the best performer for participating systems in the 1998 senseval evaluation.</Paragraph>
    <Paragraph position="15"> Our technique could easily be combined with these techniques, presumably leading to even better results.</Paragraph>
    <Paragraph position="16"> However, since we build our decision lists from last to first, rather than first to last, the local probability is not available as the list is being built. But there is no reason we could not interpolate the local probability into a final list. Similarly, in Yarowsky's technique, the local probability is also not available at the time the questions are sorted.</Paragraph>
    <Paragraph position="17"> Our algorithm can be thought of as a natural probabilistic version of a non-probabilistic decision list learner which prepends rules (Webb, 1994). One difficulty that that approach has is ranking rules. In the probabilistic framework, using entropy reduction and smoothing seems like a natural solution.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML