XML Viewer - w00-1304

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1304_metho.xml
Size: 10,832 bytes
Last Modified: 2025-10-06 14:07:28
<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1304">
  <Title>Coaxing Confidences from an Old Friend: Probabilistic Classifications from Transformation Rule Lists</Title>
  <Section position="4" start_page="26" end_page="26" type="metho">
    <SectionTitle>
2 Transformation rule lists
</SectionTitle>
    <Paragraph position="0"> The central idea of transformation-based learning is to learn an ordered list of rules which progressively improve upon the current state of the training set. An initial assignment is made based on simple statistics, and then rules are greedily learned to correct the mistakes, until no net improvement can be made.</Paragraph>
    <Paragraph position="1"> These definitions and notation will be used throughout the paper:  to ,.9). This can be as simple as the most common class label in the training set, or it can be the output from another classifier.</Paragraph>
    <Paragraph position="2"> * A set of allowable templates for rules. These templates determine the predicates the rules will test, and they have the biggest influence over the behavior of the system.</Paragraph>
    <Paragraph position="3"> * An objective function for learning. Unlike in many other learning algorithms, the objective function for TBL will typically optimize the evaluation function. An often-used method is the difference in performance resulting from applying the rule.</Paragraph>
    <Paragraph position="4"> At the beginning of the learning phase, the training set is first given an initial class assignment. The system then iteratively executes the following steps:  1. Generate all productive rules.</Paragraph>
    <Paragraph position="5"> 2. For each rule: (a) Apply to a copy of the most recent state of the training set.</Paragraph>
    <Paragraph position="6"> (b) Score the result using the objective function. null 3. Select the rule with the best score.</Paragraph>
    <Paragraph position="7"> 4. Apply the rule to the current state of the training set, updating it to reflect this change. 5. Stop if the score is smaller than some pre-set threshold T.</Paragraph>
    <Paragraph position="8"> 6. Repeat from Step 1.</Paragraph>
    <Paragraph position="9">  The system thus learns a list of rules in a greedy fashion, according to the objective function. When no rule that improves the current state of the training set beyond the pre-set threshold can be found, the training phase ends. During the evaluation phase, the evaluation set is initialized with the same initial class assignment. Each rule is then applied, in the order it was learned, to the evaluation set. The final classification is the one attained when all rules have been applied.</Paragraph>
  </Section>
  <Section position="5" start_page="26" end_page="28" type="metho">
    <SectionTitle>
3 Probability estimation with
</SectionTitle>
    <Paragraph position="0"> transformation rule lists Rule lists are infamous for making hard decisions, decisions which adhere entirely to one possibility, excluding all others. These hard decisions are often accurate and outperform other types of classifiers in terms of exact-match accuracy, but because they do not have an associated probability, they give no hint as to when they might fail. In contrast, probabilistic systems make soft decisions by assigning a probability distribution over all possible classes.</Paragraph>
    <Paragraph position="1"> There are many applications where soft decisions prove useful. In situations such as active learning, where a small number of samples are selected for annotation, the probabilities can be used to determine which examples the classifier was most unsure of, and hence should provide the most extra information. A probabilistic system can also act as a filter for a more expensive system or a human expert when it is permitted to reject samples. Soft decision-making is also useful when the system is one of the components in a larger decision-malting process, as is the case in speech recognition systems (Bald et al., 1989), or in an ensemble system like AdaBoost (Freund and Schapire, 1997). There are many other applications in which a probabilistic classifier is necessary, and a non-probabHistic classifier cannot be used instead.</Paragraph>
    <Section position="1" start_page="26" end_page="27" type="sub_section">
      <SectionTitle>
3.1 Estimation via conversion to decision
</SectionTitle>
      <Paragraph position="0"> tree The method we propose to obtain probabilistic classifications from a transformation rule list involves dividing the samples into equivalence classes and computing distributions over each equivalence class. At any given point in time i, each sample z in the training set has an associated state si(z) = (z,~l). Let R(z) to be the set of rules r~ that applies to the state el(z),</Paragraph>
      <Paragraph position="2"> An equivalence class consists of all the samples z that have the same R(z). Class probability assignments are then estimated using statistics computed on the equivalence classes.</Paragraph>
      <Paragraph position="3">  An illustration of the conversion from a rule list to a decision tree is shown below. Table 1 shows an example transformation rule list. It is straightforward to convert this rule list into a decision pylon (Bahl et al., 1989)~. which can be used to represent all the possible sequences of labels assigned to a sample during the application of the TBL algorithm. The decision pylon associated with this particular rule list is displayed on the left side of Figure 1. The decision tree shown on the right side of Figure 1 is constructed such that the samples stored in any leaf have the same class label sequence as in the displayed decision pylon. In the decision pylon, &amp;quot;no&amp;quot; answers go straight down; in the decision tree, &amp;quot;yes&amp;quot; answers take the right branch. Note that a one rule in the transformation rule list can often correspond to more than one node in the decision tree.</Paragraph>
      <Paragraph position="4"> Initial label = A If Q1 and label=A then label+-B If Q2 and label=A then labele-B If Q3 and label=B then label~A  from Table 1 to a decision tree, The conversion from a transformation rule list to a decision tree is presented as a recursive procedure. The set of samples in the training set is transformed to a set of states by applying the initial class assignments. A node n is created for each of the initial class label assignments c and all states labeled c are assigned to n.</Paragraph>
      <Paragraph position="5"> The following recursive procedure is invoked with an initial &amp;quot;root&amp;quot; node, the complete set of states (from the corpus) and the whole sequence of rules learned during training:  1. If 7~ is empty, the end of the rule list has been reached. Create a leaf node, n, and estimate the probability class distribution based on the true classifications of the states in 13. Return n.</Paragraph>
      <Paragraph position="6"> 2. Let rj = (Irj,yj,j) be the lowest-indexed rule in 7~. Remove it from 7~.</Paragraph>
      <Paragraph position="7"> 3. Split the data in/3 using the predicate 7rj and the current hypothesis such that samples on which 7rj returns true are on the right of the split:</Paragraph>
      <Paragraph position="9"> 4. If IBLI &gt; K and IBRI &gt; K, the split is acceptable: (a) Create a new internal node, n; (b) Set the question: q(n) = 7rj; (c) Create the left child of n using a recursive call to RLTDT(BL, 7~); (d) Create the right child of n using a recursive call to RLTDT(BR, 7~); (e) Return node n.</Paragraph>
      <Paragraph position="10">  Otherwise, no split is performed using rj. Repeat from Step 1.</Paragraph>
      <Paragraph position="11"> The parameter K is a constant that determines the minimum weight that a leaf is permitted to have, effectively pruning the tree during construction. In all the experiments, K was set to 5.</Paragraph>
    </Section>
    <Section position="2" start_page="27" end_page="28" type="sub_section">
      <SectionTitle>
3.2 Further growth of the decision tree
</SectionTitle>
      <Paragraph position="0"> When a rule list is converted into a decision tree, there are often leaves that are inordinately heavy because they contain a large number of samples.</Paragraph>
      <Paragraph position="1"> Examples of such leaves are those containing samples which were never transformed by any of the rules in the rule list. These populations exist either because they could not be split up during the rule list learning without incurring a net penalty, or because any rule that acts on them has an objective function score of less than the threshold T. This is sub-optimal for estimation because when a large portion of the corpus falls into the same equivalence class, the distribution assigned to it reflects only the mean of those samples. The undesirable consequence is that all of those samples are given the same probability distribution.</Paragraph>
      <Paragraph position="2"> To ameliorate this problem, those samples are partitioned into smaller equivalence classes by further growing the decision tree. Since a decision tree does not place all the samples with the same current label into a single equivalence class, it does not get stuck in the same situation as a rule list m in which no change in the current state of corpus can be made without incurring a net loss in performance.</Paragraph>
      <Paragraph position="3">  Continuing to grow the decision tree that was converted from a rule list can be viewed from another angle. A highly accurate prefix tree for the final decision tree is created by tying questions together during the first phase of the growth process (TBL). Unlike traditional decision trees which select splitting questions for a node by looking only at the samples contained in the local node, this decision tree selects questions by looking at samples contained in all nodes on the frontier whose paths have a suM&lt; in common. An illustration of this phenomenon can be seen in Figure 1, where the choice to split on Question 3 was made from samples which tested false on the predicate of Question 1, together with samples which tested false on the predicate of Question 2. The result of this is that questions are chosen based on a much larger population than in standard decision tree growth, and therefore have a much greater chance of being useful and generalizable. This alleviates the problem of overpartitioning of data, which is a widely-recognized concern during decision tree growth.</Paragraph>
      <Paragraph position="4"> The decision tree obtained from this conversion can be grown further. When the rule list 7~ is exhausted at Step 1, instead of creating a leaf node, continue splitting the samples contained in the node with a decision tree induction algorithm. The splitting criterion used in the experiments is the information gain measure.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML