XML Viewer - w04-3239

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3239_metho.xml
Size: 14,227 bytes
Last Modified: 2025-10-06 14:09:30
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3239">
  <Title>A Boosting Algorithm for Classification of Semi-Structured Text</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Classifier for Trees
</SectionTitle>
    <Paragraph position="0"> We first assume that a text to be classified is represented as a labeled ordered tree. The focused problem can be formalized as a general problem, called the tree classification problem.</Paragraph>
    <Paragraph position="1"> The tree classification problem is to induce a mapping f(x) : X !f 1g, from given training examples T = fhxi;yiigLi=1, where xi 2X is a labeled ordered tree and yi 2f 1gis a class label associated with each training data (we focus here on the problem of binary classification.). The important characteristic is that the input example xi is represented not as a numerical feature vector (bagof-words) but a labeled ordered tree.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Preliminaries
</SectionTitle>
      <Paragraph position="0"> Let us introduce a labeled ordered tree (or simply tree), its definition and notations, first.</Paragraph>
      <Paragraph position="1"> Definition 1 Labeled ordered tree (Tree) A labeled ordered tree is a tree where each node is associated with a label and is ordered among its siblings, that is, there are a first child, second child, third child, etc.</Paragraph>
      <Paragraph position="2"> Definition 2 Subtree Let t and u be labeled ordered trees. We say that t matches u, or t is a subtree of u (t u), if there exists a one-to-one function from nodes in t to u, satisfying the conditions: (1) preserves the parent-daughter relation, (2) preserves the sibling relation, (3) preserves the labels.</Paragraph>
      <Paragraph position="3"> We denote the number of nodes in t asjtj. Figure 1 shows an example of a labeled ordered tree and its subtree and non-subtree.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Decision Stumps
</SectionTitle>
      <Paragraph position="0"> Decision stumps are simple classifiers, where the final decision is made by only a single hypothesis</Paragraph>
      <Paragraph position="2"> or feature. Boostexter (Schapire and Singer, 2000) uses word-based decision stumps for topic-based text classification. To classify trees, we here extend the decision stump definition as follows.</Paragraph>
      <Paragraph position="3"> Definition 3 Decision Stumps for Trees Let t and x be labeled ordered trees, and y be a class label (y2f 1g), a decision stump classifier for trees is given by</Paragraph>
      <Paragraph position="5"> The parameter for classification is the tupleht;yi, hereafter referred to as the rule of the decision stumps.</Paragraph>
      <Paragraph position="6"> The decision stumps are trained to find ruleh^t; ^yi that minimizes the error rate for the given training data T =fhxi;yiigLi=1:</Paragraph>
      <Paragraph position="8"> The gain function for ruleht;yiis defined as</Paragraph>
      <Paragraph position="10"> Using the gain, the search problem given in (1) becomes equivalent to the following problem:</Paragraph>
      <Paragraph position="12"> gain(ht;yi): In this paper, we will use gain instead of error rate for clarity.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Applying Boosting
</SectionTitle>
      <Paragraph position="0"> The decision stumps classifiers for trees are too inaccurate to be applied to real applications, since the final decision relies on the existence of a single tree. However, accuracies can be boosted by the Boosting algorithm (Freund and Schapire, 1996; Schapire and Singer, 2000). Boosting repeatedly calls a given weak learner to finally produce hypothesis f, which is a linear combination of K hypotheses produced by the prior weak learners, i,e.: f(x) = sgn(PKk=1 khhtk;yki(x)).</Paragraph>
      <Paragraph position="1"> A weak learner is built at each iteration k with different distributions or weights d(k) = (d(k)i ;:::;d(k)L ), (where PNi=1d(k)i = 1;d(k)i 0).</Paragraph>
      <Paragraph position="2"> The weights are calculated in such a way that hard examples are focused on more than easier examples.</Paragraph>
      <Paragraph position="3"> To use the decision stumps as the weak learner of Boosting, we redefine the gain function (2) as follows: null</Paragraph>
      <Paragraph position="5"> There exist many Boosting algorithm variants, however, the original and the best known algorithm is AdaBoost (Freund and Schapire, 1996). We here use Arc-GV (Breiman, 1999) instead of AdaBoost, since Arc-GV asymptotically maximizes the margin and shows faster convergence to the optimal solution than AdaBoost.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Efficient Computation
</SectionTitle>
    <Paragraph position="0"> In this section, we introduce an efficient and practical algorithm to find the optimal rule h^t; ^yi from given training data. This problem is formally defined as follows.</Paragraph>
    <Paragraph position="1">  Problem 1 Find Optimal Rule Let T = fhx1;y1;d1i;:::;hxL;yL;dLig be training data, where, xi is a labeled ordered tree, yi 2 f 1g is a class label associated with xi  and di (PLi=1di = 1; di 0) is a normalized weight assigned to xi. Given T, find the optimal rule h^t; ^yi that maximizes the gain, i.e.,</Paragraph>
    <Paragraph position="3"> The most naive and exhaustive method, in which we first enumerate all subtrees F and then calculate the gains for all subtrees, is usually impractical, since the number of subtrees is exponential to its size. We thus adopt an alternative strategy to avoid such exhaustive enumeration.</Paragraph>
    <Paragraph position="4"> The method to find the optimal rule is modeled as a variant of the branch-and-bound algorithm, and is summarized in the following strategies:  1. Define a canonical search space in which a whole set of subtrees of a set of trees can be enumerated.</Paragraph>
    <Paragraph position="5"> 2. Find the optimal rule by traversing this search space.</Paragraph>
    <Paragraph position="6"> 3. Prune search space by proposing a criterion  with respect to the upper bound of the gain. We will describe these steps more precisely in the following subsections.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Efficient Enumeration of Trees
</SectionTitle>
      <Paragraph position="0"> Abe and Zaki independently proposed an efficient method, rightmost-extension, to enumerate all sub-trees from a given tree (Abe et al., 2002; Zaki, 2002). First, the algorithm starts with a set of trees consisting of single nodes, and then expands a given tree of size (k 1) by attaching a new node to this tree to obtain trees of size k. However, it would be inefficient to expand nodes at arbitrary positions of the tree, as duplicated enumeration is inevitable.</Paragraph>
      <Paragraph position="1"> The algorithm, rightmost extension, avoids such duplicated enumerations by restricting the position of attachment. We here give the definition of rightmost extension to describe this restriction in detail.</Paragraph>
      <Paragraph position="2"> Definition 4 Rightmost Extension (Abe et al., 2002; Zaki, 2002) Let t and t0 be labeled ordered trees. We say t0 is a rightmost extension oft, if and only iftandt0satisfy the following three conditions:  (1) t0 is created by adding a single node to t, (i.e., t t0 andjtj+ 1 =jt0j).</Paragraph>
      <Paragraph position="3"> (2) A node is added to a node existing on the unique path from the root to the rightmost leaf (rightmostpath) in t.</Paragraph>
      <Paragraph position="4"> (3) A node is added as the rightmost sibling.</Paragraph>
      <Paragraph position="5">  Consider Figure 2, which illustrates example tree t with the labels drawn from the set L = fa;b;cg.</Paragraph>
      <Paragraph position="6"> For the sake of convenience, each node in this figure has its original number (depth-first enumeration). The rightmost-path of the tree t is (a(c(b))), and occurs at positions 1;4 and 6 respectively. The set of rightmost extended trees is then enumerated by simply adding a single node to a node on the right-most path. Since there are three nodes on the right-most path and the size of the label set is 3 (= jLj), a total of 9 trees are enumerated from the original tree t. Note that rightmost extension preserves the prefix ordering of nodes in t (i.e., nodes at positions 1::jtjare preserved). By repeating the process of rightmost-extension recursively, we can create a search space in which all trees drawn from the setL are enumerated. Figure 3 shows a snapshot of such a search space.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Upper bound of gain
</SectionTitle>
      <Paragraph position="0"> Rightmost extension defines a canonical search space in which one can enumerate all subtrees from a given set of trees. We here consider an upper bound of the gain that allows subspace pruning in</Paragraph>
      <Paragraph position="2"> this canonical search space. The following theorem, an extension of Morhishita (Morhishita, 2002), gives a convenient way of computing a tight upper bound on gain(ht0;yi) for any super-tree t0 of t.</Paragraph>
      <Paragraph position="3"> Theorem 1 Upper bound of the gain: (t) For any t0 t and y2f 1g, the gain of ht0;yiis bounded by (t) (i.e., gain(ht0yi) (t)), where (t) is given by</Paragraph>
      <Paragraph position="5"> We can efficiently prune the search space spanned by right most extension using the upper bound of gain u(t). During the traverse of the subtree lattice built by the recursive process of rightmost extension, we always maintain the temporally suboptimal gain among all gains calculated previously.</Paragraph>
      <Paragraph position="6"> If (t) &lt; , the gain of any super-tree t0 t is no greater than , and therefore we can safely prune the search space spanned from the subtree t. If (t) , in contrast, we cannot prune this space, since there might exist a super-tree t0 t such that gain(t0) . We can also prune the space with respect to the expanded single node s. Even if (t) and a node s is attached to the tree t, we can ignore the space spanned from the tree t0 if (s) &lt; , since no super-tree of s can yield optimal</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Relation to SVMs with Tree Kernel
</SectionTitle>
    <Paragraph position="0"> Recent studies (Breiman, 1999; Schapire et al., 1997; R&amp;quot;atsch et al., 2001) have shown that both Boosting and SVMs (Boser et al., 1992) have a similar strategy; constructing an optimal hypothesis that maximizes the smallest margin between the positive and negative examples. We here describe a connection between our Boosting algorithm and SVMs with tree kernel (Collins and Duffy, 2002; Kashima and Koyanagi, 2002).</Paragraph>
    <Paragraph position="1"> Tree kernel is one of the convolution kernels, and implicitly maps the example represented in a labeled ordered tree into all subtree spaces. The implicit mapping defined by tree kernel is given as: (x)=(I(t1 x);:::;I(tjFj x)), where tj2F, x2X and I( ) is the indicator function 1.</Paragraph>
    <Paragraph position="2"> The final hypothesis of SVMs with tree kernel can be given by</Paragraph>
    <Paragraph position="4"> Similarly, the final hypothesis of our boosting algorithm can be reformulated as a linear classifier: 1Strictly speaking, tree kernel uses the cardinality of each substructure. However, it makes little difference since a given tree is often sparse in NLP and the cardinality of substructures will be approximated by their existence.</Paragraph>
    <Paragraph position="5"> Algorithm: Find Optimal Rule argument: T =fhx1;y1;d1i:::;hxL;yL;dLig (xi a tree, yi2f 1gis a class, and di (PLi=1di = 1; di 0) is a weight) returns: Optimal ruleh^t; ^yi</Paragraph>
    <Paragraph position="7"> foreach t02fset of trees that are rightmost extension of tg s =single node added by RME</Paragraph>
    <Paragraph position="9"/>
    <Paragraph position="11"> We can thus see that both algorithms are essentially the same in terms of their feature space. The difference between them is the metric of margin; the margin of Boosting is measured in l1-norm, while, that of SVMs is measured in l2-norm. The question one might ask is how the difference is expressed in practice. The difference between them can be explained by sparseness.</Paragraph>
    <Paragraph position="12"> It is well known that the solution or separating hyperplane of SVMs is expressed as a linear combination of the training examples using some coefficients , (i.e., w = PLi=1 i (xi)). Maximizingl2norm margin gives a sparse solution in the example space, (i.e., most of i becomes 0). Examples that have non-zero coefficient are called support vectors that form the final solution. Boosting, in contrast, performs the computation explicitly in the feature space. The concept behind Boosting is that only a few hypotheses are needed to express the final solution. The l1-norm margin allows us to realize this property. Boosting thus finds a sparse solution in the feature space.</Paragraph>
    <Paragraph position="13"> The accuracies of these two methods depends on the given training data. However, we argue that Boosting has the following practical advantages.</Paragraph>
    <Paragraph position="14"> First, sparse hypotheses allow us to build an efficient classification algorithm. The complexity of SVMs with tree kernel is O(L0jN1jjN2j), where N1 and N2 are trees, and L0 is the number of support vectors, which is too heavy to realize real applications. Boosting, in contrast, runs faster, since the complexity depends only on the small number of decision stumps. Second, sparse hypotheses are useful in practice as they provide &amp;quot;transparent&amp;quot; models with which we can analyze how the model performs or what kind of features are useful. It is difficult to give such analysis with kernel methods, since they define the feature space implicitly.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML