File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/n01-1025_metho.xml

Size: 18,277 bytes

Last Modified: 2025-10-06 14:07:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="N01-1025">
  <Title>Chunking with Support Vector Machines</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Support Vector Machines
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Optimal Hyperplane
</SectionTitle>
      <Paragraph position="0"> Let us define the training samples each of which belongs either to positive or negative class as: a2a4a3a6a5a8a7a8a9a10a5a8a11a12a7a14a13a15a13a15a13a16a7a17a2a4a3a19a18a4a7a8a9a17a18a20a11a21a2a4a3a23a22a25a24a27a26a29a28a30a7a25a9a14a22a31a24a33a32a35a34a37a36a10a7a15a38a31a36a17a39a35a11a40a13 a3a41a22 is a feature vector of the a42 -th sample represented by an a43 dimensional vector. a9a14a22 is the class (positive(a34a37a36 ) or negative(a38a31a36 ) class) label of the a42 -th sample. a44 is the number of the given training sam- null ples. In the basic SVMs framework, we try to separate the positive and negative samples by a hyper-plane expressed as: a2a4a45a47a46a40a3a19a11a48a34a50a49a52a51a54a53a54a2a4a45a55a24a56a26a57a28a48a7a8a49a58a24 a26a59a11 . SVMs find an &amp;quot;optimal&amp;quot; hyperplane (i.e. an optimal parameter set for a45a60a7a61a49 ) which separates the training data into two classes. What does &amp;quot;optimal&amp;quot; mean? In order to define it, we need to consider the margin between two classes. Figure 1 illustrates this idea. Solid lines show two possible hyperplanes, each of which correctly separates the training data into two classes. Two dashed lines parallel to the separating hyperplane indicate the boundaries in which one can move the separating hyper-plane without any misclassification. We call the distance between those parallel dashed lines as margin. SVMs find the separating hyperplane which maximizes its margin. Precisely, two dashed lines and margin (a62 ) can be expressed as: a45a63a46a14a3a64a34a65a49a66a51</Paragraph>
      <Paragraph position="2"> To maximize this margin, we should minimize a73a61a45a75a73 . In other words, this problem becomes equivalent to solving the following optimization problem:</Paragraph>
      <Paragraph position="4"> The training samples which lie on either of two dashed lines are called support vectors. It is known that only the support vectors in given training data matter. This implies that we can obtain the same decision function even if we remove all training samples except for the extracted support vectors.</Paragraph>
      <Paragraph position="5"> In practice, even in the case where we cannot separate training data linearly because of some noise in the training data, etc, we can build the separating linear hyperplane by allowing some misclassifications. Though we omit the details here, we can build an optimal hyperplane by introducing a soft margin parameter a111 , which trades off between the training error and the magnitude of the margin.</Paragraph>
      <Paragraph position="6"> Furthermore, SVMs have a potential to carry out the non-linear classification. Though we leave the details to (Vapnik, 1998), the optimization problem can be rewritten into a dual form, where all feature vectors appear in their dot products. By simply substituting every dot product of a3a41a22 and a3a48a112 in dual form with a certain Kernel function a113 a2a4a3a23a22a114a7a102a3a48a112a8a11 , SVMs can handle non-linear hypotheses. Among many kinds of Kernel functions available, we will focus on the</Paragraph>
      <Paragraph position="8"> Use of a115 -th polynomial kernel functions allows us to build an optimal separating hyperplane which takes into account all combinations of features up to a115 .</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Generalization Ability of SVMs
</SectionTitle>
      <Paragraph position="0"> Statistical Learning Theory(Vapnik, 1998) states that training error (empirical risk) a124a58a125 and test error (risk) a124a127a126 hold the following theorem.</Paragraph>
      <Paragraph position="1"> Theorem 1 (Vapnik) If a128 a2 a128a78a129a54a44 a11 is the VC dimension of the class functions implemented by some machine learning algorithms, then for all functions of that class, with a probability of at least a36a130a38a132a131 , the risk is bounded by</Paragraph>
      <Paragraph position="3"> where a128 is a non-negative integer called the Vapnik Chervonenkis (VC) dimension, and is a measure of the complexity of the given decision function. The r.h.s. term of (1) is called VC bound. In order to minimize the risk, we have to minimize the empirical risk as well as VC dimension. It is known that the following theorem holds for VC dimension a128 and margin a62 (Vapnik, 1998).</Paragraph>
      <Paragraph position="4"> Theorem 2 (Vapnik) Suppose a43 as the dimension of given training samples a62 as the margin, and a139 as the smallest diameter which encloses all training sample, then VC dimension a128 of the SVMs are bounded by</Paragraph>
      <Paragraph position="6"> In order to minimize the VC dimension a128 , we have to maximize the margin a62 , which is exactly the strategy that SVMs take.</Paragraph>
      <Paragraph position="7"> Vapnik gives an alternative bound for the risk.</Paragraph>
      <Paragraph position="8"> Theorem 3 (Vapnik) Suppose a124 a18 is an error rate estimated by Leave-One-Out procedure, a124 a18 is bounded as</Paragraph>
      <Paragraph position="10"> Leave-One-Out procedure is a simple method to examine the risk of the decision function -- first by removing a single sample from the training data, we construct the decision function on the basis of the remaining training data, and then test the removed sample. In this fashion, we test all a44 samples of the training data using a44 different decision functions. (3) is a natural consequence bearing in mind that support vectors are the only factors contributing to the final decision function. Namely, when the every removed support vector becomes error in Leave-One-Out procedure, a124 a18 becomes the r.h.s. term of (3). In practice, it is known that this bound is less predictive than the VC bound.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Chunking
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Chunk representation
</SectionTitle>
      <Paragraph position="0"> There are mainly two types of representations for proper chunks. One is Inside/Outside representation, and the other is Start/End representation.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1. Inside/Outside
</SectionTitle>
    <Paragraph position="0"> This representation was first introduced in (Ramshaw and Marcus, 1995), and has been applied for base NP chunking. This method uses the following set of three tags for representing proper chunks.</Paragraph>
    <Paragraph position="1"> I Current token is inside of a chunk.</Paragraph>
    <Paragraph position="2"> O Current token is outside of any chunk.</Paragraph>
    <Paragraph position="3"> B Current token is the beginning of a chunk which immediately follows another chunk.</Paragraph>
    <Paragraph position="4"> Tjong Kim Sang calls this method as IOB1 representation, and introduces three alternative versions -- IOB2,IOE1 and IOE2 (Tjong Kim Sang and Veenstra, 1999).</Paragraph>
    <Paragraph position="5"> IOB2 A B tag is given for every token which exists at the beginning of a chunk.</Paragraph>
    <Paragraph position="6"> Other tokens are the same as IOB1.</Paragraph>
    <Paragraph position="7">  IOE1 An E tag is used to mark the last token of a chunk immediately preceding another chunk.</Paragraph>
    <Paragraph position="8"> IOE2 An E tag is given for every token which exists at the end of a chunk.</Paragraph>
    <Paragraph position="9"> 2. Start/End  This method has been used for the Japanese named entity extraction task, and requires the following five tags for representing proper chunks(Uchimoto et al., 2000) 1.</Paragraph>
    <Paragraph position="10"> 1Originally, Uchimoto uses C/E/U/O/S representation. However we rename them as B/I/O/E/S for our purpose, since</Paragraph>
    <Paragraph position="12"> early I B I I B trading I I I E E in O O O O O busy I B I I B Hong I I I I I Kong I I E E E Monday B B I E S , O O O O O gold I B I E S was O O O O O  B Current token is the start of a chunk consisting of more than one token. E Current token is the end of a chunk consisting of more than one token. I Current token is a middle of a chunk consisting of more than two tokens. S Current token is a chunk consisting of only one token.</Paragraph>
    <Paragraph position="13"> O Current token is outside of any chunk. Examples of these five representations are shown in Table 1.</Paragraph>
    <Paragraph position="14"> If we have to identify the grammatical class of each chunk, we represent them by a pair of an I/O/B/E/S label and a class label. For example, in IOB2 representation, B-VP label is given to a token which represents the beginning of a verb base phrase (VP).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Chunking with SVMs
</SectionTitle>
      <Paragraph position="0"> Basically, SVMs are binary classifiers, thus we must extend SVMs to multi-class classifiers in order to classify three (B,I,O) or more (B,I,O,E,S) classes.</Paragraph>
      <Paragraph position="1"> There are two popular methods to extend a binary classification task to that of a113 classes. One is one class vs. all others. The idea is to build a113 classifiers so as to separate one class from all others. The other is pairwise classification. The idea is to build</Paragraph>
      <Paragraph position="3"> a38a33a36a15a11a102a70a68a69 classifiers considering all pairs of classes, and final decision is given by their weighted voting. There are a number of other methods to extend SVMs to multiclass classifiers. For example, Dietterich and Bakiri(Dietterich and Bakiri, 1995) and Allwein(Allwein et al., 2000) introduce a unifying framework for solving the multiclass problem we want to keep consistency with Inside/Start (B/I/O) representation. null by reducing them into binary models. However, we employ the simple pairwise classifiers because of the following reasons:  (1) In general, SVMs require a163 a2 a43 a92 a11a165a164 a163 a2 a43a72a166 a11  training cost (where a43 is the size of training data). Thus, if the size of training data for individual binary classifiers is small, we can significantly reduce the training cost. Although pairwise classifiers tend to build a larger number of binary classifiers, the training cost required for pairwise method is much more tractable compared to the one vs. all others.</Paragraph>
      <Paragraph position="4"> (2) Some experiments (Kressel, 1999) report that a combination of pairwise classifiers performs better than the one vs. all others.</Paragraph>
      <Paragraph position="5"> For the feature sets for actual training and classification of SVMs, we use all the information available in the surrounding context, such as the words, their part-of-speech tags as well as the chunk labels. More precisely, we give the following features to identify the chunk label a154a61a22 for the a42 -th word:  Here, a168 a22 is the word appearing at a42 -th position, a152a114a22 is the POS tag of a168 a22 , and a154a61a22 is the (extended) chunk label for a42 -th word. In addition, we can reverse the parsing direction (from right to left) by using two chunk tags which appear to the r.h.s. of the current token (a154 a22a80a170a41a5 a7a61a154 a22a80a170 a92 ). In this paper, we call the method which parses from left to right as forward parsing, and the method which parses from right to left as backward parsing.</Paragraph>
      <Paragraph position="6"> Since the preceding chunk labels (a154a8a22a4a169a48a5a15a7a8a154a61a22a4a169 a92 for forward parsing , a154a61a22a80a170a23a5a12a7a8a154a61a22a80a170 a92 for backward parsing) are not given in the test data, they are decided dynamically during the tagging of chunk labels. The technique can be regarded as a sort of Dynamic Programming (DP) matching, in which the best answer is searched by maximizing the total certainty score for the combination of tags. In using DP matching, we limit a number of ambiguities by applying beam  the number of votes for the class obtained through the pairwise voting is used as the certain score for beam search with width 5 (Kudo and Matsumoto, 2000a). In this paper, however, we apply deterministic method instead of applying beam search with keeping some ambiguities. The reason we apply deterministic method is that our further experiments and investigation for the selection of beam width shows that larger beam width dose not always give a significant improvement in the accuracy. Given our experiments, we conclude that satisfying accuracies can be obtained even with the deterministic parsing.</Paragraph>
      <Paragraph position="7"> Another reason for selecting the simpler setting is that the major purpose of this paper is to compare weighted voting schemes and to show an effective weighting method with the help of empirical risk estimation frameworks.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Weighted Voting
</SectionTitle>
      <Paragraph position="0"> Tjong Kim Sang et al. report that they achieve higher accuracy by applying weighted voting of systems which are trained using distinct chunk representations and different machine learning algorithms, such as MBL, ME and IGTree(Tjong Kim Sang, 2000a; Tjong Kim Sang et al., 2000). It is well-known that weighted voting scheme has a potential to maximize the margin between critical samples and the separating hyperplane, and produces a decision function with high generalization performance(Schapire et al., 1997). The boosting technique is a type of weighted voting scheme, and has been applied to many NLP problems such as parsing, part-of-speech tagging and text categorization. null In our experiments, in order to obtain higher accuracy, we also apply weighted voting of 8 SVM-based systems which are trained using distinct chunk representations. Before applying weighted voting method, first we need to decide the weights to be given to individual systems. We can obtain the best weights if we could obtain the accuracy for the &amp;quot;true&amp;quot; test data. However, it is impossible to estimate them. In boosting technique, the voting weights are given by the accuracy of the training data during the iteration of changing the frequency (distribution) of training data. However, we cannot use the accuracy of the training data for voting weights, since SVMs do not depend on the frequency (distribution) of training data, and can separate the training data without any mis-classification by selecting the appropriate kernel function and the soft margin parameter. In this paper, we introduce the following four weighting methods in our experiments: null</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1. Uniform weights
</SectionTitle>
    <Paragraph position="0"> We give the same voting weight to all systems.</Paragraph>
    <Paragraph position="1"> This method is taken as the baseline for other weighting methods.</Paragraph>
    <Paragraph position="2">  By applying (1) and (2), we estimate the lower bound of accuracy for each system, and use the accuracy as a voting weight. The voting weight is calculated as: a168 a51a171a36a31a38a27a153 a111 a49a12a147</Paragraph>
    <Paragraph position="4"> The value of a139 , which represents the smallest diameter enclosing all of the training data, is approximated by the maximum distance from the origin.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4. Leave-One-Out bound
</SectionTitle>
    <Paragraph position="0"> By using (3), we estimate the lower bound of the accuracy of a system. The voting weight is calculated as: a168 a51a105a36a127a38 a124 a18 .</Paragraph>
    <Paragraph position="1"> The procedure of our experiments is summarized  as follows: 1. We convert the training data into 4 representations (IOB1/IOB2/IOE1/IOE2).</Paragraph>
    <Paragraph position="2"> 2. We consider two parsing directions (Forward/Backward) for each representation, i.e.</Paragraph>
    <Paragraph position="4"> a69a130a51a54a173 systems for a single training data set.</Paragraph>
    <Paragraph position="5"> Then, we employ SVMs training using these  independent chunk representations.</Paragraph>
    <Paragraph position="6"> 3. After training, we examine the VC bound and  Leave-One-Out bound for each of 8 systems. As for cross validation, we employ the steps 1 and 2 for each divided training data, and obtain the weights.</Paragraph>
    <Paragraph position="7"> 4. We test these 8 systems with a separated test data set. Before employing weighted voting, we have to convert them into a uniform representation, since the tag sets used in individual 8 systems are different. For this purpose, we re-convert each of the estimated results into 4 representations (IOB1/IOB2/IOE2/IOE1).</Paragraph>
    <Paragraph position="8"> 5. We employ weighted voting of 8 systems with respect to the converted 4 uniform representations and the 4 voting schemes respectively. Finally, we have a172 (types of uniform representations) a161 4 (types of weights) a51a174a36a15a175 results for our experiments.</Paragraph>
    <Paragraph position="9"> Although we can use models with IOBES-F or IOBES-B representations for the committees for the weighted voting, we do not use them in our voting experiments. The reason is that the number of classes are different (3 vs. 5) and the estimated VC and LOO bound cannot straightforwardly be compared with other models that have three classes (IOB1/IOB2/IOE1/IOE2) under the same condition. We conduct experiments with IOBES-F and IOBES-B representations only to investigate how far the difference of various chunk representations would affect the actual chunking accuracies.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML