File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0730_metho.xml
Size: 8,520 bytes
Last Modified: 2025-10-06 14:07:22
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0730"> <Title>Use of ',Support Vector Learning for Chunk Identification</Title> <Section position="2" start_page="0" end_page="142" type="metho"> <SectionTitle> 2 Support Vector Machines </SectionTitle> <Paragraph position="0"> Support Vector Machines (SVMs), first introduced by Vapnik (Cortes and Vapnik, 1995; Vapnik, 1995), are relatively new learning approaches for solving two-class pattern recognition problems. SVMs are well-known for their good generalization performance, and have been applied to many pattern recognition problems. In the field of natural language processing, SVMs are applied to text categorization, and are reported to have achieved high accuracy without falling into over-fitting even with a large number of words taken as the features (Joachims, 1998; Taira and Haruno, 1999) First of all, let us define the training data which belongs to either positive or negative class as follows: (Xl,YX),..., (Xl,Yl) Xi 6 R n, Yi 6 {+1,-1} xi is a feature vector of the i-th sample represented by an n dimensional vector, yi is the class (positive(+l) or negative(-1) class) label of the i-th data. In basic SVMs framework, we try to separate the positive and negative examples by hyperplane written as:</Paragraph> <Paragraph position="2"> SVMs find the &quot;optimal&quot; hyperplane (optimal parameter w, b) which separates the training data into two classes precisely. What &quot;optimal&quot; means? In order to define it, we need to consider the margin between two classes.</Paragraph> <Paragraph position="3"> Figures 1 illustrates this idea. The solid lines show two possible hyperplanes, each of which correctly separates the training data into two classes. The two dashed lines parallel to the separating hyperplane show the boundaries in which one can move the separating hyperplane without misclassification. We call the distance between each parallel dashed lines as margin.</Paragraph> <Paragraph position="4"> SVMs take a simple strategy that finds the separating hyperplane which maximizes its margin. Precisely, two dashed lines and margin (d) can be written as: (w. x) + b = :kl, d = 2111wll.</Paragraph> <Paragraph position="5"> SVMs can be regarded as an optimization problem; finding w and b which minimize \[\[w\[\[ under the constraints: yi\[(w * xi) + b\] > 1.</Paragraph> <Paragraph position="6"> Furthermore, SVMs have potential to cope with the linearly unseparable training data. We leave the details to (Vapnik, 1995), the optimization problems can be rewritten into a dual form, where all feature vectors appear in their dot product. By simply substituting every dot product of xi and xj in dual form with any Kernel function K(xl, xj), SVMs can handle non-linear hypotheses. Among the many kinds of Kernel functions available, we will focus on the d-th polynomial kernel:</Paragraph> <Paragraph position="8"> Use of d-th polynomial kernel function allows us to build an optimal separating hyperplane which takes into account all combination of features up to d.</Paragraph> <Paragraph position="9"> We believe SVMs have advantage over conventional statistical learning algorithms, such as Decision Tree, and Maximum Entropy Models, from the following two aspects: * SVMs have high generalization performance independent of dimension of feature vectors. Conventional algorithms require careful feature selection, which is usually optimized heuristically, to avoid overfitting. null * SVMs can carry out their learning with all combinations of given features without increasing computational complexity by introducing the Kernel function. Conventional algorithms cannot handle these combinations efficiently, thus, we usually select &quot;important&quot; combinations heuristically with taking the trade-off between accuracy and computational complexity into account.</Paragraph> </Section> <Section position="3" start_page="142" end_page="143" type="metho"> <SectionTitle> 3 Approach for Chunk Identification </SectionTitle> <Paragraph position="0"> The chunks in the CoNLL-2000 shared task are represented with IOB based model, in which every word is to be tagged with a chunk label extended with I (inside a chunk), O (outside a chunk) and B (inside a chunk, but the preceding word is in another chunk). Each chunk type belongs to I or B tags. For example, NP could be considered as two types of chunk, I-NP or B-NP. In training data of CoNLL-2000 shared task, we could find 22 types of chunk 1 considering all combinations of IOB-tags and chunk types. We simply formulate the chunking task as a classification problem of these 22 types of chunk.</Paragraph> <Paragraph position="1"> Basically, SVMs are binary classifiers, thus we must extend SVMs to multi-class classifiers in order to classify these 22 types of chunks. It is 1Precisely, the number of combination becomes 23. However, we do not consider I-LST tag since it dose not appear in training data.</Paragraph> <Paragraph position="2"> known that there are mainly two approaches to extend from a binary classification task to those with K classes. First approach is often used and typical one &quot;one class vs. all others&quot;. The idea is to build K classifiers that separate one class among from all others. Second approach is pairwise classification. The idea is to build K x (K - 1)/2 classifiers considering all pairs of classes, and final class decision is given by their majority voting. We decided to construct pair-wise classifiers for all the pairs of chunk labels, so that the total number of classifiers becomes 22x21 231. The reasons that we use pairwise 2 -classifiers are as follows: * Some experiments report that combination of pairwise classifier perform better than K classifier (Kret~el, 1999).</Paragraph> <Paragraph position="3"> * The amount of training data for a pair is less than the amount of training data for separating one class with all others.</Paragraph> <Paragraph position="4"> For the features, we decided to use all the information available in the surrounding contexts, such as the words, their POS tags as well as the chunk labels. More precisely, we give the following for the features to identify chunk label</Paragraph> <Paragraph position="6"> where wi is the word appearing at i-th word, ti is the POS tag of wi, and c/ is the (extended) chunk label at i-th word. Since the chunk labels are not given in the test data, they are decided dynamically during the tagging of chunk labels.</Paragraph> <Paragraph position="7"> This technique can be regarded as a sort of Dynamic Programming (DP) matching, in which the best answer is searched by maximizing the total certainty score for the combination of tags.</Paragraph> <Paragraph position="8"> In using DP matching, we decided to keep not all ambiguities but a limited number of them.</Paragraph> <Paragraph position="9"> This means that a beam search is employed, and only the top N candidates are kept for the search for the best chunk tags. The algorithm scans the test data from left to right and calls the SVM classifiers for all pairs of chunk tags for obtaining the certainty score. We defined the certainty score as the number of votes for the class (tag) obtained through the pairwise voting.</Paragraph> <Paragraph position="10"> Since SVMs are vector based classifier, they accept only numerical values for their features. To cope with this constraints, we simply expand all features as a binary-value taking either 0 or 1. By taking all words and POS tags appearing in the training data as features, the total dimension of feature vector becomes as large as 92837. Generally, we need vast computational complexity and memories to handle such a huge dimension of vectors. In fact, we can reduce these complexity considerably by holding only indices and values of non-zero elements, since the feature vectors are usually sparse, and SVMs only require the evaluation of dot products of each feature vectors for their training.</Paragraph> <Paragraph position="11"> In addition, although we could apply some cut-off threshold for the number of occurrence in the training set, we decided to use everything, not only POS tags but also words themselves.</Paragraph> <Paragraph position="12"> The reasons are that we simply do not want to employ a kind of &quot;heuristics&quot;, and SVMs are known to have a good generalization performance even with very large featm:es.</Paragraph> </Section> class="xml-element"></Paper>