File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/c02-1020_intro.xml
Size: 3,938 bytes
Last Modified: 2025-10-06 14:01:18
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1020"> <Title>Extracting Word Sequence Correspondences with Support Vector Machines</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Support Vector Machines </SectionTitle> <Paragraph position="0"> SVMs are binary classifiers which linearly separate d dimension vectors to two classes. Each vector represents the sample which has d features. It is distinguished whether given sample ~x = (x1;x2;:::;xd) belongs toX1 orX2 by equation (1) :</Paragraph> <Paragraph position="2"> where g(~x) is the hyperplain which separates two classes in which ~w and b are decided by optimization. null g(~x)=~w ~x+b (2) Let supervise signals for the training samples be expressed as</Paragraph> <Paragraph position="4"> whereX1 is a set of positive samples andX2 is a set of negative samples.</Paragraph> <Paragraph position="5"> If the training samples can be separated linearly, there could exist two or more pairs of ~w and b that</Paragraph> <Paragraph position="7"/> <Paragraph position="9"> Figure 1 shows that the hyperplain which separates the samples. In this figure, solid line shows separating hyperplain ~w ~x+b = 0 and two dotted lines show hyperplains expressed by ~w ~x+b= 1.</Paragraph> <Paragraph position="10"> The constraints (3) mean that any vectors must not exist inside two dotted lines. The vectors on dotted lines are called support vectors and the distance between dotted lines is called a margin, which equals to 2=jj~wjj.</Paragraph> <Paragraph position="11"> The learning algorithm for SVMs could optimize ~w and b which maximize the margin 2=jj~wjjor minimize jj~wjj2=2 subject to constraints (3). According to Lagrange's theory, the optimization problem is transformed to minimizing the Lagrangian L :</Paragraph> <Paragraph position="13"> where i 0 (i = 1;:::;n) are the Lagrange multipliers. By di erentiating with respect to ~w and b, the following relations are obtained,</Paragraph> <Paragraph position="15"> Consequently, the optimization problem is transformed to maximizing the object function D subject to Pni=1 iyi = 0 and i 0. For the optimal parameters = arg max D, each training sample ~xi where i >0 is corresponding to support vector.</Paragraph> <Paragraph position="16"> ~w can be obtained from equation (5) and b can be obtained from b=yi ~w ~xi where ~xi is an arbitrary support vector. From equation (2) (5), the optimal hyperplain can be expressed as the following equation with optimal parameters</Paragraph> <Paragraph position="18"> The training samples could be allowed in some degree to enter the inside of the margin by changing</Paragraph> <Paragraph position="20"> where i 0 are called slack variables. At this time, the maximal margin problem is enhanced as minimizing jj~wjj2=2+CPni=1 i, where C expresses the weight of errors. As a result, the problem is to maximize the object function D subject toPni=1 iyi =0 and 0 i C.</Paragraph> <Paragraph position="21"> For the training samples which cannot be separated linearly, they might be separated linearly in higher dimension by mapping them using a non-linear function: : Rd 7!Rd0 A linear separating in Rd0 for (~x) is same as a non-linear separating in Rd for~x. Let satisfy</Paragraph> <Paragraph position="23"> where K(~x;~x0) is called kernel function. As a result, the object function is rewritten to</Paragraph> <Paragraph position="25"> Note that does not appear in equation (11) (12).</Paragraph> <Paragraph position="26"> Therefore, we need not calculate in higher dimension. null The well-known kernel functions are the polynomial kernel function (13) and the Gaussian kernel function (14).</Paragraph> <Paragraph position="28"> A non-linear separating using one of these kernel functions is corresponding to separating with consideration of the dependencies between the features in Rd.</Paragraph> </Section> class="xml-element"></Paper>