File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0402_metho.xml
Size: 12,173 bytes
Last Modified: 2025-10-06 14:08:21
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0402"> <Title>An SVM Based Voting Algorithm with Application to Parse Reranking</Title> <Section position="3" start_page="2" end_page="2" type="metho"> <SectionTitle> 2 A New SVM-based Voting Algorithm </SectionTitle> <Paragraph position="0"> Let xij be the jth candidate parse for the ith sentence in training data. Let xi1 is the parse with the highest f-score among all the parses for the ith sentence.</Paragraph> <Paragraph position="1"> We may take xi1 as positive samples, and xij(j>1) as negative samples. However, experiments have shown that this is not the best way to utilize SVMs in reranking (Dijkstra, 2001). A trick to be used here is to take a pair of parses as a sample: for any i and j > 1, (xi1,xij) is a positive sample, and (xij,xi1) is a negative sample.</Paragraph> <Paragraph position="2"> Similar idea was employed in the early works of parse reranking. In the boosting algorithm of (Collins, 2000), for each sample (parse) xij, its margin is defined as F(xi1, -a) [?] F(xij, -a), where F is a score function and -a is the parameter vector. In (Collins and Duffy, 2002), for each offending parse, the parameter vector updating function is in the form of w = w + h(xi1) [?] h(xij), where w is the parameter vector and h returns the feature vector of a parse. But neither of these two papers used a pair of parses as a sample and defined functions on pairs of parses. Furthermore, the advantage of using difference between parses was not theoretically clarified, which we will describe in the next section.</Paragraph> <Paragraph position="3"> As far as SVMs are concerned, the use of parses or pairs of parses both maximize the margin between xi1 and xij, but the one using a single parse as a sample needs to satisfy some extra constraints on the selection of decision function. However these constraints are not necessary (see section 3.3). Therefore the use of pairs of parses has both theoretic and practical advantages.</Paragraph> <Paragraph position="4"> Now we need to define the kernel on pairs of parses.</Paragraph> <Paragraph position="5"> Let (t1,t2),(v1,v2) are two pairs of parses. Let K is any kernel function on the space of single parses. The preference kernel PK is defined on K as follows.</Paragraph> <Paragraph position="7"> The preference kernel of this form was previously used in the context of ordinal regression in (Herbrich et al., 2000). Then the decision function is</Paragraph> <Paragraph position="9"> where xj and xk are two distinct parses of a sentence, (si1,si2) is the ith support vector, and Ns is the total number of support vectors.</Paragraph> <Paragraph position="10"> As we have defined them, the training samples are symmetric with respect to the origin in the space. Therefore, for any hyperplane that does not pass through the origin, we can always find a parallel hyperplane that crosses the origin and makes the margin larger. Hence, the outcome separating hyperplane has to pass through the origin, which means that b = 0.</Paragraph> <Paragraph position="11"> Therefore, for each test parse x, we only need to compute its score as follows.</Paragraph> <Paragraph position="13"/> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 2.1 Kernels </SectionTitle> <Paragraph position="0"> In (6), the preference kernel PK is defined on kernel K.</Paragraph> <Paragraph position="1"> K can be any possible kernel. We will show that PK is well-defined in the next section. In this paper, we consider two kernels for K, the linear kernel and the tree kernel.</Paragraph> <Paragraph position="2"> In (Collins, 2000), each parse is associated with a set of features. Linear combination of the features is used in the decision function. As far as SVM is concerned, we may encode the features of each parse with a vector. Dot product is used as the kernel K. Let u and v are two parses. The computational complexity of linear kernel O(|fu|[?]|fv|), where |fu |and |fv |are the length of the vectors associated with parse u and v respectively. The goodness of the linear kernel is that it runs very fast in the test phase, because coefficients of the support vectors can be combined in advance. For a test parse x, the computational complexity of test is only O(|fx|), which is independent with the number of the support vectors.</Paragraph> <Paragraph position="3"> In (Collins and Duffy, 2002), the tree kernel Tr is used to count the total number of common sub-trees of two parse trees. Let u and v be two trees. Because Tr can be computed by dynamic programming, the computational complexity of Tr(u,v) is O(|u|[?]|v|), where |u |and |v| are the tree sizes of u and v respectively. For a test parse x, the computational complexity of the test is O(S[?]|x|), where S is the number of support vectors.</Paragraph> </Section> </Section> <Section position="4" start_page="2" end_page="2" type="metho"> <SectionTitle> 3 Justifying the Algorithm </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.1 Justifying the Kernel </SectionTitle> <Paragraph position="0"> Firstly, we show that the preference kernel PK defined above is well-defined. Suppose kernel K is defined on T x T. So there exists Ph : T mapsto- H, such that</Paragraph> <Paragraph position="2"> It suffices to show that there exist space Hprime and mapping function Phprime : TxT - Hprime such that</Paragraph> <Paragraph position="4"> According to the definition of PK, we have</Paragraph> <Paragraph position="6"/> </Section> <Section position="2" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.2 Margin Bound for SVM-based Voting </SectionTitle> <Paragraph position="0"> We will show that the expected error of voting is bounded from above in the PAC framework. The approach used here is analogous to the proof of ordinal regression (Herbrich et al., 2000). The key idea is to show the equivalence of the voting risk and the classification risk.</Paragraph> <Paragraph position="1"> Let X be the set of all parse trees. For each x [?] X, let -x be the best parse for the sentence related to x. Thus the appropriate loss function for the voting problem is as follows.</Paragraph> <Paragraph position="3"> where f is a parse scoring function.</Paragraph> <Paragraph position="4"> Let E = {(x, -x)|x [?] X} [?] {(-x,x)|x [?] X}. E is the space of event of the classification problem, and</Paragraph> <Paragraph position="6"> scoring function f, let gf(x1,x2) [?] sgn(f(x1)[?]f(x2)).</Paragraph> <Paragraph position="7"> For classifier gf on space E, its loss function is</Paragraph> <Paragraph position="9"> Therefore the expected risk Rvote(f) for the voting problem is equivalent to the expected risk Rclass(gf) for the classification problem.</Paragraph> <Paragraph position="11"> However, the definition of space E violates the independently and identically distributed (iid) assumption.</Paragraph> <Paragraph position="12"> Parses for the same sentence are not independent. If we suppose that no two pairs of parses come from the same sentence, then the idd assumption holds. In practice, the number of sentences is very large, i.e. more than 30000.</Paragraph> <Paragraph position="13"> So we may use more than one pair of parses of the same sentence and still assume the idd property roughly, because for any two arbitrary pairs of parses, 29999 out of 30000, these two samples are independent.</Paragraph> <Paragraph position="14"> Let r [?] mini=1..n,j=2..mi |f(xi1) [?] f(xij) |= mini=1..n,j=2..mi |g(xi1,xij)[?]0|. According to (11) and Theorem 1 in section 1.2 we get the following theorem.</Paragraph> <Paragraph position="15"> Theorem 2 If gf makes no error on the training data, with confidence 1[?]d</Paragraph> <Paragraph position="17"/> </Section> <Section position="3" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.3 Justifying Pairwise Samples </SectionTitle> <Paragraph position="0"> An obvious way to use SVM is to use each single parse, instead of a pair of parse trees, as a training sample. Only the best parse of each sentence is regarded as a positive sample, and all the rest are regarded as negative samples.</Paragraph> <Paragraph position="1"> Similar to the pairwise system, it also maximizes the margin between the best parse of a sentence and all incorrect parses of this sentence. Suppose f is the function resulting from the SVM. It requires yijf(xij) > 0 for each sample (xij,yij). However this constraint is not necessary. We only need to guarantee that f(xi1) > f(xij).</Paragraph> <Paragraph position="2"> This is the reason for using pairs of parses as training samples instead of single parses.</Paragraph> <Paragraph position="3"> We may rewrite the score function (7) as follows.</Paragraph> <Paragraph position="5"> where i is the index for sentence, j is the index for parse, and [?]isummationtextj ci,j = 0.</Paragraph> <Paragraph position="6"> The format of score in (13) is the same as the decision function generated by an SVM trained on the single parses as samples. However, there is a constraint that the sum of the coefficients related to parses of the same sentence is 0. So in this way we decrease the size of hypothesis space based on the prior knowledge that only the different segments of two distinct parses determine which parse is better.</Paragraph> </Section> </Section> <Section position="5" start_page="2" end_page="2" type="metho"> <SectionTitle> 4 Related Work </SectionTitle> <Paragraph position="0"> The use of pairs of parse trees in our model is analogous to the preference relation used in the ordinal regression algorithm (Herbrich et al., 2000). In that paper, pairs of objects have been used as training samples. For example, let (r1,r2,...rm) be a list of objects in the training data, where ri ranks ith. Then pairs of objects (ri[?]1,ri) are training samples. Preference kernel PK in our paper is the same as the preference kernel in (Herbrich et al., 2000) in format.</Paragraph> <Paragraph position="1"> However, the purpose of our model is different from that of the ordinal regression algorithm. Ordinal regression searches for a regression function for ordinal values, while our algorithm is designed to solve a voting problem. As a result, the two algorithms differ on the definition of the margin. In ordinal regression, the margin is min|f(ri) [?] f(ri[?]1)|, where f is the regression function for ordinal values. In our algorithm, the margin is min|score(xi1)[?]score(xij)|.</Paragraph> <Paragraph position="2"> In (Kudo and Matsumoto, 2001), SVMs have been employed in the NP chunking task, a typical labeling problem. However, they have used a deterministic algorithm for decoding.</Paragraph> <Paragraph position="3"> In (Collins, 2000), two reranking algorithms were proposed. In both of these two models, the loss functions are computed directly on the feature space. All the features are manually defined.</Paragraph> <Paragraph position="4"> In (Collins and Duffy, 2002), the Voted Perceptron algorithm was used to in parse reranking. It was shown in (Freund and Schapire, 1999; Graepel et al., 2001) that error bound of (voted) Perceptron is related to margins existing in the training data, but these algorithm are not supposed to maximize margins. Variants of the Perceptron algorithm, which are known as Approximate Maximal Margin classifier, such as PAM (Krauth and Mezard, 1987), ALMA (Gentile, 2001) and PAUM (Li et al., 2002), produce decision hyperplanes within ratio of the maximal margin. However, almost all these algorithms are reported to be inferior to SVMs in accuracy, while more efficient in training.</Paragraph> <Paragraph position="5"> Furthermore, these variants of the Perceptron algorithm take advantage of the large margin existing in the training data. However, in NLP applications, samples are usually inseparable even if the kernel trick is used. SVMs can still be trained to maximize the margin through the method of soft margin.</Paragraph> </Section> class="xml-element"></Paper>