File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/w03-0402_evalu.xml

Size: 5,488 bytes

Last Modified: 2025-10-06 13:58:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0402">
  <Title>An SVM Based Voting Algorithm with Application to Parse Reranking</Title>
  <Section position="6" start_page="2" end_page="2" type="evalu">
    <SectionTitle>
5 Experiments and Analysis
</SectionTitle>
    <Paragraph position="0"> We use SVMlight (Joachims, 1998) as the SVM classifier. The soft margin parameter C is set to its default value in SVMlight.</Paragraph>
    <Paragraph position="1"> We use the same data set as described in (Collins, 2000; Collins and Duffy, 2002). Section 2-21 of the Penn WSJ Treebank (Marcus et al., 1994) are used as training data, and section 23 is used for final test. The training data contains around 40,000 sentences, each of which has 27 distinct parses on average. Of the 40,000 training sentences, the first 36,000 are used to train SVMs. The remaining 4,000 sentences are used as development data.</Paragraph>
    <Paragraph position="2"> The training complexity for SVMlight is roughly O(n2.1) (Joachims, 1998), where n is the number of the training samples. One solution to the scaling difficulties is to use the Kernel Fisher Discriminant as described in (Salomon et al., 2002). In this paper, we divide training data into slices to speed up training. Each slice contains two pairs of parses from each sentence. Specifically, slice i contains positive samples (( ~pk,pki),+1) and negative samples ((pki, ~pk),[?]1), where ~pk is the best parse for sentence k, pki is the parse with the ith highest log-likelihood in all the parses for sentence k and it is not the best parse. There are about 60000 parses in each slice in average. For each slice, we train an SVM. Then results of SVMs are put together with a simple combination. It takes about 2 days to train a slice on a P3 1.13GHz processor. null As a result of this subdivision of the training data into slices, we cannot take advantage of SVM's global optimization ability. This seems to nullify our effort to create this new algorithm. However, our new algorithm is still useful for the following reasons. Firstly, with the improvement in the computing resources, we will be able to use larger slices so as to utilize more global optimization. SVMs are superior to other linear classifiers in theory. On the other hand, the current size of the slice is large enough for other NLP applications like text chunking, although it is not large enough for parse reranking. The last reason is that we have achieved state-of-the-art results even with the sliced data.</Paragraph>
    <Paragraph position="3"> We have used both a linear kernel and a tree kernel.</Paragraph>
    <Paragraph position="4"> For the linear kernel test, we have used the same dataset as that in (Collins, 2000). In this experiment, we first train 22 SVMs on 22 distinct slices. In order to combine those SVMs results, we have tried mapping SVMs' results to probabilities via a Sigmoid as described in (Platt, 1999).</Paragraph>
    <Paragraph position="5"> We use the development data to estimate parameter A and B in the Sigmoid</Paragraph>
    <Paragraph position="7"> LR/LP = labeled recall/precision. CBs = average number of crossing brackets per sentence. 0 CBs, 2 CBs are the percentage of sentences with 0 or [?] 2 crossing brackets respectively. CO99 = Model 2 of (Collins, 1999). CH00  where fi is the result of the ith SVM. The parse with maximal value of producttexti Pi(y = 1|fi) is chosen as the top-most parse. Experiments on the development data shows that the result is better if Ae[?]fiB is much larger than 1.</Paragraph>
    <Paragraph position="9"> Therefore, we may use summationtexti fi directly, and there is no need to estimate A and B in (14). Then we combine SVMs' result with the log-likelihood generated by the parser (Collins, 1999). Parameter b is used as the weight of the log-likelihood. In addition, we find that our SVM has greater labeled precision than labeled recall, which means that the system prefer parses with less brackets.</Paragraph>
    <Paragraph position="10"> So we take the number of brackets as another feature to be considered, with weight a. a and b are estimated on the development data.</Paragraph>
    <Paragraph position="11"> The result is shown in Table 1. The performance of our system matches the results of (Charniak, 2000), but is a little lower than the results of the Boosting system in (Collins, 2000), except that the percentage of sentences with no crossing brackets is 1% higher than that of (Collins, 2000). Since we have to divide data into slices, we cannot take full advantage of the margin maximiza- null trol the weight of log-likelihood given by the parser. The proper value of b depends on the size of training data.</Paragraph>
    <Paragraph position="12"> The best result does not improve much after combining 7 slices of training data. We think this is due the limitation of local optimization.</Paragraph>
    <Paragraph position="13"> Our next experiment is on the tree kernel as it is used in (Collins and Duffy, 2002). We have only trained 5 slices, since each slice takes about 2 weeks to train on a P3 1.13GHz processor. In addition, the speed of test for the tree kernel is much slower than that for the linear kernel. The experimental result is shown in Table 2 The results of our SVM system match the results of the Voted Perceptron algorithm in (Collins and Duffy, 2002), although only 5 slices, amounting to less than one fourth of the whole training dataset, have been used.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML