File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-1033_evalu.xml

Size: 9,113 bytes

Last Modified: 2025-10-06 13:59:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1033">
  <Title>Graph Transformations in Data-Driven Dependency Parsing</Title>
  <Section position="7" start_page="260" end_page="262" type="evalu">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> All experiments are based on PDT 1.0, which is divided into three data sets, a training set ([?]t), a development test set ([?]d), and an evaluation test set ([?]e). Table 1 shows the size of each data set, as well as the relative frequency of the specific constructions that are in focus here. Only 1.3% of all words in the training data are identified as auxiliary verbs (A), whereas coordination (S and C) is more common in PDT. This implies that coordination transformations are more likely to have a greater impact on overall accuracy compared to the verb group transformations.</Paragraph>
    <Paragraph position="1"> In the parsing experiments reported in sections 4.1-4.2, we use [?]t for training, [?]d for tuning, and [?]e for the final evaluation. The part-of-speech tagging used (both in training and testing) is the HMM tagging distributed with the treebank, with a tagging accuracy of 94.1%, and with the tagset compressed to 61 tags as in Collins et al. (1999).</Paragraph>
    <Paragraph position="3"> compared to [?]t MaltParser is used with the parsing algorithm of Nivre (2003) together with the feature model used for parsing Czech by Nivre and Nilsson (2005). In section 4.2 we use MBL, again with the same settings as Nivre and Nilsson (2005),3 and in section 4.2 we use SVM with a polynomial kernel of degree 2.4 The metrics for evaluation are the attachment score (AS) (labeled and unlabeled), i.e., the proportion of words that are assigned the correct head, and the exact match (EM) score (labeled and unlabeled), i.e., the proportion of sentences that are assigned a completely correct analysis.</Paragraph>
    <Paragraph position="4"> All tokens, including punctuation, are included in the evaluation scores. Statistical significance is assessed using McNemar's test.</Paragraph>
    <Section position="1" start_page="260" end_page="261" type="sub_section">
      <SectionTitle>
4.1 Experiment 1: Transformations
</SectionTitle>
      <Paragraph position="0"> The algorithms are fairly simple. In addition, there will always be a small proportion of syntactic constructions that do not follow the expected pattern.</Paragraph>
      <Paragraph position="1"> Hence, the transformation and inverse transformation will inevitably result in some distortion. In order to estimate the expected reduction in parsing accuracy due to this distortion, we first consider a pure treebank transformation experiment, where we compare t[?]1(t([?]t)) to [?]t, for all the different transformations t defined in the previous section. The results are shown in table 2.</Paragraph>
      <Paragraph position="2"> We see that, even though coordination is more frequent, verb groups are easier to handle.5 The  coordination version with the least loss of information (tc+) fails to recover the correct head for 0.4% of all words in [?]t.</Paragraph>
      <Paragraph position="3"> The difference between tc+ and tc is expected.</Paragraph>
      <Paragraph position="4"> However, in the next section this will be contrasted with the increased burden on the parser for tc+, since it is also responsible for selecting the correct dependency type for each arc among as many as 2* |R |types instead of |R|.</Paragraph>
    </Section>
    <Section position="2" start_page="261" end_page="261" type="sub_section">
      <SectionTitle>
4.2 Experiment 2: Parsing
</SectionTitle>
      <Paragraph position="0"> Parsing experiments are carried out in four steps (for a given transformation t):  1. Transform the training data set into t([?]t). 2. Train a parser p on t([?]t).</Paragraph>
      <Paragraph position="1"> 3. Parse a test set [?] using p with output p([?]). 4. Transform the parser output into t[?]1(p([?])).  Table 3 presents the results for a selection of transformations using MaltParser with MBL, tested on the evaluation test set [?]e with the untransformed data as baseline. Rows 2-5 show that transforming coordinate structures to MS improves parsing accuracy compared to the baseline, regardless of which transformation and inverse transformation are used. Moreover, the parser benefits from the verb group transformation, as seen in row 6.</Paragraph>
      <Paragraph position="2"> The final row shows the best combination of a coordination transformation with the verb group transformation, which amounts to an improvement of roughly two percentage points, or a ten percent overall error reduction, for unlabeled accuracy. All improvements over the baseline are statistically significant (McNemar's test) with respect to attachment score (labeled and unlabeled) and unlabeled exact match, with p &lt; 0.01 except for the unlabeled exact match score of the verb group transformation, where 0.01 &lt; p &lt; 0.05. For the labeled exact match, no differences are significant. The experimental results indicate that MS is more suitable than PS as the target representation for deterministic data-driven dependency parsing. A relevant question is of course why this is the case. A partial explanation may be found in the &amp;quot;short-dependency preference&amp;quot; exhibited by most parsers (Eisner and Smith, 2005), with MaltParser being no exception. The first row of table 4 shows the accuracy of the parser for different arc lengths under the baseline condition (i.e., with no transformations). We see that it performs very well on  formation; AS = attachment score, EM = exact match; U = unlabeled, L = labeled  short arcs, but that accuracy drops quite rapidly as the arcs get longer. This can be related to the mean arc length in [?]t, which is 2.59 in the untransformed version, 2.40 in tc([?]t) and 2.54 in tv([?]t). Rows 3-5 in table 4 show the distribution of arcs for different arc lengths in different versions of the data set. Both tc and tv make arcs shorter on average, which may facilitate the task for the parser.</Paragraph>
      <Paragraph position="3"> Another possible explanation is that learning is facilitated if similar constructions are represented similarly. For instance, it is probable that learning is made more difficult when a unit has different heads depending on whether it is part of a coordination or not.</Paragraph>
    </Section>
    <Section position="3" start_page="261" end_page="262" type="sub_section">
      <SectionTitle>
4.3 Experiment 3: Optimization
</SectionTitle>
      <Paragraph position="0"> In this section we combine the best results from the previous section with the graph transformations proposed by Nivre and Nilsson (2005) to recover non-projective dependencies. We write tp for the projectivization of training data and t[?]1p for the inverse transformation applied to the parser's  costly to train (Sagae and Lavie, 2005).</Paragraph>
      <Paragraph position="1"> Table 5 shows the results, for both MBL and SVM, of the baseline, the pure pseudo-projective parsing, and the combination of pseudo-projective parsing with PS-to-MS transformations. We see that pseudo-projective parsing brings a very consistent increase in accuracy of at least 1.5 percentage points, which is more than that reported by Nivre and Nilsson (2005), and that the addition of the PS-to-MS transformations increases accuracy with about the same margin. We also see that SVM outperforms MBL by about two percentage points across the board, and that the positive effect of the graph transformations is most pronounced for the unlabeled exact match score, where the improvement is more than five percentage points overall for both MBL and SVM.</Paragraph>
      <Paragraph position="2"> Table 6 gives a more detailed analysis of the parsing results for SVM, comparing the optimal parser to the baseline, and considering specifically the (unlabeled) precision and recall of the categories involved in coordination (separators S and conjuncts C) and verb groups (auxiliary verbs A and main verbs M). All figures indicate, without exception, that the transformations result in higher precision and recall for all directly involved words. (All differences are significant beyond the 0.01 level.) It is worth noting that the error reduction is actually higher for A and M than for S and C, although the former are less frequent.</Paragraph>
      <Paragraph position="3"> With respect to unlabeled attachment score, the results of the optimized parser are slightly below the best published results for a single parser. Hall and Nov'ak (2005) report a score of 85.1%, applying a corrective model to the output of Charniak's parser; McDonald and Pereira (2006) achieve a score of 85.2% using a second-order spanning tree algorithm. Using ensemble methods and a pool of different parsers, Zeman and VZabokrtsk'y (2005) attain a top score of 87.0%. For unlabeled exact match, our results are better than any previously reported results, including those of McDonald and Pereira (2006). (For the labeled scores, we are not aware of any comparable results in the literature.)</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML