File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-2006_evalu.xml

Size: 7,664 bytes

Last Modified: 2025-10-06 13:59:44

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2006">
  <Title>Evaluating the Accuracy of an Unlexicalized Statistical Parser on the PARC DepBank</Title>
  <Section position="7" start_page="44" end_page="47" type="evalu">
    <SectionTitle>
4.2 Results
</SectionTitle>
    <Paragraph position="0"> Our parser produced rooted sentential analyses for  roughly comparable, the XLE as reported by King et al.</Paragraph>
    <Paragraph position="1"> than this since some of the test sentences are elliptical or fragmentary, but in many cases are recognized as single complete constituents. Kaplan et al. report that the complete XLE system finds rooted analyses for 79% of section 23 of the WSJ but do not report coverage just for the test sentences. The XLE parser uses several performance optimizations which mean that processing of sub-analyses in longer sentences can be curtailed or preempted, so that it is not clear what proportion of the remaining data is outside grammatical coverage. null Table 1 shows accuracy results for each individual relation and feature, starting with the GR bilexical relations in the extended DepBank and followed by most DepBank features reported by Kaplan et al., and finally overall macro- and microaverages. The macroaverage is calculated by taking the average of each measure for each individual relation and feature; the microaverage measures are calculated from the counts for all relations and features.4 Indentation of GRs shows degree of specificity of the relation. Thus, mod scores are microaveraged over the counts for the five fully specified modifier relations listed immediately after it in Table 1. This allows comparison of overall accuracy on modifiers with, for instance overall accuracy on arguments. Figures in italics to the right are discussed in the next section.</Paragraph>
    <Paragraph position="2"> Kaplan et al.'s microaveraged scores for Collins' Model 3 and the cut-down and complete versions of the XLE parser are given in Table 2, alongwiththemicroaveragedscoresforourparser from Table 1. Our system's accuracy results (evaluated on the reannotated DepBank) are better than those for Collins and the cut-down XLE, and very similar overall to the complete XLE (evaluated on DepBank). Speed of processing is also very competitive.5 These results demonstrate that a statistical parser with roughly state-of-the-art accuracy can be constructed without the need for large in-domain treebanks. However, the performance of the system, as measured by microraveraged F1-score on GR extraction alone, has declined by 2.7% over the held-out Susanne data, so even the unlexicalized parser is by no means domain-independent.</Paragraph>
    <Section position="1" start_page="45" end_page="47" type="sub_section">
      <SectionTitle>
4.3 Evaluation Issues
</SectionTitle>
      <Paragraph position="0"> The DepBank num feature on nouns is evaluated by Kaplan et al. on the grounds that it is semantically-relevant for applications. There are over 5K num features in DepBank so the overall microaveraged scores for a system will be significantly affected by accuracy on num. We expected our system, which incorporates a tagger with good empirical (97.1%) accuracy on the test data, to recover this feature with 95% accuracy or better, as it will correlate with tags NNx1 and NNx2 (where  morphology, and parsing, including module startup overheads). Allowing for slightly different CPUs, this is 2.5-10 times faster than the Collins and XLE parsers, as reported by Kaplan et al.</Paragraph>
      <Paragraph position="1">  tagset). However, DepBank treats the majority of prenominal modifiers as adjectives rather than nouns and, therefore, associates them with an adegree rather than a num feature. The PoS tag selected depends primarily on the relative lexical probabilities of each tag for a given lexical item recorded in the tagger lexicon. But, regardless of this lexical decision, the correct GR is recovered, and neither adegree(positive) or num(sg) add anything semantically-relevant when the lexical item is a nominal premodifier. A strategy which only provided a num feature for nominal heads would be both more semantically-relevant and would also yield higher precision (95.2%).</Paragraph>
      <Paragraph position="2"> However, recall (48.4%) then suffers against DepBank as noun premodifiers have a num feature.</Paragraph>
      <Paragraph position="3"> Therefore, in the results presented in Table 1 we have not counted cases where either DepBank or our system assign a premodifier adegree(positive) or num(sg).</Paragraph>
      <Paragraph position="4"> There are similar issues with other DepBank features and relations. For instance, the form of a subordinator with clausal complements is annotated as a relation between verb and subordinator, while there is a separate comp relation between verb and complement head. The GR representation adds the subordinator as a subtype of ccomp recording essentially identical information in a single relation. So evaluation scores based on  aggregatedcountsofcorrectdecisionswillbedoubled for a system which structures this information as in DepBank. However, reproducing the exact DepBank subord form relation from the GR ccomp one is non-trivial because DepBank treats modal auxiliaries as syntactic heads while the GRscheme treats the main verb as head in all ccomp relations. We have not attempted to compensate for any further such discrepancies other than the onediscussedinthepreviousparagraph. However, we do believe that they collectively damage scores for our system.</Paragraph>
      <Paragraph position="5"> As King et al. note, it is difficult to identify such informational redundancies to avoid doublecounting and to eradicate all system specific biases. However, reporting precision, recall and F1-scores for each relation and feature separately and microaveraging these scores on the basis of a hierarchy, as in our GR scheme, ameliorates many of these problems and gives a better indication of the strengths and weaknesses of a particular parser, which may also be useful in a decision about its usefulness for a specific application. Unfortunately, Kaplan et al. do not report their results broken down by relation or feature so it is not possible, for example, on the basis of the arguments made above, to choose to compare the performance of our system on ccomp to theirs for comp, ignoring subord form. King et al. do report individual results for selected features and relations from an evaluation of the complete XLE parser on all 700 DepBank sentences with an almost identical overall microaveraged F1 score of 79.5%, suggesting that these results provide a reasonably accurate idea of the XLE parser's relative performance on different features and relations.</Paragraph>
      <Paragraph position="6"> Where we believe that the information captured by a DepBank feature or relation is roughly comparable to that expressed by a GR in our extended DepBank, we have included King et al.'s scores in the rightmost column in Table 1 for comparison purposes. Even if these features and relations were drawn from the same experiment, however, they would still not be exactly comparable. For instance,asdiscussedinSS3nearlyhalf(justover1K) null the DepBank subj relations include pro as one element, mostly double counting a corresponding xcomp relation. On the other hand, our ta relation syntactically underspecifies many DepBank adjunct relations. Nevertheless, it is possible to see, for instance, that while both parsers perform badly on second objects ours is worse, presumably because of lack of lexical subcategorization information. null</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML