File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0211_metho.xml

Size: 13,200 bytes

Last Modified: 2025-10-06 14:09:54

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0211">
  <Title>Evaluating State-of-the-Art Treebank-style Parsers for Coh-Metrix and Other Learning Technology Environments</Title>
  <Section position="4" start_page="70" end_page="71" type="metho">
    <SectionTitle>
2 Evaluated Parsers
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="70" end_page="71" type="sub_section">
      <SectionTitle>
2.1 Apple Pie
</SectionTitle>
      <Paragraph position="0"> Apple Pie (AP) (Sekine and Grishman, 1995) extracts a grammar from PTB v.2 in which S and NP are the only true non-terminals (the others are included into the right-hand side of S and NP rules). The rules extracted from the PTB have S or NP on the left-hand side and a flat structure on the right-hand side, for instance S - NP VBX JJ. Each such rule has the most common structure in the PTB associated with it, and if the parser uses the rule it will generate its corresponding structure. The parser is a chart parser and factors grammar rules with common prefixes to reduce the number of active nodes. Although the underlying model of the parser is simple, it can't handle sentences over 40 words due to the large variety of linguistic  constructs in the PTB.</Paragraph>
    </Section>
    <Section position="2" start_page="71" end_page="71" type="sub_section">
      <SectionTitle>
2.2 Charniak's Parser
</SectionTitle>
      <Paragraph position="0"> Charniak presents a parser (CP) based on probabilities gathered from the WSJ part of the PTB (Charniak, 1997). It extracts the grammar and probabilities and with a standard context-free chart-parsing mechanism generates a set of possible parses for each sentence retaining the one with the highest probability (probabilities are not computed for all possible parses). The probabilities of an entire tree are computed bottomup. In (Charniak, 2000), he proposes a generative model based on a Markov-grammar. It uses a standard bottom-up, best-first probabilistic parser to first generate possible parses before ranking them with a probabilistic model.</Paragraph>
    </Section>
    <Section position="3" start_page="71" end_page="71" type="sub_section">
      <SectionTitle>
2.3 Collins's (Bikel's) Parser
</SectionTitle>
      <Paragraph position="0"> Collins's statistical parser (CBP; (Collins, 1997)), improved by Bikel (Bikel, 2004), is based on the probabilities between head-words in parse trees. It explicitly represents the parse probabilities in terms of basic syntactic relationships of these lexical heads. Collins defines a mapping from parse trees to sets of dependencies, on which he defines his statistical model. A set of rules defines a head-child for each node in the tree. The lexical head of the head-child of each node becomes the lexical head of the parent node. Associated with each node is a set of dependencies derived in the following way. For each non-head child, a dependency is added to the set where the dependency is identified by a triplet consisting of the non-head-child non-terminal, the parent non-terminal, and the head-child non-terminal. The parser is a CYKstyle dynamic programming chart parser.</Paragraph>
    </Section>
    <Section position="4" start_page="71" end_page="71" type="sub_section">
      <SectionTitle>
2.4 Stanford Parser
</SectionTitle>
      <Paragraph position="0"> The Stanford Parser (SP) is an unlexicalized parser that rivals state-of-the-art lexicalized ones (Klein and Manning, 2003). It uses a context-free grammar with state splits.</Paragraph>
      <Paragraph position="1"> The parsing algorithm is simpler, the grammar smaller and fewer parameters are needed for the estimation. It uses a CKY chart parser which exhaustively generates all possible parses for a sentence before it selects the highest probability tree. Here we used the default lexicalized version.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="71" end_page="74" type="metho">
    <SectionTitle>
3 Experiments and Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="71" end_page="72" type="sub_section">
      <SectionTitle>
3.1 Text Corpus
</SectionTitle>
      <Paragraph position="0"> We performed experiments on three data sets.</Paragraph>
      <Paragraph position="1"> First, we chose the norm for large scale parser evaluation, the 2416 sentences of WSJ section 23. Since parsers have different parameters that can be tuned leading to (slightly) different results we first report performance values on the standard data set and then use same parameter settings on the second data set for more reliable comparison.</Paragraph>
      <Paragraph position="2"> The second experiment is on a set of three narrative and four expository texts. The gold standard for this second data set was built manually by the authors starting from CP's as well as SP's output on those texts. The four texts used initially are two expository and two narrative texts of reasonable length for detailed evaluation:  An additional set of three texts was chosen from the Touchstone Applied Science Associates, Inc., (TASA) corpus with an average sentence length of 13.06 (overall TASA average) or higher.</Paragraph>
      <Paragraph position="3">  We also tested all four parsers for speed on a corpus of four texts chosen randomly from the Metametrix corpus of school text books, across high and low grade levels and across narrative and science texts (see Section 3.2.2).</Paragraph>
      <Paragraph position="4"> G4: 4th grade narrative text, 1,500 sentences,</Paragraph>
    </Section>
    <Section position="2" start_page="72" end_page="72" type="sub_section">
      <SectionTitle>
3.2 General Parser Evaluation Results
3.2.1 Accuracy
</SectionTitle>
      <Paragraph position="0"> The parameters file we used for evalb was the standard one that comes with the package.</Paragraph>
      <Paragraph position="1"> Some parsers are not robust, meaning that for some input they do not output anything, leading to empty lines that are not handled by the evaluator. Those parses had to be &amp;quot;aligned&amp;quot; with the gold standard files so that empty lines are eliminated from the output file together with their peers in the corresponding gold standard files.</Paragraph>
      <Paragraph position="2"> In Table 1 we report the performance values on Section 23 of WSJ. Table 2 shows the results for our own corpus. The table gives the average values of two test runs, one against the SP-based gold standard, the other against the CP-based gold standard, to counterbalance the bias of the standards. Note that CP and SP possibly still score high because of this bias. However, CBP is clearly a contender despite the bias, while AP is not.</Paragraph>
      <Paragraph position="3">  The reported metrics are Labelled Precision (LP) and Labelled Recall (LR). Let us denote by a the number of correct phrases in the output from a parser for a sentence, by b the number of incorrect phrases in the output and by c the number of phrases in the gold standard for the same sentence. LP is defined as a/(a+b) andLRisdefinedasa/c. A summary of the other dimensions of the evaluation is offered in  AP's performance is reported for sentences &lt; 40 words in length, 2,250 out of 2,416. SP is also not robust enough and the performance reported is only on 2,094 out of 2,416 sentences in section 23 of WSJ. because we were not able to find a bullet-proof parser so far, but we must recognize that some parsers are significantly more stable than others, namely CP and CBP. In terms of resources needed, the parsers are comparable, except for AP which uses less memory and processing time.</Paragraph>
      <Paragraph position="4"> The LP/LR of AP is significantly lower, partly due to its outputting partial trees for longer sentences. Overall, CP offers the best performance. Note in Table 1 that CP's tagging accuracy is worst among the three top parsers but still delivers best overall parsing results. This means that its parsing-only performance is slighstly better than the numbers in the table indicate.</Paragraph>
      <Paragraph position="5"> The numbers actually represent the tagging and parsing accuracy of the tested parsing systems.</Paragraph>
      <Paragraph position="6"> Nevertheless, this is what we would most likely want to know since one would prefer to input raw text as opposed to tagged text. If more finely grained comparisons of only the parsing aspects of the parsers are required, perfect tags extracted from PTB must be provided to measure performance.</Paragraph>
      <Paragraph position="7"> Table 4 shows average measures for each of the parsers on the PTB and seven expository and narrative texts in the second column and for expository and narrative in the fourth column. The third and fifth columns contain standard deviations for the previous columns, respectively. Here too, CP shows the best result.  All parsers ran on the same Linux Debian machine: P4 at 3.4GHz with 1.0GB of RAM.</Paragraph>
      <Paragraph position="8">  AP's and SP's high speeds can be explained to a large degree by their skipping longer sentences, the very ones that lead to the longer times for the other two candidates. Taking this into account, SP is clearly the fastest, but the large range of processing times need to be heeded.</Paragraph>
    </Section>
    <Section position="3" start_page="72" end_page="74" type="sub_section">
      <SectionTitle>
3.3 Directed Parser Evaluation Results
</SectionTitle>
      <Paragraph position="0"> This section reports the results of expert rating of texts for specific problems (see Section 1.3).</Paragraph>
      <Paragraph position="1"> The best results are produced by CP with an average of 88.69% output useable for Coh-Metrix 2.0 (Table 6). CP also produces good output  Some of the parsers also run under Windows.</Paragraph>
      <Paragraph position="2">  most consistently at a standard deviation over the seven texts of 8.86%. The other three candidates are clearly trailing behing, namely by between 5% (SP) and 11% (AP). The distribution of severe problems is comparable for all parsers.  As expected, longer sentences are more problematic for all parsers, as can be seen in Table 7. No significant trends in performance differences with respect to genre difference, narrative (Orlando, Moving, Betty03) vs. expository texts (Heat, Plants, Barron17, Olga91), were detected (cf. also speed results in Table 5). But we assume that the difference in average sentence length obscures any genre differences in our small sample.</Paragraph>
      <Paragraph position="3"> The most common non-fatal problems (type one) involved the well-documented adjunct attachment site issue, in particular for prepositional phrases ((Abney et al., 1999), (Brill and Resnik, 1994), (Collins and Brooks, 1995)) as well as adjectival phrases (Table 8)</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="74" end_page="75" type="metho">
    <SectionTitle>
. Similar
</SectionTitle>
    <Paragraph position="0"> misattachment issues for adjuncts are encountered with adverbial phrases, but they were rare  PP = wrong attachment site for a prepositional phrase; ADV = wrong attachment site for an adverbial phrase; cNP = misparsed complex noun phrase; &amp;X =  in our corpus.</Paragraph>
    <Paragraph position="1"> Another common problem are deverbal nouns and denominal verbs, as well as -ing/VBG forms. They share surface forms leading to ambiguous part of speech assignments. For many Coh-Metrix 2.0 measures, most obviously temporal cohesion, it is necessary to be able to distinguish gerunds from gerundives and deverbal  ticularly detrimental in view of the important role of NPs in Coh-Metrix 2.0 measures. This pertains in particular to the mistagging/misparsing of complex NPs and the coordination of NPs. Parses with fatal problems are expected to produce useless results for algorithms operating with them. Wrong coordination is another notorious problem of parsers (cf. (Cremers, 1993), (Grootveld, 1994)). In our corpus we found 33 instances of miscoordination, of which 23 involved NPs. Postprocessing approaches that address these issues are currently under investigation.</Paragraph>
    <Paragraph position="2">  The paper presented the evaluation of freely available, Treebank-style, parsers. We offered a uniform evaluation for four parsers: Apple Pie, Charniak's, Collins/Bikel's, and the Stanford parser. A novelty of this work is the evaluation of the parsers along new dimensions such as stability and robustness and across genre, in particular narrative and expository. For the latter part we developed a gold standard for narrative and expository texts from the TASA corpus. No significant effect, not already captured by variation in sentence length, could be found here. Another novelty is the evaluation of the parsers with respect to particular error types that are anticipated to be problematic for a given use of the resulting parses. The reader is invited to have a closer look at the figures our tables provide. We lack the space in the present paper to discuss them in more detail. Overall, Charniak's parser emerged as the most succesful candidate of a parser to be integrated where learning technology requires syntactic information from real text in real time.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML