XML Viewer - p98-1034

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/98/p98-1034_evalu.xml
Size: 10,250 bytes
Last Modified: 2025-10-06 14:00:27
<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-1034">
  <Title>Error-Driven Pruning of Treebank Grammars for Base Noun Phrase Identification</Title>
  <Section position="5" start_page="221" end_page="223" type="evalu">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"> To evaluate the treebank approach to base NP identification, we created two base NP corpora. Each is derived from the Penn Treebank WSJ. The first corpus attempts to duplicate the base NPs used the Ramshaw &amp; Marcus (R&amp;M) study. The second corpus contains slightly less complicated base NPs -base NPs that are better suited for use with our sentence analyzer, Empire. 2 By evaluating on both corpora, we can measure the effect of noun phrase complexity on the treebank approach to base NP identification. In particular, we hypothesize that the treebank approach will be most appropriate when the base NPs are sufficiently simple.</Paragraph>
    <Paragraph position="1"> For all experiments, we derived the training, pruning, and testing sets from the 25 sections of Wall Street Journal distributed with the Penn Treebank II. All experiments employ 5-fold cross validation.</Paragraph>
    <Paragraph position="2"> More specifically, in each of five runs, a different fold is used for testing the final, pruned rule set; three of the remaining folds comprise the training corpus (to create the initial rule set); and the final partition is the pruning corpus (to prune bad rules from the initial rule set). All results are averages across the five folds. Performance is measured in terms of precision and recall. Precision was described earlier -- it is a standard measure of accuracy. Recall, on the other hand, is an attempt to measure coverage: # of correct proposed NPs P = # of proposed NPs # of correct proposed NPs R = # of NPs in the annotated text Table 1 summarizes the performance of the tree-bank approach to base NP identification on the R&amp;M and Empire corpora using the initial and pruned rule sets. The first column of results shows the performance of the initial, unpruned base NP grammar. The next two columns show the performance of the automatically pruned rule sets. The final column indicates the performance of rule sets that had been pruned using the handcrafted pruning heuristics. As expected, the initial rule set performs quite poorly. Both automated approaches provide significant increases in both recall and precision. In addition, they outperform the rule set pruned using handcrafted pruning heuristics.</Paragraph>
    <Paragraph position="3"> 2Very briefly, the Empire sentence analyzer relies on partial parsing to find simple constituents like base NPs and verb groups. Machine learning algorithms then operate on the output of the partial parser to perform all attachment decisions. The ultimate output of the parser is a semantic case frame representation of the functional structure of the input sentence.</Paragraph>
    <Paragraph position="4">  Ramshaw &amp; Marcus (1998) both With and Without Lexical Templates, on the R&amp;M Corpus Throughout the table, we see the effects of base NP complexity -- the base NPs of the R&amp;M corpus are substantially more difficult for our approach to identify than the simpler NPs of the Empire corpus. For the R&amp;M corpus, we lag the best published results (93.1P/93.5R) by approximately 3%.</Paragraph>
    <Paragraph position="5"> This straightforward comparison, however, is not entirely appropriate. Ramshaw &amp; Marcus allow their learning algorithm to access word-level information in addition to part-of-speech tags. The treebank approach, on the other hand, makes use only of part-of-speech tags. Table 2 compares Ramshaw &amp; Marcus' (In press) results with and without lexical knowledge. The first column reports their performance when using lexical templates; the second when lexical templates are not used; the third again shows the treebank approach using incremental pruning. The treebank approach and the R&amp;M approach without lecial templates are shown to perform comparably (-1.1P/+0.2R). Lexicalization of our base NP finder will be addressed in Section 4.1.</Paragraph>
    <Paragraph position="6"> Finally, note the relatively small difference between the threshold and incremental pruning methods in Table 1. For some applications, this minor drop in performance may be worth the decrease in training time. Another effective technique to speed up training is motivated by Charniak's (1996) observation that the benefit of using rules that only occurred once in training is marginal. By discarding these rules before pruning, we reduce the size of the initial grammar -- and the time for incremental pruning -- by 60%, with a performance drop of only -0.3P/-0.1R.</Paragraph>
    <Section position="1" start_page="221" end_page="223" type="sub_section">
      <SectionTitle>
4.1 Errors and Local Repair Heuristics
</SectionTitle>
      <Paragraph position="0"> It is informative to consider the kinds of errors made by the treebank approach to bracketing. In particular, the errors may indicate options for incorporating lexical information into the base NP finder. Given the increases in performance achieved by Ramshaw &amp; Marcus by including word-level cues, we would hope to see similar improvements by exploiting lexical information in the treebank approach.</Paragraph>
      <Paragraph position="1"> For each corpus we examined the first 100 or so errors and found that certain linguistic constructs consistently cause trouble. (In the examples that follow, the bracketing shown is the error.)  lem in the R&amp;M corpus. For the Empire corpus, conjunctions of adjectives proved difficult: \[record/N2~ \[third-quarter/JJ and/CC nine-month/JJ results/NN5~.</Paragraph>
      <Paragraph position="2"> * Gerunds. Even though the most difficult VBG constructions such as manufacturing titans were removed from the Empire corpus, there were others that the bracketer did not handle, like \[chiej~ operating \[officer\]. Like conjunctions, gerunds posed a major difficulty in the R&amp;M corpus.</Paragraph>
      <Paragraph position="3"> * NPs Containing Punctuation. Predictably, the bracketer has difficulty with NPs containing periods, quotation marks, hyphens, and parentheses. null * Adverbial Noun Phrases. Especially temporal NPs such as last month in at \[83.6~\] of\[capacity last month\].</Paragraph>
      <Paragraph position="4"> * Appositives. These are juxtaposed NPs such as of \[colleague Michael Madden\] that the bracketer mistakes for a single NP.</Paragraph>
      <Paragraph position="5"> * Quantified NPs. NPs that look like PPs are a problem: at/IN \[least/JJS~ \[the/DT right/JJ jobs/NNS~; about/IN \[25/CD million/CD\].</Paragraph>
      <Paragraph position="6"> Many errors appear to stem from four underlying causes. First, close to 20% can be attributed to errors in the Treebank and in the Base NP corpus, bringing the effective performance of the algorithm to 94.2P/95.9R and 91.5P/92.TR for the Empire and R&amp;M corpora, respectively. For example, neither corpus includes WH-phrases as base NPs.</Paragraph>
      <Paragraph position="7"> When the bracketer correctly recognizes these NPs, they are counted as errors. Part-of-speech tagging errors are a second cause. Third, many NPs are missed by the bracketer because it lacks the appropriate rule. For example, household products busi- ness is bracketed as \[household/NN products/NNS~ \[business/Nh~. Fourth, idiomatic and specialized expressions, especially time, date, money, and numeric phrases, also account for a substantial portion of the errors.</Paragraph>
      <Paragraph position="8"> These last two categories of errors can often be detected because they produce either recognizable patterns or unlikely linguistic constructs. Consecutive NPs, for example, usually denote bracketing errors, as in \[household/NN products/NNS~ \[business/Nh~. Merging consecutive NPs in the correct contexts would fix many such errors. Idiomatic and specialized expressions might be corrected by similarly local repair heuristics. Typical examples might include changing \[effective/JJ Monday/NNP\] to effective \[Monday\]; changing \[the/DT balance/NN due/J J\] to \[the balance\] due; and changing were/VBP \[n't/RB the/DT only/RS losers/NNS~ to were n't \[the only losers\].</Paragraph>
      <Paragraph position="9"> Given these observations, we implemented three local repair heuristics. The first merges consecutive NPs unless either might be a time expression. The second identifies two simple date expressions. The third looks for quantifiers preceding of NP. The first heuristic, for example, merges \[household products\] \[business\] to form \[household products business\], but leaves increased \[15 ~ \[last Friday\] untouched. The second heuristic merges \[June b~ , \[1995\] into \[June 5, 1995\]; and \[June\], \[1995\] into \[June, 1995\]. The third finds examples like some of\[the companies\] and produces \[some\] of \[the companies\]. These heuristics represent an initial exploration into the effectiveness of employing lexical information in a post-processing phase rather than during grammar induction and bracketing. While we are investigating the latter in current work, local repair heuristics have the advantage of keeping the training and bracketing algorithms both simple and fast.</Paragraph>
      <Paragraph position="10"> The effect of these heuristics on recall and precision is shown in Table 3. We see consistent improvements for both corpora and both pruning methods,  achieving approximately 94P/R for the Empire corpus and approximately 91P/R for the R&amp;M corpus.</Paragraph>
      <Paragraph position="11"> Note that these are the final results reported in the introduction and conclusion. Although these experiments represent only an initial investigation into the usefulness of local repair heuristics, we are very encouraged by the results. The heuristics uniformly boost precision without harming recall; they help the R&amp;M corpus even though they were designed in response to errors in the Empire corpus. In addition, these three heuristics alone recover 1/2 to 1/3 of the improvements we can expect to obtain from lexicalization based on the R&amp;M results.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML