File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1013_metho.xml
Size: 13,547 bytes
Last Modified: 2025-10-06 14:07:45
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1013"> <Title>High Precision Extraction of Grammatical Relations</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Empirical Results </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Weight Thresholding </SectionTitle> <Paragraph position="0"> Our first experiment compared the accuracy of the parser when extracting GRs from the highest ranked analysis (the standard probabilistic parsing setup) against extracting weighted GRs from all parses in the forest. To measure accuracy we use the precision, recall and F-score measures of parser GRs against 'gold standard GR annotations in a 10,000word test corpus of in-coverage sentences derived from the SUSANNE corpus and covering a range of written genres4. GRs are in general compared using an equality test, except that in a specific, limited number of cases (described by Carroll, Minnen and Briscoe, 1998) the parser is allowed to return more generic relation types.</Paragraph> <Paragraph position="1"> When a parser GR has a weight of less than one, we proportionally discount its contribution to the precision and recall scores. Thus, given a set a47 of GRs with associated weights produced by the parser, i.e.</Paragraph> <Paragraph position="3"> and a set a95 of gold-standard (unweighted) GRs, we compute the weighted match between a95 and the elements of a47 as</Paragraph> <Paragraph position="5"> where a109 a50a23a112a113a57a114a48a115a92 if a112 is true and a0 otherwise. The weighted precision and recall are then</Paragraph> <Paragraph position="7"> respectively, expressed as percentages. We are not aware of any previous published work using weighted precision and recall measures, although there is an option for associating weights with complete parses in the distributed software implementing the PARSEVAL scheme (Harrison et al., 1991) for evaluating parser accuracy with respect to phrase structure bracketings. The weighted measures make sense for application tasks that can deal with sets of mutually-inconsistent GRs.</Paragraph> <Paragraph position="8"> In this initial experiment, precision and recall when extracting weighted GRs from all parses were both one and a half percentage points lower than when GRs were extracted from just the highest ranked analysis (see table 1)5. This decrease in accuracy might be expected, though, given that a true positive GR may be returned with weight less than one, and so will not receive full credit from the weighted precision and recall measures.</Paragraph> <Paragraph position="9"> However, these results only tell part of the story. An application using grammatical relation analyses might be interested only in GRs that the parser is fairly confident of being correct. For instance, in unsupervised acquisition of lexical information (such as subcategorisation frames for verbs) from text, the usual methodology is to (partially) analyse the text, retaining only reliable hypotheses which are then filtered based on the amount of evidence for them over the corpus as a whole. Thus, Brent (1993) only creates hypotheses on the basis of instances of verb frames that are reliably and unambiguously cued by closed class items (such as pronouns) so there can be no other attachment possibilities. In recent work on unsupervised learning of prepositional phrase disambiguation, Pantel and Lin (2000) derive training instances only from relevant data appearing in syntactic contexts that are guaranteed to be unambiguous. In our system, the weights on GRs indicate how certain the parser is of the associated relations being correct. We therefore investigated whether more highly weighted GRs are in fact more likely to be correct than ones with lower weights. We did this by setting a threshold on the output, such that any GR with weight lower than the threshold is discarded. null Figure 2 plots weighted recall and precision as the threshold is varied between zero and one The results are intriguing. Precision increases monotonically from 74.6% at a threshold of zero (the situation as in the previous experiment where all GRs extracted from all parses in the forest are returned) to 90.4% at a threshold of one. (The latter threshold has the effect of allowing only those GRs that form part of every single analysis to be returned).</Paragraph> <Paragraph position="10"> The influence of the threshold on recall is equally dramatic, although since we have not escaped the usual trade-off with precision the results are somewhat less positive. Recall decreases from 75.3% to 45.2%, initially rising slightly, then falling at a gradually increasing rate. Between thresholds 0.99 and 1.0 there is only a two percentage point difference in precision, but recall differs by almost fourteen percentage points6. Over the whole range, as the threshold is increased from zero, precision rises faster than recall falls until the threshold reaches 0.65; here the F-score attains its overall maximum of 77.</Paragraph> <Paragraph position="11"> It turns out that the eventual figure of over 90% precision is not due to 'easier relation types (such as the dependency between a determiner and a noun) being returned and more difficult ones (for example clausal complements) being ignored. The majority of relation types are produced with frequency consistent with the overall 45% recall figure. Exceptions are arg mod (encoding the English passive 'by-phrase ) and iobj (indirect object), for which no GRs at all are produced. The reason for this is that both types of relation originate from an occurrence of a prepositional phrase in contexts where it could be either a modifier or a complement of a predicate. This pervasive ambiguity means that there will always be disagreement between analyses over the relation type (but not necessarily over the identity of the head and dependent themselves).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Parse Unpacking </SectionTitle> <Paragraph position="0"> Schmid and Rooth s algorithm computes expected governors efficiently by using dynamic programming and processing the entire parse forest rather than individual trees. In contrast, we unpack the whole parse forest and then extract weighted GRs from each tree individually. Our implementation is certainly less elegant, but in practical terms for 6Roughly, each percentage point increase or decrease in precision and recall is statistically significant at the 95% level. In this and all significance tests in this paper we use a one-tailed paired t-test (with 499 degrees of freedom).</Paragraph> <Paragraph position="1"> sentences where there are relatively small numbers of parses the speed is still acceptable. However, throughput goes down linearly with the number of parses, and when there are many thousands of parses--and particularly also when the sentence is long and so each tree is large--the parsing system becomes unacceptably slow.</Paragraph> <Paragraph position="2"> One possibility to improve the situation would be to extract GRs directly from forests. At first glance this looks a possibility: although our parse forests are produced by a probabilistic LR parser using a unification-based grammar, they are similar in content to those computed by a probabilistic context-free grammar, as assumed by Schmid and Rooth s algorithm. However, there are problems. If the test for being able to pack local ambiguities in the unification grammar parse forest is feature structure subsumption, unpacking a parse apparently encoded in the forest can fail due to non-local inconsistency in feature values (Oepen and Carroll, 2000)7, so every governor tuple hypothesis would have to be checked to ensure that the parse it came from was globally valid. It is likely that this verification step would cancel out the efficiency gained from using an algorithm based on dynamic programming. This problem could be side-stepped (but at the cost of less compact parse forests) by instead testing for feature structure equivalence rather than subsumption. A second, more serious problem is that some of our relation types encode more information than is present in a single governor tuple (the non-clausal subject relation, for instance, encoding whether the surface subject is the 'deep object in a passive construction); this information can again be less local and violate the conditions required for the dynamic programming approach.</Paragraph> <Paragraph position="3"> Another possibility is to compute only the a127 highest ranked parses and extract weighted GRs from just those. The basic case where a127a128a48a129a92 is equivalent to the standard approach of computing GRs from the highest probability parse. Table 2 shows the effect on accuracy as a127 is increased in stages to a92 a0a4a0a4a0 , using a threshold for GR extraction of a92 ; also shown is the previous setup (labelled 'unlimited ) in which all parses in the forest are considered.8 (All differences in precision in the table are significant to at least the 95% level, except between a92 a0a4a0a4a0 parses and an unlimited number). The results demonstrate that limiting processing to a relatively small, fixed number of parses--even as low as 100--comes within a small margin of the accuracy achieved using the full parse forest. These results are striking, in view of the fact that the grammar assigns more than a7a15a0a4a0 parses to over a third of the sentences in the test corpus, and more than a thousand parses to a fifth of them. Another interesting observation is that the relationship between precision and recall is very close to that seen when the threshold is varied (as in the previous section); there appears to be no loss in recall at a given level of precision. We therefore feel confident in unpacking a limited number of parses from the forest and extracting weighted GRs from them, rather than trying to process all parses. We have tentatively set the limit to be a92 a0a4a0a4a0 , as a reasonable compromise in our system between throughput and accuracy.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Parse Weighting </SectionTitle> <Paragraph position="0"> The way in which the GR weighting is carried out does not matter when the weight threshold is equal to 1 (since then only GRs that are part of every analysis are returned, each with a weight of one). However, we wanted to see whether the precise method for assigning weights to GRs has an effect on accuracy, and if so, to what extent. We therefore tried an alternative approach where each GR receives a contribution of 1 from every parse, no matter what the probability of the parse is, normalising in this case by the number of parses considered. This tends to increase the numbers of GRs returned for any given threshold, so when comparing the two methods we found thresholds such that each method obtained the same precision figure (of roughly 83.38%). We then compared the recall figures (see table 3). The recall for the probabilistic weighting scheme is 4% higher (statistically significant at the 99.95% level).</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.4 Maximal Consistent Relation Sets </SectionTitle> <Paragraph position="0"> It is interesting to see what happens if we compute for each sentence the maximal consistent set of weighted GRs. (We might want to do this if we want complete and coherent sentence analyses, interpreting the weights as confidence measures over sub-analysis segments). We use a 'greedy algorithm to compute consistent relation sets, taking GRs sorted in order of decreasing weight and adding a GR to the set if and only if there is not already a GR in the set with the same dependent. (But note that the correct analysis may in fact contain more than one GR with the same dependent, such as the ncsubj ... Failure GRs in Figure 1, and in these cases this method will introduce errors). The weighted precision, recall and F-score at threshold zero are 79.31%, 73.56% and 76.33 respectively. Precision and F-score are significantly better (at the 95.95% level) than the baseline.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.5 Parser Bootstrapping </SectionTitle> <Paragraph position="0"> One of our primary research goals is to explore unsupervised acquisition of lexical knowledge. The parser we use in this work is 'semi-lexicalised , using subcategorisation probabilities for verbs acquired automatically from (unlexicalised) parses. In the future we intend to acquire other types of lexicostatistical information (for example on PP attachment) which we will feed back into the parser s disambiguation procedure, bootstrapping successively more accurate versions of the parsing system. There is still plenty of scope for improvement in accuracy, since compared with the number of correct GRs in top-ranked parses there are roughly a further 20% that are correct but present only in lower-ranked parses. There appears to be less room for improvement with argument relations (ncsubj, dobj etc.) than with modifier relations (ncmod and similar). This indicates that our next efforts should be directed to collecting information on modification.</Paragraph> </Section> </Section> class="xml-element"></Paper>