File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/a97-1052_metho.xml

Size: 9,447 bytes

Last Modified: 2025-10-06 14:14:33

<?xml version="1.0" standalone="yes"?>
<Paper uid="A97-1052">
  <Title>Corpus Data TP FP FN</Title>
  <Section position="5" start_page="358" end_page="361" type="metho">
    <SectionTitle>
3 Experimental Evaluation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="358" end_page="359" type="sub_section">
      <SectionTitle>
3.1 Lexicon Evaluation - Method
</SectionTitle>
      <Paragraph position="0"> In order to test the accuracy of our system (as developed so far) and to provide empirical feedback for further development, we took the Susanne, SEC (Taylor &amp; Knowles, 1988) and LOB corpora (Garside et al., 1987)--a total of 1.2 million words--and extracted all sentences containing an occurrence of one of fourteen verbs, up to a maximum of 1000 citations of each. These verbs, listed in Figure 2, were chosen at random, subject to the constraint that they exhibited multiple complementation patterns. The sentences containing these verbs were tagged and parsed automatically, and the extractor, classifier and evaluator were applied to the resulting  successful analyses. The citations from which entries were derived totaled approximately 70K words.</Paragraph>
      <Paragraph position="1"> The results were evaluated against a merged entry for these verbs from the ANLT and COMLEX Syntax dictionaries, and also against a manual analysis of the corpus data for seven of the verbs. The process of evaluating the performance of the system relative to the dictionaries could, in principle, be reduced to an automated report of type precision (percentage of correct subcategorization classes to all classes found) and recall (perCentage of correct classes found in the dictionary entry). However, since there are disagreements between the dictionaries and there are classes found in the corpus data that are not contained in either dictionary, we report results relative both to a manually merged entry from ANLT and COMLEX, and also, for seven of the verbs, to a manual analysis of the actual corpus data. The latter analysis is necessary because precision and recall measures against the merged entry will still tend to yield inaccurate results as the system cannot acquire classes not exemplified in the data, and may acquire classes incorrectly absent from the dictionaries.</Paragraph>
      <Paragraph position="2"> We illustrate these problems with reference to seem, where there is overlap, but not agreement between the COMLEX and ANLT entries. Thus, both predict that seem will occur with a sentential complement and dummy subject, but only ANLT predicts the possibility of a 'wh' complement and only COMLEX predicts the (optional) presence of a PP\[to\] argument with the sentential complement.</Paragraph>
      <Paragraph position="3"> One ANLT entry covers two COMLEX entries given the different treatment of the relevant complements but the classifier keeps them distinct. The corpus data for seem contains examples of further classes which we judge valid, in which seem can take a PP\[to\] and infinitive complement, as in he seems to me to be insane, and a passive participle, as in he seemed depressed. This comparison illustrates the problem of errors of omission common to computational lexicons constructed manually and also from machine-readable dictionaries. All classes for seem are exemplified in the corpus data, but for ask, for example, eight classes (out of a possible 27 in the merged entry) are not present, so comparison only to the merged entry would give an unreasonably low estimate of recall.</Paragraph>
    </Section>
    <Section position="2" start_page="359" end_page="360" type="sub_section">
      <SectionTitle>
3.2 Lexicon Evaluation - Results
</SectionTitle>
      <Paragraph position="0"> Figure 2 gives the raw results for the merged entries and corpus analysis on each verb. It shows the number of true positives (TP), correct classes proposed by our system, false positives (FP), incorrect classes proposed by our system, and false negatives (FN), correct classes not proposed by our system, as judged against the merged entry, and, for seven of the verbs, against the corpus analysis. It also shows, in the final column, the number of sentences from which classes were extracted.</Paragraph>
      <Paragraph position="1">  our system's recognition of subcategorization classes as evaluated against the merged dictionary entries (14 verbs) and against the manually analysed corpus data (7 verbs). The frequency distribution of the classes is highly skewed: for example for believe, there are 107 instances of the most common class in the corpus data, but only 6 instances in total of the least common four classes. More generally, for the manually analysed verbs, almost 60% of the false negatives have only one or two exemplars each in the corpus citations. None of them are returned by the system because the binomial filter always rejects classes hypothesised on the basis of such little evidence. null In Figure 4 we estimate the accuracy with which our system ranks true positive classes against the correct ranking for the seven verbs whose corpus input was manually analysed. We compute this measure by calculating the percentage of pairs of classes at positions (n, m) s.t. n &lt; m in the system ranking that are ordered the same in the correct ranking. This gives us an estimate of the accuracy of the relative frequencies of classes output by the system.</Paragraph>
      <Paragraph position="2"> For each of the seven verbs for which we undertook a corpus analysis, we calculate the token recall of our system as the percentage (over all exemplars) of true positives in the corpus. This gives us an estimate of the parsing performance that would result from providing a parser with entries built using the system, shown in Figure 5.</Paragraph>
      <Paragraph position="3"> Further evaluation of the results for these seven verbs reveals that the filtering phase is the weak link in the systerri. There are only 13 true negatives which the system failed to propose, each exemplified in the data by a mean of 4.5 examples. On the other hand, there are 67 false negatives supported by an estimated mean of 7.1 examples which should, ide- null ally, have been accepted by the filter, and 11 false positives which should have been rejected. The performance of the filter for classes with less than 10 exemplars is around chance, and a simple heuristic of accepting all classes with more than 10 exemplars would have produced broadly similar results for these verbs. The filter may well be performing poorly because the probability of generating a sub-categorization class for a given verb is often lower than the error probability for that class.</Paragraph>
    </Section>
    <Section position="3" start_page="360" end_page="360" type="sub_section">
      <SectionTitle>
3.3 Parsing Evaluation
</SectionTitle>
      <Paragraph position="0"> In addition to evaluating the acquired subcategorization information against existing lexical resources, we have also evaluated the information in the context of an actual parsing system. In particular we wanted to establish whether the subcategorization frequency information for individual verbs could be used to improve the accuracy of a parser that uses statistical techniques to rank analyses.</Paragraph>
      <Paragraph position="1"> The experiment used the same probabilistic parser and tag sequence grammar as are present in the acquisition system (see references above)--although the experiment does not in any way rely on the</Paragraph>
    </Section>
    <Section position="4" start_page="360" end_page="361" type="sub_section">
      <SectionTitle>
Susanne bracketings
</SectionTitle>
      <Paragraph position="0"> parsers or grammars being the same. We randomly selected a test set of 250 in-coverage sentences (of lengths 3-56 tokens, mean 18.2) from the Susanne treebank, retagged with possibly multiple tags per word, and measured the 'baseline' accuracy of the unlexicalized parser on the sentences using the now standard PARSEVAL/GEIG evaluation metrics of mean crossing brackets per sentence and (unlabelled) bracket recall and precision (e.g. Grishman et al., 1992); see figure 65. Next, we collected all words in the test corpus tagged as possibly being verbs (giving a total of 356 distinct lemmas) and retrieved all citations of them in the LOB corpus, plus Susanne with the 250 test sentences excluded. We acquired subcategorization and associated frequency information from the citations, in the process successfully parsing 380K words. We then parsed the test set, with each verb subcategorization possibility weighted by its raw frequency score, and using the naive add-one smoothing technique to allow for omitted possibilities. The GEIG measures for the lexicalized parser show a 7% improvement in the crossing bracket score (figure 6).</Paragraph>
      <Paragraph position="1">  cally significant at the 95% level (paired t-test, 1.21, 249 dr, p = 0.11)--although if the pattern of differences were maintained over a larger test set of 470 sentences it would be significant. We expect that a more sophisticated smoothing technique, a larger acquisition corpus, and extensions to the system to deal with nominal and adjectival predicates would improve accuracy still further. Nevertheless, this experiment demonstrates that lexicalizing a grammar/parser with subcategorization frequencies can appreciably improve the accuracy of parse ranking.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML