XML Viewer - p02-1043

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/p02-1043_metho.xml
Size: 24,546 bytes
Last Modified: 2025-10-06 14:07:56
<?xml version="1.0" standalone="yes"?>
<Paper uid="P02-1043">
  <Title>Generative Models for Statistical Parsing with Combinatory Categorial Grammar</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Evaluating a CCG parser
</SectionTitle>
    <Paragraph position="0"> Since CCG produces unary and binary branching trees with a very fine-grained category set, CCG Parseval scores cannot be compared with scores of standard Treebank parsers. Therefore, we also evaluate performance using a dependency evaluation reported by Collins (1999), which counts word-word dependencies as determined by local trees and their labels. According to this metric, a local tree with parent node P, head daughter H and non-head daughter S (and position of S relative to P, ie. left or right, which is implicit in CCG categories) defines a CWPBNHBNSCX dependency between the head word of S, w S , and the head word of H, w H . This measure is neutral with respect to the branching factor. Furthermore, as noted by Hockenmaier (2001), it does not penalize equivalent analyses of multiple modi-Computational Linguistics (ACL), Philadelphia, July 2002, pp. 335-342. Proceedings of the 40th Annual Meeting of the Association for  fiers. In the unlabeled case CWCX (where it only matters whether word a is a dependent of word b, not what the label of the local tree is which defines this dependency), scores can be compared across grammars with different sets of labels and different kinds of trees. In order to compare our performance with the parser of Clark et al. (2002), we also evaluate our best model according to the dependency evaluation introduced for that parser. For further discussion we refer the reader to Clark and Hockenmaier (2002) .</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 CCGbank--a CCG treebank
</SectionTitle>
    <Paragraph position="0"> CCGbank is a corpus of CCG normal-form derivations obtained by translating the Penn Tree-bank trees using an algorithm described by Hockenmaier and Steedman (2002). Almost all types of construction--with the exception of gapping and UCP (&amp;quot;Unlike Coordinate Phrases&amp;quot;) are covered by the translation procedure, which processes 98.3% of the sentences in the training corpus (WSJ sections 02-21) and 98.5% of the sentences in the test corpus (WSJ section 23). The grammar contains a set of type-changing rules similar to the lexical rules described in Carpenter (1992). Figure 1 shows a derivation taken from CCGbank. Categories, such as B4B4CBCJCQCLD2C6C8B5BPC8C8B5BPC6C8, encode unsaturated subcat frames. The complement-adjunct distinction is made explicit; for instance as a nonexecutive director is marked up as PP-CLR in the Treebank, and hence treated as a PP-complement of join, whereas Nov. 29 is marked up as an NP-TMP and therefore analyzed as VP modifier. The -CLR tag is not in fact a very reliable indicator of whether a constituent should be treated as a complement, but the translation to CCG is automatic and must do the best it can with the information in the Treebank.</Paragraph>
    <Paragraph position="1"> The verbal categories in CCGbank carry features distinguishing declarative verbs (and auxiliaries) from past participles in past tense, past participles for passive, bare infinitives and ing-forms.</Paragraph>
    <Paragraph position="2"> There is a separate level for nouns and noun phrases, but, like the nonterminal NP in the Penn Treebank, noun phrases do not carry any number agreement.</Paragraph>
    <Paragraph position="3"> The derivations in CCGbank are &amp;quot;normal-form&amp;quot; in the sense that analyses involving the combinatory rules of type-raising and composition are only used when syntactically necessary.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="1" type="metho">
    <SectionTitle>
4 Generative models of CCG derivations
</SectionTitle>
    <Paragraph position="0"/>
    <Paragraph position="2"> The models described here are all extensions of a very simple model which models derivations by a top-down tree-generating process. This model was originally described in Hockenmaier (2001), where it was applied to a preliminary version of CCGbank, and its definition is repeated here in the top row of Table 1. Given a (parent) node with category P, choose the expansion exp of P,whereexp can be leaf (for lexical categories), unary (for unary expansions such as type-raising), left (for binary trees where the head daughter is left) or right (binary trees, head right). If P is a leaf node, generate its head word w. Otherwise, generate the category of its head daughter H.IfP is binary branching, generate the category of its non-head daughter S (a complement or modifier of H).</Paragraph>
    <Paragraph position="3"> The model itself includes no prior knowledge specific to CCG other than that it only allows unary and binary branching trees, and that the sets of nonterminals and terminals are not disjoint (hence the need to include leaf as a possible expansion, which acts as a stop probability).</Paragraph>
    <Paragraph position="4"> All the experiments reported in this section were conducted using sections 02-21 of CCGbank as training corpus, and section 23 as test corpus. We replace all rare words in the training data with their POS-tag. For all experiments reported here and in section 5, the frequency threshold was set to 5. Like Collins (1999), we assume that the test data is POStagged, and can therefore replace unknown words in the test data with their POS-tag, which is more appropriate for a formalism like CCG with a large set of lexical categories than one generic token for all unknown words.</Paragraph>
    <Paragraph position="5"> The performance of the baseline model is shown in the top row of table 3. For six out of the 2379 sentences in our test corpus we do not get a parse.</Paragraph>
    <Paragraph position="6">  The reason is that a lexicon consisting of the word-category pairs observed in the training corpus does not contain all the entries required to parse the test corpus. We discuss a simple, but imperfect, solution to this problem in section 7.</Paragraph>
  </Section>
  <Section position="6" start_page="1" end_page="4" type="metho">
    <SectionTitle>
5 Extending the baseline model
</SectionTitle>
    <Paragraph position="0"> State-of-the-art statistical parsers use many other features, or conditioning variables, such as head words, subcategorization frames, distance measures and grandparent nodes. We too can extend the baseline model described in the previous section by including more features. Like the models of Goodman (1997), the additional features in our model are generated probabilistically, whereas in the parser of Collins (1997) distance measures are assumed to be a function of the already generated structure and are not generated explicitly.</Paragraph>
    <Paragraph position="1"> In order to estimate the conditional probabilities of our model, we recursively smooth empirical es-</Paragraph>
    <Paragraph position="3"> of specific conditional distributions with (possible smoothed) estimates of less specific distri-</Paragraph>
    <Paragraph position="5"> l is a smoothing weight which depends on the particular distribution.</Paragraph>
    <Paragraph position="6">  When defining models, we will indicate a back-off level with a # sign between conditioning variables, eg. ABNB # C # D means that we interpolate  We conjecture that the minor variations in coverage among the other models (except Grandparent) are artefacts of the beam.  We compute l in the same way as Collins (1999), p. 185.</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
5.1 Adding non-lexical information
</SectionTitle>
      <Paragraph position="0"> The coordination feature We define a boolean feature, conj, which is true for constituents which expand to coordinations on the head path.</Paragraph>
      <Paragraph position="1">  ble 1 shows how conj is used as a conditioning variable. This is intended to allow the model to capture the fact that, for a sentence without extraction, a CCG derivation where the subject is type-raised and composed with the verb is much more likely in right node raising constructions like the above.</Paragraph>
      <Paragraph position="2"> The impact of the grandparent feature Johnson (1998) showed that a PCFG estimated from a version of the Penn Treebank in which the label of a node's parent is attached to the node's own label yields a substantial improvement (LP/LR: from 73.5%/69.7% to 80.0%/79.2%).</Paragraph>
      <Paragraph position="3"> The inclusion of an additional grandparent feature gives Charniak (1999) a slight improvement in the Maximum Entropy inspired model, but a slight decrease in performance for an MLE model. Table 3(Grandparent) shows that a grammar transformation like Johnson's does yield an improvement, but not as dramatic as in the Treebank-CFG case.</Paragraph>
      <Paragraph position="4"> At the same time coverage is reduced (which might not be the case if this was an additional feature in the model rather than a change in the representation of the categories). Both of these results are to be expected--CCG categories encode more contextual information than Treebank labels, in particular about parents and grandparents; therefore the history feature might be expected to have less impact. Moreover, since our category set is much larger, appending the parent node will lead to an even more fine-grained partitioning of the data, which then results in sparse data problems.</Paragraph>
      <Paragraph position="5"> Distance measures for CCG Our distance measures are related to those proposed by Goodman (1997), which are appropriate for binary trees (unlike those of Collins (1997)). Every node has a left distance measure, [?] L , measuring the distance from the head word to the left frontier of the constituent. There is a similar right distance measure [?]  tervening punctuation marks (0, 1, 2 or 3 and more). These [?]s are generated by the model in the following manner: at the root of the sentence, generate [?]</Paragraph>
      <Paragraph position="7"> Then, for each expansion, if it is a unary expansion, [?]</Paragraph>
      <Paragraph position="9"> with a probability of 1. If it is a binary expansion, only the [?] in the direction of the sister changes, with a probability  They are then used as further conditioning variables for the other distributions as shown in table 1. Table 3 also gives the Parseval and dependency scores obtained with each of these measures. [?] Pct has the smallest effect. However, our model does not yet contain anything like the hard constraint on punctuation marks in Collins (1999).</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
5.2 Adding lexical information
</SectionTitle>
      <Paragraph position="0"> Gildea (2001) shows that removing the lexical dependencies in Model 1 of Collins (1997) (that is, not conditioning on w</Paragraph>
      <Paragraph position="2"> creases labeled precision and recall by only 0.5%.</Paragraph>
      <Paragraph position="3"> It can therefore be assumed that the main influence of lexical head features (words and preterminals) in Collins' Model 1 is on the structural probabilities.</Paragraph>
      <Paragraph position="4"> In CCG, by contrast, preterminals are lexical categories, encoding complete subcategorization information. They therefore encode more information about the expansion of a nonterminal than Treebank POS-tags and thus are more constraining.</Paragraph>
      <Paragraph position="5"> Generating a constituent's lexical category c at its maximal projection (ie. either at the root of the tree, TOP, or when generating a non-head daughter S), and using the lexical category as conditioning variable (LexCat) increases performance of the baseline model as measured by CWPBNHBNSCX by almost 3%. In this model, c S , the lexical category of S depends on the category S and on the local tree in which S is generated. However, slightly worse performance is obtained for LexCatDep, a model which is identical to the original LexCat model, except that c</Paragraph>
      <Paragraph position="7"> , the lexical category of the head node, which introduces a dependency between the lexical categories.</Paragraph>
      <Paragraph position="8"> Since there is so much information in the lexical categories, one might expect that this would reduce the effect of conditioning the expansion of a constituent on its head word w.However,wedidfinda substantial effect. Generating the head word at the maximal projection (HeadWord) increases performance by a further 2%. Finally, conditioning w</Paragraph>
      <Paragraph position="10"> , hence including word-word dependencies, (HWDep) increases performance even more, by another 3.5%, or 8.3% overall. This is in stark contrast to Gildea's findings for Collins' Model 1.</Paragraph>
      <Paragraph position="11"> We conjecture that the reason why CCG benefits more from word-word dependencies than Collins' Model 1 is that CCG allows a cleaner parametrization of these surface dependencies. In Collins'  , but also on the distance [?] between the head and the modifier to be generated. However, Model 1 does not incorporate the notion of subcategorization frames. Instead, the distance measure was found to yield a good, if imperfect, approximation to subcategorization information.</Paragraph>
      <Paragraph position="12"> Using our notation, Collins' Model 1 generates w</Paragraph>
      <Paragraph position="14"> BR (the standard Parseval scores labeled/bracketed precision and recall) are not commensurate with other Treebank parsers. CWPBNHBNSCX, CWSCX,andCWCX are as defined in section 2. CM on CWCX is the percentage of sentences with complete match on CWCX,andAK2 CD is the percentage of sentences with under 2 &amp;quot;crossing dependencies&amp;quot; as defined by CWCX.</Paragraph>
      <Paragraph position="15"> The CWPBNHBNSCX labeled dependencies we report are not directly comparable with Collins (1999), since CCG categories encode subcategorization frames.</Paragraph>
      <Paragraph position="16"> For instance, if the direct object of a verb has been recognized as such, but a PP has been mistaken as a complement (whereas the gold standard says it is an adjunct), the fully labeled dependency evaluation CWPBNHBNSCX will not award a point. Therefore, we also include in Table 3 a more comparable evaluation CWSCX which only takes the correctness of the non-head category into account. The reported figures are also deflated by retaining verb features like tensed/untensed. If this is done (by stripping off all verb features), an improvement of 0.6% on the CWPBNHBNSCX score for our best model is obtained.</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
5.3 Combining lexical and non-lexical
</SectionTitle>
      <Paragraph position="0"> information When incorporating the adjacency distance measure or the coordination feature into the dependency model (HWDep[?] and HWDepConj), overall performance is lower than with the dependency model alone. We conjecture that this arises from data sparseness. It cannot be concluded from these results alone that the lexical dependencies make structural information redundant or superfluous. Instead, it is quite likely that we are facing an estimation problem similar to Charniak (1999), who reports that the inclusion of the grandparent feature worsens performance of an MLE model, but improves performance if the individual distributions are modelled using Maximum Entropy. This intuition is strengthened by the fact that, on casual inspection of the scores for individual sentences, it is sometimes the case that the lexicalized models perform worse than the unlexicalized models.</Paragraph>
    </Section>
    <Section position="4" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
5.4 The impact of tagging errors
</SectionTitle>
      <Paragraph position="0"> All of the experiments described above use the POS-tags as given by CCGbank (which are the Treebank tags, with some corrections necessary to acquire correct features on categories). It is reasonable to assume that this input is of higher quality than can be produced by a POS-tagger. We therefore ran the dependency model on a test corpus tagged with the POS-tagger of Ratnaparkhi (1996), which is trained on the original Penn Treebank (see HWDep (+ tagger) in Table 3). Performance degrades slightly, which is to be expected, since our approach makes so much use of the POS-tag information for unknown words. However, a POS-tagger trained on CCGbank might yield slightly better results.</Paragraph>
    </Section>
    <Section position="5" start_page="2" end_page="3" type="sub_section">
      <SectionTitle>
5.5 Limitations of the current model
</SectionTitle>
      <Paragraph position="0"> Unlike Clark et al. (2002), our parser does not always model the dependencies in the logical form.</Paragraph>
      <Paragraph position="1"> For example, in the interpretation of a coordinate structure like &amp;quot;buy and sell shares&amp;quot;, shares will head an object of both buy and sell. Similarly, in examples like &amp;quot;buy the company that wins&amp;quot;, the relative construction makes company depend upon both buy as object and wins as subject. As is well known (Abney, 1997), DAG-like dependencies cannot in general be modeled with a generative approach of the kind taken here  .</Paragraph>
    </Section>
    <Section position="6" start_page="3" end_page="4" type="sub_section">
      <SectionTitle>
5.6 Comparison with Clark et al. (2002)
</SectionTitle>
      <Paragraph position="0"> Clark et al. (2002) presents another statistical CCG parser, which is based on a conditional (rather than generative) model of the derived dependency structure, including non-surface dependencies. The following table compares the two parsers according to the evaluation of surface and deep dependencies given in Clark et al. (2002). We use Clark et al.'s parser to generate these dependencies from the output of our parser (see Clark and Hockenmaier (2002))  One of the advantages of CCG is that it provides a simple, surface grammatical analysis of extraction and coordination. We investigate whether our best  It remains to be seen whether the more restricted reentrancies of CCG will ultimately support a generative model.  Due to the smaller grammar and lexicon of Clark et al., our parser can only be evaluated on slightly over 94% of the sentences in section 23, whereas the figures for Clark et al. (2002) are on 97%.</Paragraph>
      <Paragraph position="1"> model, HWDep, predicts the correct analyses, using the development section 00.</Paragraph>
      <Paragraph position="2"> Coordination There are two instances of argument cluster coordination (constructions like cost $5,000 in July and $6,000 in August) in the development corpus. Of these, HWDep recovers none correctly. This is a shortcoming in the model, rather than in CCG: the relatively high probability both of the NP modifier analysis of PPs like in July and of NP coordination is enough to misdirect the parser. There are 203 instances of verb phrase coordination (CBCJBMCLD2C6C8, with CJBMCL any verbal feature) in the development corpus. On these, we obtain a labeled recall and precision of 67.0%/67.3%. Interestingly, on the 24 instances of right node raising (coordination of B4CBCJBMCLD2C6C8B5BPC6C8), our parser achieves higher performance, with labeled recall and precision of 79.2% and 73.1%. Figure 2 gives an example of the output of our parser on such a sentence.</Paragraph>
      <Paragraph position="3"> Extraction Long-range dependencies are not captured by the evaluation used here. However, the accuracy for recovering lexical categories for words with &amp;quot;extraction&amp;quot; categories, such as relative pronouns, gives some indication of how well the model detects the presence of such dependencies.</Paragraph>
      <Paragraph position="4"> The most common category for subject relative pronouns, B4C6C8D2C6C8B5BPB4CBCJCSCRD0CLD2C6C8B5, has been recovered with precision and recall of 97.1% (232 out of 239) and 94.3% (232/246).</Paragraph>
      <Paragraph position="5"> Embedded subject extraction requires the special lexical category B4B4CBCJCSCRD0CLD2C6C8B5BPC6C8B5BPB4CBCJCSCRD0CLD2C6C8B5 for verbs like think. On this category, the model achieves a precision of 100% (5/5) and recall of 83.3% (5/6). The case the parser misanalyzed is due to lexical coverage: the verb agree occurs in our lexicon, but not with this category.</Paragraph>
      <Paragraph position="6"> The most common category for object relative pronouns, B4C6C8D2C6C8B5BPB4CBCJCSCRD0CLBPC6C8B5, has a recall of 76.2% (16 out of 21) and precision of 84.2% (16/19).</Paragraph>
      <Paragraph position="7"> Free object relatives, C6C8BPB4CBCJCSCRD0CLBPC6C8B5,havea recall of 84.6% (11/13), and precision of 91.7% (11/12). However, object extraction appears more frequently as a reduced relative (the man John saw), and there are no lexical categories indicating this extraction. Reduced relative clauses are captured by a type-changing rule C6C8D2C6C8 AX CBCJCSCRD0CLBPC6C8. This rule was applied 56 times in the gold standard, and 70  times by the parser, out of which 48 times it corresponded to a rule in the gold standard (or 34 times, if the exact bracketing of the CBCJCSCRD0CLBPC6C8 is taken into account--this lower figure is due to attachment decisions made elsewhere in the tree).</Paragraph>
      <Paragraph position="8"> These figures are difficult to compare with standard Treebank parsers. Despite the fact that the original Treebank does contain traces for movement, none of the existing parsers try to generate these traces (with the exception of Collins' Model 3, for which he only gives an overall score of 96.3%/98.8% P/R for subject extraction and 81.4%/59.4% P/R for other cases). The only &amp;quot;long range&amp;quot; dependency for which Collins gives numbers is subject extraction CWSBAR, WHNP, SG, RCX,which has labeled precision and recall of 90.56% and 90.56%, whereas the CCG model achieves a labeled precision and recall of 94.3% and 96.5% on the most frequency subject extraction dependency CWC6C8D2C6C8, B4C6C8D2C6C8B5BPB4CBCJCSCRD0CLD2C6C8B5, CBCJCSCRD0CLD2C6C8CX, which occurs 262 times in the gold standard and was produced 256 times by our parser. However, out of the 15 cases of this relation in the gold standard that our parser did not return, 8 were in fact analyzed as subject extraction of bare infinitivals CWC6C8D2C6C8, B4C6C8D2C6C8B5BPB4CBCJCQCLD2C6C8B5, CBCJCQCLD2C6C8CX, yielding a combined recall of 97.3%.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="4" end_page="6" type="metho">
    <SectionTitle>
7 Lexical coverage
</SectionTitle>
    <Paragraph position="0"> The most serious problem facing parsers like the present one with large category sets is not so much the standard problem of unseen words, but rather the problem of words that have been seen, but not with the necessary category.</Paragraph>
    <Paragraph position="1"> For standard Treebank parsers, the latter problem does not have much impact, if any, since the Penn Treebank tagset is fairly small, and the grammar underlying the Treebank is very permissive. However, for CCG this is a serious problem: the first three rows in table 4 show a significant difference in performance for sentences with complete lexical coverage (&amp;quot;No missing&amp;quot;) and sentences with missing lexical entries (&amp;quot;Missing&amp;quot;).</Paragraph>
    <Paragraph position="2"> Using the POS-tags in the corpus, we can estimate the lexical probabilities PB4w CY cB5 using a linear interpolation between the relative frequency estimates  Table 4 shows the performance of the baseline model with a frequency cutoff of 5 and 10 for rare words and with a smoothed and non-smoothed lexicon. null  This frequency cutoff plays an important role here - smoothing with a small cutoff yields worse performance than not smoothing, whereas smoothing with a cutoff of 10 does not have a significant impact on performance. Smoothing the lexicon in this way does make the parser more robust, resulting in complete coverage of the test set. However, it does not affect overall performance, nor does it alleviate the problem for sentences with missing lexical entries for seen words.</Paragraph>
    <Paragraph position="3">  We compute l in the same way as Collins (1999), p. 185.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML