File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0622_metho.xml
Size: 20,137 bytes
Last Modified: 2025-10-06 14:15:31
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0622"> <Title>Guiding a Well-Founded Parser with Corpus Statistics</Title> <Section position="4" start_page="179" end_page="180" type="metho"> <SectionTitle> 3 Language Model </SectionTitle> <Paragraph position="0"> We noted above that we would like a more complete lexicalization than what has been used by recent models in statistical parsing. To this end, we propose a generative model which is a direct extension of a Probabilistic Context Free Grammar (PCFG). In our model, as in a PCFG, the sentence is generated via top-down expansion of its parse tree, beginning with the root node.</Paragraph> <Paragraph position="1"> The crucial difference, however, is that as we expand from a nonterminal to its children, we simultaneously generate both the syntactic category and the head word of each child. This expansion is predicated on the category and head word of the mother.</Paragraph> <Paragraph position="2"> We will also make the traditional assumption that all sentences are generated independently of each other. Then, under this assumption and the assumed model, we can write the probability of a particular parse tree T as the product of all</Paragraph> <Paragraph position="4"> where X and w are the syntactic category and head word of the mother node, and Yi and h.t are the syntactic category and head word of the ith daughter, rulename is the identifier for the rule that licenses the expansion. (Of course, all of these terms should be indexed appropriately for the expansion under consideration, but we leave that off for clarity.) Note that rulename usually determines X, and rulename and X together always determine Y. Also note that each tree is assumed to be rooted at a dummy UTT node (with a dummy WORD head word)~ which serves as the parent for the &quot;true root&quot; of the tree.</Paragraph> <Paragraph position="5"> We can expand (1) via the chain rule:</Paragraph> <Paragraph position="7"> Note we have dropped Y from the equations, since as noted above, that sequence is determined by rulename and X. This is an appealing rewriting, since the first term of (2), which we will term the syntactic expansion probability, corresponds neatly to the theory of Lexical Preference for those rules whose head constituent is a lexical category. Consider the following sentences, from Ford et al. (1982) (1) a. Mary wanted the dress on that rack b. Mary PoSitioned the dress on that rack LP predicts that the preferred interpretation for the first sentence is the (NP the dress on that rack} structure, while for the second, a reader would prefer the! flat V-NP-PP structure. This follows from the~ theory of Lexical Preference, which stipulates that the head word of a phrase selects for its &quot;most preferred&quot; syntactic expansion. This is exactly what is modeled by the first term of Equation (2). &quot;Lexical preference&quot; has been around a long time, and the (corresponding) syntactic expansion probability we use has been used by many researchers in parsing, including many of those mentioned in this article. The difficulty iwith this model, and perhaps the reason it has not been pursued to-date, is the intense data sparsity problem encountered in estimating the second term of equation (2), the lexical introduction probability. Much work in statistical parsing limits all probabilities to &quot;binary&quot; lexical statistics, where for any probability of the form P(X1,... ,Xrt I Y1, ... ,Yr~), at most one of the X random variables and one of the Y random variables is lexical. By allowing &quot;vt-ary&quot; lexical statistics, we allow an explosion of the probability space.</Paragraph> <Paragraph position="8"> Nevertheless, we suggest that human parsing is responsive to 'the familiarity (or otherwise) of particular head patterns in rules. To combat the data sparsity, we have used WordNet to &quot;back off&quot; to more general semantic classes when statistics on a word are unavailable. To back off semantically, however, we need to be dealing with word senses, not just word forms.</Paragraph> <Paragraph position="9"> It is a simple refinement of our model to replace all instances of &quot;head word&quot; with &quot;head sense&quot;. Additionally, a semantic concordance of a sub-set of the Brown. Corpus and WordNet senses is available (Lande's et al., 1998). Thus, our corpus, collected as described in Section 2, can be augmented to use statistics on word senses in syntactic constructs, after alignment of this semantic concordance (of Brown) with treebank's labeled bracketing (of Brown). Moreover, a language model that distinguishes word senses will tend to reflect the semantic as well as syntactic and lexical patterns in language, and thus should be advantageous both in training the model and in using it for parsing.</Paragraph> </Section> <Section position="5" start_page="180" end_page="181" type="metho"> <SectionTitle> 4 Estimation </SectionTitle> <Paragraph position="0"> We have used WordNet senses so that we might combat the data spaxsity we encounter when trying to calculate the probabilities in Equation (2). Specifically, we have employed a &quot;semantic backoff&quot; in estimating the probabilities, where the backing off is done by ascending in the WordNet hierarchy (following its hypernym relations). When attempting to calculate the probability of a syntactic expansion -- the probability of a category X with head sense w expanding as rulename -- we search, breadthfirst, for the first hypernym of w in Word-Net which occurred with X at least t times in our training data, where t is some threshold value. So the probability P(rulename I X,w) P(rulename \[ X,p(w)), where p(w) denotes the hypernym we found.</Paragraph> <Paragraph position="1"> Similarly, for the probability of lexical introduction, we abstract the tuple (X,w, rulename) to a tuple (X, p~(w), rulename) which occurred sufficiently often. Once this is found, we search upward, again breadth-first, 2 for some abstraction ~(h) of h which appeared at least once in the context of the adequate conditioning information. Each a(h4) is some parent of the word sense h4. The probability of each original word h4 is then conditioned on the appropriate hypernym of the found abstraction. So we approximate: null</Paragraph> <Paragraph position="3"> Note that p(w) in the first estimation may not equal p~(w) in the second. Also note that backing off completely, to the TOP of the ontology, for the word in the conditioning information, is equivalent to dropping it from the conditioning information. Backing off completely in 2Breadth-first is a first approximation as the search mechanism; we intend to pursue this issue in future work. search for the abstraction when calculating the probability of lexical introduction effectively reduces that probability to I-It P(ht) .3</Paragraph> </Section> <Section position="6" start_page="181" end_page="182" type="metho"> <SectionTitle> 5 Experimental Results </SectionTitle> <Paragraph position="0"> We sequestered 421 sentences from our corpus of 4892 sentences (with trees and sense-tags), and used the balance for training the probabilities in equation (2). These 4892 are the parseable segment of the 16,374 trees for which we were able to &quot;match up&quot; the Treebank syntactic annotation with the semantic concordance.</Paragraph> <Paragraph position="1"> (Random errors and inconsistencies seem to account for why not all 19,843 trees align. In fact, these 19,843 themselves exclude all trees which appear to be headlines or some other irregular text. We do not, however, exclude any trees on the basis of the type of their root category.</Paragraph> <Paragraph position="2"> The corpus contains sentences as well as verb phrases, noun phrases, etc..) We then tested the parser varying two binary parameters: * whether or not the semantic backoff procedure was used -- If not, an unobserved conditioning event would immediately have us drop the lexical information. For example, (X,w / would immediately be backed off to simply (X).</Paragraph> <Paragraph position="3"> * whether or not we simply estimated the joint probability P(~tl X,w, rulename) as I-~i y (kt \] X, w, rulename ). This we will call the &quot;binary&quot; assumption, as opposed to &quot;rt-ary&quot;. Effectively, it means that each daughter's head word sense is introduced independently of the others.</Paragraph> <Paragraph position="4"> Tables 1 and 2 display the results for the four different settings, along with the results for a straight PCFG model (as a baseline). Note that t, our threshold parameter from above, was set to 10 for these experiments. Labeled precision and recall (Table 1) are the same as in other reports on statistical parsing: they measure how often a particular syntactic category was correctly calculated to span a particular portion of the input. Recall that our corpus aWe actually stop short of this in our estimations. We search upward for the top-most nodes in WordNet, but we do not continue to the synthetic T0P node. Instead, we drop the lexeme from the conditioning information and restart the search.</Paragraph> <Paragraph position="5"> was derived using a hand-crafted grammar. It makes sense, then, to add an additional criterion for correctness: we can check the actual expansions (rulenames) used and see if they were correct. This metric speaks to an issue raised by Charniak (1997b) when he notes that the rule NP -> NP NP has (at least) two different interpretations: one for appositive NPs and one for &quot;unit&quot; phrases like &quot;5 dollars a share&quot;.4 A hand-written grammar will differentiate these two constructions. Thus Table 2 shows precision and recall figures for this more strict criterion, for the four models in question plus PCFG again as a baseline. Note also that since Table 2 is for syntactic expansions, it does not include lexical level bracketings.</Paragraph> <Section position="1" start_page="181" end_page="182" type="sub_section"> <SectionTitle> Results </SectionTitle> <Paragraph position="0"> First note that the degree of improvement over baseline of even the most minimal model is approximately what other researchers, using purely corpus-driven techniques, have reported (Charniak, 1997a).</Paragraph> <Paragraph position="1"> Also note that the &quot;full&quot; model, using both rLary lexical statistics and semantic backoff, performs (statistically) significantly better than both of the models which do not use semantic backoff. The lone exception is that the precision of the labeled bracketings is not significantly different for the &quot;full&quot; model and the &quot;minimal&quot; model. 5 4In fact there should be syntactic differences for these two constructions, since phrases like &quot;the dollars the share&quot; are syntactically ill-formed unit noun phrases. ~Two-sided tests were used, with oC/ = 0.05.</Paragraph> <Paragraph position="2"> Interestingly, the &quot;minimal&quot; model is not significantly different from either of the two models gotten by adding one o/ rt-axy statistics or semantic backoff. The improvement is only significant when both features are added.</Paragraph> <Paragraph position="3"> Our results for word sense disambiguation (obtained as a by-product of parsing) are shown in semantically enables the parser to do a better job at getting Senses right. The sense recall figures for the two models which use semantic backoff are significantly better than for those models which do not. Additionally, the improvement over baseline is significantly better for those models which use semantic backoff (11 percentage points improvement) than for those which do not (4 points better). 6 baseline results:are gotten by choosing the most frequent sense for the word, given the part of speech assigned by the parser. (Hence it may be different across,different models for the parser.)</Paragraph> </Section> </Section> <Section position="7" start_page="182" end_page="183" type="metho"> <SectionTitle> 6 Related Work </SectionTitle> <Paragraph position="0"> As our framework and corpus are rather different from other work on parsing and sense disambiguation, it is difficult to make quantitative comparisons. Many researchers have achieved sense disambiguation rates above 90% (e.g. Gale et al. (1992)),: but this work has typically focussed on disambiguating a few polysemous words with &quot;coarse&quot; sense distinctions using a large corpus. Here, we are disambiguating all words with WordNet isenses and not very much data.</Paragraph> <Paragraph position="1"> Ng and Lee (1996) report results on disambiguat/ null 6The improvement gotten for moving from binary to rt-ary relations, when using WordNet, is not significant. This is most likely due to the small percentage of expansions which are likely to be helped by rt-ary statistics -less than 1%. In fact, there were only seven instances, over the 421-sentence test set, where an n-ary rule was correctly selectedl by the parser and the head of that phrase was also 'correctly selected. Given such small numbers, we would not expect to see a significant improvement, when using n-ary statistics, for word sense disambiguation.</Paragraph> <Paragraph position="2"> ing among WordNet senses for the most frequent 191 nouns and verbs (together, they account for 20% of all nouns and verbs we expect to encounter in a random selection of text).</Paragraph> <Paragraph position="3"> They get an improvement of 6.9 percentage points (54.0 over 47.1 percent) in disambiguating instances of these words in the Brown Corpus.</Paragraph> <Paragraph position="4"> Since the most frequent words are typically the most polysemous, the ambiguity problem is more severe for this subset, but there is also more data: we have about 24,000 instances of 10,000 distinct senses in our corpus, and Ng and Lee (1996) use 192,800 occurrences of their 191 words.</Paragraph> <Paragraph position="5"> Carroll et al. (1998) report results on a parser, similarly based on linguistically well-founded resources, using corpus-derived subcategorization probabilities (the first term in Equation (2)). They report a significant increase in parsing accuracy, measured using a system of grammatical relations. Their corpus is annotated with grammatical relations like subj and ccomp, and the parser can then output these relations as a component of a parse. Carroll et al. (1998) argue that these relations enable a more accurate metric for parsing than labeled bracketing and recall. Our evaluation of phrase structure rules used in a parse is a crude attempt at this higher-level evaluation.</Paragraph> <Paragraph position="6"> As mentioned above, much recent work on lexicalizing parsers has focused on binary lexical relations, specifically head-head relations of mother and daughter constituents e.g. (Carroll and Rooth, 1998; Collins, 1996). Some have used word classes to combat the sparsity problem (Charniak, 1997a). Link grammars allow for a probabilistic model with ternary head-headhead relations (Lafferty et al., 1992). The link grammar website reports that, on a test of their parser on 100 sentences (average length 25 words) of Wall Street Journal text, over 82% of the labeled constituents were correctly calculated/ Some limited work has been done using u-ary lexical statistics. Hogenhout and Matsumoto (1996) describe a lexicalization of context free grammars very similar to ours, but without presenting a generative model. The probabilities used, as a result, ignore valuable conditioning information, such as the head word of constituent helping to predict its syntactic expansion. Nevertheless, they are able to achieve approximately 95% labeled bracketing precision and recall on their corpus. Note that they use a small finite number of word classes, rather than lexical items, in their statistics.</Paragraph> <Paragraph position="7"> Utsuro and Matsumoto (1997) present a very interesting mechanism for learning semantic case frames for Japanese verbs: each case frame is a tuple of independent component frames (each of which may have an n-tuple of slots). Moreover, they use an ontology rather than simply word classes when finding the case frames. In this way, the work is essentially a generalization of the work of Resnik (1993). They report results on disambiguating whether a nominal argument in a complex Japanese sentence belongs to the subordinate clause verb or the matrix clause verb. Their evaluation covers three Japanese verbs, and achieves accuracy of 96% on this disambiguation task.</Paragraph> <Paragraph position="8"> Chang et al. (1992) describe a model for machine translation which can accommodate n-ary lexical statistics. They report no improvement in parsing accuracy for n~ > 2. Their results most likely suffer from sparse data (they had only about 1000 sentences), although they did use semantic classes rather than lexical items.</Paragraph> <Paragraph position="9"> They report that their total sentence accuracy (percent of test sentences whose calculated bracketing is completely correct) is approximately 58%.</Paragraph> </Section> <Section position="8" start_page="183" end_page="183" type="metho"> <SectionTitle> 7 Future Work </SectionTitle> <Paragraph position="0"> There are many directions to take this work.</Paragraph> <Paragraph position="1"> One advantage of our well-founded framework is that it allows more linguistic information, e.g.</Paragraph> <Paragraph position="2"> features like tense and agreement, to be used in the language model, s For example, a verb phrase in the imperfect may often be modified by an adjunctive, durative PP:for. We would like to use the techniques of corpus-based parsing to extract these statistical patterns automatically. The model easily extends to incorporate a host of syntactic features (Seagull and Schubert, 1998).</Paragraph> <Paragraph position="3"> SNore that these particular features are in theory available to a purely corpus-based parser, as part-of-speech tags in the Penn Treebank are marked for tense and agreement. But that information is not available to the phrase-level constituent unless a notion of heads and feature passing is added to the mechanism. It seems that foot features, unless explicitly realized at the phrase level (e.g. WHPP) would be even more difficult to percolate without an a priori notion of features and grammar.</Paragraph> <Paragraph position="4"> Currently the parser uses a pruning scheme that filters as it creates the parse bottom-up.</Paragraph> <Paragraph position="5"> The filtering is done based on the probability of the individual nodes, irrespective of the global context. The pruning procedure needs refinement, as our full model was not able to arrive at a parse for eight of the 421 sentences in the test set.</Paragraph> <Paragraph position="6"> We would certainly like to expand our corpus by increasing the coverage of our grammar.</Paragraph> <Paragraph position="7"> Also, adding a constituent size/distance effect, as described by Schubert (1986) and as used by some researchers in parsing (e.g. Lesmo and Torasso (1985) and Collins (1997)) would almost certainly improve parsing.</Paragraph> <Paragraph position="8"> Most likely, WordNet senses are more fine-grained than we need for syntactic disambiguation. We may investigate methods of automatically collapsing senses which are similar. Also, we may use more data on word sense frequencies, outside of the data we get from our &quot;parseable bracketings'. We used WordNet for these experiments both because WordNet provides an ontology, and because there was an extant corpus which was annotated with both syntactic and word sense information. Using a corpus that is tagged with &quot;coarser&quot; senses will almost certainly yield better results, on both sense disambiguation and parsing.</Paragraph> </Section> class="xml-element"></Paper>