File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1320_metho.xml

Size: 24,884 bytes

Last Modified: 2025-10-06 14:07:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1320">
  <Title>A Statistical Model for Parsing and Word-Sense Disambiguation</Title>
  <Section position="4" start_page="155" end_page="157" type="metho">
    <SectionTitle>
3 The Model
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="155" end_page="156" type="sub_section">
      <SectionTitle>
3.1 Overview
</SectionTitle>
      <Paragraph position="0"> The parsing model we started with was extracted from BBN's SIFT system (Miller et al., 1998), which we briefly present again here, using examples from Figure 1 to illustrate the model's parameters. 3 The model generates the head of a constituent first, then each of the left- and rightmodifiers, generating from the head outward, using a bigram model of node labels. Here are the first few elements generated by the model for the tree of Figure 1:  1. S and its head word and part of speech, caught- VB D.</Paragraph>
      <Paragraph position="1"> 2. The head constituent of S, VP.</Paragraph>
      <Paragraph position="2"> 3. The head word of the VP, caught-VBD.</Paragraph>
      <Paragraph position="3"> 4. The premodifier constituent ADVP.</Paragraph>
      <Paragraph position="4"> 3We began with the BBN parser because its authors  were kind enough to allow us to extent it, and because its design allowed easy integration with our existing WordNet code.</Paragraph>
      <Paragraph position="6"> stituent of the VP.</Paragraph>
      <Paragraph position="7"> This process recurses on each of the modifier constituents (in this case, the subject NP and the VP) until all words have been generated. (Note that many words effectively get generated high up in the tree; in this example sentence, the last words to get generated are the two the's ) More formally, the lexicalized PCFG that sits behind the parsing model has rules of the form Figure 1. For brevity, we omit the smoothing details of BBN's model (see (Miller et al., 1998) for a complete description); we note that all smoothing weights are computed via the technique described in (Bikel et al., 1997). The probability of generating p as the root label is predicted conditioning on only +TOP+, which is the hidden root of all parse trees: P(Pl +TOP+), e.g., P(S I + TOP+). (2) The probability of generating a head node h with a parent p is</Paragraph>
      <Paragraph position="9"> The probability of generating a left-modifier l~</Paragraph>
      <Paragraph position="11"> where P, H, L, and .P~ are all lexicalized nonterminals, i.e., of the form X(w, t, fl, where X is a traditional CFG nonterminal and (w, t, f/ is the word-part-of.-speech-word-feature triple that is the head of the phrase denoted by X. 4 The lexicalized nonterminal H is so named because it is the head constztuent, where P inherits its head triple from this head constituent.</Paragraph>
      <Paragraph position="12"> The constituents labeled L~ and .R~ are leftand right-modifier constituents, respectively.</Paragraph>
    </Section>
    <Section position="2" start_page="156" end_page="157" type="sub_section">
      <SectionTitle>
3.2 Probability structure of the
</SectionTitle>
      <Paragraph position="0"> original model We use p to denote the unlexicalized nonterminal corresponding to P, and similarly for li, ri and h. We now present the top-level generation probabilities, along with examples from 4The inclusion of the word feature in the BBN model was due to the work described in (Weischedel et al., 1993), where word features helped reduce part of speech ambiguity for unknown words.</Paragraph>
      <Paragraph position="2"> when generating the NP for NP(boy-NN), and the probability of generating a right modifier</Paragraph>
      <Paragraph position="4"> when generating the NP for NP(ball-NN). 5 The probabilities for generating lexical elements (part-of-speech tags, words and word features) are as follows. The part of speech tag of the head of the entire sentence, th, is 5The bidden nonterminal +BEGIN+ is used to provide a convenient mechanism for determining the initial probability of the underlying Markov process generating the modifying nonterminals; the hidden non-terminal +END+ is used to provide consistency to the underlying Markov process, i.e., so that the probabilities of all possible nonterminal sequences sum to 1.  computed conditioning only on the top-most symbol p:6 P(th I P). (6) Part of speech tags of modifier constituents, tt, and tri, are predicted conditioning on the modifier constituent lz or ri, the tag of the head constituent, th, and the word of the head constituent, Wh: P(tt, Ill, th, Wh) and P(tr, \[ri, th, Wh). (7) The head word of the entire sentence, Wh, is predicted conditioning only on the top-most symbol p and th.</Paragraph>
      <Paragraph position="5"> P(Wh\[th,p). (8) Head words of modifier constituents, wl, and wr,, are predicted conditioning on all the context used for predicting parts of speech in (7), as well as the parts of speech themsleves P(wt, \]tt,, li, th, Wh) and P(wr, \] try, r~, th, Wh). (9) The word feature of the head of the entire sentence, fh, is predicted conditioning on the top-most symbol p, its head word, wh, and its head tag, th: P(fh \[Wh, th,p). (10) Finally, the word features for the head words of modifier constituents, fz, and fr,, are predicted conditioning on all the context used to predict modifier head words in (9), as well as the modifier head words themselves: P(ft, I known(wt, ), tt~, li, th, Wh) (11) and P(fr, I known(w~,), tr,, ri, th, Wh) where known(x) is a predicate returning true if the word x was observed more than 4 times in the training data.</Paragraph>
      <Paragraph position="6"> The probability of an entire parse tree is the product of the probabifities of generating all of the elements of that parse tree, where an element is either a constituent label, a part of speech tag, a word or a word feature. We obtain maximum-likelihood estimates of the parameters of this model using frequencies gathered from the training data.</Paragraph>
      <Paragraph position="7"> 6This is the one place where we have altered the original model, as the lexical components of the head of the entire sentence were all being estimated incorrectly, causing an inconsistency in the model. We have corrected the estimation of th, Wh and fh in our implementation. null</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="157" end_page="158" type="metho">
    <SectionTitle>
4 Word-sense Extensions to the
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="157" end_page="158" type="sub_section">
      <SectionTitle>
Lexical Model
</SectionTitle>
      <Paragraph position="0"> The desired output structure of our combined parser/word-sense disambiguator is a standard, Treebank-style parse tree, where the words not only have parts ef speech, but also WordNet synsets. Incorporating synsets into the lexical part of the model is fairly straightforward: a synset is yet another element to be generated. The question is when to generate it.</Paragraph>
      <Paragraph position="1"> The lexical model has decomposed the generation of the (w, t, f) triple into three steps, each conditioning on all the history of the previous step. While it is probabilistically identical to predict synsets at any of the four possible points if we continue to condition on all the history at each step, we would like to pick the point that is most well-founded both in terms of the underlying linguistic structure and in terms of what can be well-estimated. In Section 2.2.1 we mentioned the soft-clustering aspect of synsets; in fact, they have a duality.</Paragraph>
      <Paragraph position="2"> On the one hand, they serve to add specificity to what might otherwise be an ambiguous lexical item; on the other, they are sets, clustering lexical items that have similar meanings. Even further, noun and verb synsets form a concept taxonomy, the hypernym relation forming a partial ordering on the lemmas contained in WordNet. The former aspect corresponds roughly to what we as human listeners or readers do: we hear or see a sequence of words in context, and determine incrementally the particular meaning of each of those words. The latter aspect corresponds more closely to a mental model of generation: we have a desire or intention to convey, we choose the appropriate concepts with which to convey it, and we realize that desire or intention with the most felicitous syntactic structure and lexical realizations of those concepts. As this is a generative model, we generate a word's synset after generating the part of speech tag but before generating the word itself/ The synset of the head of the entire sentence, Sh is predicted conditioning only on the top-most symbol p and the head tag, th: P(Sh\[th,p). (12) We accordingly changed the probability of 7We believe that synsets and parts of speech are largely orthogonal with respect to tlieir lexical information, and thus their relative order of prediction was not a concern.</Paragraph>
      <Paragraph position="3">  generating the head word of the entire sentence to be P(Wh I Sh, th,p). (13) The probability estimates for (12) and (13) are not smoothed.</Paragraph>
      <Paragraph position="4"> The probability model for generating synsets of modifier constituents mi, complete with smoothing components, is as follows:</Paragraph>
      <Paragraph position="6"> where @'(Sh) is the i th hypernym of Sh. The WordNet hypernym relations, however, do not form a tree, but a DAG, so whenever there are multiple hypernyms, the uniformly-weighted mean is taken of the probabilities conditioning on each of the hypernyms. That is,</Paragraph>
      <Paragraph position="8"> Note that in the first level of back-off, we no longer condition on the head word, but strictly on its synset, and thereafter on hypernyms of that synset; these models, then, get at the heart of our approach, which is to abstract away from lexical head relations, and move to the more general lexico-semantic relations, here represented by synset relations.</Paragraph>
      <Paragraph position="9"> Now that we generate synsets for words using (14), we can also change the word generation model to have synsets in its history: P(w~, I sm,, t~,, m,, Wh, Sh) = (16) ),0P(wm, I sin,, t~, mi, wh)  where once again, @i(Sh) is the zth hypernym of Sh. For both the word and synset prediction models, by backing off up the hypernym chain, there is an appropriate confiation of similar head relations. For example, if in training the verb phrase \[strike the target\] had been seen, if the unseen verb phrase \[attack the target\] appeared during testing, then the training from the semantically-similar training phrase could be used, since this sense of attack is the hypernym of this sense of stroke.</Paragraph>
      <Paragraph position="10"> Finally, we note that both of these synsetand word-prediction probability estimates contain an enormous number of back-off levels for nouns and verbs, corresponding to the head word's depth in the synset hierarchy. A valid concern would be that the model might be backing off using histories that are fax too general, so we experimented with limiting the hypernym back-off to only two, three and four levels. This change produced a negligible difference in parsing performance, s</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="158" end_page="159" type="metho">
    <SectionTitle>
5 A New Approach, A New Data
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="158" end_page="159" type="sub_section">
      <SectionTitle>
Set
</SectionTitle>
      <Paragraph position="0"> Ideally, the well-established gold standard for syntax, the Penn Treebank, would have a parallel word-sense-annotated corpus; unfortunately, no such word-sense corpus exists.</Paragraph>
      <Paragraph position="1"> However, we do have SemCor (Miller et al., 1994), where every noun, verb, adjective and adverb from a 455k word portion of the Brown Corpus has been assigned a WordNet synset.</Paragraph>
      <Paragraph position="2"> While all of the Brown Corpus was annotated in the style of Treebank I, a great deal was also more recently annotated in Tree-bank II format, and this corpus has recently been released by the Linguistic Data Consortium. 9 As it happens, the intersection between the Treebank-II-annotated Brown and SemCor comprises some 220k words, most of which is fiction, with some nonfiction and humor writing as well.</Paragraph>
      <Paragraph position="3"> We went through all 220k words of the corpora, synchronizing them. That is, we made sure that the corpora were identical up to the spelling of individual tokens, correcting all 8We aim to investigate the precise effects of our back-off strategy in the next version of our combined parsing/WSD model.</Paragraph>
      <Paragraph position="4"> 9We were given permission to use a pre-release version of this Treebank II-style corpus.</Paragraph>
      <Paragraph position="5">  tokenization and sentence-breaking discrepancies. This correcton task ranged from the simple, such as connecting two sentences in one corpus that were erroneously broken, to the middling, such as joining two tokens in SemCor that comprised a hyphenate in Brown, to the difficult, such as correcting egregious parse annotation errors, or annotating entire sentences that were omitted from SemCor. In particular, the case of hyphenates was quite frequent, as it was the default in SemCor to split up all such words and assign them their individual word senses (synsets). In general, we attempted to make SemCor look as much as possible like the Treebank II-annotated Brown, and we used the following guidelines for assigning word senses to hyphenates:  1. Assign the word sense of the head of the hyphenate. E.g., both twelve-foot and ten-foot get the word sense of foot_l (the unit of measure equal to 12 inches).</Paragraph>
      <Paragraph position="6"> 2. If there is no clear head, then attempt to annotate with the word sense of the hypernym of the senses of the hyphenate components. E.g., U.S.-Soviet gets the word sense of country_2 (a state or nation). null 3. If options 1 and 2 are not possible, the hyphenate is split in the Treebank II file.</Paragraph>
      <Paragraph position="7"> 4. If the hyphenate has the prefix non- or  anti-, annotate with the word sense of that which follows, with the understanding that a post-processing step could recover the antonymous word sense, if necessary. null After three passes through the corpora, they were perfectly synchronized. We are seeking permission to make this data set available to any who already have access to both SemCor and the Treebank II version of Brown.</Paragraph>
      <Paragraph position="8"> After this synchronization process, we merged the word-sense annotations of our corrected SemCor with the tokens of our corrected version of the Treebank II Brown data. Here we were forced to make two decisions.</Paragraph>
      <Paragraph position="9"> First, SemCor allows multiple synsets to be assigned to a particular word; in these cases, we simply discard all but the first assigned synset. Second, WordNet has collocations, whereas Treebank does not. To deal with this disparity, we re-analyze annotated collocations as a sequence of separate words that have all been assigned the same synset as was assigned the collocation as a whole. This is not as unreasonable as it may sound; for example, vice_president is a lemma in WordNet and appears in SemCor, so the merged corpus has instances where the word president has the synset vice president l, but only when preceded by the word vice. The cost of this decision is an increase in average polysemy.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="159" end_page="159" type="metho">
    <SectionTitle>
6 Training and Decoding
</SectionTitle>
    <Paragraph position="0"> Using this merged corpus, actual training of our model proceeds in an identical fashion to training the non-WordNet-extended model, except that for each lexical relation, the hypernym chain of the parent head is followed to derive counts for the various back-off levels described in Section 4. We also developed a &amp;quot;plug-'n'-play&amp;quot; lexical model system to facilitate experimentation with various word- and synset-prediction models and back-off strategies. null Even though the model is a top-down, generative one, parsing proceeds bottom-up. The model is searched via a modified version of CKY, where candidate parse trees that cover the same span of words are ranked against each other. In the unextended parsing model, the cells corresponding to spans of length one are seeded with (w,t,f) triples, with every possible tag t for a given word zv (the word-feature f is computed deterministically for w); this step introduces the first degree of ambiguity in the decoding process. Our WordNet-extended model adds to this initial ambiguity, for each cell is seeded with (w, t, f, s) quadruples, with every possible synset s for a given word-tag pair.</Paragraph>
    <Paragraph position="1"> During decoding, two forms of pruning are employed: a beam is applied to each cell in the chart, pruning away all parses whose ranking score is not within a factor of e -k of the top-ranked parse, and only the top-ranked n sub-trees are maintained, and the rest are pruned away. The &amp;quot;out-of-the-box&amp;quot; BBN program uses values of-5 and 25 for k and n, respectively.</Paragraph>
    <Paragraph position="2"> We changed these to default to -9 and 50, because generating additional unseen items (in our case, synsets) will necessarily lower intermediate ranking scores.</Paragraph>
  </Section>
  <Section position="8" start_page="159" end_page="161" type="metho">
    <SectionTitle>
7 Experiments and Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="159" end_page="161" type="sub_section">
      <SectionTitle>
7.1 Parsing
</SectionTitle>
      <Paragraph position="0"> Initially, we created a small test set, blindly choosing the last 117 sentences, or 1%, of  our 220k word corpus, sentences which were, as it happens, from section &amp;quot;r&amp;quot; of the Brown Corpus. After some disappointing parsing results using both the regular parser and our WordNet-extended version, we peeked in (Francis and Ku~era, 1979) and discovered this was the humor writing section; our initial test corpus was literally a joke. To create a more representative test set, we sampled every 100th sentence to create a new liTsentence test set that spanned the entire range of styles in the 220k words; we put all other sentences in the training set. 1deg For the sake of comparison, we present results for both test sets (from section &amp;quot;r&amp;quot; and the balanced test set) and both the standard model (Norm) and our WN-extended model (WN-ext) in Table 1.11 We note that after we switched to the balanced test set, we did not use the &amp;quot;out-of-the-box&amp;quot; version of the BBN parser, as its default settings for pruning away low-count items and the threshold at which to count a word as &amp;quot;unknown&amp;quot; were too high to yield decent results. Instead, we used precisely the same settings as for our WordNet-extended version, complete with the larger beam width discussed in the previous section32 , The reader will note that our extended model performs at roughly the same level as the unextended version with respect to parsing--a shave better with the &amp;quot;r&amp;quot; test set, and slightly worse on the balanced test set.</Paragraph>
      <Paragraph position="1"> Recall, however, that this is in spite of adding more intermediate ambiguity during the decoding process, and yet using the same beam width. Furthermore, our extensions have occurred strictly within the framework of the original model, but we believe that for the true advantages of synsets to become apparent, we must use trilexical or even tetralex~degWe realize these are very small test sets, but we presume they are large enough to at least give a good indicator of performance on the tasks evaluated. They were kept small to allow for a rapid train-test-analyze cycle, z.e., they were actually used as development test sets. With the completion of these initial experiments, we are going to designate a proper three-way divsion of training, devtest and test set of this new merged corpus.</Paragraph>
      <Paragraph position="2"> UThe scores in the rows labeled Norm, &amp;quot;r&amp;quot;, indicating the performance of the standard BBN model on the &amp;quot;r&amp;quot; test set, are actually scores based on 116 of the 117 sentences, as one sentence did not get parsed due to a timeout in the program.</Paragraph>
      <Paragraph position="3">  both test sets. All results are percentages, except for those in the CB column. *See footnote  dependencies might cripple a standard generative model, the soft-clustering aspects of synsets should offset the sparse data problem.</Paragraph>
      <Paragraph position="4"> As an example of the lack of such dependencies, in the current model when predicting the attachment of \[bought company \[for million\]\], there is no current dependence between the verb bought and the object of the preposition million--a dependence shown to be useful in virtually all the PP attachment work, and particularly in (Stetina and Nagao, 1997). Related to this issue, we note that the head rules, which were nearly identical to those used in (Collins, 1997), have not been tuned at all to this task. For example, in the sentence in Figure 2, the subject Jane is predicted conditioning on the head of the VP, which is the modal wdl, as opposed to the more semanticallycontent-rich kill. So, while the head relations provide a very useful structure for many syntactic decisions the parser needs to make, it is quite possible that the synset relations of this model would require additional or different de- null balanced test set.</Paragraph>
      <Paragraph position="5"> pendencies that would help in the prediction of correct synsets, and in turn help further reduce certain syntactic ambiguities, such as PP attachment. This is because the &amp;quot;lightweight semantics&amp;quot; offered by synset relations can provide selectional and world-knowledge restrictions that simple lexicalized nonterminal relations cannot.</Paragraph>
    </Section>
    <Section position="2" start_page="161" end_page="161" type="sub_section">
      <SectionTitle>
7.2 Word-sense disambiguation
</SectionTitle>
      <Paragraph position="0"> The WSD results on the balanced test set are shown in Table 2. A few important points must be made when evaluating these results. First, almost all other WSD approaches are aimed at distinguishing homonyms, as opposed to the type of fine-grained distinctions that can be made by WordNet. Second, almost all other WSD approaches attempt to disambiguate a small set of such homonymous terms, whereas here we are attacking the generahzed word-sense disambiguation problem. Third, we call attention to the fact that SemCor has a reported inter-annotator agreement of 78.6% overall, and as low as 70% for words with polysemy of 8 or above (Fellbaum et al., 1998), so it is with this upper bound in mind that one must consider the precision of any generalized WSD system. Finally, we note that the scores in Table 2 are for exact synset matches; that is, if our program delivers a synset that is, say, the hypernym or sibling of the correct answer, no credit is given.</Paragraph>
      <Paragraph position="1"> While it is tempting to compare these results to those of (Stetina et al., 1998), who reported 79.4% overall accuracy on a different, larger test set using their non-discourse model, we note that that was more of an upper-bound study, examining how well a WSD algorithm could perform if it had access to goldstandard-perfect parse trees33 By way of further comparison, that algorithm has a feature space similar to the synset-prediction compo1nit is not clear how or why the results of (Stetina et al., 1998) exceeded the reported inter-annotator agreement of the entire corpus.</Paragraph>
      <Paragraph position="2"> nents of our model, but the steps used to rank possible answers are based largely on heuristics; in contrast, our model is based entirely on maximum-likelihood probability estimates.</Paragraph>
      <Paragraph position="3"> A final note on the scores of Table 2: given the fact that there is not a deterministic mapping between the 50-odd Treebank and</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="161" end_page="161" type="metho">
    <SectionTitle>
4 WordNet parts of speech, when our pro-
</SectionTitle>
    <Paragraph position="0"> gram delivers a synset for a WordNet part of speech that is different from our gold file, we have called this a recall error, as this is consistent with all other WSD work, where part of speech ambiguity is not a component of an algorithm's precision.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML