XML Viewer - w04-2410

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2410_metho.xml
Size: 16,342 bytes
Last Modified: 2025-10-06 14:09:24
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-2410">
  <Title>Thesauruses for Prepositional Phrase Attachment</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Previous work
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 PP attachment
</SectionTitle>
      <Paragraph position="0"> Early work on PP attachment disambiguation used strictly syntactic or high-level pragmatic rules to decide on an attachment (Frazier, 1979; Altman and Steedman, 1988). However, work by Whittemore et al. (1990) and Hindle and Rooth (1993) showed that simple lexical preferences alone can deliver reasonable accuracy. Hindle and Rooth's approach was to use mostly unambiguous (v,n1,p) triples extracted from automatically parsed text to train a maximum likelihood classifier. This achieved around 80% accuracy on ambiguous samples.</Paragraph>
      <Paragraph position="1"> This marked a flowering in the field of PP attachment, with a succession of papers bringing the whole armoury of machine learning techniques to bear on the problem.</Paragraph>
      <Paragraph position="2"> Ratnaparkhi et al. (1994) trained a maximum entropy model on (v,n1,p,n2) quadruples extracted from the Wall Street Journal corpus and achieved 81.6% accuracy.</Paragraph>
      <Paragraph position="3"> The Collins and Brooks (1995) model scores 84.5% accuracy on this task, and is one of the most accurate models that do not use additional supervision. The current state of the art is 88% reported by Stetina and Nagao (1997) using the WSJ text in conjunction with WordNet. The next section discusses other specific approaches that incorporate smoothing techniques.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Similarity-based smoothing
</SectionTitle>
      <Paragraph position="0"> Smoothing for statistical models involves adjusting probability estimates away from the maximum likelihood estimates to avoid the low probabilities caused by sparse data. Typically this involves mixing in probability distributions that have less context and are less likely to suffer from sparse data problems. For example, if the probability of an attachment given a PP p(a|v,n1,p,n2) is undefined because that quadruple was not seen in the training data, then a less specific distribution such as p(a|v,n1,p) can be used instead. A wide range of different techniques have been proposed (Chen and Goodman, 1996) including the backing-off technique used by Collins' model (see Section 3).</Paragraph>
      <Paragraph position="1"> An alternative but complementary approach is to mix in probabilities from distributions over &amp;quot;similar&amp;quot; contexts. This is the idea behind both similarity-based and class-based smoothing. Class-based methods cluster similar words into classes which are then used in place of actual words. For example the class-based language model of (Brown et al., 1992) is defined as:</Paragraph>
      <Paragraph position="3"> This helps solve the sparse data problem since the number of classes is usually much smaller than the number of words.</Paragraph>
      <Paragraph position="4"> Class-based methods have been applied to the PP attachment task in several guises, using both automatic clustering and hand-crafted classes such as WordNet. Li and Abe (1998) use both WordNet and an automatic clustering algorithm to achieve 85.2% accuracy on the WSJ dataset. The maximum entropy approach of Ratnaparkhi et al. (1994) uses the mutual information clustering algorithm described in (Brown et al., 1992). Although class-based smoothing is shown to improve the model in both cases, some researchers have suggested that clustering words is counterproductive since the information lost by conflating words into broader classes outweighs the benefits derived from reducing data sparseness. This remains to be proven conclusively (Dagan et al., 1999).</Paragraph>
      <Paragraph position="5"> In contrast, similarity-based techniques do not discard any data. Instead the smoothed probability of a word is defined as the total probability of all similar words S(w) as drawn from a thesaurus, weighted by their similarity a(w,wprime). For example, the similarity-based language model of (Dagan et al., 1999) is defined as:</Paragraph>
      <Paragraph position="7"> tion reflects how often the two words appear in the same context. For example, Lin's similarity metric (Lin, 1998b) used in this paper is based on an information-theoretic comparison between a pair of co-occurrence probability distributions.</Paragraph>
      <Paragraph position="8"> This language model was incorporated into a speech recognition system with some success (Dagan et al., 1999). Similarity-based methods have also been successfully applied word sense disambiguation (Dagan et al., 1997) and extraction of grammatical relations (Grishman and Sterling, 1994). Similarity-based smoothing techniques of the kind described here have not yet been applied to probabilistic PP attachment models. The memory-based learning approach of (Zavrel et al., 1997) is the closest point of contact and shares many of the same ideas, although the details are quite different. Memory-based learning consults similar previously-seen examples to make a decision, but the similarity judgements are usually based on a strict feature matching measure rather than on co-occurrence statistics. Under this scheme pizza and pasta are as different as pizza and Paris. To overcome this Zavrel et al. also experiment with features based on a reduced-dimensionality vector of co-occurrence statistics and note a small (0.2%) increase in performance, leading to a final accuracy of 84.4%.</Paragraph>
      <Paragraph position="9"> Our use of specialist thesauruses for this task is also novel, although in they have been used in the somewhat related field of selectional preference acquisition by</Paragraph>
      <Paragraph position="11"> 1. f(a,v,n1,p,n2)f(v,n1,p,n2) 2. f(a,v,n1,p)+f(a,v,p,n2)+f(a,n1,p,n2)f(v,n1,p)+f(v,p,n2)+f(n1,p,n2) 3. f(a,v,p)+f(a,n1,p)+f(a,p,n2)f(v,p)+f(n1,p)+f(p,n2) 4. f(a,p)f(p) 5. Default: noun attachment  rithm. A less specific context is used when the denominator is zero or p(a|v,n1,p,n2) = 0.5. Takenobu et. al. (1995). Different thesauruses were created for different grammatical roles such as subject and object, and used to build a set of word clusters. Clusters based on specialist thesauruses were found to predict fillers for these roles more accurately than generic clusters. null</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Smoothing
</SectionTitle>
    <Paragraph position="0"> Our baseline model is Collins and Brooks (1995) model, which implements the popular and effective backing-off smoothing technique. The idea is to initially use p(a|v,n1,p,n2), but if there isn't enough data to support a maximum likelihood estimate of this distribution, or p(a|v,n1,p,n2) = 0.5, then the algorithm backs off and uses a distribution with less conditioning context. The backing off steps are shown in Figure 1.</Paragraph>
    <Paragraph position="1"> If we use the similarity-based language model shown in (2) as a guide, then we can create a smoothed version of Collins' model using the weighted probability of all similar PPs (for brevity we use c in to indicate the context, in this case an entire PP quadruple):</Paragraph>
    <Paragraph position="3"> In contrast to the language model shown in (2), the set of similar contexts S(c) and similarity function a(c,cprime) must be defined for multiple words (we abuse our notation slightly by using the same a and S for both PPs and words, but the meaning should be clear from the context). Thesauruses only supply neighbours and similarity scores for single words, but we can generate distributionally similar PPs by replacing each word in the phrase independently with a similar one provided by the thesaurus. For example, if eat has two neighbours: S(eat) = {drink,enjoy}, and pizza has just one: S(pizza) = {pasta}, then the following examples will be generated for eat pizza with fork: eat pasta with fork drink pizza with fork drink pasta with fork enjoy pizza with fork enjoy pasta with fork Clearly this strategy of generates some nonsensical or at least unhelpful examples. This is not necessarily a serious problem since such instances should occur at best infrequently in the training data. Unfortunately our base-line model will back off and attempt to provide a reasonable probability for them all, for example by using p(a|with) in place of p(a|enjoy,pasta,with,fork).</Paragraph>
    <Paragraph position="4"> This introduces unwanted noise into the smoothed probability estimate.</Paragraph>
    <Paragraph position="5"> Our solution is to apply smoothing to the counts used by the probability model. The smoothed frequency of a prepositional phrase fs(a,c) is the weighted average frequency of the set of similar PPs S(c):</Paragraph>
    <Paragraph position="7"> These smoothed frequencies are used to calculate the conditional probabilities for the model. For example, the probability distribution in step one is defined as:</Paragraph>
    <Paragraph position="9"> Distributionally similar triples are generated for step two using the same word replacement strategy and smoothed frequency estimates for triples are calculated in the same way as quadruples. We back off to a smaller amount of context if the smoothed denominator is less than 1. This is done for empirical reasons, since decisions based on very low frequency counts are unreliable. The distributions used in steps three and four are not smoothed. Attempting to disambiguate a PP based on just two words is risky enough; introducing similar PPs found by replacing these two words with synonyms introduces too much noise.</Paragraph>
    <Paragraph position="10"> Quadruples and triples are more reliable since the context rules out those unhelpful PPs. For example, our model automatically deals with polysemous words without the need for explicit word sense disambiguation. Although thesauruses do conflate multiple senses in their neighbour lists, implausible senses result in infrequent PPs. The similarity set for the PP open plant in Korea might contain open tree in Korea but the latter's frequency is likely to be zero. Generating triples is riskier since there is less context to rule out unlikely PPs: the triple tree in Korea is more plausible and possibly misleading. But our model does have a natural preference for the most frequent sense in the thesaurus training corpus, which is a useful heuristic for word sense disambiguation (Pedersen and Bruce, 1997). For example, if the thesaurus is trained on business text then factory will be ranked higher than tree when the thesaurus trained on a business corpus (this issue is discussed further in Section 5.2).</Paragraph>
    <Paragraph position="11"> Finally, to complete our PP attachment scheme we need to define a similarity function between PPs, expressed fully as aparenleftbig(v,n1,p,n2),(vprime,nprime1,pprime,nprime2)parenrightbig. The raw materials we have to work with are the similarity scores for matching pairs of verbs and nouns as given by the thesaurus. We do not smooth preposition counts. In this paper we compare three similarity measures: * average: The average similarity score of all word pairs in the PP using the similarity measure provided by the thesaurus. For example, a(c,cprime)</Paragraph>
    <Paragraph position="13"> The similarity score of identical words is assumed to be 1.</Paragraph>
    <Paragraph position="14"> * rank: The rank score of the nth neighbour wprime of a word w is defined as:</Paragraph>
    <Paragraph position="16"> where 0 [?] b [?] 1. The rank similarity scores for the pizza example above when b = 0.1 are rs(eat,enjoy) = 0.2 and rs(pizza,pasta) = 0.1.</Paragraph>
    <Paragraph position="17"> The combined score for a PP is found by summing the rank score for each word pair and subtracting this total from one:</Paragraph>
    <Paragraph position="19"> We impose a floor of zero on this score. Continuing with the pizza example, the rank similarity score between (eat,pizza,with,fork) and (enjoy,pasta,with,fork) is a(c,cprime) = 1 [?] 0.2 [?] 0.1 = 0.7. Note that the similarity score provided by the thesaurus is used to determine the ranking but it otherwise not used.</Paragraph>
    <Paragraph position="20"> * single best: Instead of smoothing using several similar contexts, we can set a(c,cprime) = 1 for the closest context for which f(cprime) &gt; 0 and ignore all others, thereby just replacing an unknown feature with a similar known one. This simplified form of smoothing may be appropriate for non-statistical models or situations where relative frequency estimates are hard to incorporate.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Thesauruses
</SectionTitle>
    <Paragraph position="0"> As noted above, a thesaurus is a resource that groups together words that are distributionally similar. Although we refer to such resources using the singular, a thesaurus has several parts for different word categories such as nouns, verbs and adjectives.</Paragraph>
    <Paragraph position="1"> We compare three thesauruses on this task. The first two are large-scale generic thesauruses, both constructed using the similarity metric described in (Lin, 1998b), but based on different corpora. The first, which we call Lin, is derived from 300 million words of newswire text and is available on the Internet1. The second, which we call WASPS, forms part of the WASPS lexicographical workbench developed at Brighton University 2 and is derived from the 100 million word BNC. The co-occurrence relations for both are a variety of grammatical relations such as direct object, subject and modifier. WASPS also includes prepositional phrase relations but without attempting to disambiguate them. All possible attachments are included under the assumption that correct attachments will tend to have higher frequency (Adam Kilgarriff, p.c.).</Paragraph>
    <Paragraph position="2"> These thesauruses are designed to find words that are similar in a very general sense, and are often compared against hand-crafted semantic resources such as Word-Net. However for the PP attachment task semantic similarity may be less important. We are more interested in how words behave in particular syntactic roles. For example, eat and bake are rather loosely related semantically but will be close neighbours in PP terms if they both often occur with prepositional phrase contexts such as pizza with anchovies.</Paragraph>
    <Paragraph position="3"> The third thesaurus is designed to supply such specialised, task-specific neighbours. It consists of three sub-thesauruses, one for the each of the v,n1 and n2 words in the PP (a preposition thesaurus was also constructed with plausible-looking neighbours but was found not to be useful in practice). The co-occurrence relations used in each case consist of all possible subsets of the three remaining words together with the attachment decision. For example, given eat pizza with fork the following co-occurrences will be included in the thesaurus training corpus:</Paragraph>
    <Paragraph position="5"> The training corpus is created from 3.3 million prepositional phrases extracted from the British National Corpus. These PPs are identified semi-automatically using a version of the weighted GR extraction scheme described in (Carroll and Briscoe, 2001). The raw text is parsed and any PPs that occur in a large percentage of the highly ranked candidate parses are considered reliable and added to the thesaurus training corpus. Mostly these are unambiguous (v,p,n1) or (n1,p,n2) triples from phrases such as we met in January. The dataset is rather noisy due to tagging and parsing errors, so we discarded any co-occurrence relations occurring fewer than 100 times.</Paragraph>
    <Paragraph position="6"> We use the similarity metric described in Weeds (2003). This is a parameterised measure that can be adjusted to suit different tasks, but to ensure compatibility with the two generic thesauruses we chose parameter settings that mimic Lin's measure.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML