File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/j02-3004_intro.xml

Size: 12,904 bytes

Last Modified: 2025-10-06 14:01:22

<?xml version="1.0" standalone="yes"?>
<Paper uid="J02-3004">
  <Title>c(c) 2002 Association for Computational Linguistics The Disambiguation of Nominalizations</Title>
  <Section position="4" start_page="363" end_page="367" type="intro">
    <SectionTitle>
3. Smoothing
</SectionTitle>
    <Paragraph position="0"> Smoothing techniques have been used in a variety of statistical NLP applications as a means of addressing data sparseness, an inherent problem for statistical methods that rely on the relative frequencies of word combinations. The problem arises when the probability of word combinations that do not occur in the training data needs to be estimated. The smoothing methods proposed in the literature (overviews are provided by Dagan, Lee, and Pereira (1999) and Lee (1999)) can be generally divided into three types: discounting (Katz 1987), class-based smoothing (Resnik 1993; Brown et al. 1992;  Computational Linguistics Volume 28, Number 3 Pereira, Tishby, and Lee 1993), and distance-weighted averaging (Grishman and Sterling 1994; Dagan, Lee, and Pereira 1999).</Paragraph>
    <Paragraph position="1"> Discounting methods decrease the probability of previously seen events so that the total probability of observed word co-occurrences is less than one, leaving some probability mass to be redistributed among unseen co-occurrences. Class-based smoothing and distance-weighted averaging both rely on an intuitively simple idea: interword dependencies are modeled by relying on the corpus evidence available for words that are similar to the words of interest. The two approaches differ in the way they measure word similarity. Distance-weighted averaging estimates word similarity from lexical co-occurrence information; namely, it finds similar words by taking into account the linguistic contexts in which they occur: two words are similar if they occur in similar contexts. In class-based smoothing, classes are used as the basis according to which the co-occurrence probability of unseen word combinations is estimated. Classes can be induced directly from the corpus using distributional clustering (Pereira, Tishby, and Lee 1993; Brown et al. 1992; Lee and Pereira 1999) or taken from a manually crafted taxonomy (Resnik 1993). In the latter case the taxonomy is used to provide a mapping from words to conceptual classes.</Paragraph>
    <Paragraph position="2"> Distance-weighted averaging differs from distributional clustering in that it does not explicitly cluster words. Although both methods make use of the evidence of words similar to the words of interest, distributional clustering assigns to each word a probability distribution over clusters to which it may belong; co-occurrence probabilities can then be estimated on the basis of the average of the clusters to which the words in the co-occurrence belong. This means that word co-occurrences are modeled by taking general word clusters into account and that the same set of clusters is used for different co-occurrences. Distance-weighted averaging does not explicitly create general word clusters. Instead, unseen co-occurrences are estimated by averaging the set of co-occurrences most similar to the target unseen co-occurrence, and a different set of similar neighbors (i.e., distributionally similar words) is used for different co-occurrences.</Paragraph>
    <Paragraph position="3"> In language modeling, smoothing techniques are typically evaluated by showing that a language model that uses smoothed estimates incurs a reduction in perplexity on test data over a model that does not employ smoothed estimates (Katz 1987).</Paragraph>
    <Paragraph position="4"> Dagan, Lee, and Pereira (1999) use perplexity to compare back-off smoothing against distance-weighted averaging methods within the context of language modeling for speech recognition and show that the latter outperform the former. They also compare different distance-weighted averaging methods on a pseudoword disambiguation task in which the language model decides which of two verbs v  and v  is more likely to take a noun n as its object. The method being tested must reconstruct which of the</Paragraph>
    <Paragraph position="6"> Lee and Pereira (1999) in a detailed comparison between distributional clustering and distance-weighted averaging that demonstrates that the two methods yield comparable results.</Paragraph>
    <Paragraph position="7"> In our experiments we re-created co-occurrence frequencies for unseen verb-subject and verb-object pairs using three maximally different approaches: back-off smoothing, class-based smoothing using a predefined taxonomy, and distance-weighted averaging. We preferred taxonomic class-based methods over distributional clustering mainly because we wanted to compare directly methods that use distributional information inherent in the corpus without making external assumptions with regard to how concepts and their similarity are represented with methods that quantify similarity relationships based on information present in a hand-crafted taxonomy. Furthermore, as Lee and Pereira's (1999) results indicate that distributional clustering  Lapata The Disambiguation of Nominalizations and distance-weighted averaging obtain similar levels of performance, we restricted ourselves to the latter.</Paragraph>
    <Paragraph position="8"> We evaluated the contribution of the different smoothing methods on the nominalization task by exploring how each method and their combination influences disambiguation performance. Sections 3.1-3.3 review discounting, class-based smoothing, and distance-weighted averaging. Section 4 introduces an algorithm that uses smoothed verb-argument tuples to arrive at the interpretation of nominalizations.</Paragraph>
    <Section position="1" start_page="365" end_page="365" type="sub_section">
      <SectionTitle>
3.1 Back-Off Smoothing
</SectionTitle>
      <Paragraph position="0"> Back-off n-gram models were initially proposed by Katz (1987) for speech recognition but have also been successfully used to disambiguate the attachment site of structurally ambiguous prepositional phrases (Collins and Brooks 1995). The main idea behind back-off smoothing is to adjust maximum likelihood estimates like (8) so that the total probability of observed word co-occurrences is less than one, leaving some probability mass to be redistributed among unseen co-occurrences. In general the frequency of observed word sequences is discounted using the Good-Turing estimate (see Katz (1987) and Church and Gale (1991) for details on Good-Turing estimation), and the probability of unseen sequences is estimated by using lower-level conditional distributions. Assuming that the numerator f(v</Paragraph>
    </Section>
    <Section position="2" start_page="365" end_page="366" type="sub_section">
      <SectionTitle>
3.2 Class-Based Smoothing
</SectionTitle>
      <Paragraph position="0"> Generally speaking, taxonomic class-based smoothing re-creates co-occurrence frequencies based on information provided by lexical resources such as WordNet (Miller et al. 1990) or Roget's publicly available thesaurus. In the case of verb-argument tuples, we use taxonomic information to estimate the frequencies f(v  occurring in an argument position the concept with which it is represented in the taxonomy (Resnik 1993). So f(v</Paragraph>
      <Paragraph position="2"> ) can be estimated by counting the number of times the concept corresponding to n  was observed as the argument of the verb v n  in the corpus.</Paragraph>
      <Paragraph position="3"> This would be a straightforward task if each word was always represented in the taxonomy by a single concept or if we had a corpus of verb-argument tuples labeled explicitly with taxonomic information. Lacking such a corpus we need to take into consideration the fact that words in a taxonomy may belong to more than one conceptual class: counts of verb-argument configurations are reconstructed for each conceptual class by dividing the contribution from the argument by the number of classes to which it belongs (Resnik 1993; Lauer 1995):  Consider, for example, the tuple register group (derived from the compound group registration), which is not attested in the BNC. The word group has two senses in WordNet and belongs to five conceptual classes (&lt;abstraction&gt; , &lt;entity&gt; , &lt;object&gt; , &lt;set&gt; , and &lt;substance&gt; ). This means that the frequency f(v n  , rel, c) will be constructed for each of the five classes, as shown in Table 4. Suppose now that we see the tuple register patient in the corpus. The word patient has two senses in WordNet and belongs to seven conceptual classes (&lt;case&gt; , &lt;person&gt; , &lt;life form&gt; , &lt;entity&gt; , &lt;causal agent&gt; , &lt;sick person&gt; , &lt;unfortunate&gt; ), one of which is &lt;entity&gt; . This means that we will increment the observed co-occurrence count of register and &lt;entity&gt; by  ) (see equations (13) and (14)) crucially relies on the simplifying assumption that the argument of a verb is distributed evenly across its conceptual classes. This simplification is necessary unless we have a corpus of verb-argument pairs labeled explicitly with taxonomic information. The task of finding the right class for representing the argument of a given predicate is a research issue on its own (Clark and Weir 2001; Li and  Abe 1998; Carroll and McCarthy 2000), and a detailed comparison between different methods for accomplishing this task is beyond the scope of the present study.</Paragraph>
    </Section>
    <Section position="3" start_page="366" end_page="367" type="sub_section">
      <SectionTitle>
3.3 Distance-Weighted Averaging
</SectionTitle>
      <Paragraph position="0"> Distance-weighted averaging induces classes of similar words from word co-occurrences without making reference to a taxonomy. Instead, it is based on the assumption that if a word w  can provide information about the frequency of unseen word pairs involving w  (Dagan, Lee, and Pereira 1999). A key feature of this type of smoothing is the function that measures distributional similarity from co-occurrence frequencies.</Paragraph>
      <Paragraph position="1"> Several measures of distributional similarity have been proposed in the literature (Dagan, Lee, and Pereira 1999; Lee 1999). We used two measures, the Jensen-Shannon divergence and the confusion probability. The choice of these two measures was motivated by work described in Dagan, Lee, and Pereira (1999), in which the Jensen-Shannon divergence outperforms related similarity measures (such as the confusion probability or the L  norm) on a pseudodisambiguation task that uses verb-object pairs. The confusion probability has been used by several authors to smooth word co- null Lapata The Disambiguation of Nominalizations occurrence probabilities (Essen and Steinbiss 1992; Grishman and Sterling 1994) and shown to give promising performance. Grishman and Sterling (1994) in particular employ the confusion probability to re-create the frequencies of verb-noun co-occurrences in which the noun is the object or the subject of the verb in question. In the following we describe these two similarity measures and show how they can be used to re-create the frequencies for unseen verb-argument tuples (for a more detailed description see Dagan, Lee, and Pereira (1999)).</Paragraph>
      <Paragraph position="2">  occurs. A large confusion probability value indicates that the two words w  as context and smooth over the verb w  . We opted for the latter for two reasons. Theoretically speaking, it is the verb that imposes the semantic restrictions on its arguments and not vice versa. The idea that semantically similar verbs have similar subcategorizational and selectional patterns is by no means new and has been extensively argued for by Levin (1993). Computational efficiency considerations also favor an approach that treats rel, w  as context: the nouns w  outnumber the verbs w  by a factor of four (see Table 1). When verb-argument tuples are taken into consideration, (8) can be rewritten as follows:  The confusion probability can be computed efficiently, since it involves summation only over the common contexts rel, w  .</Paragraph>
      <Paragraph position="3">  tion-theoretic measure. It recasts the concept of distributional similarity into a measure of the &amp;quot;distance&amp;quot; between two probability distributions. The value of the Jensen-Shannon divergence ranges from zero for identical distributions to log 2 for maximally different distributions. J is defined as:  Disambiguation algorithm for nominalizations.</Paragraph>
      <Paragraph position="4"> Similarly to the confusion probability, the computation of J depends only on the common contexts rel, w  . Recall that the Jensen-Shannon divergence is a dissimilarity measure. The dissimilarity measure is transformed into a similarity measure using a</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML