File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/p99-1005_metho.xml

Size: 9,703 bytes

Last Modified: 2025-10-06 14:15:21

<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1005">
  <Title>Distributional Similarity Models: Clustering vs. Nearest Neighbors</Title>
  <Section position="4" start_page="34" end_page="35" type="metho">
    <SectionTitle>
2 Two models
</SectionTitle>
    <Paragraph position="0"> We now survey the distributional clustering (section 2.1) and nearest-neighbors averaging (section 2.2) models. Section 2.3 examines the relationships between these two methods.</Paragraph>
    <Section position="1" start_page="34" end_page="35" type="sub_section">
      <SectionTitle>
2.1 Clustering
</SectionTitle>
      <Paragraph position="0"> The distributional clustering model that we evaluate in this paper is a refinement of our earlier model (Pereira et al., 1993). The new model has important theoretical advantages over the earlier one and interesting mathematical properties, which will be discussed elsewhere. Here, we will outline the main motivation for the model, the iterative equations that implement it, and their practical use in clustering.</Paragraph>
      <Paragraph position="1"> The model involves two discreterandom variables N (nouns) and V (verbs) whose joint distribution we have sampled, and a new unobserved discrete random variable C representing probabilistic clusters of elements of N. The role of the hidden variable C is specified by the conditional distribution p(cln), which can be thought of as the probability that n belongs to cluster c. We want to preserve in C as much as possible of the information that N has about V, that is, maximize the mutual information 2 I(V, C). On the other hand, we would also</Paragraph>
      <Paragraph position="3"> the same cluster (dotted ellipse), the two nearest neighbors to A are not the nearest two neighbors to B.</Paragraph>
      <Paragraph position="4"> like to control the degree of compression of C relative to N, that is, the mutual information I(C,N). Furthermore, since C is intended to summarize N in its role as a predictor of V, it should carry no information about V that N does not already have. That is, V should be conditionally independent of C given N, which allows us to write</Paragraph>
      <Paragraph position="6"> The distribution p(VIc ) is the centroid for cluster c.</Paragraph>
      <Paragraph position="7"> It can be shown that I(V, C) is maximized subject to fixed I(C, N) and the above conditional independence assumption when</Paragraph>
      <Paragraph position="9"> where /3 is the Lagrange multiplier associated with fixed I(C, N), Zn is the normalization</Paragraph>
      <Paragraph position="11"> and D is the KuUback-Leiber (KL) divergence, which measures the distance, in an information-theoretic sense, between two distributions q and</Paragraph>
      <Paragraph position="13"> The main behavioral difference between this model and our previous one is the p(c) factor in (2), which tends to sharpen cluster membership distributions. In addition, our earlier experiments used a uniform marginal distribution for the nouns instead of the marginal distribution in the actual data, in order to make clustering more sensitive to informative but relatively rare  nouns. While neither difference leads to major changes in clustering results, we prefer the current model for its better theoretical foundation. For fixed /3, equations (2) and (1) together with Bayes rule and marginalization can be used in a provably convergent iterative reestimation process for p(glc), p(YlC ) and p(C). These distributions form the model for the given/3.</Paragraph>
      <Paragraph position="14"> It is easy to see that for/3 = 0, p(nlc ) does not depend on the cluster distribution p(VIc), so the natural number of clusters (distinct values of C) is one. At the other extreme, for very large /3 the natural number of clusters is the same as the number of nouns. In general, a higher value of/3 corresponds to a larger number of clusters. The natural number of clusters k and the probabilistic model for different values of/3 are estimated as follows. We specify an increasing sequence {/3i} of/3 values (the &amp;quot;annealing&amp;quot; schedule), starting with a very low value/30 and increasing slowly (in our experiments, /30 = 1 and/3i+1 = 1-1/30. Assuming that the natural number of clusters and model for/3i have been computed, we set/3 =/3i+1 and split each cluster into two twins by taking small random perturbations of the original cluster centroids. We then apply the iterative reestimation procedure until convergence. If two twins end up with significantly different centroids, we conclude that they are now separate clusters. Thus, for each i we have a number of clusters ki and a model relating those clusters to the data variables N and V.</Paragraph>
      <Paragraph position="15"> A cluster model can be used to estimate p(vln ) when v and n have not occurred together in training. We consider two heuristic ways of doing this estimation: * all-cluster weighted average:</Paragraph>
      <Paragraph position="17"> where c* maximizes p(c*ln).</Paragraph>
    </Section>
    <Section position="2" start_page="35" end_page="35" type="sub_section">
      <SectionTitle>
2.2 Nearest-neighbors averaging
</SectionTitle>
      <Paragraph position="0"> As noted earlier, the nearest-neighbors averaging method is an alternative to clustering for estimating the probabilities of unseen cooccurfences. Given an unseen pair (n, v), we calculate an estimate 15(vln ) as an appropriate average of p(vln I) where n I is distributionally similar to n. Many distributional similarity measures can be considered (Lee, 1999). In this paper, we focus on the one that gave the best results in our earlier work (Dagan et al., 1999), the Jensen-Shannon divergence (Rao, 1982; Lin, 1991). The Jensen-Shannon divergence of two discrete distributions p and q over the same domain is defined as</Paragraph>
      <Paragraph position="2"> It is easy to see that JS(p, q) is always defined.</Paragraph>
      <Paragraph position="3"> In previous work, we used the estimate</Paragraph>
      <Paragraph position="5"> k are tunable parameters, S(n, k) is the set of k nouns with the smallest Jensen-Shannon divergence to n, and an is a normalization term.</Paragraph>
      <Paragraph position="6"> However, in the present work we use the simpler unweighted average</Paragraph>
      <Paragraph position="8"> and examine the effect of the choice of k on modeling performance. By eliminating extra parameters, this restricted formulation allows a more direct comparison of nearest-neighbors averaging to distributional clustering, as discussed in the next section. Furthermore, our earlier experiments showed that an exponentially decreasing weight has much the same effect on performance as a bound on the number of nearest neighbors participating in the estimate.</Paragraph>
    </Section>
    <Section position="3" start_page="35" end_page="35" type="sub_section">
      <SectionTitle>
2.3 Discussion
</SectionTitle>
      <Paragraph position="0"> In the previous two sections, we presented two complementary paradigms for incorporating distributional similarity information into cooccurrence probability estimates. Now, one cannot always draw conclusions about the relative fitness of two methods simply from head-to-head performance comparisons; for instance, one method might actually make use of inherently more informative statistics but produce worse results because the authors chose a sub-optimal weighting scheme. In the present case, however, we are working with two models which, while representing opposite extremes in terms of generalization, share enough features to make the comparison meaningful.</Paragraph>
      <Paragraph position="1"> First, both models use linear combinations of cooccurrence probabilities for similar entities. Second, each has a single free parameter k, and the two k's enjoy a natural inverse correspondence: a large number of clusters in the distributional clustering case results in only the closest centroids contributing significantly to the cooccurrence probability estimate, whereas a large number of neighbors in the nearest-neighbors averaging case means that relatively distant words are consulted. And finally, the two distance functions are similar in spirit: both are based on the KL divergence to some type of averaged distribution. We have thus attempted to eliminate functional form, number and type of parameters, and choice of distance function from playing a role in the comparison, increasing our confidence that we are truly comparing paradigms and not implementation details.</Paragraph>
      <Paragraph position="2"> What are the fundamental differences between the two methods? From the foregoing discussion it is clear that distributional clustering is theoretically more satisfying and depends on a single model complexity parameter.</Paragraph>
      <Paragraph position="3"> On the other hand, nearest-neighbors averaging in its most general form offers more flexibility in defining the set of most similar words and their relative weights (Dagan et al., 1999). Also, the training phase requires little computation, as opposed to the iterative re-estimation procedure employed to build the cluster model. But the key difference is the amount of data compression, or equivalently the amount of generalization, produced by the two models. Cluster- null ing yields a far more compact representation of the data when k, the model size parameter, is smaller than INf. As noted above, various authors have conjectured that this data reduction must inevitably result in lower performance in comparison to nearest-neighbor methods, which store the most specific information for each individual word. Our experiments aim to explore this hypothesized generalization-accuracy tradeoff.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML