File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/p99-1005_intro.xml
Size: 7,068 bytes
Last Modified: 2025-10-06 14:06:49
<?xml version="1.0" standalone="yes"?> <Paper uid="P99-1005"> <Title>Distributional Similarity Models: Clustering vs. Nearest Neighbors</Title> <Section position="3" start_page="0" end_page="34" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> In many statistical language-processing problems, it is necessary to estimate the joint probability or cooeeurrence probability of events drawn from two prescribed sets. Data sparseness can make such estimates difficult when the events under consideration are sufficiently fine-grained, for instance, when they correspond to occurrences of specific words in given configurations.</Paragraph> <Paragraph position="1"> In particular, in many practical modeling tasks, a substantial fraction of the cooccurrences of interest have never been seen in training data. In most previous work (Jelinek and Mercer, 1980; Katz, 1987; Church and Gale, 1991; Ney and Essen, 1993), this lack of information is addressed by reserving some mass in the probability model for unseen joint events, and then assigning that mass to those events as a function of their marginal frequencies.</Paragraph> <Paragraph position="2"> An intuitively appealing alternative to relying on marginal frequencies alone is to combine estimates of the probabilities of &quot;similar&quot; events. More specifically, a joint event (x, y) would be considered similar to another (x t, y) if the distributions of Y given x and Y given x' (the cooccurrence distributions of x and x ') meet an appropriate definition of distributional similarity.</Paragraph> <Paragraph position="3"> For example, one can infer that the bigram &quot;after ACL-99&quot; is plausible -- even if it has never occurred before -- from the fact that the bigram &quot;after ACL-95&quot; has occurred, if &quot;ACL-99&quot; and &quot;ACL-95&quot; have similar cooccurrence distributions. null For concreteness and experimental evaluation, we focus in this paper on a particular type of cooccurrence, that of a main verb and the head noun of its direct object in English text.</Paragraph> <Paragraph position="4"> Our main goal is to obtain estimates ~(vln ) of the conditional probability of a main verb v given a direct object head noun n, which can then be used in particular prediction tasks.</Paragraph> <Paragraph position="5"> In previous work, we and our co-authors have proposed two different probability estimation methods that incorporate word similarity information: distributional clustering and nearest-neighbors averaging. Distributional clustering (Pereira et al., 1993) assigns to each word a probability distribution over clusters to which it may belong, and characterizes each cluster by a centroid, which is an average of cooccurrence distributions of words weighted according to cluster membership probabilities. Cooccurrence probabilities can then be derived from either a membership-weighted average of the clusters to which the words in the cooccurrence belong, or just from the highest-probability cluster. null In contrast, nearest-neighbors averaging 1 (Dagan et al., 1999) does not explicitly cluster words. Rather, a given cooccurrence probability is estimated by averaging probabilities for the set of cooccurrences most similar to the target cooccurrence. That is, while both methods involve appealing to similar &quot;witnesses&quot; (in the clustering case, these witnesses are the centroids; for nearest-neighbors averaging, they are 1In previous papers, we have used the term &quot;similarity-based&quot;, but this term would cause confusion in the present article.</Paragraph> <Paragraph position="6"> the most similar words), in nearest-neighbors averaging the witnesses vary for different cooccurrences, whereas in distributional clustering the same set of witnesses is used for every cooccurrence (see Figure 1).</Paragraph> <Paragraph position="7"> We thus see that distributional clustering and nearest-neighbors averaging are complementary approaches. Distributional clustering generally creates a compact representation of the data, namely, the cluster membership probability tables and the cluster centroids. Nearest-neighbors averaging, on the other hand, associates a specific set of similar words to each word and thus typically increases the amount of storage required. In a way, it is clustering taken to the limit - each word forms its own cluster.</Paragraph> <Paragraph position="8"> In previous work, we have shown that both distributional clustering and nearest-neighbors averaging can yield improvements of up to 40% with respect to Katz's (1987) state-of-the-art backoffmethod in the prediction of unseen cooccurrences. In the case of nearest-neighbors averaging, we have also demonstrated perplexity reductions of 20% and statistically significant improvement in speech recognition error rate. Furthermore, each method has generated some discussion in the literature (Hofmann et al., 1999; Baker and McCallum, 1998; Ide and Veronis, 1998). Given the relative success of these methods and their complementarity, it is natural to wonder how they compare in practice.</Paragraph> <Paragraph position="9"> Several authors (Schiitze, 1993; Dagan et al., 1995; Ide and Veronis, 1998) have suggested that clustering methods, by reducing data to a small set of representatives, might perform less well than nearest-neighbors averaging-type methods. For instance, Dagan et al. (1995, p. 124) argue: This \[class-based\] approach, which follows long traditions in semantic classification, is very appealing, as it attempts to capture &quot;typical&quot; properties of classes of words. However .... it is not clear that word co-occurrence patterns can be generalized to class co-occurrence parameters without losing too much information.</Paragraph> <Paragraph position="10"> Furthermore, early work on class-based language models was inconclusive (Brown et al., 1992).</Paragraph> <Paragraph position="11"> In this paper, we present a detailed comparison of distributional clustering and nearest-neighbors averaging on several large datasets, exploring the tradeoff in similarity-based modeling between memory usage on the one hand and estimation accuracy on the other. We find that the performances of the two methods are in general very similar: with respect to Katz's back-off, they both provide average error reductions of up to 40% on one task and up to 7% on a related, but somewhat more difficult, task.</Paragraph> <Paragraph position="12"> Only in a fairly unrealistic setting did nearest-neighbors averaging clearly beat distributional clustering, but even in this case, both methods were able to achieve average error reductions of at least 18% in comparison to backoff. Therefore, previous claims that clustering methods are necessarily inferior are not strongly supported by the evidence of these experiments, although it is of course possible that the situation may be different for other tasks.</Paragraph> </Section> class="xml-element"></Paper>