File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/93/p93-1024_intro.xml

Size: 8,622 bytes

Last Modified: 2025-10-06 14:05:27

<?xml version="1.0" standalone="yes"?>
<Paper uid="P93-1024">
  <Title>DISTRIBUTIONAL CLUSTERING OF ENGLISH WORDS</Title>
  <Section position="3" start_page="0" end_page="183" type="intro">
    <SectionTitle>
INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> Methods for automatically classifying words according to their contexts of use have both scientific and practical interest. The scientific questions arise in connection to distributional views of linguistic (particularly lexical) structure and also in relation to the question of lexical acquisition both from psychological and computational learning perspectives. From the practical point of view, word classification addresses questions of data sparseness and generalization in statistical language models, particularly models for deciding among alternative analyses proposed by a grammar. null It is well known that a simple tabulation of frequencies of certain words participating in certain configurations, for example of frequencies of pairs of a transitive main verb and the head noun of its direct object, cannot be reliably used for comparing the likelihoods of different alternative configurations. The problemis that for large enough corpora the number of possible joint events is much larger than the number of event occurrences in the corpus, so many events are seen rarely or never, making their frequency counts unreliable estimates of their probabilities.</Paragraph>
    <Paragraph position="1"> Hindle (1990) proposed dealing with the sparseness problem by estimating the likelihood of unseen events from that of &amp;quot;similar&amp;quot; events that have been seen. For instance, one may estimate the likelihood of a particular direct object for a verb from the likelihoods of that direct object for similar verbs. This requires a reasonable definition of verb similarity and a similarity estimation method. In Hindle's proposal, words are similar if we have strong statistical evidence that they tend to participate in the same events. His notion of similarity seems to agree with our intuitions in many cases, but it is not clear how it can be used directly to construct word classes and corresponding models of association.</Paragraph>
    <Paragraph position="2"> Our research addresses some of the same questions and uses similar raw data, but we investigate how to factor word association tendencies into associations of words to certain hidden senses classes and associations between the classes themselves.</Paragraph>
    <Paragraph position="3"> While it may be worth basing such a model on pre-existing sense classes (Resnik, 1992), in the work described here we look at how to derive the classes directly from distributional data. More specifically, we model senses as probabilistic concepts or clusters c with corresponding cluster membership probabilities p(clw ) for each word w. Most other class-based modeling techniques for natural language rely instead on &amp;quot;hard&amp;quot; Boolean classes (Brown et al., 1990). Class construction is then combinatorially very demanding and depends on frequency counts for joint events involving particular words, a potentially unreliable source of information as noted above. Our approach avoids both problems.</Paragraph>
    <Section position="1" start_page="0" end_page="183" type="sub_section">
      <SectionTitle>
Problem Setting
</SectionTitle>
      <Paragraph position="0"> In what follows, we will consider two major word classes, 12 and Af, for the verbs and nouns in our experiments, and a single relation between them, in our experiments the relation between a transitive main verb and the head noun of its direct object. Our raw knowledge about the relation consists of the frequencies f~n of occurrence of particular pairs (v,n) in the required configuration in a training corpus. Some form of text analysis is required to collect such a collection of pairs. The corpus used in our first experiment was derived from newswire text automatically parsed by  Hindle's parser Fidditch (Hindle, 1993). More recently, we have constructed similar tables with the help of a statistical part-of-speech tagger (Church, 1988) and of tools for regular expression pattern matching on tagged corpora (Yarowsky, 1992). We have not yet compared the accuracy and coverage of the two methods, or what systematic biases they might introduce, although we took care to filter out certain systematic errors, for instance the misparsing of the subject of a complement clause as the direct object of a main verb for report verbs like &amp;quot;say&amp;quot;.</Paragraph>
      <Paragraph position="1"> We will consider here only the problem of classifying nouns according to their distribution as direct objects of verbs; the converse problem is formally similar. More generally, the theoretical basis for our method supports the use of clustering to build models for any n-ary relation in terms of associations between elements in each coordinate and appropriate hidden units (cluster centroids) and associations between thosehidden units.</Paragraph>
      <Paragraph position="2"> For the noun classification problem, the empirical distribution of a noun n is then given by the conditional distribution p,~(v) = f~./ ~v f&amp;quot;~&amp;quot; The problem we study is how to use the Pn to classify the n EAf. Our classification method will construct a set C of clusters and cluster membership probabilities p(c\]n). Each cluster c is associated to a cluster centroid Pc, which is a distribution over l; obtained by averaging appropriately the pn.</Paragraph>
    </Section>
    <Section position="2" start_page="183" end_page="183" type="sub_section">
      <SectionTitle>
Distributional Similarity
</SectionTitle>
      <Paragraph position="0"> To cluster nouns n according to their conditional verb distributions Pn, we need a measure of similarity between distributions. We use for this purpose the relative entropy or Kullback-Leibler (KL) distance between two distributions O(p I\[ q) = ZP(x) log p(x) : q(x) This is a natural choice for a variety of reasons, which we will just sketch here) First of all, D(p I\[ q) is zero just when p = q, and it increases as the probability decreases that p is the relative frequency distribution of a random sample drawn according to q. More formally, the probability mass given by q to the set of all samples of length n with relative frequency distribution p is bounded by exp-nn(p I\] q) (Cover and Thomas, 1991). Therefore, if we are trying to distinguish among hypotheses qi when p is the relative frequency distribution of observations, D(p II ql) gives the relative weight of evidence in favor of qi. Furthermore, a similar relation holds between D(p IIP') for two empirical distributions p and p' and the probability that p and p~ are drawn from the same distribution q. We can thus use the relative entropy between the context distributions for two words to measure how likely they are to be instances of the same cluster centroid.</Paragraph>
      <Paragraph position="1"> aA more formal discussion will appear in our paper Distributional Clustering, in preparation. From an information theoretic perspective D(p \]1 q) measures how inefficient on average it would be to use a code based on q to encode a variable distributed according to p. With respect to our problem, D(pn H Pc) thus gives us the information loss in using cluster centroid Pc instead of the actual distribution pn for word n when modeling the distributional properties of n.</Paragraph>
      <Paragraph position="2"> Finally, relative entropy is a natural measure of similarity between distributions for clustering because its minimization leads to cluster centroids that are a simple weighted average of member distributions. null One technical difficulty is that D(p \[1 p') is not defined when p'(x) = 0 but p(x) &gt; 0. We could sidestep this problem (as we did initially) by smoothing zero frequencies appropriately (Church and Gale, 1991). However, this is not very satisfactory because one of the goals of our work is precisely to avoid the problems of data sparseness by grouping words into classes. It turns out that the problem is avoided by our clustering technique, since it does not need to compute the KL distance between individual word distributions, but only between a word distribution and average distributions, the current cluster centroids, which are guaranteed to be nonzero whenever the word distributions are. This is a useful advantage of our method compared with agglomerative clustering techniques that need to compare individual objects being considered for grouping.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML