File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/n06-4007_intro.xml

Size: 8,203 bytes

Last Modified: 2025-10-06 14:03:28

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-4007">
  <Title>Automatic Cluster Stopping with Criterion Functions and the Gap Statistic</Title>
  <Section position="3" start_page="276" end_page="278" type="intro">
    <SectionTitle>
2 Methodology
</SectionTitle>
    <Paragraph position="0"> In word sense or name discrimination, the number of contexts (N) to cluster is usually very large, and considering all possible values of k from 1...N would be inefficient. As the value of k increases, the criterion function will reach a plateau, indicating that dividing the contexts into more and more clusters does not improve the quality of the solution.</Paragraph>
    <Paragraph position="1"> Thus, we identify an upper bound to k that we refer to as deltaK by finding the point at which the criterion function only changes to a small degree as k increases.</Paragraph>
    <Paragraph position="2"> According to the H2 criterion function, the higher its ratio of within-cluster similarity to between-cluster similarity, the better the clustering. A large value indicates that the clusters have high internal similarity, and are clearly separated from each other.</Paragraph>
    <Paragraph position="3"> Intuitively then, one solution to selecting k might be to examine the trend of H2 scores, and look for the smallest k that results in a nearly maximum H2 value.</Paragraph>
    <Paragraph position="4"> However, a graph of H2 values for a clustering of the 2 sense name conflation Sonia Gandhi and Leonid Kuchma as shown in Figure 1 (top) reveals the difficulties of such an approach. There is a gradual curve in this graph and there is no obvious knee point (i.e., sharp increase) that indicates the appro- null the name conflate pair Sonia Gandhi and Leonid Kuchma. The predicted number of senses is 2 for all the measures.</Paragraph>
    <Section position="1" start_page="277" end_page="277" type="sub_section">
      <SectionTitle>
2.1 PK1
</SectionTitle>
      <Paragraph position="0"> The PK1 measure is based on (Mojena, 1977), which finds clustering solutions for all values of k from 1..N, and then determines the mean and standard deviation of the criterion function. Then, a score is computed for each value of k by subtracting the mean from the criterion function, and dividing by the standard deviation. We adapt this technique by using the H2 criterion function, and limit k from</Paragraph>
      <Paragraph position="2"> To select a value of k, a threshold must be set.</Paragraph>
      <Paragraph position="3"> Then, as soon as PK1(k) exceeds this threshold, k-1 is selected as the appropriate number of clusters. Mojena suggests values of 2.75 to 3.50, but also states they would need to be adjusted for different data sets. We have arrived at an empirically determined value of -0.70, which coincides with the point in the standard normal distribution where 75% of the probability mass is associated with values greater than this.</Paragraph>
      <Paragraph position="4"> We observe that the distribution of PK1 scores tends to change with different data sets, making it hard to apply a single threshold. The graph of the PK1 scores shown in Figure 1 illustrates the difficulty : the slope of these scores is nearly linear, and as such any threshold is a somewhat arbitrary cutoff.</Paragraph>
    </Section>
    <Section position="2" start_page="277" end_page="277" type="sub_section">
      <SectionTitle>
2.2 PK2
</SectionTitle>
      <Paragraph position="0"> PK2 is similar to (Hartigan, 1975), in that both take the ratio of a criterion function at k and k-1, in order to assess the relative improvement when increasing the number of clusters.</Paragraph>
      <Paragraph position="2"> When this ratio approaches 1, the clustering has reached a plateau, and increasing k will have no benefit. If PK2 is greater than 1, then we should increase k. We compute the standard deviation of PK2 and use that to establish a boundary as to what it means to be &amp;quot;close enough&amp;quot; to 1 to consider that we have reached a plateau. Thus, PK2 will select k where PK2(k) is the closest to (but not less than) 1 + standard deviation(PK2[1...deltaK]).</Paragraph>
      <Paragraph position="3"> The graph of PK2 in Figure 1 shows an elbow that is near the actual number of senses. The critical region defined by the standard deviation is shaded, and note that PK2 selected the value of k that was outside of (but closest to) that region. This is interpreted as being the last value of k that resulted in a significant improvement in clustering quality. Note that here PK2 predicts 2 senses, which corresponds to the number of underlying entities.</Paragraph>
    </Section>
    <Section position="3" start_page="277" end_page="278" type="sub_section">
      <SectionTitle>
2.3 PK3
</SectionTitle>
      <Paragraph position="0"> PK3 utilizes three k values, in an attempt to find a point at which the criterion function increases and then suddenly decreases. Thus, for a given value of k we compare its criterion function to the preceding and following value of k:</Paragraph>
      <Paragraph position="2"> The form of this measure is identical to that of the Dice Coefficient, although in set theoretic or probabilistic applications Dice tends to be used to compare two variables or sets with each other.</Paragraph>
      <Paragraph position="3"> PK3 is close to 1 if the H2 values form a line, meaning that they are either ascending, or they are on the plateau. However, our use of deltaK eliminates the plateau, so in our case values of 1 show that k is resulting in consistent improvements to clustering quality, and that we should continue. When PK3 rises significantly above 1, we know that k+1 is not climbing as quickly, and we have reached a point where additional clustering may not be helpful. To select k we select the largest value of PK3(k) that is closest to (but still greater than) the critical region defined by the standard deviation of PK3.</Paragraph>
      <Paragraph position="4"> PK3 is similar in spirit to (Salvador and Chan, 2004), which introduces the L measure. This tries to find the point of maximum curvature in the criterion function graph, by fitting a pair of lines to the curve (where the intersection of these lines represents the selected k).</Paragraph>
    </Section>
    <Section position="4" start_page="278" end_page="278" type="sub_section">
      <SectionTitle>
2.4 The Gap Statistic
</SectionTitle>
      <Paragraph position="0"> SenseClusters includes an adaptation of the Gap Statistic (Tibshirani et al., 2001). It is distinct from the measures PK1, PK2, and PK3 since it does not attempt to directly find a knee point in the graph of a criterion function. Rather, it creates a sample of reference data that represents the observed data as if it had no meaningful clusters in it and was simply made up of noise. The criterion function of the reference data is then compared to that of the observed data, in order to identify the value of k in the observed data that is least like noise, and therefore represents the best clustering of the data.</Paragraph>
      <Paragraph position="1"> To do this, it generates a null reference distribution by sampling from a distribution where the marginal totals are fixed to the observed marginal values. Then some number of replicates of the reference distribution are created by sampling from it with replacement, and each of these replicates is clustered just like the observed data (for successive values of k using a given criterion function).</Paragraph>
      <Paragraph position="2"> The criterion function scores for the observed and reference data are compared, and the point at which the distance between them is greatest is taken to provide the appropriate value of k. An example of this is seen in Figure 2. The reference distribution represents the noise in the observed data, so the value of k where the distance between the reference and observed data is greatest represents the most effective clustering of the data.</Paragraph>
      <Paragraph position="3"> Our adaption of the Gap Statistic allows us to use any clustering criterion function to make the comparison of the observed and reference data, whereas the original formulation is based on using the within-cluster dispersion.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML