File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/w97-0322_intro.xml
Size: 3,841 bytes
Last Modified: 2025-10-06 14:06:22
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0322"> <Title>Distinguishing Word Senses in Untagged Text</Title> <Section position="3" start_page="0" end_page="197" type="intro"> <SectionTitle> 2 Agglomerative Clustering </SectionTitle> <Paragraph position="0"> In general, clustering methods rely on the assumption that classes occupy distinct regions in the feature space. The distance between two points in a multi-dimensional space can be measured using any of a wide variety of metrics (see, e.g. (Devijver and Kittler, 1982)). Observations are grouped in the manner that minimizes the distance between the members of each class.</Paragraph> <Paragraph position="1"> Ward's and McQuitty's method are agglomerative clustering algorithms that differ primarily in how they compute the distance between clusters. All such algorithms begin by placing each observation in a unique cluster, i.e. a cluster of one. The two closest clusters are merged to form a new cluster that replaces the two merged clusters. Merging of the two closest clusters continues until only some specified number of clusters remain.</Paragraph> <Paragraph position="2"> However, our data does not immediately lend itself to a distance-based interpretation. Our features represent part-of-speech (POS) tags, morphological characteristics, and word co-occurrence; such features are nominal and their values do not have scale. Given a POS feature, for example, we could choose noun = 1, verb = 2, adjective = 3, and adverb = 4. That adverb is represented by a larger number than noun is purely coincidental and implies nothing about the relationship between nouns and adverbs.</Paragraph> <Paragraph position="3"> Thus, before we employ either clustering algo-</Paragraph> <Paragraph position="5"> rithm, we represent our data sample in terms of a dissimilarity matrix. Suppose that we have N observations in a sample where each observation has q features. This data is represented in a N x N dissimilarity matrix such that the value in cell (i,j), where i represents the row number and j represents the column, is equal to the number of features in observations i and j that do not match.</Paragraph> <Paragraph position="6"> For example, in Figure 1 we have four observations. We record the values of three nominal features for each observation. This sample can be represented by the 4 x 4 dissimilarity matrix shown in Figure 2. In the dissimilarity matrix, cells (1, 2) and (2, 1) have the value 2, indicating that the first and second observations in Figure 1 have different values for two of the three features. A value of 0 indicates that observations i and j are identical.</Paragraph> <Paragraph position="7"> When clustering our data, each observation is represented by its corresponding row (or column) in the dissimilarity matrix. Using this representation, observations that fall close together in feature space are likely to belong to the same class and are grouped together into clusters. In this paper, we use Ward's and McQuitty's methods to form clusters of observations, where each observation is represented by a row in a dissimilarity matrix.</Paragraph> <Section position="1" start_page="197" end_page="197" type="sub_section"> <SectionTitle> 2.1 Ward's minimum-variance method </SectionTitle> <Paragraph position="0"> In Ward's method, the internal variance of a cluster is the sum of squared distances between each observation in the cluster and the mean observation for that cluster (i.e., the average of all the observations in the cluster). At each step in Ward's method, a new cluster, CKL, with the smallest possible internal variance, is created by merging the two clusters, CK and CL, that have the minimum variance between them. The variance between CK and eL is computed as follows:</Paragraph> </Section> </Section> class="xml-element"></Paper>