File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2406_metho.xml

Size: 27,870 bytes

Last Modified: 2025-10-06 14:09:22

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-2406">
  <Title>Word Sense Discrimination by Clustering Contexts in Vector and Similarity Spaces</Title>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 First Order Context Vectors
</SectionTitle>
    <Paragraph position="0"> First order context vectors directly indicate which features make up a context. In all of our experiments, the context of the target word is limited to 20 surrounding content words on either side. This is true both when we are selecting features from a set of training data, or when we are converting test instances into vectors for clustering. The particular features we are interested in are bi-grams and co-occurrences.</Paragraph>
    <Paragraph position="1"> Co-occurrences are words that occur within five positions of the target word (i.e., up to three intervening words are allowed). Bigrams are ordered pairs of words that co-occur within five positions of each other. Thus, co-occurrences are unordered word pairs that include the target word, whereas bigrams are ordered pairs that may or may not include the target. Both the co-occurrences and the bigrams must occur in at least two instances in the training data, and the two words must have a log-likelihood ratio in excess of 3.841, which has the effect of removing co-occurrences and bigrams that have more than 95% chance of being independent of the target word.</Paragraph>
    <Paragraph position="2"> After selecting a set of co-occurrences or bigrams from a corpus of training data, a first order context representation is created for each test instance. This shows how many times each feature occurs in the context of the target word (i.e., within 20 positions from the target word) in that instance.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Second Order Context Vectors
</SectionTitle>
    <Paragraph position="0"> A test instance can be represented by a second order context vector by finding the average of the first order context vectors that are associated with the words that occur near the target word. Thus, the second order context representation relies on the first order context vectors of feature words. The second order experiments in this paper use two different types of features, co-occurrences and bigrams, defined as they are in the first order experiments.</Paragraph>
    <Paragraph position="1"> Each co-occurrence identified in training data is assigned a unique index and occupies the corresponding row/column in a word co-occurrence matrix. This is constructed from the co-occurrence pairs, and is a symmetric adjacency matrix whose cell values show the log-likelihood ratio for the pair of words representing the corresponding row and column. Each row of the co-occurrence matrix can be seen as a first order context vector of the word represented by that row. The set of words forming the rows/columns of the co-occurrence matrix are treated as the feature words.</Paragraph>
    <Paragraph position="2"> Bigram features lead to a bigram matrix such that for each selected bigram WORDi&lt;&gt;WORDj, WORDi represents a single row, say the i th row, and WORDj represents a single column, say the j th column, of the bigram matrix. Then the value of cell (i,j) indicates the log-likelihood ratio of the words in the bigram WORDi&lt;&gt;WORDj. Each row of the bigram matrix can be seen as a bigram vector that shows the scores of all bigrams in which the word represented by that row occurs as the first word. Thus, the words representing the rows of the bigram matrix make the feature set while the words representing the columns form the dimensions of the feature space.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Clustering
</SectionTitle>
    <Paragraph position="0"> The objective of clustering is to take a set of instances represented as either a similarity matrix or context vectors and cluster together instances that are more like each other than they are to the instances that belong to other clusters.</Paragraph>
    <Paragraph position="1"> Clustering algorithms are classified into three main categories, hierarchical, partitional, and hybrid methods that incorporate ideas from both. The algorithm acts as a search strategy that dictates how to proceed through the instances. The actual choice of which clusters to split or merge is decided by a criteria function. This section describes the clustering algorithms and criteria functions that have been employed in our experiments.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Hierarchical
</SectionTitle>
      <Paragraph position="0"> Hierarchical algorithms are either agglomerative or divisive. They both proceed iteratively, and merge or divide clusters at each step. Agglomerative algorithms start with each instance in a separate cluster and merge a pair of clusters at each iteration until there is only a single cluster remaining. Divisive methods start with all instances in the same cluster and split one cluster into two during each iteration until all instances are in their own cluster.</Paragraph>
      <Paragraph position="1"> The most widely known criteria functions used with hierarchical agglomerative algorithms are single link, complete link, and average link, also known as UPGMA.</Paragraph>
      <Paragraph position="2"> (Sch&amp;quot;utze, 1998) points out that single link clustering tends to place all instances into a single elongated cluster, whereas (Pedersen and Bruce, 1997) and (Purandare, 2003) show that hierarchical agglomerative clustering using average link (via McQuitty's method) fares well.</Paragraph>
      <Paragraph position="3"> Thus, we have chosen to use average link/UPGMA as our criteria function for the agglomerative experiments.</Paragraph>
      <Paragraph position="4"> In similarity space, each instance can be viewed as a node in a weighted graph. The weights on edges joining two nodes indicate their pairwise similarity as measured by the cosine between the context vectors that represent the pair of instances.</Paragraph>
      <Paragraph position="5"> When agglomerative clustering starts, each node is in its own cluster and is considered to be the centroid of that cluster. At each iteration, average link selects the pair of clusters whose centroids are most similar and merges them into a single cluster. For example, suppose the clusters I and J are to be merged into a single cluster IJ. The weights on all other edges that connect existing nodes to the new node IJ must now be revised. Suppose that Q is such a node. The new weight in the graph is computed by averaging the weight on the edge between nodes I and Q and that on the edge between J and Q. In other words:  In vector space, average link starts by assigning each vector to a single cluster. The centroid of each cluster is found by calculating the average of all the context vectors that make up the cluster. At each iteration, average link selects the pair of clusters whose centroids are closest with respect to their cosines. The selected pair of clusters is merged and a centroid is computed for this newly created cluster.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Partitional
</SectionTitle>
      <Paragraph position="0"> Partitional algorithms divide an entire set of instances into a predetermined number of clusters (K) without going through a series of pairwise comparisons. As such these methods are somewhat faster than hierarchical algorithms. null For example, the well known K-means algorithm is partitional. In vector space, each instance is represented by a context vector. K-means initially selects K random vectors to serve as centroids of these initial K clusters. It then assigns every other vector to one of the K clusters whose centroid is closest to that vector. After all vectors are assigned, it recomputes the cluster centroids by averaging all of the vectors assigned to that cluster. This repeats until convergence, that is until no vector changes its cluster across iterations and the centroids stabilize.</Paragraph>
      <Paragraph position="1"> In similarity space, each instance can be viewed as a node of a fully connected weighted graph whose edges indicate the similarity between the instances they connect.</Paragraph>
      <Paragraph position="2"> K-means will first select K random nodes that represent the centroids of the initial K clusters. It will then assign every other node I to one of the K clusters such that the edge joining I and the centroid of that cluster has maximum weight among the edges joining I to all centroids.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Hybrid Methods
</SectionTitle>
      <Paragraph position="0"> It is generally believed that the quality of clustering by partitional algorithms is inferior to that of the agglomerative methods. However, a recent study (Zhao and Karypis, 2002) has suggested that these conclusions are based on experiments conducted with smaller data sets, and that with larger data sets partitional algorithms are not only faster but lead to better results.</Paragraph>
      <Paragraph position="1"> In particular, Zhao and Karypis recommend a hybrid approach known as Repeated Bisections. This overcomes the main weakness with partitional approaches, which is the instability in clustering solutions due to the choice of the initial random centroids. Repeated Bisections starts with all instances in a single cluster. At each iteration it selects one cluster whose bisection optimizes the chosen criteria function. The cluster is bisected using standard K-means method with K=2, while the criteria function maximizes the similarity between each instance and the centroid of the cluster to which it is assigned. As such this is a hybrid method that combines a hierarchical divisive approach with partitioning.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Experimental Data
</SectionTitle>
    <Paragraph position="0"> We use 24 of the 73 words in the SENSEVAL-2 sense-tagged corpus, and the Line, Hard and Serve sense-tagged corpora. Each of these corpora are made up of instances that consist of 2 or 3 sentences that include a single target word that has a manually assigned sense tag.</Paragraph>
    <Paragraph position="1"> However, we ignore the sense tags at all times except during evaluation. At no point do the sense tags enter into the clustering or feature selection processes. To be clear, we do not believe that unsupervised word sense discrimination needs to be carried out relative to a pre-existing set of senses. In fact, one of the great advantages of unsupervised technique is that it doesn't need a manually annotated text. However, here we employ sense-tagged text in order to evaluate the clusters that we discover.</Paragraph>
    <Paragraph position="2"> The SENSEVAL-2 data is already divided into training and test sets, and those splits were retained for these experiments. The SENSEVAL-2 data is relatively small, in that each word has approximately 50-200 training and test instances. The data is particularly challenging for unsupervised algorithms due to the large number of fine grained senses, generally 8 to 12 per word. The small volume of data combined with large number of possible senses leads to very small set of examples for most of the senses.</Paragraph>
    <Paragraph position="3"> As a result, prior to clustering we filter the training and test data independently such that any instance that uses a sense that occurs in less than 10% of the available instances for a given word is removed. We then eliminate any words that have less than 90 training instances after filtering. This process leaves us with a set of 24 SENSEVAL-2 words, which includes the 14 nouns, 6 adjectives and 4 verbs that are shown in Table 1.</Paragraph>
    <Paragraph position="4"> In creating our evaluation standard, we assume that each instance will be assigned to at most a single cluster. Therefore if an instance has multiple correct senses associated with it, we treat the most frequent of these as the desired tag, and ignore the others as possible correct answers in the test data.</Paragraph>
    <Paragraph position="5"> The Line, Hard and Serve corpora do not have a standard training-test split, so these were randomly divided into 60-40 training-test splits. Due to the large number of training and test instances for these words, we filtered out instances associated with any sense that occurred in less than 5% of the training or test instances.</Paragraph>
    <Paragraph position="6"> We also randomly selected five pairs of words from the SENSEVAL-2 data and mixed their instances together (while retaining the training and test distinction that already existed in the data). After mixing, the data was filtered such that any sense that made up less than 10% in the training or test data of the new mixed sample was removed; this is why the total number of instances for the mixed pairs is not the same as the sum of those for the individual words. These mix-words were created in order to provide data that included both fine grained and coarse grained distinctions.</Paragraph>
    <Paragraph position="7"> Table 1 shows all words that were used in our experiments along with their parts of speech. Thereafter we show the number of training (TRN) and test instances (TST) that remain after filtering, and the number of senses found in the test data (S). We also show the percentage of the majority sense in the test data (MAJ). This is particularly useful, since this is the accuracy that would be attained by a baseline clustering algorithm that puts all test instances into a single cluster.</Paragraph>
  </Section>
  <Section position="9" start_page="0" end_page="0" type="metho">
    <SectionTitle>
7 Evaluation Technique
</SectionTitle>
    <Paragraph position="0"> When we cluster test instances, we specify an upper limit on the number of clusters that can be discovered. In these experiments that value is 7. This reflects the fact that we do not know a-priori the number of possible senses a word will have. This also allows us to verify the hypothesis that a good clustering approach will automatically discover approximately same number of clusters as senses for that word, and the extra clusters (7-#actual senses) will contain very few instances. As can be seen from column S in Table 1, most of the words have 2 to 4 senses on an average. Of the 7 clusters created by an algorithm, we detect the significant clusters by ignoring (throwing out) clusters that contain less than 2% of the total instances.</Paragraph>
    <Paragraph position="1"> The instances in the discarded clusters are counted as unclustered instances and are subtracted from the total number of instances.</Paragraph>
    <Paragraph position="2"> Our basic strategy for evaluation is to assign available sense tags to the discovered clusters such that the assignment leads to a maximally accurate mapping of senses to clusters. The problem of assigning senses to clusters becomes one of reordering the columns of a confusion matrix that shows how senses and clusters align such that the diagonal sum is maximized. This corresponds to several well known problems, among them the Assignment Problem in Operations Research, or determining the maximal matching of a bipartite graph in Graph Theory.</Paragraph>
    <Paragraph position="3"> During evaluation we assign one sense to at most one cluster, and vice versa. When the number of discovered clusters is the same as the number of senses, then there is a one to one mapping between them. When the number of clusters is greater than the number of actual senses, then some clusters will be left unassigned. And when the number of senses is greater than the number of clusters, some senses will not be assigned to any cluster. The reason for not assigning a single sense to multiple clusters or multiple senses to one cluster is that, we are assuming one sense per instance and one sense per cluster.</Paragraph>
    <Paragraph position="4"> We measure the precision and recall based on this maximally accurate assignment of sense tags to clusters. Precision is defined as the number of instances that are clustered correctly divided by the number of instances clustered, while recall is the number of instances clustered correctly over the total number of instances. From that we compute the F-measure, which is two times the precision and recall, divided by the sum of precision and recall.</Paragraph>
  </Section>
  <Section position="10" start_page="0" end_page="0" type="metho">
    <SectionTitle>
8 Experimental Results
</SectionTitle>
    <Paragraph position="0"> We present the discrimination results for six configurations of features, context representations and clustering algorithms. These were run on each of the 27 target words, and also on the five mixed words. What follows is a concise description of each configuration.</Paragraph>
    <Paragraph position="1"> PB1 : First order context vectors, using co-occurrence features, are clustered in similarity space using the UPGMA technique.</Paragraph>
    <Paragraph position="2"> PB2 : Same as PB1, except that the first order context vectors are clustered in vector space using Repeated Bisections.</Paragraph>
    <Paragraph position="3"> PB3: Same as PB1, except the first order context vectors used bigram features instead of cooccurrences. null All of the PB experiments use first order context representations that correspond to the approach suggested by Pedersen and Bruce.</Paragraph>
    <Paragraph position="4"> SC1: Second order context vectors of instances were clustered in vector space using the Repeated Bisections technique. The context vectors were created from the word co-occurrence matrix whose dimensions were reduced using SVD.</Paragraph>
    <Paragraph position="5"> SC2: Same as SC1 except that the second order context vectors are converted to a similarity matrix and clustered using the UPGMA method.</Paragraph>
    <Paragraph position="6"> SC3: Same as SC1, except the second order context vectors were created from the bigram matrix.</Paragraph>
    <Paragraph position="7"> All of the SC experiments use second order context vectors and hence follow the approach suggested by Sch&amp;quot;utze.</Paragraph>
    <Paragraph position="8"> Experiment PB2 clusters the Pedersen and Bruce style (first order) context vectors using the Sch&amp;quot;utze like clustering scheme, while SC2 tries to see the effect of using the Pedersen and Bruce style clustering method on Sch&amp;quot;utze style (second order) context vectors. The motivation behind experiments PB3 and SC3 is to try bigram features in both PB and SC style context vectors.</Paragraph>
    <Paragraph position="9"> The F-measure associated with the discrimination of each word is shown in Table 1. Any score that is significantly greater than the majority sense (according to a paired t-test) is shown in bold face.</Paragraph>
  </Section>
  <Section position="11" start_page="0" end_page="0" type="metho">
    <SectionTitle>
9 Analysis and Discussion
</SectionTitle>
    <Paragraph position="0"> We employ three different types of data in our experiments. The SENSEVAL-2 words have a relatively small number of training and test instances (around 50-200).</Paragraph>
    <Paragraph position="1"> However, the Line, Hard and Serve data is much larger,  where each contains around 4200 training and test instances combined. Mixed word are unique because they combined the instances of multiple target words and thereby have a larger number of senses to discriminate. Each type of data brings with it unique characteristics, and sheds light on different aspects of our experiments.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
9.1 Senseval-2 data
</SectionTitle>
      <Paragraph position="0"> Table 2 compares PB1 against PB3, and SC1 against SC3, when these methods are used to discriminate the 24 SENSEVAL-2 words. Our objective is to study the effect of using bigram features against co-occurrences in first (PB) and second (SC) order context vectors while using relatively small amounts of training data per word. Note that PB1 and SC1 use co-occurrence features, while PB3 and SC3 rely on bigram features.</Paragraph>
      <Paragraph position="1"> This table shows the number of nouns (N), adjectives (A) and verbs (V) where bigrams were more effective than co-occurrences (bigram&gt;co-occur), less effective (bigram&lt;co-occur), and had no effect (bigram=cooccur). null Table 2 shows that there is no clear advantage to using either bigrams or co-occurrence features in first order context vectors (PB). However, bigram features show clear improvement in the results of second order context vectors (SC).</Paragraph>
      <Paragraph position="2"> Our hypothesis is that first order context vectors (PB) represent a small set of bigram features since they are selected from the relatively small SENSEVAL-2 words.</Paragraph>
      <Paragraph position="3"> These features are very sparse, and as such most instances do not share many common features with other instances, making first order clustering difficult.</Paragraph>
      <Paragraph position="4">  However, second order context vectors indirectly represent bigram features, and do not require an exact match between vectors in order to establish similarity. Thus, the poor performance of bigrams in the case of first order context vectors suggests that when dealing with small amounts of data, we need to boost or enrich our bigram feature set by using some other larger training source like a corpus drawn from the Web.</Paragraph>
      <Paragraph position="5"> Table 3 shows the results of using the Repeated Bisections algorithm in vector space (PB) against that of using UPGMA method in similarity space. This table shows the number of Nouns, Adjectives and Verbs SENSEVAL-2 words that performed better (rbr&gt;upgma), worse (rbr&lt;upgma), and equal (rbr=upgma) when using Repeated Bisections clustering versus the UPGMA technique, on first (PB) and second (SC) order vectors.</Paragraph>
      <Paragraph position="6"> In short, Table 3 compares PB1 against PB2 and SC1 against SC2. From this, we observe that with both first order and second order context vectors Repeated Bisections is more effective than UPGMA. This suggests that it is better suited to deal with very small amounts of sparse data.</Paragraph>
      <Paragraph position="7"> Table 4 summarizes the overall performance of each of these experiments compared with the majority class. This table shows the number of words for which an experiment performed better than the the majority class, broken down by part of speech. Note that SC3 and SC1 are most often better than the majority class, followed closely by PB2 and SC2. This suggests that the second order context vectors (SC) have an advantage over the first order vectors for small training data as is found among the 24 SENSEVAL-2 words.</Paragraph>
      <Paragraph position="8"> We believe that second order methods work better on  smaller amounts of data, in that the feature spaces are quite small, and are not able to support the degree of exact matching of features between instances that first order vectors require. Second order context vectors succeed in such cases because they find indirect second order co-occurrences of feature words and hence describe the context more extensively than the first order representations. With smaller quantities of data, there is less possibility of finding instances that use exactly the same set of words. Semantically related instances use words that are conceptually the same but perhaps not lexically. Second order context vectors are designed to identify such relationships, in that exact matching is not required, but rather words that occur in similar contexts will have similar vectors.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
9.2 Line, Hard and Serve data
</SectionTitle>
      <Paragraph position="0"> The comparatively good performance of PB1 and PB3 in the case of the Line, Hard and Serve data (see Table 1) suggests that first order context vectors when clustered with UPGMA perform relatively well on larger samples of data.</Paragraph>
      <Paragraph position="1"> Moreover, among the SC experiments on this data, the performance of SC2 is relatively high. This further suggests that UPGMA performs much better than Repeated Bisections with larger amounts of training data.</Paragraph>
      <Paragraph position="2"> These observations correspond with the hypothesis drawn from the SENSEVAL-2 results. That is, a large amount of training data will lead to a larger feature space and hence there is a greater chance of matching more features directly in the context of the test instances. Hence, the first order context vectors that rely on the immediate context of the target word succeed as the contexts are more likely to use similar sets of words that in turn are selected from a large feature collection.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
9.3 Mix-Word Results
</SectionTitle>
      <Paragraph position="0"> Nearly all of the experiments carried out with the 6 different methods perform better than the majority sense in the case of the mix-words. This is partially due to the fact that these words have a large number of senses, and therefore have low majority classifiers. In addition, recall that this data is created by mixing instances of distinct target words, which leads to a subset of coarse grained (distinct) senses within the data that are easier to discover than the senses of a single word.</Paragraph>
      <Paragraph position="1"> Table 1 shows that the top 3 experiments for each of the mixed-words are all second order vectors (SC). We believe that this is due to the sparsity of the feature spaces of this data. Since there are so many different senses, the number of first order features that would be required to correctly discriminate them is very high, leading to better results for second order vectors.</Paragraph>
    </Section>
  </Section>
  <Section position="12" start_page="0" end_page="0" type="metho">
    <SectionTitle>
10 Future Directions
</SectionTitle>
    <Paragraph position="0"> We plan to conduct experiments that compare the effect of using very large amounts of training data versus smaller amounts where each instance includes the target word (as is the case in this paper). We will draw our large corpora from a variety of sources, including the British National Corpus, the English GigaWord Corpus, and the Web. Our motivation is that the larger corpora will provide more generic co-occurrence information about words without regard to a particular target word. However, the data specific to a given target word will capture the word usages in the immediate context of the target word. Thus, we will test the hypothesis that a smaller sample of data where each instance includes the target word is more effective for sense discrimination than a more general corpus of training data.</Paragraph>
    <Paragraph position="1"> We are also planning to automatically attach descriptive labels to the discovered clusters that capture the underlying word sense. These labels will be created from the most characteristic features used by the instances belonging to the same cluster. By comparing such descriptive features of each cluster with the words that occur in actual dictionary definitions of the target word, we plan to carry out fully automated word sense disambiguation that does not rely on any manually annotated text.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML