File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-3008_metho.xml

Size: 14,032 bytes

Last Modified: 2025-10-06 14:08:54

<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-3008">
  <Title>SenseClusters - Finding Clusters that Represent Word Senses</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Feature Selection
</SectionTitle>
    <Paragraph position="0"> SenseClusters distinguishes among the different contexts in which a target word occurs based on a set of features that are identified from raw corpora. SenseClusters uses the Ngram Statistics Package (Banerjee and Pedersen, 2003), which is able to extract surface lexical features from large corpora using frequency cutoffs and various measures of association, including the log-likelihood ratio, Pearson's Chi-Squared test, Fisher's Exact test, the Dice Coefficient, Pointwise Mutual Information, etc.</Paragraph>
    <Paragraph position="1"> SenseClusters currently supports the use of unigram, bigram, and co-occurrence features. Unigrams are individual words that occur above a certain frequency cutoff.</Paragraph>
    <Paragraph position="2"> These can be effective discriminating features if they are shared by a minimum of 2 contexts, but not shared by all contexts. Very common non-content words are excluded by providing a stop-list.</Paragraph>
    <Paragraph position="3"> Bigrams are pairs of words that occur above a given frequency cutoff and that have a statistically significant score on a test of association. There may optionally be intervening words between them that are ignored. Co-occurrences are bigrams that include the target word. In effect co-occurrences localize the scope of the unigram features by selecting only those words that occur within some number of positions from the target word.</Paragraph>
    <Paragraph position="4"> SenseClusters allows for the selection of lexical features either from a held out corpus of training data, or from the same data that is to be clustered, which we refer to as the test data. Selecting features from separate training data is particularly useful when the amount of the test data to be clustered is too small to identify interesting features.</Paragraph>
    <Paragraph position="5"> The following is a summary of some of the options provided by SenseClusters that make it possible for a user to customize feature selection to their needs:  -training FILE A held out file of training data to be used to select features. Otherwise, features will be selected from the data to be clustered.</Paragraph>
    <Paragraph position="6"> -token FILE A file containing Perl regular expressions that defines the tokenization scheme.</Paragraph>
    <Paragraph position="7"> -stop FILE A file containing a user provided stoplist.</Paragraph>
    <Paragraph position="8"> -feature STRING The feature type to be selected.</Paragraph>
    <Paragraph position="9">  Valid options include unigrams, bigrams, and cooccurrences. null -remove N Ignore features that occur less N times.</Paragraph>
    <Paragraph position="10"> -window M Allow up to M-2 words to intervene between pairs of words when identifying bigram and co-occurrence features.</Paragraph>
    <Paragraph position="11"> -stat STRING The statistical test of association to identify bigram and co-occurrence features. Valid values include any of the tests supported by the Ngram Statistics Package.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Context Representation
</SectionTitle>
    <Paragraph position="0"> Once features are selected, SenseClusters creates a vector for each test instance to be discriminated where each selected feature is represented by an entry/index. Each vector shows if the feature represented by the corresponding index occurs or not in the context of the instance (binary vectors), or how often the feature occurs in the context (frequency vectors). This is referred to as a first order context vector, since this representation directly indicates which features make up the contexts. Here we are following (Pedersen and Bruce, 1997), who likewise took this approach to feature representation.</Paragraph>
    <Paragraph position="1"> (Sch&amp;quot;utze, 1998) utilized second order context vectors that represent the context of a target word to be discriminated by taking the average of the first order vectors associated with the unigrams that occur in that context. In SenseClusters we have extended this idea such that these first order vectors can also be based on co-occurrence or bigram features from the training corpus.</Paragraph>
    <Paragraph position="2"> Both the first and second order context vectors represent the given instances as vectors in a high dimensional word space. This approach suffers from two limitations.</Paragraph>
    <Paragraph position="3"> First, there may be synonyms represented by separate dimensions in the space. Second, and conversely, a single dimension in the space might be polysemous and associated with several different underlying concepts. To combat these problems, SenseClusters follows the lead of LSI (Deerwester et al., 1990) and LSA (Landauer et al., 1998) and allows for the conversion of word level feature spaces into a concept level semantic space by carrying out dimensionality reduction with Singular Value Decomposition (SVD). In particular, the package SVDPACK (Berry et al., 1993) is integrated into SenseClusters to allow for fast and efficient SVD.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Clustering
</SectionTitle>
    <Paragraph position="0"> Clustering can be carried out using either a first or second order vector representation of instances. SenseClusters provides a seamless interface to CLUTO, a Clustering Toolkit (Karypis, 2002), which implements a range of clustering techniques suitable for both representations, including repeated bisections, direct, nearest neighbor, agglomerative, and biased agglomerative.</Paragraph>
    <Paragraph position="1"> The first or second order vector representations of contexts can be directly clustered using vector space methods provided in CLUTO. As an alternative, each context vector can be represented as a point in similarity space such that the distance between it and any other context vector reflects the pairwise similarity of the underlying instances.</Paragraph>
    <Paragraph position="2"> SenseClusters provides support for a number of similarity measures, such as simple matching, the cosine, the Jaccard coefficient, and the Dice coefficient. A similarity matrix created by determining all pairwise measures of similarity between contexts can be used as an input to CLUTO's clustering algorithms, or to SenseClusters' own agglomerative clustering implementation.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Evaluation
</SectionTitle>
    <Paragraph position="0"> SenseClusters produces clusters of instances where each cluster refers to a particular sense of the given target word. SenseClusters supports evaluation of these clusters in two ways. First, SenseClusters provides external evaluation techniques that require knowledge of correct senses or clusters of the given instances. Second, there are internal evaluation methods provided by CLUTO that report the intra-cluster and inter-cluster similarity.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 External Evaluation
</SectionTitle>
      <Paragraph position="0"> When a gold standard clustering of the instances is available, SenseClusters builds a confusion matrix that shows  the form of sense-tagged text, where each sense tag can be considered to represent a different cluster that could be discovered.</Paragraph>
      <Paragraph position="1"> In Figure 1, the rows C0 - C9 represent ten discovered clusters while the columns represent six gold-standard senses. The value of cell (i,j) shows the number of instances in the i th discovered cluster that actually belong to the gold standard sense represented by the j th column.</Paragraph>
      <Paragraph position="2"> Note that the bottom row represents the true distribution of the instances across the senses, while the right hand column shows the distribution of the discovered clusters. To carry out evaluation of the discovered clusters, SenseClusters finds the mapping of gold standard senses to discovered clusters that would result in maximally accurate discrimination. The problem of assigning senses to clusters becomes one of re-ordering the columns of the confusion matrix to maximize the diagonal sum. Thus, each possible re-ordering shows one assignment scheme and the sum of the diagonal entries indicates the total number of instances in the discovered clusters that would be in their correct sense given that alignment. This corresponds to several well known problems, among them the Assignment Problem in Operations Research and finding the maximal matching of a bipartite graph.</Paragraph>
      <Paragraph position="3"> Figure 2 shows that cluster C1 maps most closely to sense S3, while discovered cluster C2 corresponds best to sense S5, and so forth. The clusters marked with * are not assigned to any sense. The accuracy of discrimination is simply the sum of the diagonal entries of the row/column re-ordered confusion matrix divided by the total number of instances clustered (435/1659 = 26%). Precision can also be computed by dividing the total number of correctly discriminated instances by the number of instances in the six clusters mapped to gold standard senses (435/1060 = 41%).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Internal Evaluation
</SectionTitle>
      <Paragraph position="0"> When gold-standard sense tags of the test instances are not available, SenseClusters relies on CLUTO's internal evaluation metrics to report the intra-cluster and inter-cluster similarity. There is also a graphical component to CLUTO known as gCLUTO that provides a visualization tool. An example of gCLUTO's output is provided in Figure 3, which displays a mountain view of the clusters shown in tables 1 and 2.</Paragraph>
      <Paragraph position="1"> This particular visualization illustrates the case when the gold-standard data has fewer senses (6) than the actual number requested (10). CLUTO and SenseClusters both require that the desired number of clusters be specified prior to clustering. In this example we requested 10, and the mountain view reveals that there were really only 5 to 7 actual distinct senses. In unsupervised word sense discrimination, the user will usually not know the actual number of senses ahead of time. One possible solution to this problem is to request an arbitrarily large number of clusters and rely on such visualizations to discover the true number of senses. In future work, we plan to support mechanisms that automatically determine the optimal number of clusters/senses to be found.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Summary of Unique Features
</SectionTitle>
    <Paragraph position="0"> The following are some of the distinguishing characteristics of SenseClusters.</Paragraph>
    <Paragraph position="1"> Feature Types SenseClusters supports the flexible selection of a variety of lexical features, including unigrams, bigrams, co-occurrences. These are selected by the Ngram Statistics Package using statistical tests of association or frequency cutoffs.</Paragraph>
    <Paragraph position="2"> Context Representations SenseClusters supports two different representations of context, first order context vectors as used by (Pedersen and Bruce, 1997) and second order context vectors as suggested by (Sch&amp;quot;utze, 1998). The former is a direct representation of the instances to be clustered in terms of their features, while  first order vector representations of the features that make up the context.</Paragraph>
    <Paragraph position="3"> Clustering SenseClusters seamlessly integrates CLUTO, a clustering package that provides a wide range of clustering algorithms and criteria functions. CLUTO also provides evaluation functions that report the inter-cluster and intra-cluster similarity, the most discriminating features characterizing each cluster, a dendogram tree view, and a 3D mountain view of clusters. SenseClusters also provides a native implementation of single link, complete link, and average link clustering.</Paragraph>
    <Paragraph position="4"> Evaluation SenseClusters supports the evaluation of discovered clusters relative to an existing gold standard. If sense-tagged text is available, this can be immediately used as such a gold standard. This evaluation reports precision and recall relative to the gold standard.</Paragraph>
    <Paragraph position="5"> LSA Support SenseClusters provides all of the functionality needed to carry out Latent Semantic Analysis. LSA converts a word level feature space into a concept level semantic space that smoothes over differences due to polysemy and synonymy among words.</Paragraph>
    <Paragraph position="6"> Efficiency SenseClusters is optimized to deal with a large amount of data both in terms of the number of text instances being clustered and the number of features used to represent the contexts.</Paragraph>
    <Paragraph position="7"> Integration SenseClusters transparently incorporates several specialized tools, including CLUTO, the Ngram Statistics Package, and SVDPACK. This provides a wide number of options and high efficiency at various steps like feature selection, feature space dimensionality reduction, clustering and evaluation.</Paragraph>
    <Paragraph position="8"> Availability SenseClusters is an open source software project that is freely distributed under the GNU Public License (GPL) via http://senseclusters.sourceforge.net/ SenseClusters is an ongoing project, and there are already a number of published papers based on its use (e.g., (Purandare, 2003), (Purandare and Pedersen, 2004)).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML