File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-3027_metho.xml
Size: 7,357 bytes
Last Modified: 2025-10-06 14:09:49
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-3027"> <Title>SenseClusters: Unsupervised Clustering and Labeling of Similar Contexts</Title> <Section position="4" start_page="0" end_page="105" type="metho"> <SectionTitle> 2 Clustering Methodology </SectionTitle> <Paragraph position="0"> We begin with the collection of contexts to be clustered, referred to as the test data. These may all include a given target word, or they may be headless contexts. We can select the lexical features from the test data, or from a separate source of data. In either case, the methodology proceeds in exactly the same way.</Paragraph> <Paragraph position="1"> SenseClusters is based on lexical features, in particular unigrams, bigrams, co-occurrences, and tar- null get co-occurrences. Unigrams are single words that occur more than five times, bigrams are ordered pairs of words that may have intervening words between them, while co-occurrences are simply unordered bigrams. Target co-occurrences are those co-occurrences that include the given target word. We select bigrams and co-occurrences that occur more than five times, and that have a log-likelihood ratio of more than 3.841, which signifies a 95% level of certainty that the two words are not independent. We do not allow unigrams to be stop words, and we eliminate any bigram or co-occurrence feature that includes one or more stop words.</Paragraph> <Paragraph position="2"> Previous work in word sense discrimination has shown that contexts of an ambiguous word can be effectively represented using first order (Pedersen and Bruce, 1997) or second order (Sch&quot;utze, 1998) representations. SenseClusters provides extensive support for both, and allows for them to be applied in a wider range of problems.</Paragraph> <Paragraph position="3"> In the first order case, we create a context (rows) by lexical features (columns) matrix, where the features may be any of the above mentioned types. The cell values in this matrix record the frequencies of each feature occurring in the context represented by a given row. Since most lexical features only occur a small number of times (if at all) in each context, the resulting matrix tends to be very sparse and nearly binary. Each row in this matrix forms a vector that represents a context. We can (optionally) use Singular Value Decomposition (SVD) to reduce the dimensionality of this matrix. SVD has the effect of compressing a sparse matrix by combining redundant columns and eliminating noisy ones. This allows the rows to be represented with a smaller number of hopefully more informative columns.</Paragraph> <Paragraph position="4"> In the second order context representation we start with creating a word by word co-occurrence matrix where each row represent the first word and the columns represent the second word of either bigram or co-occurrence features previously identified. If the features are bigrams then the word matrix is asymmetric whereas for co-occurrences it is symmetric and the rows and columns do not suggest any ordering. In either case, the cell values indicate how often the two words occur together, or contains their log-likelihood score of associativity. This matrix is large and sparse, since most words do not co-occur with each other. We may optionally apply SVD to this co-occurrence matrix to reduce its dimensionality. Each row of this matrix is a vector that represents the given word at the row via its co-occurrence characteristics. We create a second order representation of a context by replacing each word in that context with its associated vector, and then averaging together all these word vectors. This results in a single vector that represents the overall context.</Paragraph> <Paragraph position="5"> For contexts with target words we can restrict the number of words around the target word that are averaged for the creation of the context vector. In our name discrimination experiments we limit this scope to five words on either side of the target word which is based on the theory that words nearer to the target word are more related to it than the ones that are farther away.</Paragraph> <Paragraph position="6"> The goal of the second order context representation is to capture indirect relationships between words. For example, if the word Dictionary occurs with Words but not with Meanings, and Words occurs with Meanings, then the words Dictionary and Meanings are second order co-occurrences via the first order co-occurrence of Words.</Paragraph> <Paragraph position="7"> In either the first or second order case, once we have each context represented as a vector we proceed with clustering. We employ the hybrid clustering method known as Repeated Bisections, which offers nearly the quality of agglomerative clustering at the speed of partitional clustering.</Paragraph> </Section> <Section position="5" start_page="105" end_page="106" type="metho"> <SectionTitle> 3 Labeling Methodology </SectionTitle> <Paragraph position="0"> For each discovered cluster, we create a descriptive and a discriminating label, each of which is made up of some number of bigram features. These are identified by treating the contexts in each cluster as a separate corpora, and applying our bigram feature selection methods as described previously on each of them.</Paragraph> <Paragraph position="1"> Descriptive labels are the top N bigrams according to the log-likelihood ratio. Our goal is that these labels will provide clues as to the general nature of the contents of a cluster. The discriminating labels are any descriptive labels for a cluster that are not descriptive labels of another cluster. Thus, the discriminating label may capture the content that separates one cluster from another and provide a more</Paragraph> <Paragraph position="3"> detailed level of information.</Paragraph> </Section> <Section position="6" start_page="106" end_page="106" type="metho"> <SectionTitle> 4 Experimental Data </SectionTitle> <Paragraph position="0"> We evaluate these methods on proper name discrimination and email (newsgroup) categorization.</Paragraph> <Paragraph position="1"> For name discrimination we use the 700 million word New York Times portion of the English Giga-Word corpus as the source of contexts. While there are many ambiguous names in this data, it is difficult to evaluate the results of our approach given the absence of a disambiguated version of the text. Thus, we automatically create ambiguous names by conflating the occurrences associated with two or three relatively unambiguous names into a single obfuscated name.</Paragraph> <Paragraph position="2"> For example, we combine Britney Spears and George Bush into an ambiguous name Britney Bush, and then see how well SenseClusters is able to create clusters that reflect the true underlying identity of the conflated name.</Paragraph> <Paragraph position="3"> Our email experiments are based on the 20-NewsGroup Corpus of USENET articles. This is a collection of approximately 20,000 articles that have been taken from 20 different newsgroups. As such they are already classified, but since our methods are unsupervised we ignore this information until it is time to evaluate our approach. We present results that make two way distinctions between selected pairs of newsgroups.</Paragraph> </Section> class="xml-element"></Paper>