File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-2004_intro.xml
Size: 6,775 bytes
Last Modified: 2025-10-06 14:04:03
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2004"> <Title>Improving Name Discrimination: A Language Salad Approach</Title> <Section position="3" start_page="0" end_page="26" type="intro"> <SectionTitle> 2 Discrimination by Clustering Contexts </SectionTitle> <Paragraph position="0"> Our method of name discrimination is described in more detail in (Pedersen et al., 2005), but in general is based on an unsupervised approach to word sense discrimination introduced by (Purandare and Pedersen, 2004), which builds upon earlier work in word sense discrimination, including (Sch&quot;utze, 1998) and (Pedersen and Bruce, 1997).</Paragraph> <Paragraph position="1"> Our method treats each occurrence of an ambiguous name as a context that is to be clustered with other contexts that also include the same name. In this paper, each context consists of about 50 words, where the ambiguous name is generally in the middle of the context. The goal is to cluster similar contexts together, based on the presumption that the occurrences of a name that appear in similar contexts will refer to the same underlying entity. This approach is motivated by both the distributional hypothesis (Harris, 1968) and the strong contextual hypothesis (Miller and Charles, 1991).</Paragraph> <Section position="1" start_page="25" end_page="25" type="sub_section"> <SectionTitle> 2.1 Feature Selection </SectionTitle> <Paragraph position="0"> The contexts to be clustered are represented by lexical features which may be selected from either the contexts being clustered, or from a separate corpus. In this paper we use both approaches. We cluster the contexts based on features identified in those very same contexts, and we also cluster the contexts based on features identified in a separate set of data (in this case English). We explore the use of a mixed feature selection strategy where we identify features both from the data to be clustered and the separate corpus of English text. Thus, our feature selection data may come from one of three sources: the contexts to be clustered (which we will refer to as the evaluation contexts), English contexts which include the same name but are not to be clustered, and the combination of these two (our so-called Language Salad or Mix).</Paragraph> <Paragraph position="1"> The lexical features we employ are bigrams, that is consecutive words that occur together in the corpora from which we are identifying features. In this work we identify bigram features using Point-wise Mutual Information (PMI). This is defined as the log of the ratio of the observed frequency with which the two words occur together in the feature selection data, to the expected number of times the two words would occur together in a corpus if they were independent. This expected value is estimated simply by taking the product of the number of times the two words occur individually, and dividing this by the total number of bigrams in the feature selection data. Thus, larger values of PMI indicate that the observed frequency of the bigram is greater than would be expected if the two words were independent.</Paragraph> <Paragraph position="2"> In these experiments we take the top 500 ranked bigrams that occur five or more times in the feature selection data. We also exclude any bigram from consideration that is made up of one or two stop words, which are high frequency function words that have been specified in a manually created list.</Paragraph> <Paragraph position="3"> Note that with smaller numbers of contexts (usually 200 or fewer), we lower the frequency threshold to two or more.</Paragraph> <Paragraph position="4"> In general PMI is known to have a bias towards pairs of words (bigrams) that occur a small number of times and only with each other. In this work that is a desirable quality, since that will tend to identify pairs of words that are very strongly associated with each other and also provide unique discriminating information.</Paragraph> </Section> <Section position="2" start_page="25" end_page="26" type="sub_section"> <SectionTitle> 2.2 Context Representation </SectionTitle> <Paragraph position="0"> Once the bigram features have been identified, then the contexts to be clustered are represented using second order co-occurrences that are derived from those bigrams. In general a second order co-occurrence is a pair of words that may not occur with each other, but that both occur frequently with a third word. For example, garden and fire may not occur together often, but both commonly occur with hose. Thus, garden hose and fire hose represent first order co-occurrences, and garden and fire represent a second order cooccurrence. null The process of creating the second order representation has several steps. First, the bigram features identified by PMI (the top ranked 500 bi-grams that have occurred 5 or more times in the feature selection data) are used to create a word by word co-occurrence matrix. The first word in each bigram represents a row in the matrix, and the second word in each bigram represents a column.</Paragraph> <Paragraph position="1"> The cells in the matrix contain the PMI scores.</Paragraph> <Paragraph position="2"> Note that this matrix is not symmetric, and that there are many words that only occur in either a row or a column (and not both) because they tend to occur as the first or second word in a bigram.</Paragraph> <Paragraph position="3"> For example, President might tend to be a first word in a bigram (e.g., President Clinton, President Putin), whereas last names will tend to be the second word.</Paragraph> <Paragraph position="4"> Once the co-occurrence matrix is created, then the contexts to be clustered can be represented.</Paragraph> <Paragraph position="5"> Each word in the context is checked to see if it has a corresponding row (i.e., vector) in the co-occurrence matrix. If it does, that word is replaced in the context by the row from the matrix, so that the word in the context is now represented by the vector of words with which it occurred in the feature selection data. If a word does not have a corresponding entry in the co-occurrence matrix, then it is simply removed from the context. After all the words in the context are checked, then all of the vectors that are selected are averaged together to create a vector representation of the context.</Paragraph> <Paragraph position="6"> Then these contexts are clustered into a pre-specified number of clusters using the k-means algorithm. Note that we are currently developing methods to automatically select the number of clusters in the data (e.g., (Pedersen and Kulkarni, 2006)), although we have not yet applied them to this particular work.</Paragraph> </Section> </Section> class="xml-element"></Paper>