File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/n03-3004_intro.xml
Size: 3,071 bytes
Last Modified: 2025-10-06 14:01:42
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-3004"> <Title>Discriminating Among Word Senses Using McQuitty's Similarity Analysis</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Previous Work </SectionTitle> <Paragraph position="0"> The work in this paper builds upon two previous approaches to word sense discrimination, those of (Pedersen and Bruce, 1997) and (Sch&quot;utze, 1998). Pedersen and Bruce developed a method based on agglomerative clustering using McQuitty's Similarity Analysis (McQuitty, 1966), where the context of a target word is represented using localized contextual features such as collocations and part of speech tags that occur within one or two positions of the target word. Pedersen and Bruce demonstrated that despite it's simplicity, McQuitty's method was more accurate than Ward's Method of Minimum Variance and the EM Algorithm for word sense discrimination. null McQuitty's method starts by assuming that each instance is a separate cluster. It merges together the pair of clusters that have the highest average similarity value. This continues until a specified number of clusters is found, or until the similarity measure between every pair of clusters is less than a predefined cutoff. Pedersen and Bruce used a relatively small number of features, and employed the matching coefficient as the similarity measure. Since we use a much larger number of features, we are experimenting with the cosine measure, which scales similarity based on the number of non-zero features in each instance.</Paragraph> <Paragraph position="1"> By way of contrast, (Sch&quot;utze, 1998) performs discrimination through the use of two different kinds of context vectors. The first is a word vector that is based on co-occurrence counts from a separate training corpus. Each word in this corpus is represented by a vector made up of the words it co-occurs with. Then, each instance in a test or evaluation corpus is represented by a vector that is the average of all the vectors of all the words that make up that instance. The context in which a target word occurs is thereby represented by second order co-occurrences, which are words which co-occur with the co-occurrences of the target word. Discrimination is carried out by clustering instance vectors using the EM Algorithm.</Paragraph> <Paragraph position="2"> The approach described in this paper proceeds as follows. Surface lexical features are identified in a training corpus, which is made up of instances that consists of a sentence containing a given target word, plus one or two sentences to the left or right of it. Similarly defined instances in the test data are converted into vectors based on this feature set, and a similarity matrix is constructed using either the matching coefficient or the cosine. Thereafter McQuitty's Similarity Analysis is used to group together instances based on the similarity of their context, and these are evaluated relative to a manually created gold standard.</Paragraph> </Section> class="xml-element"></Paper>