File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-2020_metho.xml

Size: 2,608 bytes

Last Modified: 2025-10-06 14:10:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-2020">
  <Title>Evaluation of Utility of LSA for Word Sense Discrimination</Title>
  <Section position="3" start_page="77" end_page="77" type="metho">
    <SectionTitle>
2 Experimental Setup
</SectionTitle>
    <Paragraph position="0"> The basic idea of the context-group discrimination paradigm adopted in this investigation is to induce senses of ambiguous word from their contextual similarity. The occurrences of an ambiguous word represented by their context vectors are grouped into clusters, where clusters consist of contextually similar occurrences. The context vectors in our experiments are LSA-based representation of the documents in which the ambiguous word appears.</Paragraph>
    <Paragraph position="1"> Context vectors from the training portion of the corpus are grouped into clusters and the centroid of the cluster--the sense vector--is computed. Ambiguous words from the test portion of the corpus are disambiguated by finding the closest sense vector (cluster centroid) to its context vector representation. If sense labels are available for the ambiguous words in the corpus, sense vectors are given a label that corresponds to the majority sense in their cluster, and sense discrimination accuracy can be evaluated by computing the percentage of ambiguous words from the test portion that were mapped to the sense vector whose label corresponds to the ambiguous word's sense label.</Paragraph>
    <Paragraph position="2"> Our goal is to investigate how well the different senses of ambiguous words are separated in the LSA-based vector space. With an ideal representation the clusters of context vectors would be tight (the vectors in the cluster close to each other and close to centroid of the cluster), and far away from each other, and each cluster would be pure, i.e., consisting of vectors corresponding to words with the same sense. Since we don't want the evaluation of the LSA-based representation to be influenced by the choice of clustering algorithm, or the algorithm's initialization and its parameter settings that determine the resulting grouping, we took an orthogonal approach to the problem: Instead of evaluating the purity of the clusters based on geometrical position of vectors, we evaluate how well-formed the clusters based on sense labels are, how separated from each other and tight they are. As will be discussed below, performance evaluation of such sense-based clusters results in an upper bound on the performance that can be obtained by clustering algorithms such as EM or K-means.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML