File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-2110_intro.xml

Size: 4,110 bytes

Last Modified: 2025-10-06 14:03:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2110">
  <Title>Word Vectors and Two Kinds of Similarity</Title>
  <Section position="3" start_page="0" end_page="858" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Recently, geometric models have been used to represent words and their meanings, and proven to be highly useful both for many NLP applications associated with semantic processing (Widdows, 2004) and for human modeling in cognitive science (G&amp;quot;ardenfors, 2000; Landauer and Dumais, 1997). There are also good reasons for studying geometric models in the field of computational linguistics. First, geometric models are cost-effective in that it takes much less time and less effort to construct large-scale geometric representation of word meanings than it would take to construct dictionaries or thesauri. Second, they can represent the implicit knowledge of word meanings that dictionaries and thesauri cannot do. Finally, geometric representation is easy to revise and extend.</Paragraph>
    <Paragraph position="1"> A vector space model is the most commonly used geometric model for the meanings of words.</Paragraph>
    <Paragraph position="2"> The basic idea of a vector space model is that words are represented by high-dimensional vectors, i.e., word vectors, and the degree of semantic similarity between any two words can be easily computed as a cosine of the angle formed by their vectors.</Paragraph>
    <Paragraph position="3"> A number of methods have been proposed for constructing word vectors. Latent semantic analysis (LSA) is the most well-known method that uses the frequency of words in a fraction of documents to assess the coordinates of word vectors and singular value decomposition (SVD) to reduce the dimension. LSA was originally put forward as a document indexing technique for automatic information retrieval (Deerwester et al., 1990), but several studies (Landauer and Dumais, 1997) have shown that LSA successfully mimics many human behaviors associated with semantic processing. Other methods use a variety of other information: cooccurrence of two words (Burgess, 1998; Sch&amp;quot;utze, 1998), occurrence of a word in the sense definitions of a dictionary (Kasahara et al., 1997; Niwa and Nitta, 1994) or word association norms (Steyvers et al., 2004).</Paragraph>
    <Paragraph position="4"> However, despite the fact that there are different kinds of similarity between words, or different relations underlying word similarity such as a synonymous relation and an associative relation, no studies have ever examined the relationship between methods for constructing word vectors and the type of similarity involved in word vectors in a systematic way. Some studies on word vectors have compared the performance among different methods on some specific tasks such as semantic disambiguation (Niwa and Nitta, 1994) and cued/free recall (Steyvers et al., 2004), but it is not at all clear whether there are essential differences in the quality of similarity among word vectors constructed by different methods, and if so, what kind of similarity is involved in what kind of word vectors. Even in the field of cognitive psychology, although geometric models of similarity such as multidimensional scaling have long been studied and debated (Nosofsky, 1992), the possibility that different methods for word vectors may cap- null ture different kinds of word similarity has never been addressed.</Paragraph>
    <Paragraph position="5"> This study, therefore, aims to examine the relationship between the methods for constructing word vectors and the type of similarity in a systematic way. Especially this study addresses three methods, LSA-based, cooccurrence-based, and dictionary-based methods, and two kinds of similarity, taxonomic similarity and associative similarity. Word vectors constructed by these methods are compared in the performance of two tasks, i.e., multiple-choice synonym test and word association, which measure the degree to which they reflect these two kinds of similarity.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML