XML Viewer - p06-2110

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2110_metho.xml
Size: 23,889 bytes
Last Modified: 2025-10-06 14:10:27
<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2110">
  <Title>Word Vectors and Two Kinds of Similarity</Title>
  <Section position="4" start_page="858" end_page="858" type="metho">
    <SectionTitle>
2 Two Kinds of Similarity
</SectionTitle>
    <Paragraph position="0"> In this study, we divide word similarity into two categories: taxonomic similarity and associative similarity. Taxonomic similarity, or categorical similarity, is a kind of semantic similarity between words in the same level of categories or clusters of the thesaurus, in particular synonyms, antonyms, and other coordinates. Associative similarity, on the other hand, is a similarity between words that are associated with each other by virtue of semantic relations other than taxonomic one such as a collocational relation and a proximity relation. For example, the word writer and the word author are taxonomically similar because they are synonyms, while the word writer and the word book are associatively similar because they are associated by virtue of an agent-subject relation.</Paragraph>
    <Paragraph position="1"> This dichotomy of similarity is practically important. Some tasks such as automatic thesaurus updating and paraphrasing need assessing taxonomic similarity, while some other tasks such as affective Web search and semantic disambiguation require assessing associative similarity rather than taxonomic similarity. This dichotomy is also psychologically motivated. Many empirical studies on word searches and speech disorders have revealed that words in the mind (i.e., mental lexicon) are organized by these two kinds of similarity (Aitchison, 2003). This dichotomy is also essential to some cognitive processes. For example, metaphors are perceived as being more apt when their constituent words are associatively more similar but categorically dissimilar (Utsumi et al., 1998). These psychological findings suggest that people distinguish between these two kinds of similarity in certain cognitive processes.</Paragraph>
  </Section>
  <Section position="5" start_page="858" end_page="859" type="metho">
    <SectionTitle>
3 Constructing Word Vectors
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="858" end_page="858" type="sub_section">
      <SectionTitle>
3.1 Overview
</SectionTitle>
      <Paragraph position="0"> In this study, word vectors (or word spaces) are constructed in the following way. First, all content words ti in a corpus are represented as m-dimensional feature vectors wi.</Paragraph>
      <Paragraph position="2"> Each element wij is determined by statistical analysis of the corpus, whose methods will be described in Section 3.3. A matrix M is then constructed using n feature vectors as rows.</Paragraph>
      <Paragraph position="4"> parenrightBigg (2) Finally, the dimension of row vectors wi is reduced from m to k by means of a SVD technique. As a result, any words are represented as k-dimensional vectors.</Paragraph>
    </Section>
    <Section position="2" start_page="858" end_page="858" type="sub_section">
      <SectionTitle>
3.2 Corpus
</SectionTitle>
      <Paragraph position="0"> In this study, we employ three kinds of Japanese corpora: newspaper articles, novels and a dictionary. As a newspaper corpus, we use 4 months' worth of Mainichi newspaper articles published in 1999. They consist of 500,182 sentences in 251,287 paragraphs, and words vectors are constructed for 53,512 words that occur three times or more in these articles. Concerning a corpus of novels, we use a collection of 100 Japanese novels &amp;quot;Shincho Bunko No 100 Satsu&amp;quot; consisting of 475,782 sentences and 230,392 paragraphs. Word vectors are constructed for 46,666 words that occur at least three times. As a Japanese dictionary, we use &amp;quot;Super Nihongo Daijiten&amp;quot; published by Gakken, from which 89,007 words are extracted for word vectors.</Paragraph>
    </Section>
    <Section position="3" start_page="858" end_page="859" type="sub_section">
      <SectionTitle>
3.3 Methods for Computing Vector Elements
</SectionTitle>
      <Paragraph position="0"> LSA-based method (LSA) In the LSA-based method, a vector element wij is assessed as a tf-idf score of a word ti in a piece sj of document.</Paragraph>
      <Paragraph position="2"> In this formula, tfij denotes the number of times the word ti occurs in a piece of text sj, and dfi denotes the number of pieces in which the word ti occurs. As a unit of text piece sj, we consider  a sentence and a paragraph. Hence, for example, when a sentence is used as a unit, the dimension of feature vectors wi is equal to the number of sentences in a corpus. We also use two corpora, i.e., newspapers and novels, and thus we obtain four different word spaces by the LSA-based method.</Paragraph>
      <Paragraph position="3"> Cooccurrence-based method (COO) In the cooccurrence-based method, a vector element wij is assessed as the number of times words ti and tj occur in the same piece of text, and thus M is an n x n symmetric matrix. As in the case of the LSA-based method, we use two units of text piece (i.e., a sentence or a paragraph) and two corpora (i.e., newspapers or novels), thus resulting in four different word spaces.</Paragraph>
      <Paragraph position="4"> Note that this method is similar to Sch&amp;quot;utze's (1998) method for constructing a semantic space in that both are based on the word cooccurrence, not on the word frequency. However they are different in that Sch&amp;quot;utze's method uses the cooccurrence with frequent content words chosen as indicators of primitive meanings. Burgess's (1998) &amp;quot;Hyperspace Analogue to Language (HAL)&amp;quot; is also based on the word cooccurrence but does not use any technique of dimensionality reduction.</Paragraph>
      <Paragraph position="5"> Dictionary-based method (DIC) In the dictionary-based method, a vector element wij is assessed by the following formula:</Paragraph>
      <Paragraph position="7"> where fij denotes the number of times the word tj occurs in the sense definitions of the word ti, and dfj denotes the number of words whose sense definitions contain the word tj. The second term in parentheses in Equation (4) means the square root of the number of times the word tj occurs in a collection of sense definitions for any words that are included in the sense definitions of the word ti, while the third term means the number of times ti occurs in the sense definitions of tj. The parameters a and b are positive real constants expressing the weights for these information. (Following Kasahara et al. (1997), these parameters are set to 0.2 in this paper.) Equation (4) was originally put forward by Kasahara et al. (1997), but our dictionary-based method differs from their method in terms of how the dimensions are reduced. Their method groups together the dimensions for words in the same category of a thesaurus, but our method uses SVD as we will described next.</Paragraph>
    </Section>
    <Section position="4" start_page="859" end_page="859" type="sub_section">
      <SectionTitle>
3.4 Reducing Dimensions
</SectionTitle>
      <Paragraph position="0"> Using a SVD technique, a matrix M is factorized as the product of three matrices USV T , where the diagonal matrix S consists of r singular values that are arranged in nonincreasing order such that r is the rank of M. When we use a k x k matrix Sk consisting of the largest k singular values, the matrix M is approximated by UkSkV Tk , where the i-th row of Uk corresponds to a k-dimensional &amp;quot;reduced word vector&amp;quot; for the word ti.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="859" end_page="861" type="metho">
    <SectionTitle>
4 Experiment 1: Synonym Judgment
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="859" end_page="859" type="sub_section">
      <SectionTitle>
4.1 Method
</SectionTitle>
      <Paragraph position="0"> In order to compare different word vectors in terms of the ability to judge taxonomic similarity between words, we conducted a synonym judgment experiment using a standard multiple-choice synonym test. Each item of a synonym test consisted of a stem word and five alternative words from which the test-taker was asked to choose one with the most similar meaning to the stem word.</Paragraph>
      <Paragraph position="1"> In the experiment, we used 32 items from the synonym portions of Synthetic Personality Inventory (SPI) test, which has been widely used for employment selection in Japanese companies.</Paragraph>
      <Paragraph position="2"> These items were selected so that all the vector spaces could contain the stem word and at least four of the five alternative words. For comparison purpose, we also used 38 antonym test items chosen from the same SPI test. Furthermore, in order to obtain a more reliable, unbiased result, we automatically constructed 200 test items in such a way that we chose the stem word randomly, one correct alternative word randomly from words in the same deepest category of a Japanese thesaurus as the stem word, and other four alternatives from words in other categories. As a Japanese thesaurus, we used &amp;quot;Goi-Taikei&amp;quot; (Ikehara et al., 1999).</Paragraph>
      <Paragraph position="3"> In the computer simulation, the computer's choices were determined by computing cosine similarity between the stem word and each of the five alternative words using the vector spaces and choosing the word with the highest similarity.</Paragraph>
    </Section>
    <Section position="2" start_page="859" end_page="861" type="sub_section">
      <SectionTitle>
4.2 Results and Discussion
</SectionTitle>
      <Paragraph position="0"> For each of the nine vector spaces, the synonym judgment simulation described above was con-</Paragraph>
      <Paragraph position="2"> ducted and the percentage of correct choices was calculated. This process was repeated using 20 numbers of dimensions, i.e., every 50 dimensions between 50 and 1000.</Paragraph>
      <Paragraph position="3"> Figure 1 shows the percentage of correct choices for the three methods of matrix construction. Concerning the LSA-based method (denoted by LSA) and the cooccurrence-based method (denoted by COO), Figure 1 plots the correct rates for the word vectors derived from the paragraphs of the newspaper corpus. (Such combination of corpus and text unit was optimal among all combinations, which will be justified later in this section.) The most important result shown in Figure 1 is that, regardless of the number of dimensions, the dictionary-based word vectors outperformed the other kinds of vectors on both SPI and computer-generated test items. This result thus suggests that the dictionary-based vector space reflects taxonomic similarity between words better than the LSA-based and the correlation-based spaces.</Paragraph>
      <Paragraph position="4"> Another interesting finding is that there was no clear peak in the graphs of Figure 1. For SPI test items, correct rates of the three methods increased linearly as the number of dimensions increased, r = .86 for the LSA-based method, r = .72 for the correlation-based method and r = .93 for the dictionary-based method (all ps &lt; .0001), while correct rates for computer-generated test items  were steady. Our finding of the absence of any obvious optimal dimensions is in a sharp contrast to Landauer and Dumais's (1997) finding that the LSA word vectors with 300 dimensions achieved the maximum performance of 53% correct rate in a similar multiple-choice synonym test. Note that their maximum performance was a little better than that of our LSA vectors, but still worse than that of our dictionary-based vectors.</Paragraph>
      <Paragraph position="5"> Figure 2 shows the performance of the LSA-based and the dictionary-based methods in antonym judgment, together with the result of synonym judgment. (Since the performance of the cooccurrence-based method did not differ from that of the LSA-based method, the correct rates of the cooccurrence-based method are not plotted in this figure.) The dictionary-based method also outperformed the LSA-based method in antonym judgment but their difference was much smaller than that of synonym judgment; at 200 or lower dimensions LSA-based method was better than the dictionary-based method. Interestingly, the dictionary-based word vectors yielded better performance in synonym judgment than in antonym judgment, while the LSA-based vectors showed better performance in antonym judgment. These contrasting results may be attributed to the difference of corpus characteristics. Dictionary's definitions for antonymous words are likely to involve different words so that the differences between their meanings can be made clear. On the other hand, in newspaper articles (or literary texts), context words with which antonymous words occur are likely to overlap because their meanings are about the same domain.</Paragraph>
      <Paragraph position="6"> Finally, we show the results of comparison among four combinations of corpora and text units for the LSA-based and the cooccurrence-based  methods. Table 1 lists mean correct rates of SPI test and computer-generated test averaged over all the numbers of dimensions. Regardless of construction methods and test items, the word vectors constructed using newspaper paragraphs achieved the best performance, which are denoted by boldfaces. Concerning an effect of corpus difference, the newspaper corpus was superior to the literary corpus. The difference of text units did not have a clear influence on the performance of word spaces.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="861" end_page="863" type="metho">
    <SectionTitle>
5 Experiment 2: Word Association
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="861" end_page="862" type="sub_section">
      <SectionTitle>
5.1 Method
</SectionTitle>
      <Paragraph position="0"> In order to compare the ability of the word spaces to judge associative similarity, we conducted a word association experiment using a Japanese word association norm &amp;quot;Renso Kijunhyo&amp;quot; (Umemoto, 1969). This free-association database was developed based on the responses of 1,000 students to 210 stimulus words. For example, when given the word writer as a stimulus, students listed the words shown in Table 2. (Table 2 also shows the original words in Japanese.) For the simulation experiment, we selected 176 stimulus words that all the three corpora contained. These stimuli had 27 associate words on average. We then removed any associate words that were synonymous with the stimulus word (e.g., author in Table 2), since the purpose of this experiment was to examine the ability to assess associative similarity between words. Whether or not each associate is synonymous with the stimulus was determined according to whether they belong to the same deepest category of a Japanese thesaurus &amp;quot;Goi-Taikei&amp;quot; (Ikehara et al., 1999). In the computer simulation, cosine similarity  between the stimulus word and each of all the other words included in the vector space was computed, and all the words were sorted in descending order of similarity. The top i words were then chosen as associates.</Paragraph>
      <Paragraph position="1"> The ability of word spaces to mimic human word association was evaluated on mean precision. Precision is the ratio of the number of human-produced associates chosen by computer to the number i of computer-chosen associates. A precision score was calculated every time a new human-produced associate was found in the top i words when i was incremented by 1, and after that mean precision was calculated as the average of all these precision scores. It must be noted here that, in order to eliminate the bias caused by the difference in the number of contained words among word spaces, we conducted the simulation using 46,000 words that we randomly chose for each corpus so that they could include all the human-produced associates.</Paragraph>
      <Paragraph position="2"> Although this computational method of producing associates is sufficient for the present purpose, it may be inadequate to model the psychological process of free association. Some empirical studies of word association (Nelson et al., 1998) revealed that frequent or familiar words were highly likely to be produced as associates, but our methods for constructing word vectors may not directly address such frequency effect on word association.</Paragraph>
      <Paragraph position="3"> Hence, we conducted an additional experiment in which only familiar words were used for computing similarity to a given stimulus word, i.e., less familiar words were not used as candidates of as-</Paragraph>
      <Paragraph position="5"> sociates. For a measure of word familiarity, we used word familiarity scores (ranging from 1 to 7) provided by &amp;quot;Nihongo no Goitaikei&amp;quot; (Amano and Kondo, 2003). Using this data, we selected the words whose familiarity score is 5 or higher as familiar ones.</Paragraph>
    </Section>
    <Section position="2" start_page="862" end_page="863" type="sub_section">
      <SectionTitle>
5.2 Results and Discussion
</SectionTitle>
      <Paragraph position="0"> For each of the nine vector spaces, the association judgment simulation was conducted and the mean precision was calculated. As in the synonym judgment experiment, this process was repeated by every 50 dimensions between 50 and 1000.</Paragraph>
      <Paragraph position="1"> Figure 3 shows the result of word association experiment. For the LSA-based and the cooccurrence-based methods, two kinds of mean precision were plotted: the average of mean precision scores for the four word vectors and the maximum score among them. (As we will show in Table 3, the LSA-based method achieved the maximum precision when sentences of the newspaper corpus were used, while the performance of the cooccurrence-based method was maximal when paragraphs of the newspaper corpus were used.) The overall result was that the dictionary-based word vectors yielded the worst performance, as opposed to the result of synonym judgment. There was no big difference in performance between the LSA-based method and the cooccurrence-based method, but the maximal cooccurrence-based vectors (constructed from newspaper paragraphs) considerably outperformed the other kinds of word vectors. 1 These results clearly show that the LSA-based and 1These results were replicated even when all the human-produced associates including synonymous ones were used for assessing the precision scores.</Paragraph>
      <Paragraph position="2">  the cooccurrence-based vector spaces reflect associative similarity between words more than the dictionary-based space.</Paragraph>
      <Paragraph position="3"> The relation between the number of dimensions and the performance in association judgment was quite different from the relation observed in the synonym judgment experiment. Although the score of the dictionary-based vectors increased as the dimension of the vectors increased as in the case of synonym judgment, the scores of both LSA-based and cooccurrence-based vectors had a peak around 200 dimensions, as Landauer and Dumais (1997) demonstrated. This finding seems to suggest that some hundred dimensions may be enough to represent the knowledge of associative similarity.</Paragraph>
      <Paragraph position="4"> Figure 4 shows the result of the additional experiment in which familiarity effects were taken into account. As compared to the result without familiarity filtering, there was a remarkable improvement of the performance of the dictionary-based method; the dictionary-based method out-performed the LSA-based method at 350 or higher dimensions and the cooccurrence-based method at 800 or higher dimensions. This may be because word occurrence in the sense definitions of a dictionary does not reflect the actual frequency or familiarity of words, and thus the dictionary-based method may possibly overestimate the similarity of infrequent or unfamiliar words. On the other hand, since the corpus of newspaper articles or novels is likely to reflect actual word frequency, the vector spaces derived from these corpora represent the similarity of infrequent words as appropriately as that of familiar words. 2 The result that the cooccurrence-based word  vectors constructed from newspaper paragraphs achieved the best performance was again obtained in the additional experiment. This consistent result indicates that the cooccurrence-based method is particularly useful for representing the knowledge of associative similarity between words. The relation between the number of dimensions and mean precision was unchanged even if a familiarity effect was considered.</Paragraph>
      <Paragraph position="5"> Finally, Table 3 shows the comparison result among four kinds of word vectors constructed from different corpora and text units in the experiment with and without familiarity filtering. The listed values are mean precisions averaged over all the 20 numbers of dimensions. As in the case of synonym judgment experiment, word vectors constructed from newspaper paragraphs achieved the best performance, although only the LSA-based vectors had the highest precision when they were derived from sentences of newspaper articles. As in the case of synonym judgment, the newspaper corpus showed better performance than the novel corpus, and especially the cooccurrence-based method showed a fairly large difference in performance between two corpora. This finding seems to suggest that word cooccurrence in a newspaper corpus is more likely to reflect associative similarity.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="863" end_page="863" type="metho">
    <SectionTitle>
6 Semantic Network and Similarity
</SectionTitle>
    <Paragraph position="0"> As related work, Steyvers and Tenenbaum (2005) examined the properties of semantic network, an2Indeed, the dictionary-based vector spaces contained a larger number of unfamiliar words than the other spaces; 63% of words in the dictionary were judged as unfamiliar, while only 42% and 50% of words in the newspapers and the novels were judged as unfamiliar.</Paragraph>
    <Paragraph position="1"> other important geometric model for word meanings. They found that three kinds of semantic networks -- WordNet, Roget's thesaurus, and word associations -- had a small-world structure and a scale-free pattern of connectivity, but semantic networks constructed from the LSA-based vector spaces did not have these properties. They interpreted this finding as indicating a limitation of the vector space model such as LSA to model human knowledge of word meanings.</Paragraph>
    <Paragraph position="2"> However, we can interpret their finding in a different way by considering a possibility that different semantic networks may capture different kinds of word similarity. Scale-free networks have a common characteristic that a small number of nodes are connected to a very large number of other nodes (Barab'asi and Albert, 1999). In the semantic networks, such &amp;quot;hub&amp;quot; nodes correspond to basic and highly polysemous words such as make and money, and these words are likely to be taxonomically similar to many other words. Hence if semantic networks reflect in large part taxonomic similarity between words, they are likely to have a scale-free structure. On the other hand, since it is less likely to assume that only a few words are associatively similar to a large number of other words, semantic networks reflecting associative similarity may not have a scale-free structure. Taken together, Steyvers and Tenenbaum's (2005) finding can be reinterpreted as suggesting that WordNet and Roget's thesaurus better reflect taxonomic similarity, while the LSA-based word vectors better reflect associative similarity, which is consistent with our finding.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML