File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0604_metho.xml
Size: 11,263 bytes
Last Modified: 2025-10-06 14:09:54
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0604"> <Title>New Experiments in Distributional Representations of Synonymy</Title> <Section position="4" start_page="25" end_page="26" type="metho"> <SectionTitle> 2 The Test </SectionTitle> <Paragraph position="0"> To generate a TOEFL-like test from WordNet, we perform the following procedure once each for nouns, verbs, adjectives and adverbs. Given a list of candidate words, we produce one test question for every ordered pair of words appearing together in any synset in the respective WordNet part-of-speech database. Decoy words are chosen at random from among other words in the database that do not have a synonymy relation with either word in the pair.</Paragraph> <Paragraph position="1"> For convenience, we will call the resulting test the technology: A. engineering B. difference C. department D. west stadium: A. miss B. hockey C. wife D. bowl string: A. giant B. ballet C. chain D. hat trial: A. run B. one-third C. drove D. form noun test. Answers are A, D, C, and A. WordNet-based synonymy test (WBST). We take a few additional steps in order to increase the resemblance between the WBST and the TOEFL. First, we remove from consideration any stop words or inflected forms. Note that whether a particular wordform is inflected is a function of its presumed part of speech. The word &quot;indicted&quot; is either an inflected verb (so would not be used as a word in a question involving verbs) or an uninflected adjective. Second, we rule out pairs of words that are too similar under the string edit distance. Morphological variants often share a synset in WordNet. For example, &quot;group&quot; and &quot;grouping&quot; share a nominal sense. Questions using such pairs appear trivial to human test takers and allow stemming shortcuts. In the experiments reported in this paper, we used WordNet 1.7.1. Our experimental corpus is the North American News corpus, which is also used by Ehlert (2003). We include as a candidate test word any word occurring at least 1000 times in the corpus (about 15,000 words when restricted to those appearing in WordNet). Table 1 shows four sample questions generated from this list out of the noun database. In total, this procedure yields 9887 noun, 7398 verb, 5824 adjective, and 461 adverb questions, a total of 23,570 questions.1 This procedure yields questions that differ in some interesting ways from those in the TOEFL. Most notable is a bias in favor of polysemous terms. The number of times a word appears as either the target or the answer word is proportional to the number of synonyms it has in the candidate list. In contrast, decoy words are chosen at random, so are less polysemous on average.</Paragraph> </Section> <Section position="5" start_page="26" end_page="28" type="metho"> <SectionTitle> 3 The Space of Solutions </SectionTitle> <Paragraph position="0"> Given that we have a large number of test questions composed of words with high corpus frequencies, we now seek to optimize performance on the WBST. The solutions we consider all start with a word-conditional context frequency vector, usually normalized to form a probability distribution. We answer a question by comparing the target term vector and each of the response term vectors, choosing the &quot;closest.&quot; This problem definition excludes a common class of solutions to this problem, in which the closeness of a pair of terms is a statistic of the co-occurrence patterns of the specific terms in question. It has been shown that measures based on the pointwise mutual information (PMI) between question words yield good results on the TOEFL (Turney, 2001; Terra and Clarke, 2003). However, Ehlert (2003) shows convincingly that, for a fixed amount of data, the distributional model performs better than what we might call the pointwise co-occurrence model.</Paragraph> <Paragraph position="1"> Terra and Clark (2003) report a top score of 81.3% on an 80-word version of the TOEFL, which compares favorably with Ehlert's best of 82% on a 300word version, but their corpus is approximately 200 times as large as Ehlert's.</Paragraph> <Paragraph position="2"> Note that these two approaches are complementary and can be combined in a supervised setting, along with static resources, to yield truly strong performance (97.5%) on the TOEFL (Turney et al., 2003). While impressive, this work begs an important question: Where do we obtain the training data when moving to a less commonly taught language, to say nothing of the comprehensive thesauri and Web resources? In this paper, we focus on shallow methods that use only the text corpus. We are interested less in optimizing performance on the TOEFL than in investigating the validity and limits of the distributional hypothesis, and in illuminating the barriers to automated human-level lexical similarity judgments.</Paragraph> <Section position="1" start_page="26" end_page="26" type="sub_section"> <SectionTitle> 3.1 Definitions of Context </SectionTitle> <Paragraph position="0"> As in previous work, we form our context distributions by recording word-conditional counts of feature occurrences within some fixed window of a reference token. In this study, features are just unnormalized tokens, possibly augmented with direction and distance information. In other words, we do not investigate the utility of stemming. Similarly, except where noted, we do not remove stop words.</Paragraph> <Paragraph position="1"> All context definitions involve a window size, which specifies the number of tokens to consider on either side of an occurrence of a reference term. It is always symmetric. Thus, a window size of one indicates that only the immediately adjacent tokens on either side should be considered. By default, we bracket a token sequence with pseudo-tokens &quot;<bos>&quot; and &quot;<eos>&quot;.2 Contextual tokens in the window may be either observed or disregarded, and the policy governing which to admit is one of the dimensions we explore here. The decision whether or not to observe a particular contextual token is made before counting commences, and is not sensitive to the circumstances of a particular occurrence (e.g., its participation in some syntactic relation (Lin, 1997; Lee, 1999)). When a contextual token is observed, it is always counted as a single occurrence. Thus, in contrast with earlier approaches (Sahlgren, 2001; Ehlert, 2003), we do not use a weighting scheme that is a function of distance from the reference token.</Paragraph> <Paragraph position="2"> Once we have chosen to observe a contextual token, additional parameters govern whether counting should be sensitive to the side of the reference token on which it occurs and how distant from the reference token it is. If the strict direction parameter is true, a left occurrence is distinguished from a right occurrence. If strict distance is true, occurrences at distinct removes (in number of tokens) are recorded as distinct event types.</Paragraph> </Section> <Section position="2" start_page="26" end_page="28" type="sub_section"> <SectionTitle> 3.2 Distance Measures </SectionTitle> <Paragraph position="0"> The product of a particular context policy is a co-occurrence matrix a0 , where the contents of a cell a0a2a1a4a3a5 is the number of times context a6 is observed to occur with word a7 . A row of this matrix (a0 a1 ) is 2In this paper, a sequence is a North American News segment delimited by the <p> tag. Nominally paragraphs, most of these segments are single sentences.</Paragraph> <Paragraph position="1"> therefore a word-conditional context frequency vector. In comparing two of these vectors, we typically normalize counts so that all cells in a row sum to one, yielding a word-conditional distribution over contexts a0a2a1 a6a4a3a7a6a5 (but see the Cosine measure below). null We investigate some of the distance measures commonly employed in comparing term vectors.</Paragraph> <Paragraph position="3"> Note that whereas we use probabilities in calculating the Manhattan and Euclidean distances, in order to avoid magnitude effects, the Cosine, which defines a different kind of normalization, is applied to raw number counts.</Paragraph> <Paragraph position="4"> We also avail ourselves of measures suggested by probability theory. For a48 a49 a1a51a50a23a52a54a53a19a5 and word-conditional context distributions a55 and a56 , we have the so-called a48 -divergences (Zhu and Rohwer, from a56 . Members of this divergence family are in some sense preferred by theory to alternative measures. It can be shown that the a48 -divergences (or divergences defined by combinations of them, such as the Jensen-Shannon or &quot;skew&quot; divergences (Lee, 1999)) are the only ones that are robust to redundant contexts (i.e., only divergences in this family are invariant) (Csisz'ar, 1975).</Paragraph> <Paragraph position="5"> Several notions of lexical similarity have been based on the KL-divergence. Note that if any</Paragraph> <Paragraph position="7"> a11a81a1a59a55a60a52a61a56a4a5 is infinite; in general, the KL-divergence is very sensitive to small probabilities, and careful attention must be paid to smoothing if it is to be used with text co-occurrence data. The Jensen-Shannon divergence--an average of the divergences of a55 and a56 from their mean distribution-does not share this sensitivity and has previously been used in tests of lexical similarity (Lee, 1999).</Paragraph> <Paragraph position="8"> Furthermore, unlike the KL-divergence, it is symmetric, presumably a desirable property in this setting, since synonymy is a symmetric relation, and our test design exploits this symmetry.</Paragraph> <Paragraph position="9"> However,</Paragraph> <Paragraph position="11"> also symmetric and robust to small or zero estimates. To our knowledge, the Hellinger distance has not previously been assessed as a measure of lexical similarity. We experimented with both the Hellinger distance and Jensen-Shannon (JS) divergence, and obtained close scores across a wide range of parameter settings, with the Hellinger yielding a slightly better top score. We report results only for the Hellinger distance below. As will be seen, neither the Hellinger nor the JS divergence are optimal for this task.</Paragraph> <Paragraph position="12"> In pursuit of synonymy, Ehlert (2003) derives a formula for the probability of the target word given a response word:</Paragraph> <Paragraph position="14"> The second line, which fits more conveniently into our framework, follows from the first (Ehlert's expression) through an application of Bayes Theorem. While this measure falls outside the class of a48 -divergences, our experiments confirm its relative strength on synonymy tests.</Paragraph> <Paragraph position="15"> It is possible to unify the a48 -divergences with Ehlert's expression by defining a broader class of measures:</Paragraph> <Paragraph position="17"> probabilities. Since, in the context of a given question, a0a2a1 a7a6a11a25a5 does not change, maximizing the expression in Equation 3 is the same as minimizing</Paragraph> <Paragraph position="19"/> </Section> </Section> class="xml-element"></Paper>