File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-1049_metho.xml
Size: 15,440 bytes
Last Modified: 2025-10-06 14:13:36
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-1049"> <Title>CO-OCCURRENCE VECTORS FROM CORPORA VS. DISTANCE VECTORS FROM DICTIONARIES</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Word vectors reflecting word meanings are expected to enable numerical approaches to semantics. Some early attempts at vector representation in I)sycholinguistics were the semantic d(O'erential approach (Osgood et al. 1957) and the associative distribution apl)roach (Deese 1962). llowever, they were derived manually through psychological experiments. An early attempt at automation was made I)y Wilks el aL (t990) us-.</Paragraph> <Paragraph position="1"> ing co-occurrence statistics. Since then, there haw&quot; been some promising results from using co-occurrence vectors, such as word sense disambiguation (Schiitze \[993), and word clustering (Pereira eL al. 1993).</Paragraph> <Paragraph position="2"> llowever, using the co-occurrence statistics requires a huge corpus that covers even most rare words.</Paragraph> <Paragraph position="3"> We recently developed word vectors that are derived from an ordinary dictionary by measuring the inter-word distances in the word definitions (Niwa and Nitta 1993). 'this method, by its nature, h~s no prol)lom handling rare words. In this paper we examine the nsefldness of these distance vectors as semantic re W resentations by comparing them with co-occur,'ence vectors.</Paragraph> </Section> <Section position="4" start_page="0" end_page="304" type="metho"> <SectionTitle> 2 Distance Vectors </SectionTitle> <Paragraph position="0"> A reference network of the words in a dictionary (Fig.</Paragraph> <Paragraph position="1"> 1) is used to measure the distance between words, q'he network is a graph that shows which words are used in the. definition of each word (Nitta 1988). The network shown in Fig. 1 is for a w~ry small portion of the reference network for the Collins English 1)ictionary (1979 edition) in the CI)-I{OM I (Liberman 1991), with 60K head words -b 1.6M definition words.</Paragraph> <Paragraph position="3"> For example, tile delinition for diclionarg is % book ill which the words of a language are listed alphabetically .... &quot; The word dicliona~d is thus linked to the words book, word, language, and alphabelical.</Paragraph> <Paragraph position="4"> A word w~etor is defined its the list of distances from a word to a certain sew of selected words, which we call origins. The words in Fig. 1 marked with Oi (unit, book, and people) m'e assumed to be origin words. In principle, origin words can be freoly chosen.</Paragraph> <Paragraph position="5"> In our exl~eriments we used mi(Idle fi'equency words: the 51st to 1050th most frequent words in the reference Collins English I)ictiotmry (CI';D), The distance w~ctor fl)r diclionary is deriwM it'* fob lOWS:</Paragraph> <Paragraph position="7"> The i-4,h element is the distance (the length of the shortest path) between diclionary and the i-th origin, Oi. To begin, we assume every link has a constant length o\[' 1. The actual definition for link length will be given later.</Paragraph> <Paragraph position="8"> If word A is used in the definition of word B, t.he,m words are expected to be strongly related. This is the basis of our hypothesis that the distances in the refi~rence network reflect the associative distances between words (Nitta 1933).</Paragraph> <Section position="1" start_page="304" end_page="304" type="sub_section"> <SectionTitle> Use (if Refe.renee Networks l{efi,rence net- </SectionTitle> <Paragraph position="0"> works have been successfully used its neural networks (by Vdronis and Ide (1990) for word sense disainl)iguation) and as fields for artificial association, such its spreading activation (by Kojiina and l:urugori (1993) for context-coherence measurement). The distance vector of a word can be considered to be a list, of the activation strengths at the origin nodes when the word node is activated. Therefore, distance w~ctors can be expected to convey almost the santo information as the entire network, and clearly they are Ili~icli easier to handle.</Paragraph> <Paragraph position="1"> Dependence on Dietiolnlrles As a seinant{c representation of words, distltllCe w~ctors are expected to depend very weakly on the particular source dictionary. We eolilpared two sets of distance vectors, one from I,I)OCE (Procter 1978) and the other from COBUILD (Sinclair 1987), and verified that their difference is at least snlaller than the difDrence of the word definitions themselves (Niwa and Nitta 1993).</Paragraph> <Paragraph position="2"> We will now describe some technical details al)Ollt the derivation of distance vectors.</Paragraph> <Paragraph position="3"> Lhlk Length Distance measurenient in a reference network depends on the detinition of link length. Previously, we assumed for siinplicity that every link has a construct length. Ilowever, this shnph; definition seerns tlnnatllral because it does not relh'.ct word frequency. Because tt path through low-fi'equency words (rare words) implies a strong relation, it should be ineasnred ms a shorter path. Therefore, we use the following definition of link length, which takes accotltlt of word frequency.</Paragraph> <Paragraph position="4"> length (Wi, W2) d,'I:-- - log (7N-77'~.,)n' This shows the length of the links between words Wi(i = 1,2) ill Fig, 2, where Ni denotes the total minibet of links front and to }Vi and n denotes the uulnlmr of direct links bt.'tween these two words.</Paragraph> <Paragraph position="5"> Fig. 2 Links between two words.</Paragraph> <Paragraph position="6"> Normalization l)istance vectors ;ire norrrialized by first changing each coordinal,e into its deviation in the coordin;tLe: v --: ('vi) -~+ v' = vi -- ai where a i and o i are the average and the standaM deviation of the distances fi'om the i-th origin. Next, each coordinal.e is changed hire its deviation in thc ~ vector: where t? and cd are tile average .~_llld i,he standard deviation of v} (i = I .... ).</Paragraph> </Section> </Section> <Section position="5" start_page="304" end_page="305" type="metho"> <SectionTitle> 3 Co-occurro.ric(; Vectors </SectionTitle> <Paragraph position="0"> We use ordinary co-o(:Clll'rl;llCe statistics ;tlld illellSllre the co-occurrei/ce likelihood betweeii two words, X and Y, hy the Inutua\] hiforlnaLioii estilnate. ((\]hurch and ll~uiks 1989)'.</Paragraph> <Paragraph position="2"> where P(X) is the oCcilrreilce, density of word X hi whole corllus, and the conditional probability l'(x Iv) is the density of X in a neight>orhood of word Y, llere the neighl)orhood is defined as 50 words lie.fore or after s.iiy appearance of word Y. (There is a variety of neighborhood definitions Sllch as &quot;100 sllrrollllding words&quot; (Yarowsky 1992) and &quot;within a distance of no more thall 3 words igllorh/g filnction words&quot; (I)agarl el, al.</Paragraph> <Paragraph position="3"> l~)n:/).) The logarithm with '-t-' is dellned to be () for an arg;ument less than 1. Negative estimates were neglected because they are mostly accidental except when X and Y are frequent enough (Chnrch and lIanl,:s 1989).</Paragraph> <Paragraph position="4"> A co-occurence vector of a word is defined as the list of co-occtlrrellce likelihood of the word with a certahi set o\['orighi words. We tlsed the salne set oforight words ;is for the distance vectors.</Paragraph> <Paragraph position="6"> When the frequency of X or Y is zero, we can not measure their co-c, ccurence likelihood, and such cruses are not exceptional. This sparseness problem is well-known and serious in the co-occurrence sLatisC\[cs. We used as ~ corpus the 1!)87 Wall Street; JournM in the CI)-I~.OM i (1991), which has a total of 20M words.</Paragraph> <Paragraph position="7"> '\]'he nUlliber of words which appeared al, least OIlCe, was about 50% of the total 62I( head words of CEI), and tile. percentage Of&quot; tile word-origin pairs which appeared tit least once was about 16% of total 62K x</Paragraph> </Section> <Section position="6" start_page="305" end_page="307" type="metho"> <SectionTitle> 4 Experimental R, esults </SectionTitle> <Paragraph position="0"> We compared the two vector representations by using them for the following two semantic tmsks. The first is word sense disambiguation (WSD) based on the similarity of context vectors; the second is the learning of positive or negative meanings from example words.</Paragraph> <Paragraph position="1"> With WSD, the precision by using co-occurrence vectors from a 20M words corpus was higher than by using distance vectors from the CEIL</Paragraph> <Section position="1" start_page="305" end_page="305" type="sub_section"> <SectionTitle> 4.1 Word Sense Disambiguation </SectionTitle> <Paragraph position="0"> Word sense disambiguation is a serious semantic problena. A variety of approaches have been proposed for solving it. For example, V(!ronis and Ide (1990) used reference networks as neural networks, llearst (1991) used (shallow) syntactic similarity between contexts, Cowie el al. (1992) used simulated annealing for quick parallel disambignation, and Yarowsky (1992) used co-occurrence statistics between words and thesaurus categories.</Paragraph> <Paragraph position="1"> Our disambiguation method is based on the shnilarity of context vectors, which was originated by Wilks el al. (1990). In this method, a context vector is the sum of its constituent word vectors (except the target word itself). That is, tile context vector for context, C: ...W_N ...W_l WWl ...WN, ... ~</Paragraph> <Paragraph position="3"> The similarity of contexts is measured by the angle of their vectors (or actually the inner product of their normalized vectors).</Paragraph> <Paragraph position="5"> Let word w have senses sl, s2, ..., sll I } a|'ld each sells(; have the following context examples.</Paragraph> </Section> <Section position="2" start_page="305" end_page="305" type="sub_section"> <SectionTitle> Sense Context Examples </SectionTitle> <Paragraph position="0"/> <Paragraph position="2"> We infer that the sense of word w in an arhitrary context C is si if for some j the similarity, sire(C, Cij), is maximum among all tile context examples.</Paragraph> <Paragraph position="3"> Another possible way to infer the sense is to choose sense si such that the average of sim(C, Cij) over j = 1,2,...,hi is maximum. We selected the first method because a peculiarly similar example is more important than the average similarity.</Paragraph> <Paragraph position="4"> Figure 3 (next page) shows the disamhiguation precision for 9 words. For each word, we selected two senses shown over each graph. These senses were chosen because they are clearly different and we could collect sufficient nmnber (more than 20) of context examples. The names of senses were chosen from the category names in Roger's International Thesaurus, except organ's.</Paragraph> <Paragraph position="5"> The results using distance vectors are shown by clots (. * .), and using co-occurrence vectors from the 1987 vsa (20M words) by cir,.tes (o o o).</Paragraph> <Paragraph position="6"> A context size (x-axis) of, for example, 10 means 10 words before tile target word and 10 words after tile target word. Wc used 20 examples per sense; they were taken from tlle 1988 WSJ. Tile test contexts were from the 1987 WSJ: The nmnber of test contexts varies from word to word (100 to 1000). The precision is the simple average of the respective precisions for the two senses.</Paragraph> <Paragraph position="7"> The results of Fig. 3 show that the precision by using co-occurrence vectors are higher than that by using distance vectors except two cases, interest and customs. And we have not yet found a case where the distance vectors give higher precision. Therefore we conclude that co-occurrence vectors are advantageous over distance vectors to WSD based on the context similarity.</Paragraph> <Paragraph position="8"> The sl)arseness problem for co-occurrence vectors is not serious in this case because each context consists of plural words.</Paragraph> </Section> <Section position="3" start_page="305" end_page="307" type="sub_section"> <SectionTitle> 4.2 Learning of positivc-or-ne#alive </SectionTitle> <Paragraph position="0"> Another experiment using the same two vector representations was done to measure tile learning of positive or negative meanings. 1,'igure 4 shows tile changes in the precision (the percentage of agreement with the authors' combined judgement). The x-axis indicates tile nunll)er of example words for each positive or ~tegalive pair. Judgement w~s again done by using the nearest example. The example and test words are shown in Tables 1 and 2, respectively.</Paragraph> <Paragraph position="1"> In this case, the distance vectors were advantageous. The precision by using distance vectors increased to about 80% and then leveled off, while the precision by using co-occurrence vectors stayed arouud 60%. We can therefore conclude that the property of positive-or-negative is reflected in distance vectors more strongly than ill co-occurrence vectors. Tile sparseness l)roblem is supposed to be a major factor in this case.</Paragraph> <Paragraph position="3"> positive (20 words) balanced elaborate elation eligible enjoy fluent honorary Imnourable hopeful hopefully influential interested legible lustre normal recreation replete resilient restorative sincere negative (30 words) conflmion cuckold dally daumation dull ferocious flaw hesitate hostage huddle inattentive liverlsh lowly mock neglect queer rape ridiculous savage scanty sceptical schizophrenia scoff scrnffy shipwreck superstition sycophant trouble wicked worthless</Paragraph> </Section> <Section position="4" start_page="307" end_page="307" type="sub_section"> <SectionTitle> 4.3 Supplementary Data </SectionTitle> <Paragraph position="0"> In the experiments discussed above, the corpus size for co-occurrence vectors was set to 20M words ('87 WSJ) and the vector dimension for both co-occurrence and distance vectors wins set to 1000. llere we show some supplementary data that support these parameter settings. null a. Corpus size (for co-occurrence vectors) Figure 5 shows the change in disambiguation preeision as the corpus size for co-occurrence statistics increases from 200 words to 20M words. (The words are suit, issue and race, the context size is 10, and the number of examples per sense is 10.) These three graphs level off after around IM words. Therefore, a corpus size of 20M words is not too small.</Paragraph> <Paragraph position="1"> on the corpus size for c.o-occurrence vectors.</Paragraph> <Paragraph position="2"> context size: 10, number of examples: 10/sense, vector dimension: 1000.</Paragraph> <Paragraph position="3"> l). Vector Dimension Figure 6 (next page) shows the dependence of disambiguation precision on the vector dimension for (i) co-occurrence and (ii) distance vectors. As for co-occurrence vectors, the precision levels off near a dimension of 100. Therefore, a dimension size of 1000 is suflicient or cvcn redumlant. IIowever, in the distance vector's case, it is not clear whether the precision is leveling or still increasing around 1000 dimension.</Paragraph> </Section> </Section> class="xml-element"></Paper>