File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1146_metho.xml
Size: 15,930 bytes
Last Modified: 2025-10-06 14:08:47
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1146"> <Title>Characterising Measures of Lexical Distributional Similarity</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Overlap of neighbour sets </SectionTitle> <Paragraph position="0"> We have described a number of ways of calculating distributional similarity. We now consider whether there is substantial variation in a word's distributionally nearest neighbours according to the chosen measure. We do this by calculating the overlap between neighbour sets for 2000 nouns generated using di erent measures from direct-object data extracted from the British National Corpus (BNC).</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Experimental set-up </SectionTitle> <Paragraph position="0"> The data from which sets of nearest neighbours are derived is direct-object data for 2000 nouns extracted from the BNC using a robust accurate statistical parser (RASP) (Briscoe and Carroll, 2002). For reasons of computational e ciency, we limit ourselves to 2000 nouns and direct-object relation data. Given the goal of comparing neighbour sets generated by di erent measures, we would not expect these restrictions to a ect our ndings. The complete set of 2000 nouns (WScomp) is the union of two sets WShigh and WSlow for which nouns were selected on the basis of frequency: WShigh contains the 1000 most frequently occurring nouns (frequency > 500), and WSlow contains the nouns ranked 3001-4000 (frequency 100). By excluding mid-frequency nouns, we obtain a clear separation between high and low frequency nouns.</Paragraph> <Paragraph position="1"> The complete data-set consists of 1,596,798 co-occurrence tokens distributed over 331,079 co-occurrence types. From this data, we computed the similarity between every pair of nouns according to each distributional similarity measure. We then generated ranked sets of nearest neighbours (of size k = 200 and where a word is excluded from being a neighbour of itself) for each word and each measure.</Paragraph> <Paragraph position="2"> For a given word, we compute the overlap between neighbour sets using a comparison technique adapted from Lin (1998). Given a word w, each word w0 in WScomp is assigned a rank score of k rank if it is one of the k nearest neighbours of w using measure m and zero otherwise. If NS(w;m) is the vector of such scores for word w and measure m, then the overlap, C(NS(w;m1);NS(w;m2)), of two neighbour sets is the cosine between the two vectors:</Paragraph> <Paragraph position="4"> The overlap score indicates the extent to which sets share members and the extent to which they are in the same order. To achieve an overlap score of 1, the sets must contain exactly the same items in exactly the same order. An overlap score of 0 is obtained if the sets do not contain any common items. If two sets share roughly half their items and these shared items are dispersed throughout the sets in a roughly similar order, we would expect the overlap between sets to be around 0.5.</Paragraph> <Paragraph position="6"> larity measures with precision, recall and the harmonic mean in the AMCRM.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Results </SectionTitle> <Paragraph position="0"> Table 2 shows the mean overlap score between every pair of the rst seven measures in Table 1 calculated over WScomp. Table 3 shows the mean overlap score between each of these measures and precision, recall and the harmonic mean in the AMCRM. In both tables, standard deviations are given in brackets and boldface denotes the highest levels of overlap for each measure.</Paragraph> <Paragraph position="1"> For compactness, each measure is denoted by its subscript from Table 1.</Paragraph> <Paragraph position="2"> Although overlap between most pairs of measures is greater than expected if sets of 200 neighbours were generated randomly from WScomp (in this case, average overlap would be 0.08 and only the overlap between the pairs ( ,P) and (cp,P) is not signi cantly greater than this at the 1% level), there are substantial di erences between the neighbour sets generated by di erent measures. For example, for many pairs, neighbour sets do not appear to have even half their members in common.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Frequency analysis </SectionTitle> <Paragraph position="0"> We have seen that there is a large variation in neighbours selected by di erent similarity measures. In this section, we analyse how neighbour sets vary with respect to one fundamental statistical property |word frequency. To do this, we measure the bias in neighbour sets towards high frequency nouns and consider how this varies depending on whether the target noun is itself a high frequency noun or low frequency noun.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Measuring bias </SectionTitle> <Paragraph position="0"> If a measure is biased towards selecting high frequency words as neighbours, then we would expect that neighbour sets for this measure would be made up mainly of words from WShigh. Further, the more biased the measure is, the more highly ranked these high frequency words will tend to be. In other words, there will be high overlap between neighbour sets generated considering all 2000 nouns as potential neighbours and neighbour sets generated considering just the nouns in WShigh as potential neighbours. In the extreme case, where all of a noun's k nearest neighbours are high frequency nouns, the overlap with the high frequency noun neighbour set will be 1 and the overlap with the low frequency noun neighbour set will be 0. The inverse is, of course, true if a measure is biased towards selecting low frequency words as neighbours.</Paragraph> <Paragraph position="1"> If NSwordset is the vector of neighbours (and associated rank scores) for a given word, w, and similarity measure, m, and generated considering just the words in wordset as potential neighbours, then the overlap between two neighbour sets can be computed using a cosine (as before). If Chigh = C(NScomp;NShigh) and Clow = C(NScomp;NSlow), then we compute the bias towards high frequency neighbours for word w using measure m as: biashighm(w) = ChighChigh+Clow The value of this normalised score lies in the range [0,1] where 1 indicates a neighbour set completely made up of high frequency words, 0 indicates a neighbour set completely made up of low frequency words and 0.5 indicates a neighbour set with no biases towards high or low frequency words. This score is more informative than simply calculating the proportion of high high freq. low freq.</Paragraph> <Paragraph position="2"> target nouns target nouns measure and frequency of target noun.</Paragraph> <Paragraph position="3"> and low frequency words in each neighbour set because it weights the importance of neighbours by their rank in the set. Thus, a large number of high frequency words in the positions closest to the target word is considered more biased than a large number of high frequency words distributed throughout the neighbour set.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 4.2 Results </SectionTitle> <Paragraph position="0"> Table 4 shows the mean value of the biashigh score for every measure calculated over the set of high frequency nouns and over the set of low frequency nouns. The standard deviations (not shown) all lie in the range [0,0.2]. Any deviation from 0.5 of greater than 0.0234 is signi cant at the 1% level.</Paragraph> <Paragraph position="1"> For all measures and both sets of target nouns, there appear to be strong tendencies to select neighbours of particular frequencies. Further, there appears to be three classes of measures: those that select high frequency nouns as neighbours regardless of the frequency of the target noun (cm, js, , cpandR); those that select low frequency nouns as neighbours regardless of the frequency of the target noun (P); and those that select nouns of a similar frequency to the target noun (ja, ja+mi, lin and hm).</Paragraph> <Paragraph position="2"> This can also be considered in terms of distributional generality. By de nition, recall prefers words that have occurred in more of the contexts that the target noun has, regardless of whether it occurs in other contexts as well i.e., it prefers distributionally more general words.</Paragraph> <Paragraph position="3"> The probability of this being the case increases as the frequency of the potential neighbour increases and so, recall tends to select high frequency words. In contrast, precision prefers words that have occurred in very few contexts that the target word has not i.e., it prefers distributionally more speci c words. The probability of this being the case increases as the frequency of the potential neighbour decreases and so, precision tends to select low frequency words. The harmonic mean of precision and recall prefers words that have both high precision and high recall. The probability of this being the case is highest when the words are of similar frequency and so, the harmonic mean will tend to select words of a similar frequency.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Relative frequency and hyponymy </SectionTitle> <Paragraph position="0"> In this section, we consider the observed frequency e ects from a semantic perspective.</Paragraph> <Paragraph position="1"> The concept of distributional generality introduced in the previous section has parallels with the linguistic relation of hyponymy, where a hypernym is a semantically more general term and a hyponym is a semantically more speci c term. For example, animal is an (indirect1) hypernym of dog and conversely dog is an (indirect) hyponym of animal. Although one can obviously think of counter-examples, we would generally expect that the more speci c term dog can only be used in contexts where animal can be used and that the more general term animal might be used in all of the contexts where dog is used and possibly others. Thus, we might expect that distributional generality is correlated with semantic generality |a word has high recall/low precision retrieval of its hyponyms' co-occurrences and high precision/low recall retrieval of its hypernyms' co-occurrences.</Paragraph> <Paragraph position="2"> Thus, ifn1 andn2 are related andP(n2;n1) > R(n2;n1), we might expect that n2 is a hyponym of n1 and vice versa. However, having discussed a connection between frequency and distributional generality, we might also expect to nd that the frequency of the hypernymic term is greater than that of the hyponymic term. In order to test these hypotheses, we extracted all of the possible hyponym-hypernym pairs (20, 415 pairs in total) from our list of 2000 nouns (using WordNet 1.6). We then calculated the proportion for which the direction of the hyponymy relation could be accurately predicted by the relative values of precision and recall and the proportion for which the direction of the hyponymy relation could be accurately predicted by relative frequency. We found that the direction of the hyponymy relation is correlated in the predicted direction with the precision-recall 1There may be other concepts in the hypernym chain between dog and animal e.g. carnivore and mammal.</Paragraph> <Paragraph position="3"> values in 71% of cases and correlated in the predicted direction with relative frequency in 70% of cases. This supports the idea of a three-way linking between distributional generality, relative frequency and semantic generality. We now consider the impact that this has on a potential application of distributional similarity methods.</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Compositionality of collocations </SectionTitle> <Paragraph position="0"> In its most general sense, a collocation is a habitual or lexicalised word combination. However, some collocations such as strong tea are compositional, i.e., their meaning can be determined from their constituents, whereas others such as hot dog are not. Both types are important in language generation since a system must choose between alternatives but only non-compositional ones are of interest in language understanding since only these collocations need to be listed in the dictionary.</Paragraph> <Paragraph position="1"> Baldwin et al. (2003) explore empirical models of compositionality for noun-noun compounds and verb-particle constructions. Based on the observation (Haspelmath, 2002) that compositional collocations tend to be hyponyms of their head constituent, they propose a model which considers the semantic similarity between a collocation and its constituent words.</Paragraph> <Paragraph position="2"> McCarthy et al. (2003) also investigate several tests for compositionality including one (simplexscore) based on the observation that compositional collocations tend to be similar in meaning to their constituent parts. They extract co-occurrence data for 111 phrasal verbs (e.g. rip o ) and their simplex constituents (e.g. rip) from the BNC using RASP and calculate the value of simlin between each phrasal verb and its simplex constituent. The test simplexscore is used to rank the phrasal verbs according to their similarity with their simplex constituent. This ranking is correlated with human judgements of the compositionality of the phrasal verbs using Spearman's rank correlation coe cient. The value obtained (0.0525) is disappointing since it is not statistically signi cant (the probability of this value under the null hypothesis of \no correlation&quot; is 0.3).2 However, Haspelmath (2002) notes that a compositional collocation is not just similar to one of its constituents |it can be considered to be a hyponym of its head constituent. For example, \strong tea&quot; is a type of \tea&quot; and \to di erent similarity measures rip up&quot; is a way of \ripping&quot;.</Paragraph> <Paragraph position="3"> Thus, we hypothesised that a distributional measure which tends to select more general terms as neighbours of the phrasal verb (e.g. recall) would do better than measures that tend to select more speci c terms (e.g. precision) or measures that tend to select terms of a similar speci city (e.g simlin or the harmonic mean of precision and recall).</Paragraph> <Paragraph position="4"> Table 5 shows the results of using di erent similarity measures with the simplexscore test and data of McCarthy et al. (2003). We now see signi cant correlation between compositionality judgements and distributional similarity of the phrasal verb and its head constituent. The correlation using the recall measure is signi cant at the 5% level; thus we can conclude that if the simplex verb has high recall retrieval of the phrasal verb's co-occurrences, then the phrasal is likely to be compositional. The correlation score using the precision measure is negative since we would not expect the simplex verb to be a hyponym of the phrasal verb and thus, if the simplex verb does have high precision retrieval of the phrasal verb's co-occurrences, it is less likely to be compositional.</Paragraph> <Paragraph position="5"> Finally, we obtained a very similar result (0.217) by ranking phrasals according to their inverse relative frequency with their simplex constituent (i.e., freq(simplex)freq(phrasal) ). Thus, it would seem that the three-way connection between distributional generality, hyponymy and relative frequency exists for verbs as well as nouns.</Paragraph> </Section> class="xml-element"></Paper>