File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2411_metho.xml
Size: 17,522 bytes
Last Modified: 2025-10-06 14:09:23
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2411"> <Title>Calculating Semantic Distance between Word Sense Probability Distributions</Title> <Section position="4" start_page="0" end_page="3" type="metho"> <SectionTitle> 3 Sense Profile Distance </SectionTitle> <Paragraph position="0"> Our measure of sense profile distance (SPD) is designed to meet three criteria. First, it should capture fine-grained semantic similarity between profiles. Second, it should allow easy comparison between any sense profiles as probability scores spread throughout a hierarchical ontology (such as WordNet), not just between a particular format such as tree cuts. Third, it should be a symmetric measure, making it more appropriate for a wide range of applications of sense profile comparison. To achieve these goals, we measure the distance as a tree distance between the two profiles within the hierarchy, weighted by the probability scores.</Paragraph> <Paragraph position="1"> (Note that we formulate a distance measure, while referring to a component of semantic similarity. We assume throughout the paper that WordNet node distance is the inverse of WordNet similarity, and indeed the similarity measures we use are directly invertible.) We illustrate with an example the differences between our measure and both McCarthy's (2000) method and general vector distance measures. Consider the two sense profiles in Figure 1, with a0a2a1a4a3a6a5a8a7a10a9a12a11 in square boxes, and a0a2a1a13a3a14a5a8a7a15a9a17a16 in ovals.</Paragraph> <Paragraph position="2"> 2 To calculate the vector distance betweena0a2a1a4a3a6a5a8a7a10a9a18a11 anda0a2a1a13a3a14a5a8a7a15a9a17a16 , we need two vectors of equal dimension. In McCarthy (2000), the distributions are propagated to the lowest common subsumers (i.e., the nodes labelled B, C, and D). The vectors representing the two profiles become: 2Note that these are both tree cuts, so that we can compare McCarthy's method, but keep in mind that our method--as well as traditional vector distances--will apply to any probability distribution over a tree.</Paragraph> <Paragraph position="4"> Alternately, one can also increase the dimension of each profile to include all nodes in the hierarchy (or just the union of the profile nodes). The two profiles become:</Paragraph> <Paragraph position="6"> In the first method (that of McCarthy, 2000), the two profiles become identical. By generalizing the profiles to the lowest common subsumers, we lose information about the semantic specificity of the profile nodes and can no longer distinguish the semantic distance between the nodes across profiles. In the second method, the information about the hierarchical structure (of WordNet) is lost by treating each profile as a vector of nodes. Hence, vector distance measures fail to capture any semantic similarity across different nodes (e.g., the value of node B in a0a2a1a13a3a14a5a8a7a15a9a47a11 is not directly compared to the value of its child nodes E and F in a0a2a1a13a3a14a5a8a7a15a9a46a16 ).</Paragraph> <Paragraph position="7"> To remedy such shortcomings, our goal is to design a new distance measure that (i) compares the distributional differences between two profiles (somewhat similar to existing vector distances), and also (ii) captures the semantic distance between profiles. Intuitively, we can think of the profile distance as how far one profile (source) needs to &quot;travel&quot; to reach the other profile (destination). Formally, we define SPD as:</Paragraph> <Paragraph position="9"> where a98a73a99a100a3a73a101a97a102a104a103a18a105a54a106a39a107a79a108a110a109 is the portion of the profile score at node a111 in a0a112a1a4a3a14a5a113a7a10a9a115a114a117a116a23a118 that travels to node a119 in a0a2a1a13a3a14a5a8a7a15a9a18a120a17a121a23a114a23a122 , and a108a37a123a70a106a115a103a54a98a73a102a112a124a115a9a92a105a14a106a39a107a12a108a104a109 is the semantic distance between node a111 and node a119 in the hierarchy. For now, it can be assumed that a98a73a99a100a3a73a101a97a102a125a103a47a105a54a106a92a107a12a108a110a109 is a106a47a124a115a3a73a1a4a9a58a105a14a106a12a109 , the entire probability score at node a111 . Note that we design the distance to be symmetric, so that the distance remains the same regardless of which profile is source and which is destination. (We present our distance measures below.) In the current example, we can propagate a0a2a1a13a3a14a5a8a7a15a9 a16 (source) to a0a112a1a4a3a14a5a113a7a10a9a18a11 (destination) by moving its probabilities in this manner: 1. probabilities at nodes E and F move to node B 2. probability at node G moves to node C 3. probability at node D moves to nodes H and I The first two steps are straightforward--whenever there is one destination node in a propagation path, we simply multiply the amount moved by the distance of the path (a108a73a123a70a106a17a103a23a98a37a102a2a124a115a9a58a105a54a106a39a107a79a108a110a109 ). For example, step 1 are not shown. The italicized labels on nodes are WordNet classes; the single letter labels are for reference in the text. yields a contribution to a0a2a1a4a3 a105a56a0a2a1a4a3a6a5a8a7a10a9a17a114a116a54a118a18a107a14a0a2a1a4a3a6a5a8a7a10a9a18a120a17a121a117a114a54a122a14a109 of a106a47a124a115a3a73a1a4a9a58a105a6a5 a109a47a108a73a123a70a106a17a103a47a105a7a5 a107a9a8 a109a11a10 a106a47a124a115a3a73a1a4a9a58a105a6a12a86a109a18a108a73a123a70a106a17a103a47a105a7a12a52a107a9a8 a109 . However, the last step, step 3, has multiple destination nodes (H and I), and the probability of the source node, D, must be appropriately apportioned between them. We take this into account in the a98a37a99 a3a37a101a97a102a104a103 function, by including a weight component:</Paragraph> <Paragraph position="11"> where a41a113a9a46a123a43a42a45a44a21a103a47a105a13a108a110a109 is the weight of the destination node a119 and a0a104a3a73a1a17a103a117a123a81a3a73a102 a105a54a106a12a109 is the portion of a106a47a124a13a3a37a1a4a9a58a105a54a106a12a109 that we are moving. (For this example, we continue to assume that the full amount of a106a46a124a115a3a37a1a4a9a58a105a54a106a12a109 is moved; we discuss a0a104a3a73a1a17a103a117a123a81a3a73a102 a105a54a106a12a109 further below.) The weight of each destination node a119 is calculated as the proportion of its score in the sum of the scores of its siblings. Thus, in step 1 above, a41a113a9a46a123a43a42a45a44a21a103a18a105a6a8 a109 and a41a113a9a46a123a43a42a45a44a21a103a47a105a40a46 a109 are both 1, and the full amount of E, F, and G are moved up.</Paragraph> <Paragraph position="12"> In the last step, however, the sibling nodes H and I have to split the input from node D: node H has weight a111a48a47a50a49a45a51a53a52a21a105a25a54 a109a9a55a97a105a117a111a19a47a40a49a45a51a45a52a21a105a56a54 a109a57a10 a111a19a47a40a49a45a51a45a52a21a105a56a58a58a109a13a109a60a59a62a61a64a63 a65a66a55a97a105a56a61a2a63a67a65a60a10a68a61a64a63a70a69a79a109a71a59 a65a72a55a15a73 , and node I analogously has weight a69a53a55a53a73 .</Paragraph> <Paragraph position="13"> Hence, the SPD propagating from a0a2a1a4a3a6a5a8a7a10a9a18a16 to a0a112a1a4a3a14a5a113a7a10a9a47a11 can be calculated as:</Paragraph> <Paragraph position="15"> For simplicity, we designed this example such that the two profiles are very similar. As a result, we end up 3We have described the algorithm as moving one profile to another. Conceptually, there are cases, as illustrated in the example, where we are propagating profile scores downwards in the hierarchy. Moving scores downwards can be computationally expensive because one may need to search through the whole subtree rooted at the source node for destination nodes.</Paragraph> <Paragraph position="16"> We implemented an alternative by moving all the scores upwards. Since we keep track of the source and destination nodes, the two methods are equivalent.</Paragraph> <Paragraph position="17"> propagating the entire source profile by propagating the full score of each of its nodes. In practice, for most profile comparisons, we only move the portion of the score at each node necessary to make one profile resemble the other. Hence, a0a110a3a37a1a17a103a123a81a3a37a102 a105a14a106a12a109 in the formula for a98a73a99a100a3a73a101a97a102a125a103a47a105a54a106a92a107a12a108a110a109 in equation 2 captures the difference between probabilities at node a111 across the source and destination profiles.</Paragraph> <Paragraph position="18"> So far we have discussed very little the calculation of semantic distance between profile nodes (i.e., a108a73a123a70a106a17a103a54a98a73a102a112a124a13a9a58a105a14a106a39a107a12a108a110a109 in equation 1). Recall that one important goal in designing SPD is to capture semantic similarity between WordNet nodes. Naturally, we look to the current research comparing semantic similarity between word senses (e.g., Budanitsky and Hirst, 2001).</Paragraph> <Paragraph position="19"> We choose to implement two straightforward methods.</Paragraph> <Paragraph position="20"> For one, we invert (to obtain distance) the WordNet similarity measure of Wu and Palmer (1994), yielding:</Paragraph> <Paragraph position="22"> where a99a78a46a11a0a94a105a14a102a2a11a104a107a47a102a110a16a21a109 is the lowest common subsumer of a100a71a101 and a100a103a102 . The other method we use is the simple edge distance between nodes, a119a23a104a93a105a107a106a50a104 .4 Thus far, we have defined SPD as a sum of propagated profile scores multiplied by the distance &quot;travelled&quot; (equation 1). We have also considered propagating other values as a function of profile scores. Let's return to the same example but redistribute some of the probability mass of a0a2a1a4a3a6a5a8a7a10a9 a16 : node E goes from a probability of 0.3 to 0.45, and node F goes from 0.2 to 0.05. As a result, the distribution of the scores at the node B subtree is more skewed towards node E than in the originala0a112a1a4a3a14a5a113a7a10a9 a16 . For both the original and modified a0a2a1a4a3a6a5a8a7a10a9 a16 , SPD has the same value because we are moving a total probability mass of 0.5 from E and F to B, with the same semantic distance (since E and F are at the same level in the tree). However, we consider that, at the node B subtree,a0a112a1a4a3a14a5a113a7a10a9a18a11 is less similar to the skeweda0a2a1a4a3a6a5a8a7a10a9a47a16 than to 4We also implemented the WordNet edge distance measure of Leacock and Chodorow (1998). Since it did not influence our results, we omit discussion of it here.</Paragraph> <Paragraph position="23"> the original, more evenly distributed a0a112a1a4a3a14a5a113a7a10a9a47a16 . To reflect this observation, we can propagate the &quot;inverse entropy&quot; in order to capture how evenly distributed the probabilities are in a subtree. We define an alternative version of</Paragraph> <Paragraph position="25"> where we replace a0a104a3a73a1a17a103a117a123a81a3a73a102 a105a54a106a12a109 with inverse entropy,</Paragraph> <Paragraph position="27"> By propagating inverse entropy, we penalize cases where the distribution of source scores is &quot;skewed.&quot; In this work, we will experiment with both methods of propagation (with and without inverse entropy).</Paragraph> </Section> <Section position="5" start_page="3" end_page="3" type="metho"> <SectionTitle> 4 Materials and Methods </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 4.1 Corpus Data </SectionTitle> <Paragraph position="0"> Our materials are drawn from a 35M-word portion of the British National Corpus (BNC). The text is parsed using the RASP parser (Briscoe and Carroll, 2002), and subcategorizations are extracted using the system of Briscoe and Carroll (1997). The subcategorization frame entry of each verb includes the frequency count and a list of argument heads per slot. The target slots in this work are the subject of the intransitive and the object of the transitive.</Paragraph> </Section> <Section position="2" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 4.2 Verb Selection </SectionTitle> <Paragraph position="0"> We evaluate our method on the causative alternation in order for comparison to the earlier method of McCarthy (2000). We selected target verbs by choosing semantic classes (not individual verbs) from Levin (1993) that are expected to undergo the causative alternation. The target verbs are selected randomly from these classes. We refer to these as causative verbs. For both our development and test sets, we chose filler verbs randomly, as long as the verb classes they belong to do not allow a subject/object alternation as in the causative. Verbs must occur a minimum of 10 times per syntactic slot to be chosen.</Paragraph> <Paragraph position="1"> Note that we did not hand-verify that individual verbs allowed or disallowed the alternation, as McCarthy (2000) had done, because we wanted to evaluate our method in the presence of noise of this kind.</Paragraph> <Paragraph position="2"> In a pilot experiment on a smaller, domain-specific corpus (6M words, medical domain) (Tsang and Stevenson, 2004), we randomly picked 18 causative verbs and 18 filler verbs for development and 20 causative verbs and 20 filler verbs for testing. In this pilot experiment, SPD is consistently the best performer in both development and testing. SPD achieves a best accuracy of 69% in development and 65% in testing (chance accuracy of 50%).</Paragraph> <Paragraph position="3"> Given more data (35M words) in our current experiments, we randomly select additional verbs to make up a total of 60 causative verbs and 60 filler verbs, half of these for development and half for testing. Each set of verbs is further divided into a high frequency band (with at least 450 instances of one target slot), a medium frequency band (with between 150 and 400 instances in one target slot), and a low frequency band (with between 20 and 100 instances of one target slot). Each band has 20 verbs (10 causative and 10 non-causative). For each of the development and testing phases, we experiment with individual frequency bands (i.e., high, medium, and low band, separately), and with mixed frequencies (i.e., all verbs). To compare with our earlier results, we also experiment on the pilot development verbs (36 verbs). Note that in the BNC, these verbs are not evenly distributed across the bands, hence we can only experiment with the mixed frequencies condition.</Paragraph> </Section> <Section position="3" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 4.3 Experimental Set-Up </SectionTitle> <Paragraph position="0"> Using (verb,slot,noun) tuples from the corpus, we experimented with several ways of building sense profiles of each verb's target argument slots (Resnik, 1993; Li and Abe, 1998; Clark and Weir, 2002).5 In both our pilot experiment and current development work, we found that the method of Clark and Weir (2002) overall gave better performance, and so we limit our discussion here to the results on their model. Briefly, Clark and Weir (2002) populate the WordNet hierarchy based on corpus frequencies (of all nouns for a verb/slot pair), and then determine the appropriate probability estimate at each node in the hierarchy by using a24 a102 to determine whether to generalize an estimate to a parent node in the hierarchy.</Paragraph> <Paragraph position="1"> We compare SPD to other measures applied directly to the (unpropagated) probability profiles given by the Clark-Weir method: the probability distribution distance given by skew divergence (skew) (Lee, 1999), as well as the general vector distance given by cosine (cos). These are the measures (aside from SPD) that performed best in our pilot experiments.</Paragraph> <Paragraph position="2"> It is worth noting that the method of Clark and Weir (2002) does not yield a tree cut, but instead generally populates the WordNet hierarchy with non-zero probabilities. This means that the kind of straightforward propagation method used by McCarthy (2000) is not applicable to sense profiles of this type.</Paragraph> <Paragraph position="3"> To determine whether a verb participates in the causative alternation, we adopt McCarthy's method of using a threshold over the calculated distance measures, testing both the mean and median distances as possible thresholds. In our case, verbs with slot-distances 5Although Resnik's measure is not a probability distribution, his method for populating the WordNet hierarchy from corpus counts does yield a probability distribution.</Paragraph> <Paragraph position="4"> the measure(s) that produce that result, using a median threshold. SPD refers to SPD without entropy, using either a108a1a0a3a2 or a108 a121a23a120a5a4a46a121 . &quot;all&quot;, &quot;high&quot;, &quot;high-med&quot;, &quot;med&quot;, and &quot;low&quot; refer to the different frequency bands.</Paragraph> <Paragraph position="5"> below the threshold (smaller distances) are classified as causative, and those above the threshold as non-causative. In both our pilot and development work, median thresholds consistently fare better than average thresholds, hence we narrow our discussion here to using median only. Using the median also has the advantage of yielding a consistent 50% baseline. Accuracy is used as the performance measure.</Paragraph> </Section> </Section> class="xml-element"></Paper>