File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-1202_metho.xml

Size: 15,442 bytes

Last Modified: 2025-10-06 14:10:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-1202">
  <Title>The Distributional Similarity of Sub-Parses</Title>
  <Section position="4" start_page="7" end_page="8" type="metho">
    <SectionTitle>
3 Proposal
</SectionTitle>
    <Paragraph position="0"> Recently, there has been much interest in finding words which are distributionally similar e.g., Lin (1998), Lee (1999), Curran and Moens (2002), Weeds (2003) and Geffet and Dagan (2004). Two words are said to be distributionally similar if they appear in similar contexts. For example, the two words apple and pear are likely to be seen as the objects of the verbs eat and peel, and this adds to their distributional similarity. The Distributional Hypothesis (Harris, 1968) proposes a connection between distributional similarity and semantic similarity, which is the basis for a large body of work on automatic thesaurus construction using distributional similarity methods (Curran and Moens, 2002; Weeds, 2003; Geffet and Dagan, 2004).</Paragraph>
    <Paragraph position="1"> Our proposal is that just as words have distributional similarity which can be used, with at least some success, to estimate semantic similarity, so do larger units of expression. We propose that the unit of interest is a sub-parse, i.e., a fragment (connected subgraph) of a parse tree, which can range in size from a single word to the parse for the entire sen- null charging&amp;quot; and &amp;quot;my mobile phone battery is low&amp;quot; tence. Figure 1 shows the parses for the clauses, &amp;quot;my mobile phone needs charging,&amp;quot; and &amp;quot;my mobile phone battery is low&amp;quot; and highlights the fragments (&amp;quot;needs charging&amp;quot; and &amp;quot;battery is low&amp;quot;) for which we might be interested in finding similarity.</Paragraph>
    <Paragraph position="2"> In our model, we define the features or contexts of a sub-parse to be the grammatical relations between any component of the sub-parse and any word outside of the sub-parse. In the example above, both sub-parses would have features based on their grammatical relation with the word phone. The level of granularity at which to consider grammatical relations remains a matter for investigation. For example, it might turn out to be better to distinguish between all types of dependent or, alternatively, it might be better to have a single class which covers all dependents. We also consider the parents of the sub-parse as features. In the example, &amp;quot;Send me an email if my mobile phone battery is low,&amp;quot; this would be that the sub-parse modifies the verb send i.e., it has the feature, &lt;mod-of, send&gt;.</Paragraph>
    <Paragraph position="3"> Having defined these models for the unit of interest, the sub-parse, and for the context of a sub-parse, we can build up co-occurrence vectors for sub-parses in the same way as for words. A co-occurrence vector is a conglomeration (with frequency counts) of all of the co-occurrences of the target unit found in a corpus. The similarity between two such vectors or descriptions can then be found using a standard distributional similarity measure (see Weeds (2003)).</Paragraph>
    <Paragraph position="4"> The use of distributional evidence for larger units than words is not new. Szpektor et al. (2004) automatically identify anchors in web corpus data. Anchors are lexical elements that describe the context of a sentence and if words are found to occur with the same set of anchors, they are assumed to be paraphrases. For example, the anchor set {Mozart, 1756} is a known anchor set for verbs with the meaning &amp;quot;born in&amp;quot;. However, this use of distributional evidence requires both anchors, or contexts, to occur simultaneously with the target word. This differs from the standard notion of distributional similarity which involves finding similarity between co-occurrence vectors, where there is no requirement for two features or contexts to occur simulultaneously.</Paragraph>
    <Paragraph position="5"> Our work with distributional similarity is a generalisation of the approach taken by Lin and Pantel (2001). These authors apply the distributional similarity principle to paths in a parse tree. A path exists between two words if there are grammatical relations connecting them in a sentence. For example, in the sentence &amp;quot;John found a solution to the problem,&amp;quot; there is a path between &amp;quot;found&amp;quot; and &amp;quot;solution&amp;quot; because solution is the direct object of found. Contexts of this path, in this sentence, are then the grammatical relations &lt;ncsubj, John&gt; and &lt;iobj, problem&gt; because these are grammatical relations associated with either end of the path. In their work on QA, Lin and Pantel restrict the grammatical relations considered to two &amp;quot;slots&amp;quot; at either end of the path where the word occupying the slot is a noun.</Paragraph>
    <Paragraph position="6"> Co-occurrence vectors for paths are then built up using evidence from multiple occurrences of the paths in corpus data, for which similarity can then be calculated using a standard metric (e.g., Lin (1998)).</Paragraph>
    <Paragraph position="7"> In our work, we extend the notion of distributional similarity from linear paths to trees. This allows us to compute distributional similarity for any part of an expression, of arbitrary length and complexity (although, in practice, we are still limited by data sparseness). Further, we do not make any restrictions as to the number or types of the grammatical relation contexts associated with a tree.</Paragraph>
  </Section>
  <Section position="5" start_page="8" end_page="10" type="metho">
    <SectionTitle>
4 Empirical Evidence
</SectionTitle>
    <Paragraph position="0"> Practically demonstrating our proposal requires a source of paraphrases. We first looked at the MSR paraphrase corpus (Dolan et al., 2004) since it contains a large number of sentences close enough in meaning to be considered paraphrases. However, inspection of the data revealed that the lexical overlap between the pairs of paraphrasing sentences in this corpus is very high. The average word overlap (i.e., the proportion of exactly identical word forms) calculated over the sentences paired by humans in the training set is 0.70, and the lowest overlap4 for such sentences is 0.3. This high word overlap makes this a poor source of examples for us, since we wish to study similarity between phrases which do not share semantically similar words.</Paragraph>
    <Paragraph position="1">  Consequently, for our purposes, the Pascal Textual Entailment Recognition Challenge dataset is a more suitable source of paraphrase data. Here the average word overlap between textually entailing sentences is 0.39 and the lowest overlap is 0. This allows us to easily find pairs of sub-parses which do not share similar words. For example, in paraphrase pair id.19, we can see that &amp;quot;reduce the risk of diseases&amp;quot; entails &amp;quot;has health benefits&amp;quot;. Similarly in pair id.20, &amp;quot;may keep your blood glucose from rising too fast&amp;quot; entails &amp;quot;improves blood sugar control,&amp;quot; and in id.570, &amp;quot;charged in the death of&amp;quot; entails &amp;quot;accused of having killed.&amp;quot; In this last example there is semantic similarity between the words used. The word charged is semantically similar to accused. However, it is not possible to swap the two words in these contexts since we do not say &amp;quot;charged of having killed.&amp;quot; Further, there is an obvious semantic connection between the words death and killed, but being different parts of speech this would be easily missed by traditional distributional methods.</Paragraph>
    <Paragraph position="2"> Consequently, in order to demonstrate the potential of our method, we have taken the phrases &amp;quot;reduce the risk of diseases&amp;quot;, &amp;quot;has health benefits&amp;quot;, &amp;quot;charged in the death of&amp;quot; and &amp;quot;accused of having killed&amp;quot;, constructed corpora for the phrases and their components and then computed distributional similarity between pairs of phrases and their respective components. Under our hypotheses, paraphrases will be more similar than non-paraphrases and there will be no clear relation between the similarity of phrases as a whole and the similarity of their components.</Paragraph>
    <Paragraph position="3"> We now discuss corpus construction and distributional similarity calculation in more detail.</Paragraph>
    <Section position="1" start_page="9" end_page="9" type="sub_section">
      <SectionTitle>
4.1 Corpus Construction
</SectionTitle>
      <Paragraph position="0"> In order to compute distributional similarity between sub-parses, we need to have seen a large number of occurrences of each sub-parse. Since data sparseness rules out using traditional corpora, such as the British National Corpus (BNC), we constructed a corpus for each phrase by mining the web. We also constructed a similar corpus for each component of each phrase. For example, for phrase 1, we constructed corpora for &amp;quot;reduce the risk of diseases&amp;quot;, &amp;quot;reduce&amp;quot; and &amp;quot;the risk of diseases&amp;quot;. We do this in order to avoid only have occurrences of the components in the context of the larger phrase. Each corpus was constructed by sending the phrase as a quoted string to Altavista. We took the returned list of URLs (up to the top 1000 where more than 1000 could be returned), removed duplicates and then downloaded the associated files. We then searched the files for the lines containing the relevant string and added  tracted for each Phrase each of these to the corpus file for that phrase. Each corpus file was then parsed using the RASP parser (version 3.b) ready for feature extraction.</Paragraph>
    </Section>
    <Section position="2" start_page="9" end_page="10" type="sub_section">
      <SectionTitle>
4.2 Computing Distributional Similarity
</SectionTitle>
      <Paragraph position="0"> First, a feature extractor is run over each parsed corpus file to extract occurrences of the sub-parse and their features. The feature extractor reads in a template for each phrase in the form of dependency relations over lemmas. It checks each sentence parse against the template (taking care that the same word form is indeed the same occurrence of the word in the sentence). When a match is found, the other grammatical relations5 for each word in the sub-parse are output as features. When the sub-parse is only a word, the process is simplified to finding grammatical relations containing that word.</Paragraph>
      <Paragraph position="1"> The raw feature file is then converted into a co-occurrence vector by counting the occurrences of each feature type. Table 1 shows the number of feature types and tokens extracted for each phrase. This shows that we have extracted a reasonable number of features for each phrase, since distributional similarity techniques have been shown to work well for words which occur more than 100 times in a given corpus (Lin, 1998; Weeds and Weir, 2003).</Paragraph>
      <Paragraph position="2"> We then computed the distributional similarity between each co-occurrence vector using the a-skew divergence measure (Lee, 1999). The a-skew divergence measure is an approximation to the Kullback-Leibler (KL) divergence meassure between two distributions p and q:</Paragraph>
      <Paragraph position="4"> The a-skew divergence measure is designed to be used when unreliable maximum likelihood estimates (MLE) of probabilities would result in the KL divergence being equal to [?]. It is defined as:</Paragraph>
      <Paragraph position="6"> where 0 [?] a [?] 1. We use a = 0.99, since this provides a close approximation to the KL divergence measure. The result is a number greater than or equal to 0, where 0 indicates that the two distributions are identical. In other words, a smaller distance indicates greater similarity.</Paragraph>
      <Paragraph position="7"> The reason for choosing this measure is that it can be used to compute the distance between any two co-occurrence vectors independent of any information about other words. This is in contrast to many other measures, e.g., Lin (1998), which use the co-occurrences of features with other words to compute a weighting function such as mutual information (MI) (Church and Hanks, 1989). Since we only have corpus data for the target phrases, it is not possible for us to use such a measure. However, the a-skew divergence measure has been shown (Weeds, 2003) to perform comparably with measures which use MI, particularly for lower frequency target words.</Paragraph>
    </Section>
    <Section position="3" start_page="10" end_page="10" type="sub_section">
      <SectionTitle>
4.3 Results
</SectionTitle>
      <Paragraph position="0"> The results, in terms of a-skew divergence scores between pairs of phrases, are shown in Table 2. Each set of three lines shows the similarity score between a pair of phrases and then between respective pairs of components. In the first two sets, the phrases are paraphrases whereas in the second two sets, the phrases are not.</Paragraph>
      <Paragraph position="1"> From the table, there does appear to be some potential in the use of distributional similarity between sub-parses to identify potential paraphrases. In the final two examples, the paired phrases are not semantically similar, and as we would expect, their respective distributional similarities are less (i.e., they are further apart) than in the first two examples.</Paragraph>
      <Paragraph position="2"> Further, we can see that there is no clear relation between the similarity of two phrases and the similarity of respective components. However in 3 out of 4 cases, the similarity between the phrases lies between that of their components. In every case, the similarity of the phrases is less than the similarity of the verbal components. This might be what one would expect for the second example since the components &amp;quot;charged in&amp;quot; and &amp;quot;accused of&amp;quot; are semantically similar. However, in the first example, we would have expected to see that the similarity between &amp;quot;reduce the risk of diseases&amp;quot; and &amp;quot;has health Phrase 1 Phrase 2 Dist.</Paragraph>
      <Paragraph position="3"> reduce the risk of diseases has health benefits 5.28 reduce has 4.95 the risk of diseases health benefits 5.58 charged in the death of accused of having killed 5.07 charged in accused of 4.86 the death of having killed 6.16 charged in the death of has health benefits 6.04 charged in has 5.54 the death of health benefits 4.70 reduce the risk of diseases accused of having killed 6.09 reduce accused of 5.77 the risk of diseases having killed 6.31 Table 2: a-skew divergence scores between pairs of phrases benefits&amp;quot; to be greater than either pair of components, which it is not. The reason for this is not clear from just these examples. However, possibilities include the distributional similarity measure used, the features selected from the corpus data and a combination of both. It may be that single words tend to exhibit greater similarity than phrases due to their greater relative frequencies. As a result, it may be necessary to factor in the length or frequency of a sub-parse into distributional similarity calculations or comparisons thereof.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML