XML Viewer - w06-1203

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1203_metho.xml
Size: 18,174 bytes
Last Modified: 2025-10-06 14:10:41
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1203">
  <Title>Automatic Identification of Non-Compositional Multi-Word Expressions using Latent Semantic Analysis</Title>
  <Section position="4" start_page="12" end_page="13" type="metho">
    <SectionTitle>
2 Previous work
</SectionTitle>
    <Paragraph position="0"> Recent work which attempts to discriminate between compositional and non-compositional MWEs include Lin (1999), who used mutual-information measures identify such phrases, Baldwin et al. (2003), who compare the distribution of the head of the MWE with the distribution of the entire MWE, and Vallada Moir'on &amp; Tiedemann (2006), who use a word-alignment strategy to identify non-compositional MWEs making use of parallel texts. Schone &amp; Jurafsky (2001) applied LSA to MWE identification, althought they did not focus on distinguishing compositional from non-compositional MWEs.</Paragraph>
    <Paragraph position="1"> Lin's goal, like ours, was to discriminate non-compositional MWEs from compositional MWEs.</Paragraph>
    <Paragraph position="2"> His method was to compare the mutual information measure of the constituents parts of an MWE with the mutual information of similar expressions obtained by substituting one of the constituents with a related word obtained by thesaurus lookup.</Paragraph>
    <Paragraph position="3"> The hope was that a significant difference between these measures, as in the case of red tape (mutual information: 5.87) compared to yellow tape (3.75) or orange tape (2.64), would be characteristic of non-compositional MWEs. Although intuitively appealing, Lin's algorithm only achieves precision and recall of 15.7% and 13.7%, respectively (as compared to a gold standard generate from an idiom dictionary--but see below for discussion).</Paragraph>
    <Paragraph position="4"> Schone &amp; Jurafsky (2001) evaluated a number of co-occurrence-based metrics for identifying MWEs, showing that, as suggested by Lin's results, there was need for improvement in this area. Since LSA has been used in a number of meaning-related language tasks to good effect (Landauer and Dumais, 1997; Landauer and Psotka, 2000; Cederberg and Widdows, 2003), they had hoped to improve their results by identify non-compositional expressions using a method similar to that which we are exploring here. Although they do not demonstrate that this method actually identifies non-compositional expressions, they do show that the LSA similarity technique only improves MWE identification minimally.</Paragraph>
    <Paragraph position="5"> Baldwin et al., (2003) focus more narrowly on distinguishing English noun-noun compounds and verb-particle constructions which are compositional from those which are not compositional. Their approach is methodologically similar to ours, in that they compute similarity on the basis of contexts of occurrance, making use of LSA.</Paragraph>
    <Paragraph position="6"> Their hypothesis is that high LSA-based similarity between the MWE and each of its constituent parts is indicative of compositionality. They evaluate their technique by assessing the correlation between high semantic similarity of the constituents of an MWE to the MWE as a whole with the likelihood that the MWE appears in WordNet as a hyponym of one of the constituents. While the expected correlation was not attested, we suspect this  to be more an indication of the inappropriateness of the evaluation used than of the faultiness of the general approach.</Paragraph>
    <Paragraph position="7"> Lin, Baldwin et al., and Schone &amp; Jurafsky, all use as their gold standard either idiom dictionaries or WordNet (Fellbaum, 1998). While Schone &amp; Jurafsky show that WordNet is as good a standard as any of a number of machine readable dictionaries, none of these authors shows that the MWEs that appear in WordNet (or in the MRDs) are generally non-compositional, in the relevant sense. As noted by Sag et al. (2002) many MWEs are simply &amp;quot;institutionalized phrases&amp;quot; whose meanings are perfectly compositional, but whose frequency of use (or other non-linguistic factors) make them highly salient. It is certainly clear that many MWEs that appear in WordNet--examples being law student, medical student, college man--are perfectly compositional semantically.</Paragraph>
    <Paragraph position="8"> Zhai (1997), in an early attempt to apply statistical methods to the extraction of non-compositional MWEs, made use of what we take to be a more appropriate evaluation metric. In his comparison among a number of different heuristics for identifying non-compositional noun-noun compounds, Zhai did his evaluation by applying each heuristic to a corpus of items hand-classified as to their compositionality. Although Zhai's classification appears to be problematic, we take this to be the appropirate paradigm for evaluation in this domain, and we adopt it here.</Paragraph>
  </Section>
  <Section position="5" start_page="13" end_page="16" type="metho">
    <SectionTitle>
3 Proceedure
</SectionTitle>
    <Paragraph position="0"> In our work we made use of the Word Space model of (semantic) similiarty (Sch&amp;quot;utze, 1998) and extended it slightly to MWEs. In this framework, &amp;quot;meaning&amp;quot; is modeled as an n-dimensional vector, derived via singular value decomposition (Deerwester et al., 1990) from word co-occurrence counts for the expression in question, a technique frequently referred to as Latent Semantic Analysis (LSA). This kind of dimensionality reduction has been shown to improve performance in a number of text-based domains (Berry et al., 1999).</Paragraph>
    <Paragraph position="1"> For our experiments we used a local German newspaper corpus.2 We built our LSA model with the Infomap Software package.3, using the  hand-generated stop list as the content-bearing dimension words (the columns of the matrix). The 20,000 most frequent content words were assigned row values by counting occurrences within a 30word window. SVD was used to reduce the dimensionality from 1000 to 100, resulting in 100 dimensional &amp;quot;meaning&amp;quot;-vectors for each word. In our experiments, MWEs were assigned meaningvectors as a whole, using the same proceedure. For meaning similarity we adopt the standard measure of cosine of the angle between two vectors (the normalized correlation coefficient) as a metric (Sch&amp;quot;utze, 1998; Baeza-Yates and Ribeiro-Neto, 1999). On this metric, two expressions are taken to be unrelated if their meaning vectors are orthogonal (the cosine is 0) and synonymous if their vectors are parallel (the cosine is 1).</Paragraph>
    <Paragraph position="2"> Figure 1 illustrates such a vector space in two dimensions. Note that the meaning vector for L&amp;quot;offel 'spoon' is quite similar to that for essen 'to eat' but distant from sterben 'to die', while the meaning vector for the MWE den L&amp;quot;offel abgeben is close to that for sterben. Indeed den L&amp;quot;offel abgeben, like to kick the bucket, is a non-compositional idiom meaning 'to die'.</Paragraph>
    <Paragraph position="3"> While den L&amp;quot;offel abgeben is used almost exclusively in its idiomatic sense (all four occurrences in our corpus), many MWEs are used regularly in both their idiomatic and in their literal senses. About two thirds of the uses of the MWE ins Wasser fallen in our corpus are idiomatic uses, and the remaing one third are literal uses. In our first experiment we tested the hypothesis that these uses could reliably be distinguished using distribution-based models of their meaning.</Paragraph>
    <Section position="1" start_page="14" end_page="14" type="sub_section">
      <SectionTitle>
3.1 Experiment I
</SectionTitle>
      <Paragraph position="0"> For this experiment we manually annotated the 67 occurrences of ins Wasser fallen in our corpus as to whether the expression was used compositionally (literally) or non-compositionally (idiomatically).4 Marking this distinction we generate an LSA meaning vectors for the compositional uses and an LSA meaning vector for the non-compositional uses of ins Wasser fallen. The vectors turned out, as expected, to be almost orthogonal, with a cosine of the angle between them of 0.02. This result confirms that the linguistic contexts in which the literal and the idiomatic use of ins Wasser fallen appear are very different, indicating--not surprisingly--that the semantic difference between the literal meaning and the idiomatic meaning is reflected in the way these these phrases are used.</Paragraph>
      <Paragraph position="1"> Our next task was to investigate whether this difference could be used in particular cases to determine what the intended use of an MWE in a particular context was. To evaluate this, we did a 10-fold cross-validation study, calculating the literal and idiomatic vectors for ins Wasser fallen on the basis of the training data and doing a simple nearest neighbor classification of each memember of the test set on the basis of the meaning vectors computed from its local context (the 30 word window). Our result of an average accurace of 72% for our LSA-based classifier far exceeds the simple maximum-likelihood baseline of 58%.</Paragraph>
      <Paragraph position="2"> In the final part of this experiment we compared the meaning vector that was computed by summing over all uses of ins Wasser fallen with the literal and idiomatic vectors from above. Since idiomatic uses of ins Wasser fallen prevail in the corpus (2/3 vs. 1/3), it is not surprisingly that the similarity to the literal vector (0.0946) is much than similarity to the idiomatic vector (0.3712).</Paragraph>
      <Paragraph position="3"> To summarize Experiment I, which is a variant of a supervised phrase sense disambiguation task, demonstrates that we can use LSA to distinguish between literal and the idiomatic usage of an  tated independently, with very high agreement--kappa score of over 0.95 (Carletta, 1996). Occurrences on which the annotators disagreed were thrown out. Of the 64 occurrences we used, 37 were idiomatic and 27 were literal.</Paragraph>
    </Section>
    <Section position="2" start_page="14" end_page="16" type="sub_section">
      <SectionTitle>
3.2 Experiment II
</SectionTitle>
      <Paragraph position="0"> In our second experiment we sought to make use of the fact that there are typically clear distributional difference between compositional and non-compositional uses of MWEs to determine whether a given MWE indeed has non-compositional uses at all. In this experiment we made use of a test set of German Preposition-Noun-Verb &amp;quot;collocation candidate&amp;quot; database whose extraction is described by Krenn (2000) and which has been made available electronically.5 From this database only word combinations with frequency of occurrence more than 30 in our test corpus were considered. Our task was to classify these 81 potential MWEs according whether or not thay have an idiomatic meaning. null To accomplish this task we took the following approach. We computed on the basis of the distribution of the components of the MWE an estimate for the compositional meaning vector for the MWE. We then compared this to the actual vector for the MWE as a whole, with the expectation MWEs which indeed have non-compositinoal uses will be distinguished by a relatively low vector similarity between the estimated compositional meaning vector and the actual meaning vector.</Paragraph>
      <Paragraph position="1"> In other words small similarity values should be diagnostic for the presense of non-compositinoal uses of the MWE.</Paragraph>
      <Paragraph position="2"> We calculated the estimated compositional meaning vector by taking it to be the sum of the meaning vector of the parts, i.e., the compositional meaning of an expression w1w2 consisting of two words is taken to be sum of the meaning vectors for the constituent words.6 In order to maximize the independent contribution of the constituent words, the meaning vectors for these words were always computed from contexts in which they appear alone (that is, not in the local context of the other constituent). We call the estimated compositional meaning vector the &amp;quot;composed&amp;quot; vector.7 The comparisons we made are illustrated in Figure 2, where vectors for the MWE auf die Strecke bleiben 'to fall by the wayside' and the words  into two dimensions8. (the words Autobahn 'highway' and eigenst&amp;quot;andig 'independent' are given for comparison). Here we see that the linear combination of the component words of the MWE is clearly distinct from that of the MWE as a whole.</Paragraph>
      <Paragraph position="3"> As a further illustration of the difference between the composed vector and the MWE vector, in Table 2 we list the words whose meaning vector is most similar to that of the MWE auf dis Strecke bleiben along with their similarity values, and in Table 3 we list those words whose meaning vector is most similar to the composed vector. The semantic differences among these two classes are readily apparent.</Paragraph>
      <Paragraph position="4">  We recognize that the composed vector is clearly nowhere near a perfect model of compositional meaning in the general case. This can be illustrated by considering, for example, the MWE fire breathing. This expression is clearly compositional, as it denotes the process of producing 8The preposition auf and the article die are on the stop list combusting exhalation, exactly what the semantic combination rules of the English would predict. Nevertheless the distribution of fire breathing is quite unrelated to that of its constituents fire and breathing ( the former appears frequently with dragon and circus while the later appear frequently with blaze and lungs, respectively). Despite these principled objections, the composed vector provides a useful baseline for our investigation. We should note that a number of researchers in the LSA tradition have attempted to provide more compelling combinatory functions to capture the non-linearity of linguistic compositional interpretation (Kintsch, 2001; Widdows and Peters, 2003).</Paragraph>
      <Paragraph position="5"> As a check we chose, at random, a number of simple clearly-compositional word combinations (not from the candidate MWE list). We expected that on the whole these would evidence a very high similarity measure when compared with their associated composed vector, and this is indeed the case, as shown in Table 1. We also compared  the literal and non-literal vectors for ins Wasser fallen from the first experiment with the composed vector, computed out of the meaning vectors for Wasser and for fallen.9 The difference isn't large, but nevertheless the composed vector is more similar to the literal vector (cosine of 0.2937) than to the non-literal vector (cosine of 0.1733).</Paragraph>
      <Paragraph position="6"> Extending to the general case, our task was to compare the composed vector to the actual vector for all the MWEs in our test set. The resulting cosine similarity values range from 0.01 to 0.80. Our hope was that there would be a similarity threshold for distinguishing MWEs that have non-compositional interpretations from those that do not. Indeed of the MWEs with a similarity values of under 0.1, just over half are MWEs which were hand-annotated to have non-literal uses.10 It  is clear then that the technique described is, prima facie, capable of detecting idiomatic MWEs.</Paragraph>
    </Section>
    <Section position="3" start_page="16" end_page="16" type="sub_section">
      <SectionTitle>
3.3 Evaluation and Discussion
</SectionTitle>
      <Paragraph position="0"> To evaluate the method, we used the careful manual annotation of the PNV database described by Krenn (2000) as our gold standard. By adopting different threshholds for the classification decision, we obtained a range of results (trading off precision and recall). Table 4 illustrates this range.</Paragraph>
      <Paragraph position="1"> The F-score measure is maximized in our experiments by adopting a similarity threshold of 0.2. This means that MWEs which have a meaning vector whose cosine is under this value when compared with with the combined vector should be classified as having a non-literal meaning.</Paragraph>
      <Paragraph position="2"> To compare our method with that proposed by Baldwin et al. (2003), we applied their method to our materials, generating LSA vectors for the component content words in our candidate MWEs and comparing their semantic similarity to the MWEs LSA vector as a whole, with the expectation being that low similarity between the MWE as a whole and its component words is indication of the non-compositionality of the MWE. The results are given in Table 5.</Paragraph>
      <Paragraph position="3"> It is clear that while Baldwin et al.'s expectation is borne out in the case of the constituent noun (the non-head), it is not in the case of the constituent verb (the head). Even in the case of the nouns, however, the results are, for the most part, markedly inferior to the results we achieved using the composed vectors.</Paragraph>
      <Paragraph position="4"> There are a number of issues that complicate the workability of the unsupervised technique described here. We rely on there being enough non-compositional uses of an idiomatic MWE in the corpus that the overall meaning vector for the MWE reflects this usage. If the literal meaning is overwhelmingly frequent, this will reduce the effectivity of the method significantly. A second problem concerns the relationship between the literal and the non-literal meaning. Our technique relies on these meaning being highly distinct. If the meanings are similar, it is likely that local context will be inadequate to distinguish a compositional from a non-compositional use of the expression. In our investigation it became apparent, in fact, that in the newspaper genre, highly idiomatic expressions such as ins Wasser fallen were often Appendix I.</Paragraph>
      <Paragraph position="5"> used in their idiomatic sense (apparently for humorous effect) particularly frequently in contexts in which elements of the literal meaning were also present.11</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML