File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1812_metho.xml

Size: 15,679 bytes

Last Modified: 2025-10-06 14:08:41

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1812">
  <Title>An Empirical Model of Multiword Expression Decomposability</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Past research
</SectionTitle>
    <Paragraph position="0"> Although there has been some useful work on compositionality in statistical machine translation (e.g.</Paragraph>
    <Paragraph position="1"> Melamed (1997)), there has been little work on detecting &amp;quot;non-compositional&amp;quot; (i.e. non-decomposable and idiosyncratically decomposable) items of variable syntactic type in monolingual corpora. One interesting exception is Lin (1999), whose approach is explained as follows: The intuitive idea behind the method is that the metaphorical usage of a non-compositional expression causes it to have a different distributional characteristic than expressions that are similar to its literal meaning.</Paragraph>
    <Paragraph position="2"> The expressions he uses are taken from a collocation database (Lin, 1998b). These &amp;quot;expressions that are similar to [their] literal meaning&amp;quot; are found by substituting each of the words in the expression with the 10 most similar words according to a corpus derived thesaurus (Lin, 1998a). Lin models the distributional difference as a significant difference in mutual information. Significance here is defined as the absence of overlap between the 95% confidence interval of the mutual information scores. Lin provides some examples that suggest he has identified a successful measure of &amp;quot;compositionality&amp;quot;. He offers an evaluation where an item is said to be non-compositional if it occurs in a dictionary of idioms. This produces the unconvincing scores of 15.7% for precision and 13.7% for recall.</Paragraph>
    <Paragraph position="3"> We claim that substitution-based tests are useful in demarcating MWEs from productive word combinations (as attested by Pearce (2001a) in a MWE detection task), but not in distinguishing the different classes of decomposability. As observed above, simple decomposable MWEs such as motor car fail the substitution test not because of nondecomposability, but because the expression is institutionalised to the point of blocking alternates.</Paragraph>
    <Paragraph position="4"> Thus, we expect Lin's method to return a wide array of both decomposable and non-decomposable MWEs.</Paragraph>
    <Paragraph position="5"> Bannard (2002) focused on distributional techniques for describing the meaning of verb-particle constructions at the level of logical form. The semantic similarity between a multiword expression and its head was used as an indicator of decomposability. The assumption was that if a verb-particle was sufficiently similar to its head verb, then the verb contributed its simplex meaning. It gave empirical backing to this assumption by showing that annotator judgements for verb-particle decomposability correlate significantly with non-expert human judgements on the similarity between a verb-particle construction and its head verb. Bannard et al. (2003) extended this research in looking explicitly at the task of classifying verb-particles as being compositional or not. They successfully combined statistical and distributional techniques (including LSA) with a substitution test in analysing compositionality. McCarthy et al. (2003) also targeted verb-particles for a study on compositionality, and judged compositionality according to the degree of overlap in the N most similar words to the verb-particle and head verb, e.g., to determine compositionality. null We are not the first to consider applying LSA to MWEs. Schone and Jurafsky (2001) applied LSA to the analysis of MWEs in the task of MWE discovery, by way of rescoring MWEs extracted from a corpus. The major point of divergence from this research is that Schone and Jurafsky focused specifically on MWE extraction, whereas we are interested in the downstream task of semantically classifying attested MWEs.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Resources and Techniques
</SectionTitle>
    <Paragraph position="0"> In this section, we outline the resources used in evaluation, give an informal introduction to the LSA model, sketch how we extracted the MWEs from corpus data, and describe a number of methods for modelling decomposability within a hierarchical lexicon.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Resources and target MWEs
</SectionTitle>
      <Paragraph position="0"> The particular reference lexicon we use to evaluate our technique is WordNet 1.7 (Miller et al., 1990), due to its public availability, hierarchical structure and wide coverage. Indeed, Schone and Jurafsky (2001) provide evidence that suggests that WordNet is as effective an evaluation resource as the web for MWE detection methods, despite its inherent size limitations and static nature. Two MWE types that are particularly well represented in WordNet are compound nouns (47,000 entries) and multiword verbs (2,600 entries). Of these, we chose to specifically target two types of MWE: noun-noun (NN) compounds (e.g. computer network, work force) and verb-particles (e.g. look on, eat up) due to their frequent occurrence in both decomposable and non-decomposable configurations, and also their disparate syntactic behaviours.</Paragraph>
      <Paragraph position="1"> We extracted the NN compounds from the 1996 Wall Street Journal data (WSJ, 31m words), and the verb-particles from the British National Corpus (BNC, 90m words: Burnard (2000)). The WSJ data is more tightly domain-constrained, and thus a more suitable source for NN compounds if we are to expect sentential context to reliably predict the semantics of the compound. The BNC data, on the other hand, contains more colloquial and prosaic texts and is thus a richer source of verb-particles.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Description of the LSA model
</SectionTitle>
      <Paragraph position="0"> Our goal was to compare the distribution of different compound terms with their constituent words, to see if this indicated similarity of meaning. For this purpose, we used latent semantic analysis (LSA) to build a vector space model in which term-term similarities could be measured.</Paragraph>
      <Paragraph position="1"> LSA is a method for representing words as points in a vector space, whereby words which are related in meaning should be represented by points which are near to one another, first developed as a method for improving the vector model for information retrieval (Deerwester et al., 1990). As a technique for measuring similarity between words, LSA has been shown to capture semantic properties, and has been used successfully for recognising synonymy (Landauer and Dumais, 1997), word-sense disambiguation (Sch&amp;quot;utze, 1998) and for finding correct translations of individual terms (Widdows et al., 2002). The LSA model we built is similar to that described in (Sch&amp;quot;utze, 1998). First, 1000 frequent content words (i.e. not on the stoplist)1 were chosen as &amp;quot;content-bearing words&amp;quot;. Using these content-bearing words as column labels, the 50,000 most frequent terms in the corpus were assigned row vectors by counting the number of times they oc1A &amp;quot;stoplist&amp;quot; is a list of frequent words which have little independent semantic content, such as prepositions and determiners (Baeza-Yates and Ribiero-Neto, 1999, p167).</Paragraph>
      <Paragraph position="2"> curred within the same sentence as a content-bearing word. Singular-value decomposition (Deerwester et al., 1990) was then used to reduce the number of dimensions from 1000 to 100. Similarity between two vectors (points) was measured using the cosine of the angle between them, in the same way as the similarity between a query and a document is often measured in information retrieval (Baeza-Yates and Ribiero-Neto, 1999, p28). Effectively, we could use LSA to measure the extent to which two words or MWEs x and y usually occur in similar contexts.</Paragraph>
      <Paragraph position="3"> Since the corpora had been tagged with parts-ofspeech, we could build syntactic distinctions into the LSA models -- instead of just giving a vector for the string test we were able to build separate vectors for the nouns, verbs and adjectives test. This combination of technologies was also used to good effect by Widdows (2003): an example of the contribution of part-of-speech information to extracting semantic neighbours of the word fire is shown in Table 1. As can be seen, the noun fire (as in the substance/element) and the verb fire (mainly used to mean firing some sort of weapon) are related to quite different areas of meaning. Building a single vector for the string fire confuses this distinction -the neighbours of fire treated just as a string include words related to both the meaning of fire as a noun (more frequent in the BNC) and as a verb. The appropriate granularity of syntactic classifications is an open question for this kind of research: treating all the possible verbs categories as different (e.g. distinguishing infinitive from finite from gerund forms) led to data sparseness, and instead we considered &amp;quot;verb&amp;quot; as a single part-of-speech type.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 MWE extraction methods
</SectionTitle>
      <Paragraph position="0"> NN compounds were extracted from the WSJ by first tagging the data with fnTBL 1.0 (Ngai and Florian, 2001) and then simply taking noun bigrams (adjoined on both sides by non-nouns to assure the bigram is not part of a larger compound nominal).</Paragraph>
      <Paragraph position="1"> Out of these, we selected those compounds that are listed in WordNet, resulting in 5,405 NN compound types (208,000 tokens).</Paragraph>
      <Paragraph position="2"> Extraction of the verb-particles was considerably more involved, and drew on the method of Baldwin and Villavicencio (2002). Essentially, we used a POS tagger and chunker (both built using fnTBL 1.0 (Ngai and Florian, 2001)) to first (re)tag the BNC. This allowed us to extract verb-particle tokens through use of the particle POS and chunk tags returned by the two systems. This produces highprecision, but relatively low-recall results, so we performed the additional step of running a chunk-based grammar over the chunker output to detect candidate mistagged particles. In the case that a noun phrase followed the particle candidate, we performed attachment disambiguation to determine the transitivity of the particle candidate. These three methods produced three distinct sets of verb-particle tokens, which we carried out weighted voting over to determine the final set of verb-particle tokens. A total of 461 verb-particles attested in WordNet were extracted (160,765 tokens).</Paragraph>
      <Paragraph position="3"> For both the NN compound and verb-particle data, we replaced each token occurrence with a single-word POS-tagged token to feed into the LSA model.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Techniques for evaluating correlation with
WordNet
</SectionTitle>
      <Paragraph position="0"> In order to evaluate our approach, we employed the lexical relations as defined in the WordNet lexical hierarchy (Miller et al., 1990). WordNet groups words into sets with similar meaning (known as &amp;quot;synsets&amp;quot;), e.g. fcar, auto, automobile, machine, motorcar g . These are organised into a hierarchy employing multiple inheritance. The hierarchy is structured according to different principles for each of nouns, verbs, adjectives and adverbs. The nouns are arranged according to hyponymy or ISA relations, e.g. a car is a kind of automobile. The verbs are arranged according to troponym or &amp;quot;manner-of&amp;quot; relations, where murder is a manner of killing, so kill immediately dominates murder in the hierarchy.</Paragraph>
      <Paragraph position="1"> We used WordNet for evaluation by way of looking at: (a) hyponymy, and (b) semantic distance.</Paragraph>
      <Paragraph position="2"> Hyponymy provides the most immediate way of evaluating decomposability. With simple decomposable MWEs, we can expect the constituents (and particularly the head) to be hypernyms (ancestor nodes) or synonyms of the MWE. That is, simple decomposable MWEs are generally endocentric, although there are some exceptions to this generalisation such as vice president arguably not being a hyponym of president. No hyponymy relation holds with non-decomposable or idiosyncratically decomposable MWEs (i.e., they are exocentric), as even if the semantics of the head noun can be determined through decomposition, by definition this will not correspond to a simplex sense of the word.</Paragraph>
      <Paragraph position="3"> We deal with polysemy of the constituent words and/or MWE by simply looking for the existence of a sense of the constituent words which fire (string only) fire nn1 fire vvi  subsumes a sense of the MWE. The function hyponym(word i; mwe) thus returns a value of 1 if some sense of word i subsumes a sense of mwe, and a value of 0 otherwise.</Paragraph>
      <Paragraph position="4"> A more proactive means of utilising the WordNet hierarchy is to derive a semantic distance based on analysis of the relative location of senses in Word-Net. Budanitsky and Hirst (2001) evaluated the performance of five different methods that measure the semantic distance between words in the Word-Net Hierarchy, which Patwardhan et al. (2003) have then implemented and made available for general use as the Perl package distance-0.11.2 We focused in particular on the following three measures, the first two of which are based on information theoretic principles, and the third on sense topology: Resnik (1995) combined WordNet with corpus statistics. He defines the similarity between two words as the information content of the lowest superordinate in the hierarchy, defining the information content of a concept c (where a concept is the WordNet class containing the word) to be the negative of its log likelihood.</Paragraph>
      <Paragraph position="5"> This is calculated over a corpus of text.</Paragraph>
      <Paragraph position="6"> Lin (1998c) also employs the idea of corpus-derived information content, and defines the similarity between two concepts in the following way:</Paragraph>
      <Paragraph position="8"> where C0 is the lowest class in the hierarchy that subsumes both classes.</Paragraph>
      <Paragraph position="10"> Hirst and St-Onge (1998) use a system of &amp;quot;relations&amp;quot; of different strength to determine the similarity of word senses, conditioned on the type, direction and relative distance of edges separating them.</Paragraph>
      <Paragraph position="11"> The Patwardhan et al. (2003) implementation that we used calculates the information values from SemCor, a semantically tagged subset of the Brown corpus. Note that the first two similarity measures operate over nouns only, while the last can be applied to any word class.</Paragraph>
      <Paragraph position="12"> The similarity measures described above calculate the similarity between a pair of senses. In the case that a given constituent word and/or MWE occur with more than one sense, we calculate a similarity for sense pairing between them, and average over them to produce a consolidated similarity value.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML