File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1812_intro.xml
Size: 6,695 bytes
Last Modified: 2025-10-06 14:02:04
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1812"> <Title>An Empirical Model of Multiword Expression Decomposability</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> This paper is concerned with an empirical model of multiword expression decomposability. Multiword expressions (MWEs) are defined to be cohesive lexemes that cross word boundaries (Sag et al., 2002; Copestake et al., 2002; Calzolari et al., 2002). They occur in a wide variety of syntactic configurations in different languages (e.g. in the case of English, compound nouns: post office, verbal idioms: pull strings, verb-particle constructions: push on, etc.).</Paragraph> <Paragraph position="1"> Decomposability is a description of the degree to which the semantics of an MWE can be ascribed to those of its parts (Riehemann, 2001; Sag et al., 2002). Analysis of the semantic correlation between the constituent parts and whole of an MWE is perhaps more commonly discussed under the banner of compositionality (Nunberg et al., 1994; Lin, 1999).</Paragraph> <Paragraph position="2"> Our claim here is that the semantics of the MWE are deconstructed and the parts coerced into often idiosyncratic interpretations to attain semantic alignment, rather than the other way around. One idiom which illustrates this process is spill the beans, where the semantics of reveal0(secret0) are decomposed such that spill is coerced into the idiosyncratic interpretation of reveal0 and beans into the idiosyncratic interpretation of secret0. Given that these senses for spill and beans are not readily available at the simplex level other than in the context of this particular MWE, it seems fallacious to talk about them composing together to form the semantics of the idiom.</Paragraph> <Paragraph position="3"> Ideally, we would like to be able to differentiate between three classes of MWEs: nondecomposable, idiosyncratically decomposable and simple decomposable (derived from Nunberg et al.'s sub-classification of idioms (1994)). With non-decomposable MWEs (e.g. kick the bucket, shoot the breeze, hot dog), no decompositional analysis is possible, and the MWE is semantically impenetrable. The only syntactic variation that non-decomposable MWEs undergo is verbal inflection (e.g. kicked the bucket, kicks the bucket) and pronominal reflexivisation (e.g. wet oneself , wet themselves). Idiosyncratically decomposable MWEs (e.g. spill the beans, let the cat out of the bag, radar footprint) are decomposable but coerce their parts into taking semantics unavailable outside the MWE. They undergo a certain degree of syntactic variation (e.g. the cat was let out of the bag). Finally, simple decomposable MWEs (also known as &quot;institutionalised&quot; MWEs, e.g. kindle excitement, traffic light) decompose into simplex senses and generally display high syntactic variability. What makes simple decomposable expressions true MWEs rather than productive word combinations is that they tend to block compositional alternates with the expected semantics (termed anti-collocations by Pearce (2001b)). For example, motor car cannot be rephrased as *engine car or *motor automobile. Note that the existence of anti-collocations is also a test for non-decomposable and idiosyncratically decomposable MWEs (e.g. hot dog vs. #warm dog or #hot canine).</Paragraph> <Paragraph position="4"> Our particular interest in decomposability stems from ongoing work on grammatical means for capturing MWEs. Nunberg et al. (1994) observed that idiosyncratically decomposable MWEs (in particular idioms) undergo much greater syntactic variation than non-decomposable MWEs, and that the variability can be partially predicted from the decompositional analysis. We thus aim to capture the decomposability of MWEs in the grammar and use this to constrain the syntax of MWEs in parsing and generation. Note that it is arguable whether simple decomposable MWEs belong in the grammar proper, or should be described instead as lexical affinities between particular word combinations.</Paragraph> <Paragraph position="5"> As the first step down the path toward an empirical model of decomposability, we focus on demarcating simple decomposable MWEs from idiosyncratically decomposable and non-decomposable MWEs. This is largely equivalent to classifying MWEs as being endocentric (i.e., a hyponym of their head) or exocentric (i.e., not a hyponym of their head: Haspelmath (2002)).</Paragraph> <Paragraph position="6"> We attempt to achieve this by looking at the semantic similarity between an MWE and its constituent words, and hypothesising that where the similarity between the constituents of an MWE and the whole is sufficiently high, the MWE must be of simple decomposable type.</Paragraph> <Paragraph position="7"> The particular similarity method we adopt is latent semantic analysis, or LSA (Deerwester et al., 1990). LSA allows us to calculate the similarity between an arbitrary word pair, offering the advantage of being able to measure the similarity between the MWE and each of its constituent words. For MWEs such as house boat, therefore, we can expect to capture the fact that the MWE is highly similar in meaning to both constituent words (i.e. the modifier house and head noun boat). More importantly, LSA makes no assumptions about the lexical or syntactic composition of the inputs, and thus constitutes a fully construction- and language-inspecific method of modelling decomposability. This has clear advantages over a more conventional supervised classifierstyle approach, where training data would have to be customised to a particular language and construction type.</Paragraph> <Paragraph position="8"> Evaluation is inevitably a difficulty when it comes to the analysis of MWEs, due to the lack of concise consistency checks on what MWEs should and should not be incorporated into dictionaries. While recognising the dangers associated with dictionary-based evaluation, we commit ourselves to this paradigm and focus on searching for appropriate means of demonstrating the correlation between dictionary- and corpus-based similarities.</Paragraph> <Paragraph position="9"> The remainder of this paper is structured as follows. Section 2 describes past research on MWE compositionality of relevance to this effort. Section 3 provides a basic outline of the resources used in this research, LSA, the MWE extraction methods, and measures used to evaluate our method. Section 4 then provides evaluation of the proposed method, and the paper is concluded with a brief discussion in Section 5.</Paragraph> </Section> class="xml-element"></Paper>