File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-0411_intro.xml
Size: 4,169 bytes
Last Modified: 2025-10-06 14:02:27
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-0411"> <Title>Lexical Encoding of MWEs</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Multiword Expressions (MWEs) can be defined as idiosyncratic interpretations that cross word boundaries (or spaces) (from Sag et al. (2002). They comprise a wide-range of distinct but related phenomena like idioms, phrasal verbs, noun-noun compounds and many others, that due to their flexible nature, are considered to be a challenge for many areas of current language technology. Even though some MWEs are fixed, and do not present internal variation, such as ad hoc, others are much more flexible and allow different degrees of internal variability and modification, as, for instance, touch a nerve (touch/find a nerve) and spill beans (spill several/musical/mountains of beans). In terms of semantics, some MWEs are opaque and their semantics cannot be straightforwardly inferred from the meanings of the component words (e.g. to kick the bucket as to die). In other cases the meaning is more transparent and can be inferred from the words in the MWE (e.g. eat up, where the particle up adds a completive sense to eat).</Paragraph> <Paragraph position="1"> Given the flexibility and variation in form of MWEs and the complex interrelations that may be found between their components, an encoding that treats them as invariant strings (a words with spaces approach), will not be adequate to fully describe any such expression appropriately with the exception of the simplest fixed cases such as ad hoc ((Sag et al., 2002), (Calzolari et al., 2002)). Different strategies for encoding MWEs have been employed by different lexical resources with varying degrees of success, depending on the type of MWE. One case is the Alvey Tools Lexicon (Carroll and Grover, 1989), which has a good coverage of phrasal verbs, providing extensive information about their syntactic aspects (variation in word order, subcategorisation, etc), but it does not distinguish compositional from non-compositional entries neither does it specify entries that can be productively formed. Word-Net, on the other hand, covers a large number of MWEs (Fellbaum, 1998), but does not provide information about their variability. Neither of these resources covers idioms. The challenge in designing adequate lexical resources for MWEs, is to ensure that the variability and the extra dimensions required by the different types of MWE can be captured. Such a move is called for by Calzolari et al.</Paragraph> <Paragraph position="2"> (2002) and Copestake et al. (2002). Calzolari et al.</Paragraph> <Paragraph position="3"> (2002) discuss these problems while attempting to establish the standards for MWE description in the context of multilingual lexical resources. Their focus is on MWEs that are productive and that present regularities that can be generalised and applied to other classes of words that have similar properties.</Paragraph> <Paragraph position="4"> Copestake et al. (2002) present an initial schema for MWE description and we build on these ideas here, by proposing an architecture for a lexical encoding of MWEs, which allows for a unified treatment of different kinds of MWE.</Paragraph> <Paragraph position="5"> In what follows, we start by laying out the minimal encoding needed for simplex (single) words.</Paragraph> <Paragraph position="6"> Then, we analyse two different types of MWE (idioms and verb-particle constructions), and discuss their requirements for a lexical encoding. Given these requirements, we present a possible encoding for MWEs, that uniformly captures different types of expressions. This database encoding minimises the amount of information that needs to be specified for MWE entries, by maximising the information that can be obtained from simplex words, while re-Second ACL Workshop on Multiword Expressions: Integrating Processing, July 2004, pp. 80-87 quiring only minimal modification to the encoding used for simplex words. We finish with some discussion and conclusions.</Paragraph> </Section> class="xml-element"></Paper>