File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1807_intro.xml
Size: 13,219 bytes
Last Modified: 2025-10-06 14:02:04
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1807"> <Title>Extracting Multiword Expressions with A Semantic Tagger</Title> <Section position="2" start_page="0" end_page="3" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Automatic extraction of Multiword expressions (MWE) is an important issue in the NLP community and corpus linguistics. An efficient tool for MWE extraction can be useful to numerous areas, including terminology extraction, machine translation, bilingual/multilingual MWE alignment, automatic interpretation and generation of language. A number of approaches have been suggested and tested to address this problem. However, efficient extraction of MWEs still remains an unsolved issue, to the extent that Sag et al.</Paragraph> <Paragraph position="1"> (2001b) call it &quot;a pain in the neck of NLP&quot;. In this paper, we present our work in which we approach the issue of MWE extraction by using a semantic field annotator. Specifically, we use the UCREL Semantic Analysis System (henceforth USAS), developed at Lancaster University to identify multiword units that depict single semantic concepts, i.e. multiword expressions. We have drawn from the Meter Corpus (Gaizauskas et al., 2001; Clough et al., 2002) a collection of British newspaper reports on court stories to evaluate our approach. Our experiment shows that it is efficient in identifying MWEs, in particular MWEs of low frequencies. In the following sections, we describe this approach to MWE extraction and its evaluation.</Paragraph> <Section position="1" start_page="2" end_page="3" type="sub_section"> <SectionTitle> Related Works </SectionTitle> <Paragraph position="0"> Generally speaking, approaches to MWE extraction proposed so far can be divided into three categories: a) statistical approaches based on frequency and co-occurrence affinity, b) knowledge-based or symbolic approaches using parsers, lexicons and language filters, and c) hybrid approaches combining different methods (Smadja 1993; Dagan and Church 1994; Daille 1995; McEnery et al.</Paragraph> <Paragraph position="1"> 1997; Wu 1997; Wermter et al. 1997; Michiels and Dufour 1998; Merkel and Andersson 2000; Piao and McEnery 2001; Sag et al. 2001a, 2001b; Biber et al. 2003).</Paragraph> <Paragraph position="2"> In practice, most statistical approaches use linguistic filters to collect candidate MWEs.</Paragraph> <Paragraph position="3"> Such approaches include Dagan and Church's (1994) Termight Tool. In this tool, they first collect candidate nominal terms with a POS syntactic pattern filter, then use concordances to identify frequently co-occurring multiword units. In his Xtract system, Smadja (1993) first extracted significant pairs of words that consistently co-occur within a single syntactic structure using statistical scores called distance, strength and spread, and then examined concordances of the bi-grams to find longer frequent multiword units. Similarly, Merkel and Andersson (2000) compared frequency-based and entropy based algorithms, each of which was combined with a language filter. They reported that the entropy-based algorithm produced better results.</Paragraph> <Paragraph position="4"> One of the main problems facing statistical approaches, however, is that they are unable to deal with low-frequency MWEs. In fact, the majority of the words in most corpora have low frequencies, occurring only once or twice. This means that a major part of true multiword expressions are left out by statistical approaches. Lexical resources and parsers are used to obtain better coverage of the lexicon in MWE extraction. For example, Wu (1997) used an English-Chinese bilingual parser based on stochastic transduction grammars to identify terms, including multi-word expressions. In their DEFI Project, Michiels and Dufour (1998) used dictionaries to identify English and French multiword expressions and their translations in the other language. Wehrli (1998) employed a generative grammar framework to identify compounds and idioms in their ITS-2 MT English-French system. Sag et al. (2001b) introduced Head-driven Phrase Structure Grammar for analyzing MWEs. Like pure statistical approaches, purely knowledge-based symbolic approaches also face problems. They are language dependent and not flexible enough to cope with complex structures of MWEs. As Sag et al. (2001b) suggest, it is important to find the right balance between symbolic and statistical approaches.</Paragraph> <Paragraph position="5"> In this paper, we propose a new approach to MWEs extraction using semantic field information. In this approach, multiword units depicting single semantic concepts are recognized using the Lancaster USAS semantic tagger. We describe that system and the algorithms used for identifying single and multi-word units in the following section.</Paragraph> </Section> <Section position="2" start_page="3" end_page="3" type="sub_section"> <SectionTitle> Lancaster Semantic tagger </SectionTitle> <Paragraph position="0"> The USAS system has been in development at Lancaster University since 1990 .</Paragraph> <Paragraph position="1"> Based on POS annotation provided by the CLAWS tagger (Garside and Smith, 1997), USAS assigns a set of semantic tags to each item in running text and then attempts to disambiguate the tags in order to choose the most likely candidate in each context. Items can be single words or multiword expressions. The semantic tags indicate semantic fields which group together word senses that are related by virtue of their being connected at some level of generality with the same mental concept. The groups include not only synonyms and antonyms but also hypernyms and hyponyms.</Paragraph> <Paragraph position="2"> The initial tagset was loosely based on Tom McArthur's Longman Lexicon of Contemporary English (McArthur, 1981) as this appeared to offer the most appropriate thesaurus type classification of word senses for this kind of analysis. The tagset has since been considerably revised in the light of practical tagging problems met in the course of the research. The revised tagset is arranged in a hierarchy with 21 major discourse fields expanding into 232 category labels. The following list shows the 21 labels at the top level of the hierarchy (for the full tagset, see website: http://www.comp.lancs.ac.uk/ucrel/usas).</Paragraph> <Paragraph position="3"> This work is continuing to be supported by the Benedict project, EU project IST-2001-34237. A general and abstract terms B the body and the individual C arts and crafts</Paragraph> <Paragraph position="5"> F food and farming G government and the public domain H architecture, buildings, houses and the home I money and commerce in industry K entertainment, sports and games L life and living things M movement, location, travel and trans null Z names and grammatical words Currently, the lexicon contains just over 37,000 words and the template list contains over 16,000 multiword units. These resources were created manually by extending and expanding dictionaries from the CLAWS tagger with observations from large text corpora. Generally, only the base form of nouns and verbs are stored in the lexicon and a lemmatisation procedure is used for look-up. However, the base form is not sufficient in some cases. Stubbs (1996: 40) observes that &quot;meaning is not constant across the inflected forms of a lemma&quot;, and Tognini-Bonelli (2001: 92) notes that lemma variants have different senses.</Paragraph> <Paragraph position="6"> In the USAS lexicon, each entry consists of a word with one POS tag and one or more semantic tags assigned to it. At present, in cases where a word has more than one syntactic tag, it is duplicated (i.e. each syntactic tag is given a separate entry).</Paragraph> <Paragraph position="7"> The semantic tags for each entry in the lexicon are arranged in approximate rank frequency order to assist in manual post editing, and to allow for gross automatic selection of the common tag, subject to weighting by domain of discourse.</Paragraph> <Paragraph position="8"> In the multi-word-unit list, each template consists of a pattern of words and part-of-speech tags. The semantic tags for each template are arranged in rank frequency order in the same way as the lexicon. Various types of multiword expressions are included: phrasal verbs (e.g. stubbed out), noun phrases (e.g. ski boots), proper names (e.g. United States), true idioms (e.g. life of Riley).</Paragraph> <Paragraph position="9"> Figure 1 below shows samples of the actual templates used to identify these MWUs. Each of these example templates has only one semantic tag associated with it, listed on the right-hand end of the template. However, the second example (ski boot) combines the clothing (B5) and sports (K5.1) fields into one tag. The pattern on the left of each template consists of a sequence of words joined to POS tags with the underscore character. The words and POS fields can include the asterisk wild-card character to allow for inflectional variants and to write more powerful templates with wider coverage. USAS templates can match discontinuous MWUs, and this is illustrated by the first example, which includes optional intervening POS items marked within curly brackets. Thus this template can match stubbed out and stubbed the cigarette out. 'Np' is used to match simple noun phrases identified with a noun-phrase chunker. null</Paragraph> <Paragraph position="11"> Figure 1 Sample of USAS multiword templates As in the case of grammatical tagging, the task of semantic tagging subdivides broadly into two phases: Phase I (Tag assignment): attaching a set of potential semantic tags to each lexical unit and Phase II (Tag disambiguation): selecting the contextually appropriate semantic tag from the set provided by Phase I. USAS makes use of seven major techniques or sources of information in phase II. We will list these only briefly here, since they are described in more detail elsewhere (Garside and Rayson, 1997).</Paragraph> <Paragraph position="12"> 1. POS tag. Some senses can be eliminated by prior POS tagging. The CLAWS part-of-speech tagger is run prior to semantic tagging.</Paragraph> <Paragraph position="13"> 2. General likelihood ranking for single-word and MWU tags. In the lexicon and MWU list senses are ranked in terms of frequency, even though at present such ranking is derived from limited or unverified sources such as frequency-based dictionaries, past tagging experience and intuition.</Paragraph> <Paragraph position="14"> 3. Overlapping MWU resolution. Nor null mally, semantic multi-word units take priority over single word tagging, but in some cases a set of templates will produce overlapping candidate taggings for the same set of words. A set of heuristics is applied to enable the most likely template to be treated as the preferred one for tag assignment.</Paragraph> <Paragraph position="15"> 4. Domain of discourse. Knowledge of the current domain or topic of discourse is used to alter rank ordering of semantic tags in the lexicon and template list for a particular domain.</Paragraph> <Paragraph position="16"> 5. Text-based disambiguation. It has been claimed (by Gale et al, 1992) on the basis of corpus analysis that to a very large extent a word keeps the same meaning throughout a text.</Paragraph> <Paragraph position="17"> 6. Contextual rules. The template mechanism is also used in identifying regular contexts in which a word is constrained to occur in a particular sense.</Paragraph> <Paragraph position="18"> 7. Local probabilistic disambiguation. It is generally supposed that the correct semantic tag for a given word is substantially determined by the local surrounding context. After automatic tag assignment has been carried out, manual post-editing can take place, if desired, to ensure that each word and idiom carries the correct semantic classification. null From these seven disambiguation methods, our main interest in this paper is the third technique of overlapping MWU resolution.</Paragraph> <Paragraph position="19"> When more than one template match overlaps in a sentence, the following heuristics are applied in sequence: 1. Prefer longer templates over shorter templates 2. For templates of the same length, prefer shorter span matches over longer span matches (a longer span indicates more intervening items for discontinuous templates) 3. If the templates do not apply to the same sequence of words, prefer the one that begins earlier in the sentence 4. For templates matching the same sequence of words, prefer the one which contains the more fully defined template pattern (with fewer wild-cards in the word fields) 5. Prefer templates with a more fully defined first word in the template 6. Prefer templates with fewer wildcards in the POS tags These six rules were found to differentiate in all cases of overlapping MWU templates. Cases which failed to be differentiated indicated that two (or more) templates in our MWU list were in fact identical, apart from the semantic tags and required merging together. null</Paragraph> </Section> </Section> class="xml-element"></Paper>