File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/06/w06-2403_relat.xml
Size: 3,163 bytes
Last Modified: 2025-10-06 14:15:58
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2403"> <Title>Automatic Extraction of Chinese Multiword Expressions with a Statistical Tool</Title> <Section position="3" start_page="17" end_page="18" type="relat"> <SectionTitle> 2 Related Work </SectionTitle> <Paragraph position="0"> The issue of MWE processing has attracted much attention from the Natural Language Processing (NLP) community, including Smadja, 1993; Dagan and Church, 1994; Daille, 1995; 1995; McEnery et al., 1997; Wu, 1997; Michiels and Dufour, 1998; Maynard and Ananiadou, 2000; Merkel and Andersson, 2000; Piao and McEnery, 2001; Sag et al., 2001; Tanaka and Baldwin, 2003; Dias, 2003; Baldwin et al., 2003; Nivre and Nilsson, 2004 Pereira et al,.</Paragraph> <Paragraph position="1"> 2004; Piao et al., 2005. Study in this area covers a wide range of sub-issues, including MWE identification and extraction from monolingual and multilingual corpora, classification of MWEs according to a variety of viewpoints such as types, compositionality and alignment of MWEs across different languages. However studies in this area on Chinese language are limited.</Paragraph> <Paragraph position="2"> A number of approaches have been suggested, including rule-based and statistical approaches, and have achieved success to various extents.</Paragraph> <Paragraph position="3"> Despite this research, however, MWE processing still presents a tough challenge, and it has been receiving increasing attention, as exemplified by recent MWE-related ACL workshops.</Paragraph> <Paragraph position="4"> Directly related to our work is the development of a statistical MWE tool at Lancaster for searching and identifying English MWEs in running text (Piao et al., 2003, 2005). Trained on corpus data in a given domain or genre, this tool can automatically identify MWEs in running text or extract MWEs from corpus data from the similar domain/genre (see further information about this tool in section 3.1). It has been tested and compared with an English semantic tagger (Rayson et al., 2004) and was found to be efficient in identifying domain-specific MWEs in English corpora, and complementary to the se- null mantic tagger which relies on a large manually compiled lexicon.</Paragraph> <Paragraph position="5"> Other directly related work includes the development of the HYT MT system at CCID in Beijing, China. It has been under development since 1991 (Sun, 2004) and it is one of the most successful MT systems in China. However, being a mainly rule-based system, its performance degrades when processing texts from domains previously unknown to its knowledge database. Recently a corpus-based approach has been adopted for its improvement, and efforts are being made to improve its capability of processing MWEs.</Paragraph> <Paragraph position="6"> Our main interest in this study is in the application of a MWE identification tool to the improvement of MT system. As far as we know, there has not been a satisfactory solution to the efficient handling of Chinese MWEs in MT systems, and our experiment contributes to a deeper understanding of this problem.</Paragraph> </Section> class="xml-element"></Paper>