File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-2103_intro.xml
Size: 4,252 bytes
Last Modified: 2025-10-06 14:02:39
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2103"> <Title>Linguistic Preprocessing for Distributional Classification of Words</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> With the fast development of text mining technologies, automated management of lexical resources is presently an important research issue. A particular text mining task often requires a lexical database (e.g., a thesaurus, dictionary, or a terminology) with a specific size, topic coverage, and granularity of encoded meaning. That is why a lot of recent NLP and AI research has been focusing on finding ways to speedily build or extend a lexical resource ad hoc for an application.</Paragraph> <Paragraph position="1"> One attractive idea to address this problem is to elicit the meanings of new words automatically from a corpus relevant to the application domain.</Paragraph> <Paragraph position="2"> To do this, many approaches to lexical acquisition employ the distributional model of word meaning induced from the distribution of the word across various lexical contexts of its occurrence found in the corpus. The approach is now being actively explored for a wide range of semantics-related tasks including automatic construction of thesauri (Lin, 1998; Caraballo, 1999), their enrichment (Alfonseca and Manandhar, 2002; Pekar and Staab, 2002), acquisition of bilingual lexica from non-aligned (Kay and Roscheisen, 1993) and non-parallel corpora (Fung and Yee, 1998), learning of information extraction patterns from un-annotated text (Riloff and Schmelzenbach, 1998).</Paragraph> <Paragraph position="3"> However, because of irregularities in corpus data, corpus statistics cannot guarantee optimal performance, notably for rare lexical items. In order to improve robustness, recent research has attempted a variety of ways to incorporate external knowledge into the distributional model. In this paper we investigate the impact produced by the introduction of different types of linguistic knowledge into the model.</Paragraph> <Paragraph position="4"> Linguistic knowledge, i.e., the knowledge about linguistically relevant units of text and relations holding between them, is a particularly convenient way to enhance the distributional model. On the one hand, although describing the &quot;surface&quot; properties of the language, linguistic notions contain conceptual information about the units of text they describe. It is therefore reasonable to expect that the linguistic analysis of the context of a word yields additional evidence about its meaning. On the other hand, linguistic knowledge is relatively easy to obtain: linguistic analyzers (lemmatizers, PoS-taggers, parsers, etc) do not require expensive hand-encoded resources, their application is not restricted to particular domains, and their performance is not dependent on the amount of the textual data. All these characteristics fit very well with the strengths of the distributional approach: while enhancing it with external knowledge, linguistic analyzers do not limit its coverage and portability.</Paragraph> <Paragraph position="5"> This or that kind of linguistic preprocessing is carried out in many previous applications of the approach. However, these studies seldom motivate the choice of a particular preprocessing procedure, concentrating rather on optimization of other parameters of the methodology. Very few studies exist that analyze and compare different techniques for linguistically motivated extraction of distributional data. The goal of this paper is to exploire in detail a range of variables in the morphological and syntactic processing of the context information and reveal the merits and drawbacks of their particular settings.</Paragraph> <Paragraph position="6"> The outline of the paper is as follows. Section 2 describes the preprocessing methods under study.</Paragraph> <Paragraph position="7"> Section 3 describes the settings for their empirical evaluation. Section 4 details the experimental results. Section 5 discusses related work. Section 6 summarizes the results and presents the conclusions from the study.</Paragraph> </Section> class="xml-element"></Paper>