File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/c02-1090_intro.xml
Size: 3,579 bytes
Last Modified: 2025-10-06 14:01:24
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1090"> <Title>Taxonomy learning factoring the structure of a taxonomy into a semantic classification decision</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> Machine-readable thesauri are now an indispensable part for a wide range of NLP applications such as information extraction or semanticssensitive information retrieval. Since their manual construction is very expensive, a lot of recent NLP research has been aiming to develop ways to automatically acquire lexical knowledge from corpus data.</Paragraph> <Paragraph position="1"> In this paper we address the problem of large-scale augmenting a thesaurus with new lexical items. The specifics of the task are a big number of classes into which new words need to be classified and hence a lot of poorly predictable semantic distinctions that have to be taken into account. For this reason, knowledge-poor approaches such as the distributional approach are particularly suited for this task. Its previous applications (e.g., Grefenstette 1993, Hearst and Schuetze 1993, Takunaga et al 1997, Lin 1998, Caraballo 1999) demonstrated that cooccurrence statistics on a target word is often sufficient for its automatical classification into one of numerous classes such as synsets of WordNet.</Paragraph> <Paragraph position="2"> Distributional techniques, however, are poorly applicable to rare words, i.e., those words for which a corpus does not contain enough cooccurrence data to judge about their meaning. Such words are the primary concern of many practical NLP applications: as a rule, they are semantically focused words and carry a lot of important information. If one has to do with a specific domain of lexicon, sparse data is a problem particularly difficult to overcome.</Paragraph> <Paragraph position="3"> The major challenge for the application of the distributional approach in this area is, therefore, the development of ways to minimize the amount of corpus data required to successfully carry out a task. In this study we focus on optimization possibilities of an important phase in the process of automatically augmenting a thesaurus - the classification algorithm. The main hypothesis we test here is that the accuracy of semantic classification may be improved by taking advantage of information about taxonomic relations between word classes contained in a thesaurus.</Paragraph> <Paragraph position="4"> On the example of a domain-specific thesaurus we compare the performance of three state-of-the-art classifiers which presume flat organization of thesaurus classes and two classification algorithms, which make use of taxonomic organization of the thesaurus: the &quot;tree descending&quot; and the &quot;tree ascending&quot; algorithms. We find that a version of the tree ascending algorithm, though not improving on other methods overall, is much better at choosing a superconcept for the correct class of the new word. We then propose to use this algorithm to first narrow down the search space and then apply the kNN method to determine the correct class among fewer candidates.</Paragraph> <Paragraph position="5"> The paper is organized as follows. Sections 2 and 3 describe the classification algorithms under study. Section 4 describes the settings and data of the experiments. Section 5 details the evaluation method. Section 6 presents the results of the experiments. Section 7 concludes.</Paragraph> </Section> class="xml-element"></Paper>