File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-3249_intro.xml
Size: 4,059 bytes
Last Modified: 2025-10-06 14:02:51
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3249"> <Title>Unsupervised Domain Relevance Estimation for Word Sense Disambiguation</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> A fundamental issue in text processing and understanding is the ability to detect the topic (i.e. the domain) of a text or of a portion of it. Indeed, domain detection allows a number of useful simpli cations in text processing applications, such as, for instance, in Word Sense Disambiguation (WSD).</Paragraph> <Paragraph position="1"> In this paper we introduce Domain Relevance Estimation (DRE) a fully unsupervised technique for domain detection. Roughly speaking, DRE can be viewed as a text categorization (TC) problem (Sebastiani, 2002), even if we do not approach the problem in the standard supervised setting requiring category labeled training data. In fact, recently, unsupervised approaches to TC have received more and more attention in the literature (see for example (Ko and Seo, 2000).</Paragraph> <Paragraph position="2"> We assume a pre-de ned set of categories, each de ned by means of a list of related terms. We call such categories domains and we consider them as a set of general topics (e.g. SPORT, MEDICINE, POLITICS) that cover the main disciplines and areas of human activity. For each domain, the list of related words is extracted from WORDNET DOMAINS (Magnini and Cavagli a, 2000), an extension of WORDNET in which synsets are annotated with domain labels. We have identi ed about 40 domains (out of 200 present in WORDNET DOMAINS) and we will use them for experiments throughout the paper (see Table 1).</Paragraph> <Paragraph position="3"> DRE focuses on the problem of estimating a degree of relatedness of a certain text with respect to the domains in WORDNET DOMAINS.</Paragraph> <Paragraph position="4"> The basic idea underlying DRE is to combine the knowledge in WORDNET DOMAINS and a probabilistic framework which makes use of a large-scale corpus to induce domain frequency distributions.</Paragraph> <Paragraph position="5"> Speci cally, given a certain domain, DRE considers frequency scores for both relevant and non-relevant texts (i.e. texts which introduce noise) and represent them by means of a Gaussian Mixture model. Then, an Expectation Maximization algorithm computes the parameters that maximize the likelihood of the empirical data.</Paragraph> <Paragraph position="6"> DRE methodology originated from the effort to improve the performance of Domain Driven Disambiguation (DDD) system (Magnini et al., 2002).</Paragraph> <Paragraph position="7"> DDD is an unsupervised WSD methodology that makes use of only domain information. DDD assignes the right sense of a word in its context comparing the domain of the context to the domain of each sense of the word. This methodology exploits WORDNET DOMAINS information to estimate both the domain of the textual context and the domain of the senses of the word to disambiguate. The former operation is intrinsically an unsupervised TC task, and the category set used has to be the same used for representing the domain of word senses.</Paragraph> <Paragraph position="8"> Since DRE makes use of a xed set of target categories (i.e. domains) and since a document collection annotated with such categories is not available, evaluating the performance of the approach is a problem in itself. We have decided to perform an indirect evaluation using the DDD system, where unsupervised TC plays a crucial role.</Paragraph> <Paragraph position="9"> The paper is structured as follows. Section 2 introduces WORDNET DOMAINS, the lexical resource that provides the underlying knowledge to the DRE technique. In Section 3 the problem of estimating domain relevance for a text is introduced. In particular, Section 4 brie y sketchs the WSD system used for evaluation. Finally, Section 5 describes a number of evaluation experiments we have carried out.</Paragraph> </Section> class="xml-element"></Paper>