File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/a97-1055_intro.xml
Size: 8,230 bytes
Last Modified: 2025-10-06 14:06:14
<?xml version="1.0" standalone="yes"?> <Paper uid="A97-1055"> <Title>Automatic Selection of Class Labels from a Thesaurus for an Effective Semantic Tagging of Corpora.</Title> <Section position="2" start_page="0" end_page="381" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> It is well known that statistically-based approaches to lexical knowledge acquisition are faced with the problem of low counts. Many language patterns (from simple co-occurrences to more complex syntactic associations among words) occur very rarely, or are never encountered, in the learning corpus. Since rare patterns are the majority, the quality and coverage of lexical learning may result severely affected.</Paragraph> <Paragraph position="1"> The obvious strategy to reduce this problem is to generalise word patterns according to some clustering techniques. In the literature, two generalisation strategies have been adopted: Distributional approaches: Several papers adopt distributional techniques to identify clusters of words according to some defined measure of similarity. Among these, in (Grishman and Sterling, 1994) a method is proposed to cluster syntactic triples, while in (Pereira and Tishby 1992, 1993), (Dagan et al., 1994) pure bigrams are analysed. null The most intuitive evaluation of the effectiveness of distributional approaches to the problem of word generalization is presented in (Grishman and Sterling, 1994). In this paper it is argued that distributional (called also smoothing) techniques introduce a certain degree of additional error, because co-occurrences may be erroneously conflated in a cluster, and some of the co-occurrences being generalized are themselves incorrect. In general the effect is a higher recall at the price of a lower precision. Another drawback of these methods is that, since clusters have only a numeric description, they are often hard to evaluate on a linguistic ground.</Paragraph> <Paragraph position="2"> Semantic tagging: Another adopted solution is to gener- null alise the observed word patterns by grouping patterns in which words have the same semantic tag. Semantic tags are assigned from on-line thesaura like WordNet (Basili et al, 1996) (Resnik, 1995), Roget's categories (Yarowsky 1992) (Chen and Chen, 1996), the Japanese BGH (Utsuro et al, 1993), or assigned manually (Basili et al, 1992) 1. The obvious advantage of semantic tags is that words are clustered according to an intuitive principle (they belong to the same concept) rather than to some probabilistic measure. Semantic tagging has been proven useful for learning and categorising interesting relations among words, and for systematic lexical learning in sublanguages, as shown in (Basili et al, 1996) and (Basili et al, 1996b).</Paragraph> <Paragraph position="3"> On the other hand, semantic tagging has a serious drawback, which is not solely due to the limited availability of on-line resources, but rather to the entangled structure of thesaura. Wordnet and Roget's thesaura have not been conceived, despite their success among researchers in lexical statistics, as tools for automatic language processing. The purpose was rather to provide the linguists with a very refined, general purpose, linguistically motivated source of taxonomic knowledge.</Paragraph> <Paragraph position="4"> As a consequence, in most on-line thesaura words are extremely ambiguous, with very subtle distinctions among senses.</Paragraph> <Paragraph position="5"> (Dolan, 1994) and (Krovetz and Croft, 1992) claim that fine-grained semantic distinctions are unlikely to be of practical value for many applications. Our experience supports this claim: often, what matters is to be able to distinguish among contrastive (Pustejowsky, 1995) ambiguities of the bank_river bank__organisation flavour. High ambiguity, entangled nodes, and asymmetry have already been emphasised in (Hearst and Shutze, 1993) as being an obstacle to the effective use of on-line thesaura in corpus linguistics. In most cases, the noise introduced by overambiguity almost overrides the positive effect of semantic clustering. For example, in (Brill and Resnik, 1994) clustering PP heads according to WordNet synsets produced only a 1% improvement in a PP disambiguation task, with respect to the non-clustered method. There are reported cases in which the use of WordNet worsened the performance of an automatic indexing method. Even context-based sense disambiguation becomes a prohibitive task on a wide-scale basis, because when words in the context of an ambiguous word are replaced by 1 Manually assigning semantic tags if of course rather time-consuming, however on-line thesaura are not available in many languages, like Italian.</Paragraph> <Paragraph position="6"> their synsets, there is a multiplication of possible contexts, rather than a generalization. In (Agirre and Rigau, 1996) a method called Conceptual Distance is proposed to reduce this problem, but the reported performance in disambiguation still do not reach 50%.</Paragraph> <Paragraph position="7"> A possible alternative is to manually select a set of high-level tags from the thesaurus. This approach is adopted in (Chen and Chen, 1996) and in (Basili et al, 1996) where only a dozen categories are used. As discussed in the latter paper, high-level tags reduce the problem of overambiguity and allow the detection of more regular behaviours in the analysis of lexical patterns. On the other hand, high-level tags may be overgeneral, and the acquired lexical rules, while usually perform well in the task of selecting the correct word associations (for example in PP disambiguation, or sense interpretation), are less capable of filtering out the noise. Overgeneral categories may even fail to capture contrastive ambiguities of words.</Paragraph> <Paragraph position="8"> So far the manual selection of an appropriate set of semantic tags has been a matter of personal intuitions, but we believe that this task should be performed in a more principled, and automatic, way.</Paragraph> <Paragraph position="9"> In this paper, we present a method for the selection of the &quot;best-set&quot; of WordNet categories for an effective, domaintailored, semantic tagging of a corpus. The purpose of the method is to automatically select: * A domain-appropriate set of categories, that well represent the semantics of the domain.</Paragraph> <Paragraph position="10"> * A &quot;right&quot; level of abstraction, so as to mediate at best between overambiguity and overgenerality.</Paragraph> <Paragraph position="11"> * A balanced (for the domain) set of categories, i.e.</Paragraph> <Paragraph position="12"> words should be evenly distributed among categories. null The second feature is the most important, since as we remarked so far, assigning semantic characteristics to words is very useful in lexical learning tasks, but overambiguity is the major obstacle to an effective use of thesaura in semantic tagging.</Paragraph> <Paragraph position="13"> In the following sections, we define a method for the automatic selection of the '&quot;oest-set&quot; of WordNet categories, for nouns given an application corpus.</Paragraph> <Paragraph position="14"> First, an iterative method is used to create alternative sets of balanced categories. Sets have an increasing level of generality. Second, a scoring function is applied to alternative sets to identify the &quot;best&quot; set. The best set is modelled as the linear function of four performance factors: generality, coverage of the domain, average ambiguity, and discrimination power. An interpolation method is adopted to estimate the parameters of the model against a reference, correctly tagged, corpus (SEMCOR). The performance of the selected set of categories is evaluated in terms of effective reduction of overambiguity.</Paragraph> <Paragraph position="15"> The described method only requires a medium-range (stemmed) application corpus and a thesaurus. The model parameters are tuned against a reference correctly tagged corpus, but this is not strictly necessary if correctly tagged corpora are not available.</Paragraph> </Section> class="xml-element"></Paper>