File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-0711_metho.xml
Size: 9,296 bytes
Last Modified: 2025-10-06 14:15:09
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-0711"> <Title>Automatic Adaptation of WordNet to Sublanguages and to Computational Tasks</Title> <Section position="4" start_page="83" end_page="84" type="metho"> <SectionTitle> 3 Redistribution of words among </SectionTitle> <Paragraph position="0"> core categories The purpose of the method described hereafter is twofold: * The first is to attempt a reclassification of words that are not classified, or appeared as misclassified, with respect to the &quot;'original&quot; WordNet.</Paragraph> <Paragraph position="1"> * The second is to further reduce the ambiguity of words that are still very ambiguous with respect to the &quot;pruned&quot; Word-Net. The general idea is that ambiguity of words is reduced in a specific domain, and enumeration of all their senses is unnecessary. Second, some words function as sense primers for others. Third, raw contexts of words provide a significant bundle of information to guide disambiguation.</Paragraph> <Paragraph position="2"> To verify this hypothesis systematically we need to acquire from the corpus a contextual model of the core categories, and then verify to what extent certain &quot;interesting&quot; words (for example, unclassified words) adhere to the contextual model of one of such categories.</Paragraph> <Paragraph position="3"> Our method, inspired by (Yarowsky, t992), works as follows (see (Basil\[ et al, 1997} for details}: null * Step 1. Select the most typical words in each core category: Step 2. Acquire the collective contexts of these words and use them ms a (distributional) description of each category: Step 3. Use tile distributional descriptions to evaluate the (corpus-dependent) membership of each word to the different categories.</Paragraph> <Paragraph position="4"> Step l is carried out detecting tile more significant (and less ambiguous) words in any of the core classes : these sets are called the kernel of the corresponding class. Rather than training the classifier on all the nouns in tile learning corpus ,as in (Yarowsky. \[992). we select only a subset of protot!lpical words for each category. We call these words w the salient words of a</Paragraph> <Paragraph position="6"> where: ,V,, is the total number of synsets of a word w, i.e. all the WordNet synonymy sets including w. .V,o.c is the number of synsets of w that belong to the semantic category C, i.e. synsets indexed with C in WordNet.</Paragraph> <Paragraph position="7"> The typicality depends only on WordNet. A typical noun for a category C is one that is either non ambiguously assigned to C in Word-Net, or that has most of its senses (synsets) in C.</Paragraph> <Paragraph position="8"> The synonymy S~, of w in C, i.e. the degree of synonymy showed by words other than w in the synsets of the class C in which w appears. is modeled by the following ratio: s,,,(c)_ o ,c (3) O~ where: O,. is the number of words in the corpus that appear in at least one of the synsets of w.</Paragraph> <Paragraph position="9"> O,:.c is the number of words in the corpus appearing ill at least one of the synsets of w, that belong to C. &quot;\['lie synonymy depends both on WordNet and on the corpus. A noun with a high degree of synonymy in C is one with a high number of synonyms in the corpus, with reference to a specific sense (synset) belonging to C. Salient nouns for C are frequent, typical, and with a high synonymy in C. The salient words w. for a semantic category C, are thus identified maximizing the following function, that we call</Paragraph> <Paragraph position="11"> where O.4~, are the absolute occurrences of w in the corpus. The value of Score depends both on the corpus and on '~,brdNet. O.4~, depends obviously on the corpus.</Paragraph> <Paragraph position="12"> The kernel of a category kernel(C), is the set of salient words w with a &quot;high&quot; 5core~(C). In ,\[,able I some kernel words for the class gathering.as.~emblage are reported.</Paragraph> <Paragraph position="13"> Step 2 uses the kernel words to build (as in (Yarowsky. i992)) a probabilistic model of a class: this model is based on the distribution of class relevance of the surrounding terms in typical contexts.</Paragraph> <Paragraph position="14"> In Step 3 a word is assigned to one. or more, classes according to the contexts in which it appears. Many contexts may enforce the selection of a given class, or multiple classifications are possible when different contexts suggest independent classes. For a given word w, and for each category C, we evaluate the following function, that we call Domain Sense (DSense(w, C)): where</Paragraph> <Paragraph position="16"> where k's are tile contexts of w. and a&quot; is a generic word in k.</Paragraph> <Paragraph position="17"> In (6), Pr(C)is the (not uniform) probability of a class C, given by the ratio between the number of collective contexts for (7' 2 and the total number of collective contexts.</Paragraph> </Section> <Section position="5" start_page="84" end_page="85" type="metho"> <SectionTitle> 4 Discussion of the experiment </SectionTitle> <Paragraph position="0"> In this section we describe some preliminary resuits of an experiment conducted on tile Wall Street Journal. We used 21 categories including \[4 core categories plus 7 additional categories obtained with automatic extension of tile best core set (see section 2). \[n experiment I. we selected the 6 most frequent unclassified words in the corpus, and attempted a reclassification &quot;those collected around the kernel words of C according to the contextual description of the 21 categories. In experiment 2, we selected the 6 most frequent and still very ambiguous (according to the pruned WordNet) words, and attempted a reduction of ambiguity. For each word w and each category C, we compute the DSense(w, C) and then select only those senses that exhibit a membership value higher than the average membership of kernel words of C. The assignment of a word to a category is performed regardless of the current classification of w in the pruned WordNet.</Paragraph> <Paragraph position="1"> The following Table 2 summarizes the results of experiment 1: Table 3 reports on experiment 2. \[n column 3. selected categories are reported in decreasing order of class membership evidence.</Paragraph> <Paragraph position="2"> Ia Table 2, notice the apparently &quot;strange&quot; classification of wall. The problem is that, in the current version of our system, proper nouns are not correctly detected (this problem will be fixed shortly) since in the Wall Street Journal there is no special syntactic tag for proper names. Erroneously, several proper names, such as Wall Street, Wall Street Journal, Bush, Delta. Apple, etc. were initially classified as common nouns, therefore causing some noise in the data that we need now to eliminate 3.</Paragraph> <Paragraph position="3"> The word wall is in fact part of the complex nominals Wall Street and Wall Street Journal, and it is very interesting that.</Paragraph> <Paragraph position="4"> based on the context, the system classifies it correctly in the three categories: gathering, written_communication, organization Notice that the category: &quot;gathering, assemblage&quot; has somehow an unintuitive label, but in the WSJ domain this class includes rather uniform words, most of which refer to political organizations, as shown in Table 1.</Paragraph> <Paragraph position="5"> In Table 3. it is shown that often some reduction of ambiguity is possible. However. some spurious senses survive, for example, the progenitor (person) sense of stock. \[t is very important that. in all tile analyzed cases, the selected classes are a subset of the initial Word-Net classes: remember that the assignment of a word to a category is performed only on the basis of its computed membership to that category. There is one example of additional detected sense (not included in the pruned Word-Net), i.e. the sense creation for the word hood. Typical (for the domain) words in this class are: plan. yeld. software, magazine, jottroal, is.~,e. etc. therefore, the creation sense seems appropriate. null Clearly. we need to perform a better (in the large} experimentation, but the first results seem encouraging. A large scale experiment requires, besides a better tuning of the statistical parameters and fixing some obvious bug (e.g.</Paragraph> <Paragraph position="6"> the identification of proper nouns}, the preparation of a test set in which the correct classification of a large nurnber of words is verified manually in the actual corpus contexts. Finally, experiments should be extended to domains other than WordNet. We already experimented the algorithm for core category selection on a UNIX corpus and on a small Natural Science corpus.</Paragraph> <Paragraph position="7"> but again, extending the complete experiment For example, the additional category n,Oftrrtl.x, bject was created because of rhe high frequency of spurious nouns as apple, delta, b lsh. etc.</Paragraph> <Paragraph position="8"> to other corpora is not trivial for the required intensive linguistic and statistical corpus processiug. null</Paragraph> </Section> class="xml-element"></Paper>