File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/a97-1055_metho.xml
Size: 19,144 bytes
Last Modified: 2025-10-06 14:14:31
<?xml version="1.0" standalone="yes"?> <Paper uid="A97-1055"> <Title>Automatic Selection of Class Labels from a Thesaurus for an Effective Semantic Tagging of Corpora.</Title> <Section position="3" start_page="381" end_page="381" type="metho"> <SectionTitle> 2 Selection of Alternative Sets of Se- </SectionTitle> <Paragraph position="0"> mantic Categories from WordNet The first step of the method is generating alternative sets of WordNet categories. Alternative sets are selected according to the following principles: * Balanced categories: words must be uniformly distribut~z~d among categories of a set; * Increasing level of generality: alternative sets are selected by uniformly increasing the level of generality of the categories belonging to a set; * Domain-appropriateness: selected categories in a set are those pointed by (an increasingly large number of) words of the application domain, weighted by their frequency in the corpus.</Paragraph> <Paragraph position="1"> The set-generation algorithm is an iterative application of the algorithm proposed in (Hearst and Shutze, 1993) for creating WordNet categories of a fixed average size. In its modified version, the algorithm is as follows2: Let S be a set of WordNet synsets s, W the set of different words (nouns) in the corpus, P(s) the number of words in W that are instances of s, weighted by their frequency, LIB and LB the upper and lower bound for P(s), N, h and k constant values.</Paragraph> <Paragraph position="3"> initialise the set of categories C i with the empty set; new_cat(S); ifi=l or Ci~C.i_ 1 then add C i to the set of Cat. i=i+l; UB=UB+k; LB=UB*h; I until C i is not an empty set; where: newcat(S): for any category s of S { if s does not belong to C i then</Paragraph> <Paragraph position="5"> then put s in the set C i</Paragraph> <Paragraph position="7"> else add s to SCT(C i) } } 2The procedure new_cat(S) is almost the same as in (Hears-t and Shutze, 1993). For sake of brevity, the algorithm is not further explained here. N, h and k are the initial parameters of the algorithm. We experimentally observed that only h (the ratio between lower and upper bound) significantly modifies the resulting sets of categories (Ci): we established that a good compromise is h=0.4. SCT(C i) is the set of &quot;smaller&quot; WordNet categories with P(s)<LB that do not belong to the Ci set (see next section).</Paragraph> </Section> <Section position="4" start_page="381" end_page="382" type="metho"> <SectionTitle> 3 Scoring Alternative Sets of Cate- </SectionTitle> <Paragraph position="0"> gories The algorithm of section 2 creates alternative sets of balanced and increasingly general categories C i. We now need a scoring function to evaluate these alternatives. The following performance factors have been selected to express the scoring function: Generality: In principle, we would like to represent the semantics of the domain using the highest possible level of generalisation. We can express the generality G'(Ci) as 1/DM(Ci), being DM(C i) the average distance between the categories of C i and the WordNet topmost synsets. Due to the graph structure of WordNet, different paths may connect each element cij of Ci with different topmosts, therefore we compute DM(Ci) as:</Paragraph> <Paragraph position="2"> where dm(cij) is the average distance of each cij from the topmosts. Figure I illustrates a possible sysnsets hierarchical in which, for Ci=\[cil ci2}, being dm(Cil)=(4+3)/2 As defined, G'(C i) is a linear function (low values for low generality, high value for high generality), whilst our goal is to mediate at best between overspecificity and overgenerality. Therefore, we model the generality as G(Ci)=G'(Ci)*Gauss(G'(Ci)), where Gauss(G'(Ci)) is a Gauss distribution function computed by using the average and the variancy of G'(C i) values over the set of all categories Ci, selected by the algorithm in section 2, normalised in the \[0,1\] interval.</Paragraph> <Paragraph position="3"> Coverage: the algorithm of section 2, for any set C i , does not allow a full coverage of the nouns in the domain.</Paragraph> <Paragraph position="4"> Given a selected pair <UB, LB>, it may well be the case that several words are not assigned to any category, because when branching from an overpopulated category to its descendants, some of the descendants may be underpopulated. Each iterative step that creates a C i also creates a set of underpopulated categories SCT(Ci). To ensure full coverage, these categories may be added to Ci, or alternatively, they can be replaced by their direct ancestors, but clearly a &quot;good&quot; selection of C i is one that minimizes this problem. The coverage CO(Ci) is therefore defined as the ratio Nc(Ci)/W, where Nc(Ci) is the number of words that reach at least one category of C i Discrimination Power: a certain selection of categories may not allow a full discrimination of the lowest-level senses for a word (leaves-synsets hereafter). Figure 2 illustrates an example. If C i = {Cil ci2 ci3 ci4}, w2 cannot be fully disambiguated by any sense selection algorithm, because two of its leaves-synsets belong to the same category ci2- With respect to w2, ci2 is overgeneral (though nothing can be said about the actual importance of discriminating between such two synsets).</Paragraph> <Paragraph position="5"> We measure the discrimination power DP(C i) as the ratio (Nc(Ci)-Npc(Ci))/Nc(Ci), where Nc(Ci) is the number of words that reach at least one category of C i, and Npc(Ci) is the number of words that have at least two leaves-synsets that reach the same category cij of C i. For the example of figure 2 DP1, DP(C i) =(3-1 ) / 3=0.66.</Paragraph> <Paragraph position="6"> duces the initial ambiguity of the corpus. In part, because there are leaves-synsets that converge into a single category of the set, in part because there are leaves-synsets of a word that do not reach any of these categories. This phenomenon is accounted for by the inverse of the average ambiguity A(Ci). The A(Ci) is measured as:</Paragraph> <Paragraph position="8"> where Nc(C i) is the number of words that reach at least one category of Ci and, for each word wj in this set, Cwj(C i) is the number of categories of C i reached.</Paragraph> <Paragraph position="9"> In figure 2, the average ambiguity is 2 for the set C i = {cil</Paragraph> <Paragraph position="11"> Figure 3 - Performance factors G, CO, DP and l/A, for each generated set of categories The scoring function for a set of categories C i is defined as the linear combination of the performance parameters described above: (1) Score (C i ) = ~G(C i )+ 13CO(C i )+ xDP(C i )+ 8(1/A(C i )) Notice that we assigned a positive effect on the score (modelled by l/A) to the ability of eliminating certain leaves-synsets and a negative effect (modelled by DP) to the inability of discriminating among certain other leaves-synsets. This is reasonable in general, because our aim is to control overgenerality while reducing overambiguity. However, nothing can be said on the appropriateness of a specific sense aggregation and/or sense elimination for a word. It may well be the case that merging two senses in a single category is a reasonable thing to do, if the senses do not draw interesting (for the domain) distinctions. Therefore eliminating a priori a sense of a word may be inappropriate in the domain.</Paragraph> <Paragraph position="12"> The (1) is computed for all the generated sets of categories C i, and then normalised in the \[0,1\] interval. The effectiveness of this model is estimated in the following section. null</Paragraph> </Section> <Section position="5" start_page="382" end_page="386" type="metho"> <SectionTitle> 4 Evaluation Experiments and Discus- sion of the Dafa </SectionTitle> <Paragraph position="0"> The algorithm was applied to the 10,235 different nouns of the Wall Street Journal (hereafter WSJ) corpus that are classified in WordNet. Categories are generated with h= 0.4 and k=l,000. The cardinality of each set varies, but not uniformly, from 456 categories for UB--2000 (remember that words are frequency-weighted), to I category (i.e. the topmost entity) for UB=264,000. Mediumhigh level categories (those between 50,000 and 100,000 maximum words) range between 10-20 members for each Figure 3 plots the values of G, CO, DP and 1/A for the different sets of categories generated by the algorithm of Section 2. Alternative sets of categories are identified by their upperbound 3. The figure shows that DP(Ci) has a regular decreasing behaviour, while 1/A(Ci) is less regular. The coverage CO(C i) has a rather unstable behaviour due to the entangled structure of WordNet. We attempted slight changes in the definitions and computation of CO, DP and 1/A (for example, weighting words with their frequency), but globally, the behaviour remain as those in figure 3.</Paragraph> <Paragraph position="1"> To compute the score of each set Ci. , the parameters (x,13,;C/ and 8 in (1) must be estimated. To perform this task, we adopted a linear interpolation method, using SEMCOR (the semantically tagged Brown Corpus) as a reference corpus. In SEMCOR every word is unambiguously tagged with its leaf-synset.</Paragraph> <Paragraph position="2"> To build a reference scoring function against which to evaluate our model parameters, we proceeded as follows: * Since our categories are generated for an economic domain (WSJ) while SEMCOR is a tagged balanced corpus (the Brown Corpus), we extracted only the fragment of the corpus dealing with economic and financial texts. We obtained a reference corpus including 475 of the 1,235 nouns of the WSJ corpus.</Paragraph> <Paragraph position="3"> * For each set of categories C i generated by the algorithm in section 2, we computed on the reference corpus the following two l:~rformance figures: Precision: For each Ci, let W(C i) be the set of words in the reference corpus covered by the met C i. For each w k in 3Remember that words are weighted by their frequency in the corpus. This seems reasonable, but in any case we observed that our results do not vary when counting each word only once.</Paragraph> <Paragraph position="4"> W(Ci), let S(w k) be the total set of leaves-synsets of w k in WordNet, SR(w k) the subset of leaves-synsets of w k found in the reference corpus, SC(w k) the subset of leaves-synsets that reach some of the categories of C i. Let WR(Ci) ~ W(C i) be the set of w k having SC(w k) c S(Wk). Following the algorithm: for any w k in WR(C i)</Paragraph> <Paragraph position="6"> where freq(w k) is the number of occurrences of w k in the reference corpus, the precision Precision(C i) is then defined as N+/N -tdegt. The precision measures the ability of each set C i at correctly pruning out some of the senses of W(Ci).</Paragraph> <Paragraph position="7"> Global reduction of ambiguity: For each C i, let S(W i) be the total number of WordNet leaves-synsets reached by the words in WR(Ci), and SCON i) ~ S(W i) the set of these synsets that reach some category in C i. By tagging the corpus with C i, we obtain a reduction of ambiguity measured by:</Paragraph> <Paragraph position="9"> where card (X) is the number of elements in the set X Starting from these two performance figures, the global performance function Perf(C i ) is measured by: (2) Perf(C i ) = Precision(Q) + GRAmb(C i ) The (2) is computed for all the generated sets of categories Ci, and then normalised in the \[0,1\] interval. The obtained plot is the reference against which we apply a standard linear interpolation method to estimate the values of the model parameters cC/,~,X and 8 that minimize the difference between the values of the two functions for each C i. In figure 4a the (not normalised) Precision and GRAmb are plotted for the test corpus. In figure 4b the normalised reference performance function and the &quot;best fitting&quot; scoring function are shown, with the estimated values of a,~,X and 8.</Paragraph> <Paragraph position="10"> While the reference function has a peak on the class set Cj with UB--55,000 and the score function assigns the maximum value to the class set C k with UB=62,000, the performance of the sets in the range j-k is very similar. Table 10,000 in our learning corpus. This may well cause a shift of the reference scoring function, as compared with the &quot;real&quot; scoring function.</Paragraph> <Paragraph position="11"> * In any case, figure 4a shows that the sets C~ have peak performances in the range 50,000-100,d00. In this range, the precision is around 73-76%, and the reduction of ambiguity is around 35%, which are both valuable results. We also experimented that, by changing slightly the model parameters and/or the definitions of the four performance figures in the (1), in any case the peak performance of the obtained scoring function falls in the 50,000-100,000 interval, and the function stays high around the peak, with local maxima.</Paragraph> <Paragraph position="12"> * In other domains (see a brief summary in the concluding remarks) for which we did not have a reference tagged corpus, we used (a=l ~--0,5 z=l 5=1) as model parameters in the (1), and still observed a scoring function similar in shape to that of figure 4b. Selected categories vary according to the domains, but the size of the best set stays around the 10-20 categories. Evaluation is of course more problematic due to the absence of a tagged reference corpus.</Paragraph> <Paragraph position="13"> Therefore, we may conclude that the method is &quot;robust&quot;, in the sense that it correctly identifies a range of reasonable choices for the set of categories to be used, eventually leaving the final choice to a linguist.</Paragraph> <Paragraph position="14"> As for the WSJ corpus, a short analysis of the linguistic data may be useful. In figure 5 the 14 '&quot;oest&quot; selected categories for nouns are listed. Figure 6 shows four very frequent and very ambiguous words in the domain: bank, business, market and stock, with attached list of synsets as generated by WordNet, ordered from left to right by the increasing level of generality (leaf-sysnset leftmost). The senses marked with '*' are those that reach some of the categories (marked in bold in the figure) of the best-performing set, selected by the scoring function (1). For bank and market, we observed that the less plausible (for the domain) senses ridge and market_grocery_store are pruned out. The word stock retains only 5 out of 16 senses. Of these, the gunstock and progenitor senses should have been further dropped out, but there are 11 senses that are correctly pruned, like liquid, caudex, plant, etc. The word business still keeps its ambiguity, but the 9 subtle distinctions of WordNet are reduced to 7 senses.</Paragraph> <Paragraph position="15"> i. person, individual, someone, mortal, human, soul Sense n.3: stock, inventory -> merchandise,wares,product -> ... Sense n.4 (*): stockcertificate, stock -> security,certificate -> legal_document, legal_instnm~ent,official_document,instrument -> document,writtendocument, papers -> writing, written material -> written_communication,written_language -> ... Sense n,5 (*): store,stock,fund -> accumulation -> asset -> possession Sense n.6 (*): stock -> progenitor,primogenitor -> ancestor,ascendant,ascendent,antecedent -> relative,relation -> person,individual,someone,mortal,human, soul -> ...</Paragraph> <Paragraph position="16"> Sense n.7: broth,stock -> soup -> ...</Paragraph> <Paragraph position="17"> Sense n.8: stock,caudex -> stalk, stem ->...</Paragraph> <Paragraph position="18"> Sense n.9: stock-> plant_part -> ...</Paragraph> <Paragraph position="19"> Sense n.lO: stock, gillyflower -> flower -> ...</Paragraph> <Paragraph position="20"> Sense n.11: Malcolm_stock, stock -> flower -> ...</Paragraph> <Paragraph position="21"> Sense n.12: lineage, line,line_of descent,descent,bloodline,blood_line,blood,pedigree, ancestry, origin, parentage, stock -> genealogy, family_tree ->...</Paragraph> <Paragraph position="22"> Sense n.13: breed,strain,stock,variety -> animal_group -> ... Sense n.14: stock -> lumber, timber -> ...</Paragraph> <Paragraph position="23"> Sense n.15: stock-> handle, grip,hold -> ...</Paragraph> <Paragraph position="24"> Sense n.16: neckcloth,stock -> cravat -> ...</Paragraph> </Section> <Section position="6" start_page="386" end_page="386" type="metho"> <SectionTitle> 5 Concluding Remarks </SectionTitle> <Paragraph position="0"> It has already been demonstrated in (Basili et al, 1996) that tagging a corpus with semantic categories triggers a more effective lexical learning. However, overambiguity of on-line thesaura is known as the major obstacle to automatic semantic tagging of corpora. The method presented in this paper allows an efficient and simple selection of a fiat set of domain-tuned categories, that dramatically reduce the initial overambiguity of the thesaurus.</Paragraph> <Paragraph position="1"> We measured a 73% precision in reducing the initial ambiguity, and a 37% global reduction of ambiguity.</Paragraph> <Paragraph position="2"> Significantly, our method selects a limited number of categories (10-20, depending upon the learning corpus and the model parameters), out of the initial 47,110 leafsynsets of WordNet 4.</Paragraph> <Paragraph position="3"> We remark that our experiment is on large, meaning that we automatically evaluated the performance of the model on a large set of nouns taken from the Wall Street Journal. Most sense disambiguation or semantic tagging methods evaluate their performances manually, against few very ambiguous cases, with clear distinctions among senses. Instead, WordNet draws very subtle and fine-grained distinctions among words. We believe that our results are very encouraging.</Paragraph> <Paragraph position="4"> The model parameters for category selection has been tuned on SEMCOR, but a correctly tagged corpus is not strictly necessary. In our experiments, we applied a scoring function similar to that obtained for the Wall Street Journal to two other domains, a corpus of Airline reservations and the Unix handbook. We do not discuss the data here for the sake of space. The method constantly selects a set of categories at the medium-high level of generality, different for each domain. The selection &quot;seems&quot; good according to our linguistic intuition of the domains, but the absence of a correctly tagged corpus does not allow a large-scale evaluation.</Paragraph> <Paragraph position="5"> In the future, we plan to demonstrate that the method proposed in this paper, besides reducing the overambiguity of on-line thesaura, improves the performance of lexical learning methods that are based on semantic tagging, such as PP disambiguation, case frame acquisition and and sense selection, with respect to a non-optimal choice of semantic categories.</Paragraph> </Section> class="xml-element"></Paper>