File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/w93-0106_metho.xml
Size: 27,694 bytes
Last Modified: 2025-10-06 14:13:29
<?xml version="1.0" standalone="yes"?> <Paper uid="W93-0106"> <Title>Customizing a Lexicon to Better Suit a Computational Task</Title> <Section position="5" start_page="55" end_page="57" type="metho"> <SectionTitle> 2 Creating Categories from WordNet </SectionTitle> <Paragraph position="0"> We would like to decompose the WordNet noun hierarchy into a set of disjoint categories, each consisting of a relatively large number of synsets. (This is necessary for the text-labeling task, because each topic must be represented by many different terms.) The goal of creating categories of a particular average size with as small a variance as possible.</Paragraph> <Paragraph position="1"> There is some limit as to how small this variance can be because there are several synsets 3Actually, the hyponomy relation is a directed acycllc graph, in that a minority of the nodes are children of more than one parent. We will at times refer to it as a hierarchy nonetheless.</Paragraph> <Paragraph position="2"> for each synset N in the noun hierarchy a_cat(N) a_cat(N): if N has not been entered in a category</Paragraph> <Paragraph position="4"> that have a very large number of children (there are sixteen nodes with a branching factor greater than 100). This primarily occurs with synsets of a taxonymic flavor, i.e., mushroom species and languages of the world. There are two other reasons why it is not straightforward to find uniformly sized, meaningful categories: (i) There is no explicit measure of semantic distance among the children of a synset.</Paragraph> <Paragraph position="5"> (ii) The hierarchy is not balanced, i.e., the depth from root to leaf varies dramatically throughout the hierarchy, as does the branching factor. (The hierarchy has ten root nodes; on average their maximum depth is 10.5 and their minimum depth is 2.) Reason (ii) rules out a strategy of traveling down a uniform depth from the root or up a uniform height from the leaves in order to achieve uniform category sizes.</Paragraph> <Paragraph position="6"> The algorithm used here is controlled by two parameters: upper and lower bounds on the category size (see Figure 1). For example, the result of setting the lower bound to 25 and the upper bound to 60 yields categories with an average size of 58 members.</Paragraph> <Paragraph position="7"> An arbitrary node N in the hierarchy is chosen, and if it has not yet been registered as a member of a category, the algorithm checks to see how many unregistered descendants it has. In every case, if the number of descendants is too small, the assignment to a category is deferred until a node higher in the hierarchy is examined (unless the node has no parents). This helps avoid extremely small categories, which are especially undesirable.</Paragraph> <Paragraph position="8"> If the number of descendants of N falls within the boundaries, the node and its unregistered descendants are bundled into a new category, marked, and assigned a label which is derived from the synset at N. If N has too many descendants, that is, the count of its unmarked descendants exceeds the upper bound, then each of its immediate children is checked in turn: if the child's descendant count falls between the boundaries, then the child and its descendants are bundled into a category. If the child and its unmarked descendants exceed the upper bound, then the procedure is called recursively on the child. Otherwise, the child is too small and is left alone. After all of N's children have been processed, the category that N will participate in has been made as small as the algorithm will allow. There is a chance that N and its unmarked descendants will now make a category that is too small, and if this is the case, N is left alone, and a higher-up node will eventually subsume it (unless N has no parents remaining). Otherwise, N and its remaining unmarked descendants are bundled into a category.</Paragraph> <Paragraph position="9"> If N has more than one parent, N can end up assigned to the category of any of its parents (or none), depending on which parent was accessed first and how many unmarked children it had at any time, but each synset is assigned to only one category.</Paragraph> <Paragraph position="10"> The function &quot;mark&quot; places the synset and all its descendents that have not yet been entered into a category into a new category. Note that #descendents is recalculated in the third-to-last line in case any of the children of N have been entered into categories.</Paragraph> <Paragraph position="11"> In the end there may be isolated small pieces of hierarchy that aren't stored in any category, but this can be fixed by a cleanup pass, if desired.</Paragraph> </Section> <Section position="6" start_page="57" end_page="58" type="metho"> <SectionTitle> 3 A Topic Labeler </SectionTitle> <Paragraph position="0"> We are using a version of the disambiguation algorithm described in \[21\] to assign topic labels to coherent passages of text. Yarowsky defines word senses as the categories listed for a word in Roger's Thesaurus (Fourth Edition), where a category is something like TOOLS/MACHINERY. For each category, the algorithm * Collects contexts that are representative of the category.</Paragraph> <Paragraph position="1"> * Identifies salient words in the collective contexts and determines the weight for each word.</Paragraph> <Paragraph position="2"> * Uses the resulting weights to predict the appropriate category for a word occurring in a novel context.</Paragraph> <Paragraph position="3"> The proper use of this algorithm is to choose among the categories to which a particular ambiguous word can belong, based on the lexical context that surrounds a particular instance of the word.</Paragraph> <Paragraph position="4"> In our implementation of the algorithm, the 726 categories derived from WordNet, as described in the previous section, are used instead of Rogel's categories, because these are not available publically online. Training is performed on Grolier's American Academic Encyclopedia (~ 8.7M words).</Paragraph> <Paragraph position="5"> The labeling is done as follows: Instead of using the algorithm in the intended way, we are placing probes in the text at evenly-spaced intervals and accumulating the scores for each category all the way through the text. The intention is that at the end the highest scoring categories correspond to the main topics of the text. Below we show the output of the labeler on two well-known texts (made available online by Project Gutenberg). The first column indicates the rank of the category, the second column indicates the score for comparison purposes, and the third column shows the words in the synset at the top-most node of the category (these are not always entirely descriptive, so some glosses are provided in parentheses).</Paragraph> <Paragraph position="6"> Note that although most of the categories are appropriate (with the glaring exception of &quot;professional&quot; in Genesis), there is some redundancy among them, and in some cases they are too fine-level to indicate main topic information.</Paragraph> <Paragraph position="7"> In an earlier implementation of this algorithm, the categories were in general larger but less coherent than in the current set. The larger categories resulted in better-trained classifications, but the classes often conflated quite disparate terms. The current implementation produces smaller, more coherent categories. The advantage is that a more distinct meaning can be associated with a particular label, but the disadvantage is that in many cases so few of the words in the category appear in the training data that a weak model is formed. Then the categories with little distinguishing training data dominate the labeling scores inappropriately.</Paragraph> <Paragraph position="8"> In the category-derivation algorithm described above, in order to increase the size of a given category, terms must be taken from nodes adjacent in the hierarchy (either descendants or siblings). However, adjacent terms are not necessarily closely related semantically, and so after a point, expanding the category via adjacent terms introduces noise. To remedy this problem, we have experimented with increasing the size of the categories in two different ways: (1) The first approach is to retain the categories in their current form and add semantically similar terms, extracted from corpora independent of WordNet, thus improving the training of the labeling algorithm.</Paragraph> <Paragraph position="9"> (2) The second approach is to determine which categories are semantically related to one another, despite the fact that they come from quite different parts of the hierarchy, and combine them so that they form schema-like associations.</Paragraph> <Paragraph position="10"> These are described in the next two sections, respectively.</Paragraph> </Section> <Section position="7" start_page="58" end_page="63" type="metho"> <SectionTitle> 4 Augmenting Categories with Relevant Terms </SectionTitle> <Paragraph position="0"> As mentioned above, one way to improve the categories is to expand them with related relevant terms. In this section we show how comparing WordSpace vectors to the derived categories allows us to expand the categories. The first subsection describes the WordSpace algorithm, and the subsequent subsections show how it can be used to augment the derived categories.</Paragraph> <Section position="1" start_page="59" end_page="59" type="sub_section"> <SectionTitle> 4.1 Creating WordSpace from Free Text </SectionTitle> <Paragraph position="0"> WordSpace \[19\] is a corpus-based method for inducing semantic representations for a large number of words (50,000) from lexical cooccurrence statistics. The representations are derived from free text, and therefore are highly specific to the text type in question.</Paragraph> <Paragraph position="1"> The medium of representation is a multi-dimensional, real-valued vector space. The cosine of the angle between two vectors in the space is a continuous measure of their semantic relatedness.</Paragraph> <Paragraph position="2"> Lexical coocurrence, which is the basis for creating the word space vectors, can be easily measured. However, for a vocabulary of 50,000 words, there are 2,500,000,000 possible coocurrence counts, a number too high to be computationally tractable. Therefore, letter fourgrams are used here to bootstrap the representations. Cooccurrence statistics are collected for 5,000 selected fourgrams. The 5000-by-5000 matrix used for this purpose is manageable. A vector for a lexical item is then computed as the sum of fourgram vectors that occur close to it in the text.</Paragraph> <Paragraph position="3"> The first step of the creation of WordSpace consists of deriving fourgram vectors that reflect semantic similarity in the sense of being used to describe the same contexts. Consequently, one needs to be able to pairwise compare fourgrams' contexts. For this purpose, a collocation matrix for fourgrams was collected such that the entry aij counts the number of times that fourgram i occurs at most 200 fourgrams to the left of fourgram j. Two columns in this matrix are similar if the contexts the corresponding fourgrams are used in are similar. The counts were determined using five months of the New York Times (June - October 1990). The resulting collocation matrix is dense: only 2% of entries are zeros, because almost any two fourgrams cooccur. Only 10% of entries are smaller than 10, so that culling small counts would not increase the sparseness of the matrix. Consequently, any computation that employs the fourgram vectors directly would be inefficient. For this reason, a singular value decomposition was performed and 97 singular values extracted (cf. \[5\]) using an algorithm from SVDPACK \[3\]. Each fourgram can then be represented by a vector of 97 real values. Since the singular value decomposition finds the best leastsquare approximation of the original space in 97 dimensions, two fourgram vectors will be similar if their original vectors in the collocation matrix are similar. The reduced fourgram vectors can be efficiently used in the following computations.</Paragraph> <Paragraph position="4"> Cooccurrence information was used for a second time to compute word representations from the fourgram vectors: in this case coocurrence of a target word with any of the 5000 fourgrams. 50,000 words that occurred at least 20 times in 50,000,000 words of the New York Times newswire were selected. For each of the words, a context vector was computed for every position at which it occurred in the text. A context vector was defined as the sum of all defined fourgram vectors in a window of 1001 fourgrams centered around the target word. The context vectors were then normalized and summed. This sum of vectors is the vector representation of the target word. It is the confusion of all its uses in the corpus. More formally, if C(w) is the set of positions in the corpus at which w occurs and if 90(f) is the vector representation for fourgram f, then the vector representation 7&quot;(w) of w is defined as: (the dot stands for normalization) burglars thief rob mugging stray robbing lookout chase crate thieves deter intercept repel halting surveillance shield maneuvers domestic auto/-s importers/-ed threefold inventories drastically cars heap into ragged goose neatly pulls buzzing rake odd rough Confessions Jill Julie biography Judith Novak Lois Learned Pulitzer dry oyster whisky hot filling rolls lean float bottle ice jobs employ/-s/-ed/-ing attrition workers clerical labor hourly</Paragraph> <Paragraph position="6"> ice(w) ! close to i Table 1 shows a random sample of 10 words and their nearest neighbors in Word Space. As can be seen from the table, proximity in the space corresponds closely to semantic similarity in the corpus. (N'Dour is a Senegalese jazz musician. In the 1989/90 New York Times, S.O.B. mainly occurs in the book title &quot;Confessions of an S.O.B.&quot;, and Ste. in the name &quot;Ste.-Marguerite&quot; a Quebec river that is popular for salmon fishing.)</Paragraph> </Section> <Section position="2" start_page="59" end_page="63" type="sub_section"> <SectionTitle> 4.2 Augmenting WordNet Categories using WordSpace </SectionTitle> <Paragraph position="0"> We chose the following simple mapping from the derived WordNet categories to WordSpace: * for each category i from Section 2: * collect the vectors of all the words in i that are covered by WordSpace * compute the centroid of these vectors &quot;rhis centroid defines an area in WordSpace thai, corresponds 1,o tile WordNet category. Ilsing these eentroids we can now assign a word ill WordSl)ace to a derived category I)y examining the nearest neighbors of the word. The assignment algorithm we use is: * for each word w in WordSpace * collect the 20 nearest neighbors of w in the space * compute the score si of category i for w as the number of nearest neighbors that are in i * assign w to the highest scoring category or categories In order to test this algorithm, we selected 1000 words from the medium frequency words in WordSpace. 4 These turned out to be the medium-frequency words from deforestation to downed. The following subsections describe the application of the assignment algorithm to classifying proper names, reassigning words in the categories, and assigning words that are not covered by the categories.</Paragraph> <Paragraph position="1"> A deficiency of WordNet for our text labeling task and for many other applications is that it omits many proper names (and since the set of important proper names changes over time, it cannot be expected to contain an exhaustive list). We tested the performance of our assignment algorithm by searching for proper names that had high scores for the categories in Table 2. For each category on the left-hand side we show all of the proper names that assigned high scores those categories. The proper names assigned to &quot;artist&quot; are painters, the proper names assigned to &quot;European country&quot; are European politicians, &quot;performer&quot; contains actors, dancers and roles, writers and titles of movies, &quot;music&quot; has musicians and titles of musical performances (the Pasadena Doo Dah Parade, Purcell's &quot;Dido and Aeneas&quot;), &quot;athlete jock&quot; players of various sports, and &quot;process of law&quot; lawyers, judges and defendants. We checked the referents of all proper names in Table 2 in the New York Times and found only one possible error (although a few names like &quot;DePalma&quot; and &quot;Delancey&quot; had several referents only one of whom pertained to the assigned category): The President of Michigan State University, John DiBiaggio, was assigned to the &quot;athlete&quot; category because his name is mainly mentioned in articles dealing with a conflict he had with his athletic department.</Paragraph> <Paragraph position="2"> category highest scoring proper names artist creative_person degas delacroix European_country European_nation delors dienstbier diestel performer performing_artist; deniro dennehy depalma delancey depardieu dramatic composition dern desi devito dewhurst dey diaghilev doogie dourif musical_organization musical_group; depeche(mode) deville diddley dido dire(straits) musician player; music doo doobie (N')Dour athlete jock dehere delpino demarco deleon deshaies detmer dibiaggio dinah doleman doughty doran dowis due_process due_process_of.law degeorge depetris devita dichiara dicicco diles dilorenzo dougan The assignment algorithm can also be employed to adjust the assignments of individual words in the WordNet hierarchy by matching against the derived categories. Two kinds of adjustments are possible: specializing senses and adding senses that are not covered. Two examples of each case from the 1000 word test set are given in Table 3. dosage dose, dosage - (the quantity of an active agent (substance or radiation) taken in or absorbed at any one time) dissertation dissertation, thesis => treatise - (a formal exposition) derby bowler hat, bowler, derby, plug hat - (round and black and hard with a narror brim; worn by some British businessmen) dl deciliter, decilitre, dl person individual man mortal human soul; feeling emotion medicine medication medicament; room statement; message content subject_matter substance time_period period period_oLtime amount_oLtime; athlete jock message content subject_matter substance; feeling emotion dosage and dissertation are defined in a very general way in Wordnet (see Table 4). While they can be used with the general sense given in WordNet, almost all uses of dissertation in the New York Times are for doctoral dissertations that report on scientific work. Similarly, non-medical contexts are conceivable for dosage, but the dosages that the New York Times mentions are exclusively dosages of radiation or medicine in a medical context. The automatically found labelings in Table 3 indicate the need for specialization and can be used as the basis for reassignment.</Paragraph> <Paragraph position="3"> In some cases, the WordNet hierarchy is also incomplete. The two senses &quot;horse race&quot; and &quot;Disabled List&quot; for derby and dl are missing from WordNet, although they are the dominant uses in the New York Times. Again the classification algorithm finds the right topic area for the two words which can be used as the basis for reassignment. Unfortunately, the algorithm also labels some correctly assigned words with incorrect categories. We are working on an improved version that will not give &quot;false positives&quot; in the detection of misassignments.</Paragraph> <Paragraph position="4"> We would like to be able to handle unknown words since they are often highly specific and excellent indicators for the topical structure of a document. Table 5 shows the automatic assignments for all words in the 1000 word test set that were not found in WordNet. The results are mixed. 63% (17/27) of the words are assigned to a correct topic (+), an additional 19% (5/27) are assigned to topics they are related to (0), 19% are misassigned (). We are considering several ways of improving the assignment algorithm. For instance, there are &quot;diluted&quot; categories such as &quot;speech_act&quot; and &quot;trait character feature&quot; whose members are mostly words that are poorly characterized collocationally. If we ignore them in assigning categories (hoping that most unknown words will be topic-specific special terms) we can correct some of the errors, e.g. disunity would be assigned to &quot;group_action interaction social_activity&quot; which seems correct. We expect that we can improve the results in Table 5 as we gain more experience in combining WordSpace and WordNet. These results are encouraging; we have not yet tested to see if they improve the particular task of interest.</Paragraph> </Section> </Section> <Section position="8" start_page="63" end_page="65" type="metho"> <SectionTitle> 5 Combining Distant Categories </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="63" end_page="64" type="sub_section"> <SectionTitle> 5.1 The Algorithm </SectionTitle> <Paragraph position="0"> To find which categories should be considered closest to one another, we first determined how close they are in WordSpace and then group categories together that mutually ranked one another highly.</Paragraph> <Paragraph position="1"> To compute the first-degree closeness of two categories cl and cj we used the formula: 1 1 - 2 ledlcjl ~EC/i ~Ecj where d is the Euclidean distance: The primary rank of category i for category j indicates how closely related i is to j. For instance rank 1 means that i is the closest category to j, and rank 3 means there are only two closer categories to j than i.</Paragraph> <Paragraph position="2"> The second-degree closeness is computed from the rank of the primary ranks. To determine that close association is mutual between two categories, we check for mutual high ranking. Thus category i and j are grouped together if and only if i ranks j highly and j ranks i highly (where &quot;highly&quot; was determined by a cutoff value - i and j had to be ranked k or above with respect to each other, for a threshold k). Secondary ranking is needed because some categories are especially &quot;popular,&quot; attracting many other categories to them; the secondary rank enables the popular categories to retain only those categories that they mutually rank highly.</Paragraph> <Paragraph position="3"> The results of this algorithm were difficult to interpret until we displayed them graphically. The graph layout problem is notoriously difficult, but \[2\] describes a presentation tool based on theoretical work by \[6\] which uses a force-directed placement model to layout complex networks (edges are modeled as springs; nodes linked by edges are attracted to each other, but all other pairs of nodes are repelled from one another). Figure 2 shows a piece of the network. In these networks only connectivity has meaning; distance between nodes does not connote semantic distance.</Paragraph> <Paragraph position="4"> Looking at Figure 2 in more detail, we see that categories associated with the notion &quot;sports&quot;, such as &quot;athletic_game&quot;, &quot;race&quot;, &quot;sports_equipment', and &quot;sports_implement&quot;, have been grouped together. The network also shows that categories that are specified to be near one another in WordNet, such as the categories related to &quot;bread&quot;, are found to be closely interrelated. This is useful in case we would like to begin with smaller categories, in order to eliminate some of the large, broad categories that we are currently working with.</Paragraph> <Paragraph position="5"> The connectivity of the network is interesting also because it indicates the interconnectivity between categories. Athletics is linked to vehicle and competition categories; these in turn link to military vehicles and weaponry categories, which then lead in to legal categories.</Paragraph> <Paragraph position="6"> Most of the connectivity information suggested by the network was used to create the new categories. However, many of the desireable relationships do not appear in the network, perhaps because of the requirement for highly mutual co-ranking. If we were to relax this assumption we may find better coverage, but perhaps at the cost of more misleading links. The remaining associations where determined by hand, so that the original 726 categories were combined into 106 new super-categories.</Paragraph> </Section> <Section position="2" start_page="64" end_page="65" type="sub_section"> <SectionTitle> 5.2 Improving the Topic Labeler </SectionTitle> <Paragraph position="0"> The super-categories are intended to group together related categories in order to eliminate topical redundancy in the labeler and to help eliminate inappropriate labels (since the categories are larger and so have more lexical items serving as evidence). Thus the top four or five super-categories should suffice to indicate the main topics of documents. We have not yet rigorously analyzed the performance of the labeler with the original categories or with the super-categories. In future we plan to obtain reader judgements about which categories are the best labels for various texts. Here we show some example output and discuss its characteristics.</Paragraph> <Paragraph position="1"> The table below compares the results of the labeler using the original categories against the super-categories. The numbers beside the category names are the scores assigned by the algorithm; the scores in both cases are roughly similar. It is important to realize that only the top four or five labels are to be used from the super-categories; since each super-category subsumes many categories, only a few super-categories should be expected to contain the most relevant information. The first article is a 31-sentence magazine article, published in 1987, taken from \[15\]. It describes how Soviet women have little political power, discusses their role as working women, and describes the benefits of college life. The second article is a 77-sentence popular science magazine article about the Magellan space probe exploring Venus. When using the super-categories, the labeler avoids grossly inappropriate labels such as &quot;mollusk.genus&quot; and &quot;goddess&quot; in the Magellan article, and combines categories such as &quot;layer&quot;, &quot;natural_depression&quot;, and &quot;rock stone&quot; into the one super-category &quot;land terra_firma&quot;.</Paragraph> <Paragraph position="2"> Looking again at the longer texts of the United States Constitution and Genesis we see that the super-categories are more general and less redundant than the categories shown in Section 2. (Alhough the high scores for the &quot;breads&quot; category seems incorrect, even though the term &quot;bread&quot; occurs 25 times.) In some cases the user might desire more specific categories; this experiment suggests that the labeler can generate topic labels at multiple levels of granularity.</Paragraph> </Section> </Section> class="xml-element"></Paper>