File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1090_metho.xml
Size: 14,596 bytes
Last Modified: 2025-10-06 14:07:51
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1090"> <Title>Taxonomy learning factoring the structure of a taxonomy into a semantic classification decision</Title> <Section position="3" start_page="0" end_page="3" type="metho"> <SectionTitle> 2. Classification methods </SectionTitle> <Paragraph position="0"> Classification techniques previously applied to distributional data can be summarized according to the following methods: the k nearest neighbor (kNN) method, the category-based method and the centroid-based method. They all operate on vector-based semantic representations, which describe the meaning of a word of interest (target word) in terms of counts of its coocurrence with context words, i.e., words appearing within some delineation around the target word. The key differences between the methods stem from different underlying ideas about how a semantic class of words is represented, i.e. how it is derived from the original cooccurrence counts, and, correspondingly, what defines membership in a class.</Paragraph> <Paragraph position="1"> The kNN method is based on the assumption that membership in a class is defined by the new instance's similarity to one or more individual members of the class. Thereby, similarity is defined by a similarity score as, for instance, by the cosine between cooccurrence vectors. To classify a new instance, one determines the set of k training instances that are most similar to the new instance. The new instance is assigned to the class that has the biggest number of its members in the set of nearest neighbors. In addition, the classification decision can be based on the similarity measure between the new instance and its neighbors: each neighbor may vote for its class with a weight proportional to its closeness to the new instance. When the method is applied to augment a thesaurus, a class of training instances is typically taken to be constituted by words belonging to the same synonym set, i.e.</Paragraph> <Paragraph position="2"> lexicalizing the same concept (e.g., Hearst and Schuetze 1993). A new word is assigned to that synonym set that has the biggest number of its members among nearest neighbors.</Paragraph> <Paragraph position="3"> Or, probabilities determined via Maximum Likelihood Estimation.</Paragraph> <Paragraph position="4"> The major disadvantage of the kNN method that is often pointed out is that it involves significant computational expenses to calculate similarity between the new instance and every instance of the training set. A less expensive alternative is the category-based method (e.g., Resnik 1992).</Paragraph> <Paragraph position="5"> Here the assumption is that membership in a class is defined by the closeness of the new item to a generalized representation of the class. The generalized representation is built by adding up all the vectors constituting a class and normalising the resulting vector to unit length, thus computing a probabilistic vector representing the class. To determine the class of a new word, its unit vector is compared to each class vector.</Paragraph> <Paragraph position="6"> Thus the number of calculations is reduced to the number of classes. Thereby, a class representation may be derived from a set of vectors corresponding to one synonym set (as is done by Takunaga et al. 1997) or a set of vectors corresponding to a synonym set and some or all subordinate synonym sets (Resnik 1992).</Paragraph> <Paragraph position="7"> Another way to prepare a representation of a word class is what may be called the centroid-based approach (e.g., Pereira et al. 1993). It is almost exactly like the category-based method, the only difference being that a class vector is computed slightly differently. All n vectors corresponding to class members are added up and the resulting vector is divided by n to compute the centroid between the n vectors.</Paragraph> <Paragraph position="8"> 3. Making use of the structure of the thesaurus null The classification methods described above presuppose that semantic classes being augmented exist independently of each other. For most existing thesauri this is not the case: they typically encode taxonomic relations between word classes. It seems worthwhile to employ this information to enhance the performance of the classifiers.</Paragraph> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 3.1 Tree descending algorithm </SectionTitle> <Paragraph position="0"> One way to factor the taxonomic information into the classification decision is to employ the &quot;tree-descending&quot; classification algorithm, which is a familiar technique in text categorization. The principle behind this approach is that the semantics of every concept in the thesaurus tree retains some of the semantics of all its hyponyms in such a way that the upper the concept, the more relevant semantic characteristics of its hyponyms it reflects. It is thus feasible to determine the class of a new word by descending the tree from the root down to a leaf. The semantics of concepts in the thesaurus tree can be represented by means of one of the three methods to represent a class described in Section 2. At every tree node, the decision which path to follow is made by choosing the child concept that has the biggest distributional similarity to the new word.</Paragraph> <Paragraph position="1"> After the search has reached a leaf, the new word is assigned to that synonym set, which lexicalizes the concept that is most similar to the new word. This manner of search offers two advantages. First, it allows to gradually narrow down the search space and thus save on computational expenses. Second, it ensures that, in a classification decision, more relevant semantic distinctions of potential classes are given more preference than less relevant ones. As in the case with the category-based and the centroid-based representations, the performance of the method may be greatly dependent on the number of subordinate synonyms sets included to represent a concept.</Paragraph> </Section> <Section position="2" start_page="1" end_page="3" type="sub_section"> <SectionTitle> 3.2 Tree ascending algorithm </SectionTitle> <Paragraph position="0"> Another way to use information about inter-class relations contained in a thesaurus is to base the classification decision on the combined measures of distributional similarity and taxonomic similarity (i.e., semantic similarity induced from the relative position of the words in the thesaurus) between nearest neighbors. Suppose words in the nearest neighbors set for a given new word, e.g., trailer, all belong to different classes as in the following classification scenario: box (similarity score to trailer: 0.8), house (0.7), barn (0.6), villa (0.5) (Figure 1). In this case, kNN will classify trailer into the class CONTAINER, since it appears to have biggest similarity to box. However, it is obvious that the most likely class of trailer is in a different part of the thesaurus: in the nearest neighbors set there are three words which, though not belonging to one class, are semantically close to each other. It would thus be safer to assign the new word to a concept that subsumes one or all of the three semantically similar neighbors. For example, the concepts DWELLING or BUILDING could be feasible candidates in this situation.</Paragraph> <Paragraph position="1"> The crucial question here is how to calculate the total of votes for these two concepts to be able to decide which of them to choose or whether to prefer CONTAINER. Clearly, one cannot sum or average the distributional similarity measures of neighbors below a candidate concept. In the first case the root will always be the best-scoring concept. In the second case the score of the candidate concept will always be smaller than the score of its biggest-scoring hyponym.</Paragraph> <Paragraph position="2"> We propose to estimate the total of votes for such candidate concepts based on taxonomic similarity between relevant nodes. The taxonomic similarity between two concepts is measured according to the procedure elaborated in (Maedche & Staab, 2000). Assuming that a taxonomy is given as a tree with a set of nodes N, a set of edges E [?] NxN, a unique root ROOT [?] N, one first determines the least common superconcept of a pair of concepts a,b being compared. It is defined by</Paragraph> <Paragraph position="4"> where d(a,b) describes the number of edges on the shortest path between a and b. The taxonomic similarity between a and b is then given by</Paragraph> <Paragraph position="6"> where c = lcs(a,b). T is such that 0[?] T [?] 1, with 1 standing for the maximum taxonomic similarity.</Paragraph> <Paragraph position="7"> T is directly proportional to the number of edges from the least common superconcept to the root, which agrees with the intuition that a given number of edges between two concrete concepts signifies greater similarity than the same number of edges between two abstract concepts.</Paragraph> <Paragraph position="8"> We calculate the total of votes for a candidate concept by summing the distributional similarity measures of its hyponyms to the target word t each weighted by the taxonomic similarity measure between the hyponym and the candidate node:</Paragraph> <Paragraph position="10"> is the set of hyponyms below the candidate concept n, sim(t,h) is the distributional similarity between a hyponym h and the word to be classified t, and T(n,h) is the taxonomic similarity between the candidate concept and the hyponym h.</Paragraph> <Paragraph position="11"> 4. Data and settings of the experiments The machine-readable thesaurus we used in this study was derived from GETESS , an ontology for the tourism domain. Each concept in the ontology is associated with one lexical item, which expresses this concept. From this ontology, word classes were derived in the following manner. A class was formed by words lexicalizing all child concepts of a given concept. For example, the concept CULTURAL_EVENT in the ontology has successor concepts PERFORMANCE, OPERA, FESTIVAL, associated with words performance, opera, festival correspondingly. Though these words are not synonyms in the traditional sense, they are taken to constitute one semantic class, since out of all words of the ontology's lexicon their meanings are closest. The thesaurus thus derived contained 1052 words and phrases (the corpus used in the study had data on 756 of them). Out of the 756 concepts, 182 were non-final; correspondingly, 182 word classes were formed. The average depth level of the thesaurus is 5.615, the maximum number of levels is 9. The corpus from which distributional data was obtained was extracted from a web site advertising hotels around the world . It contained around 1 million words.</Paragraph> <Paragraph position="12"> Collection of distributional data was carried out in the following settings. The preprocessing of common inflections were chopped off; irregular forms of verbs, adjectives and nouns were changed to their first forms). The context of usage was delineated by a window of 3 words on either side of the target word, without transgressing sentence boundaries. In case a stop word other than a proper noun appeared inside the window, the window was accordingly expanded. The stoplist included 50 most frequent words of the British National Corpus, words listed as function words in the BNC, and proper nouns not appearing in the sentence-initial position. The obtained frequencies of cooccurrence were weighted by the 1+log weight function.</Paragraph> <Paragraph position="13"> The distributional similarity was measured by means of three different similarity measures: the Jaccard's coefficient, L1 distance, and the skew divergence. This choice of similarity measures was motivated by results of studies by (Levy et al 1998) and (Lee 1999) which compared several well known measures on similar tasks and found these three to be superior to many others. Another reason for this choice is that there are different ideas underlying these measures: while the Jaccard's coefficient is a binary measure, L1 and the skew divergence are probabilistic, the former being geometrically motivated and the latter being a version of the information theoretic Kullback Leibler divergence (cf., Lee 1999).</Paragraph> </Section> </Section> <Section position="4" start_page="3" end_page="3" type="metho"> <SectionTitle> 5. Evaluation method </SectionTitle> <Paragraph position="0"> The performance of the algorithms was assessed in the following manner. For each algorithm, we held out a single word of the thesaurus as the test case, and trained the system on the remaining 755 words. We then tested the algorithm on the held-out vector, observing if the assigned class for that word coincided with its original class in the thesaurus, and counting the number of correct classifications (&quot;direct hits&quot;). This was repeated for each of the words of the thesaurus.</Paragraph> <Paragraph position="1"> However, given the intuition that a semantic classification may not be simply either right or wrong, but rather of varying degrees of appropriateness, we believe that a clearer idea about the quality of the classifiers would be given by an evaluation method that takes into account &quot;near misses&quot; as well. We therefore evaluated the performance of the algorithms also in terms of Learning Accuracy (Hahn & Schnattinger 1998), i.e., in terms of how close on average the proposed class for a test word was to the correct class. For this purpose the taxonomic similarity between the assigned and the correct classes is measured so that the appropriateness of a particular classification is estimated on a scale between 0 and 1, with 1 signifying assignment to the correct class. Thus Learning Accuracy is compatible with the counting of direct hits, which, as will be shown later, may be useful for evaluating the methods.</Paragraph> <Paragraph position="2"> In the following, the evaluation of the classification algorithms is reported both in terms of the average of direct hits and Learning Accuracy (&quot;direct+near hits&quot;) over all words in the thesaurus. To have a benchmark for evaluation of the algorithms, a baseline was calculated, which was the average hit value a given word gets, when its class label is chosen at random. The baseline for direct hits was estimated at 0.012; for direct+near hits, it was 0.15741.</Paragraph> </Section> class="xml-element"></Paper>