File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-3012_metho.xml
Size: 11,351 bytes
Last Modified: 2025-10-06 14:08:54
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-3012"> <Title>WordNet::Similarity - Measuring the Relatedness of Concepts</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Similarity Measures </SectionTitle> <Paragraph position="0"> Three of the six measures of similarity are based on the information content of the least common subsumer (LCS) of concepts A and B. Information content is a measure of the specificity of a concept, and the LCS of concepts A and B is the most specific concept that is an ancestor of both A and B. These measures include res (Resnik, 1995), lin (Lin, 1998), and jcn (Jiang and Conrath, 1997).</Paragraph> <Paragraph position="1"> The lin and jcn measures augment the information content of the LCS with the sum of the information content of concepts A and B themselves. The lin measure scales the information content of the LCS by this sum, while jcn takes the difference of this sum and the information content of the LCS.</Paragraph> <Paragraph position="2"> The default source for information content for concepts is the sense-tagged corpus SemCor. However, there are also utility programs available with WordNet::Similarity that allow a user to compute information content values from the Brown Corpus, the Penn Treebank, the British Three similarity measures are based on path lengths between a pair of concepts: lch (Leacock and Chodorow, 1998), wup (Wu and Palmer, 1994), and path. lch finds the shortest path between two concepts, and scales that value by the maximum path length found in the is-a hierarchy in which they occur. wup finds the depth of the LCS of the concepts, and then scales that by the sum of the depths of the individual concepts. The depth of a concept is simply its distance to the root node. The measure path is a baseline that is equal to the inverse of the shortest path between two concepts.</Paragraph> <Paragraph position="3"> WordNet::Similarity supports two hypothetical root nodes that can be turned on and off. When on, one root node subsumes all of the noun concepts, and another subsumes all of the verb concepts. This allows for similarity measures to be applied to any pair of nouns or verbs. If the hypothetical root nodes are off, then concepts must be in the same physical hierarchy for a measurement to be taken.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Measures of Relatedness </SectionTitle> <Paragraph position="0"> Measures of relatedness are more general in that they can be made across part of speech boundaries, and they are not limited to considering is-a relations. There are three such measures in the package: hso (Hirst and St-Onge, 1998), lesk (Banerjee and Pedersen, 2003), and vector (Patwardhan, 2003).</Paragraph> <Paragraph position="1"> The hso measures classifies relations in WordNet as having direction, and then establishes the relatedness between two concepts A and B by finding a path that is neither too long nor that changes direction too often.</Paragraph> <Paragraph position="2"> The lesk and vector measures incorporate information from WordNet glosses. The lesk measure finds overlaps between the glosses of concepts A and B, as well as concepts that are directly linked to A and B. The vector measure creates a co-occurrence matrix for each word used in the WordNet glosses from a given corpus, and then represents each gloss/concept with a vector that is the average of these co-occurrence vectors.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Using WordNet::Similarity </SectionTitle> <Paragraph position="0"> WordNet::Similarity can be utilized via a command line interface provided by the utility program similarity.pl.</Paragraph> <Paragraph position="1"> This allows a user to run the measures interactively. In addition, there is a web interface that is based on this utility. WordNet::Similarity can also be embedded within Perl programs by including it as a module and calling its methods.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Command Line </SectionTitle> <Paragraph position="0"> The utility similarity.pl allows a user to measure specific pairs of concepts when given in word#pos#sense form.</Paragraph> <Paragraph position="1"> For example, car#n#3 refers to the third WordNet noun sense of car. It also allows for the specification of all the possible senses associated with a word or word#pos combination.</Paragraph> <Paragraph position="2"> For example, in Figure 1, the first command requests the value of the lin measure of similarity for the second noun sense of car (railway car) and the first noun sense of bus (motor coach). The second command will return the score of the pair of concepts that have the highest similarity value for the nouns car and bus. In the third command, the -allsenses switch causes the similarity measurements of all the noun senses of car to be calculated relative to the first noun sense of bus.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Programming Interface </SectionTitle> <Paragraph position="0"> WordNet::Similarity is implemented with Perl's object oriented features. It uses the WordNet::QueryData package (Rennie, 2000) to create an object representing Word-Net. There are a number of methods available that allow for the inclusion of existing measures in Perl source code, and also for the development of new measures.</Paragraph> <Paragraph position="1"> When an existing measure is to be used, an object of that measure must be created via the new() method. Then the getRelatedness() method can be called for a pair of word senses, and this will return the relatedness value.</Paragraph> <Paragraph position="2"> For example, the program in Figure 2 creates an object of the lin measure, and then finds the similarity between the</Paragraph> <Paragraph position="4"> first sense of the noun car (automobile) and the second sense of the noun bus (network bus).</Paragraph> <Paragraph position="5"> WordNet::Similarity enables detailed tracing that shows a variety of diagnostic information specific to each of the different kinds of measures. For example, for the measures that rely on path lengths (lch, wup, path) the tracing shows all the paths found between the concepts.</Paragraph> <Paragraph position="6"> Tracing for the information content measures (res, lin, jcn) includes both the paths between concepts as well as the least common subsumer. Tracing for the hso measure shows the actual paths found through WordNet, while the tracing for lesk shows the gloss overlaps in Word-Net found for the two concepts and their nearby relatives. The vector tracing shows the word vectors that are used to create the gloss vector of a concept.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Software Architecture </SectionTitle> <Paragraph position="0"> Similarity.pm is the super class of all modules, and provides general services used by all of the measures such as validation of synset identifier input, tracing, and caching of results. There are four modules that provide all of the functionality required by any of the supported measures: PathFinder.pm, ICFinder.pm, DepthFinder.pm, and LCSFinder.pm. null PathFinder.pm provides getAllPaths(), which finds all of the paths and their lengths between two input synsets, and getShortestPath() which determines the length of the shortest path between two concepts.</Paragraph> <Paragraph position="1"> ICFinder.pm includes the method IC(), which gets the information content value of a synset. probability() and getFrequency() find the probability and frequency count of a synset based on whatever corpus has been used to compute information content. Note that these values are pre-computed, so these methods are simply reading from an information content file.</Paragraph> <Paragraph position="2"> DepthFinder.pm provides methods that read values that have been pre-computed by the wnDepths.pl utility. This program finds the depth of every synset in WordNet, and also shows the is-a hierarchy in which a synset occurs. If a synset has multiple parents, then each possible depth and home hierarchy is returned. The depth of a synset is returned by getDepthOfSynset() and getTaxonomyDepth() provides the maximum depth for a given is-a hierarchy.</Paragraph> <Paragraph position="3"> LCSFinder.pm provides methods that find the least common subsumer of two concepts using three different criteria. These are necessary since there is multiple inheritance of concepts in WordNet, and different LCS can be selected for a pair of concepts if one or both of them have multiple parents in an is-a hiearchy. getLCSbyIC() chooses the LCS for a pair of concepts that has the highest information content, getLCSbyDepth() selects the LCS with the greatest depth, and getLCSbyPath() selects the LCS that results in the shortest path.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Related Work </SectionTitle> <Paragraph position="0"> Our work with measures of semantic similarity and relatedness began while adapting the Lesk Algorithm for word sense disambiguation to WordNet (Banerjee and Pedersen, 2002). That evolved in a generalized approach to disambiguation based on semantic relatedness (Patwardhan et al., 2003) that is implemented in the SenseRelate package (http://senserelate.sourceforge.net), which utilizes WordNet::Similarity. The premise behind this algorithm is that the sense of a word can be determined by finding which of its senses is most related to the possible senses of its neighbors.</Paragraph> <Paragraph position="1"> WordNet::Similarity has been used by a number of other researchers in an interesting array of domains.</Paragraph> <Paragraph position="2"> (Zhang et al., 2003) use it as a source of semantic features for identifying cross-document structural relationships between pairs of sentences found in related documents. (McCarthy et al., 2004) use it in conjunction with a thesaurus derived from raw text in order to automatically identify the predominent sense of a word. (Jarmasz and Szpakowicz, 2003) compares measures of similarity derived from WordNet and Roget's Thesaurus. The comparisons are based on correlation with human relatedness values, as well as the TOEFL synonym identification tasks. (Baldwin et al., 2003) use WordNet::Similarity to provide an evaluation tool for multiword expressions that are identified via Latent Semantic Analysis. (Diab, 2003) combines a number of similarity measures that are then used as a feature in the disambiguation of verb senses.</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 7 Availability </SectionTitle> <Paragraph position="0"> WordNet::Similarity is written in Perl and is freely distributed under the Gnu Public License. It is available from the Comprehensive Perl Archive Network (http://search.cpan.org/dist/WordNet-Similarity) and via SourceForge, an Open Source development platform (http://wn-similarity.sourceforge.net).</Paragraph> </Section> class="xml-element"></Paper>