File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1173_intro.xml
Size: 8,664 bytes
Last Modified: 2025-10-06 14:02:09
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1173"> <Title>Word Sense Disambiguation using a dictionary for sense similarity measure</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Various tasks dealing with natural language data have to cope with the numerous different senses possessed by every lexical item: machine translation, information retrieval, information extraction ... This very old issue is far from being solved, and evaluation of methods addressing it is far from obvious (Resnik and Yarowsky, 2000). This problem has been tackled in a number of ways1: by looking at contexts of use (with supervised learning or unsupervised sense clustering) or by using lexical resources such as dictionaries or thesauri.</Paragraph> <Paragraph position="1"> The first kind of approach relies on data that are hard to collect (supervised) or very sensitive to the type of corpus (unsupervised). The second kind of approach tries to exploit the lexical knowledge that is represented in dictionaries or thesaurus, with various results from its inception up to now (Lesk, 1986; Banerjee and Pedersen, 2003). In all cases, a distance between words or word senses is used as a way to find the right sense in a given context. Dictionary-based approaches usually rely on a comparison of the set of words used in sense definitions and 1A good introduction is (Ide and Veronis, 1998), or (Manning and Schutze, 1999), chap. 7.</Paragraph> <Paragraph position="2"> in the context to disambiguate2.</Paragraph> <Paragraph position="3"> This paper presents an algorithm which uses a dictionary as a network of lexical items (cf.</Paragraph> <Paragraph position="4"> sections 2 and 3) to compute a semantic similarity measure between words and word senses.</Paragraph> <Paragraph position="5"> It takes into account the whole topology of the dictionary instead of just the entry of target words. This arguably gives a certain robustness of the results with respect to the dictionary. We have begun testing this approach on word sense disambiguation on definitions of the dictionary itself (section 5), but the method is expected to be more general, although this has not been tested yet. Preliminary results are quite encouraging considering that the method does not require any prior annotated data, while operating on an unconstrained vocabulary.</Paragraph> <Paragraph position="6"> 2 Building the graph of a dictionary The experiment we describe here has been achieved on a dictionary restricted to nouns and verbs only: we considered dictionary entries classified as nouns and verbs and noun and verb lemmas occurring within those entries. The basic idea is to consider the dictionary as an undirected graph whose nodes are noun entries, and an edge exists between two nodes whenever one of them occur in the definition of the other. More precisely, the graph of the dictionary encodes two types of lexicographical informations: (1) the definitions of the entries sub-senses and (2) the structure of the entries that is the hierarchical organisation of their subsenses. The graph then includes two types of nodes: w-nodes used for the words that occur 2With the exceptions of the methods of (Kozima and Furugori, 1993; Ide and Veronis, 1990), both based on models of activation of lexical relations, but who present no quantified results.</Paragraph> <Paragraph position="7"> in the definitions and -nodes used for the definitions of the sub-senses of the entries. The graph is created in three phases: 1. For each dictionary entry, there is a -node for the entry as a whole and there is one -node for each of the sub-senses of the entry. Then an edge is added between each -node and the -nodes which represent the sub-senses of the next lower level. In other words, the graph includes a tree of -nodes which encodes the hierarchical structure of each entry.</Paragraph> <Paragraph position="8"> 2. A w-node is created in the graph for each word occurring in a definition and an edge is added between the w-node and the -node of that definition.</Paragraph> <Paragraph position="9"> 3. An edge is added between each w-node and the top -node representing the dictionary entry for that word.</Paragraph> <Paragraph position="10"> For instance, given the entry for &quot;tooth&quot;3: 1. (Anat.) One of the hard, bony appendages which are borne on the jaws, or on other bones in the walls of the mouth or pharynx of most vertebrates, and which usually aid in the prehension and mastication of food.</Paragraph> <Paragraph position="11"> 2. Fig.: Taste; palate.</Paragraph> <Paragraph position="12"> These are not dishes for thy dainty tooth.</Paragraph> <Paragraph position="13"> -Dryden.</Paragraph> <Paragraph position="14"> 3. Any projection corresponding to the tooth of an animal, in shape, position, or office; as, the teeth, or cogs, of a cogwheel; a tooth, prong, or tine, of a fork; a tooth, or the teeth, of a rake, a saw, a file, a card.</Paragraph> <Paragraph position="15"> 4. (a) A projecting member resembling a tenon, but fitting into a mortise that is only sunk, not pierced through.</Paragraph> <Paragraph position="16"> (b) One of several steps, or offsets, in a tusk. See Tusk.</Paragraph> <Paragraph position="17"> We would consider one -node for tooth as the top-level entry, let us call it 0. 0 is connected with an edge to the -nodes 1, 2 , 3 and 4 corresponding to the senses 1., 2., 3.</Paragraph> <Paragraph position="18"> 3Source: Webster's Revised Unabridged Dictionary, 1996. The experiment has actually been done on a French dictionary, Le Robert.</Paragraph> <Paragraph position="19"> and 4.; the latter will have an edge towards the two -nodes 4:1 and 4:2 for the sub-senses 4.a. and 4.b.; 4:1 will have an edge to each w-node built for nouns and verbs occurring in its definition (member, resemble, tenon, fit, mortise, sink, pierce). Then the w-node for tenon will have an edge to the -node of the top-level entry of tenon. We do not directly connect 4:1to the -nodes of the top-level entries because these may have both w- and -node daughters.</Paragraph> <Paragraph position="20"> In the graph, -nodes have tags which indicates their homograph number and their loca-tion in the hierarchical structure of the entry. These tags are sequences of integers where the first one gives the homograph number and the next ones indicate the rank of the sense-number at each level. For instance, the previous nodes 4:1 is tagged (0; 4; 1).</Paragraph> <Paragraph position="21"> 3 Prox, a distance between graph nodes We describe here our method (dubbed Prox) to compute a distance between nodes in the kind of graph described in the previous section. It is a stochastic method for the study of so-called hierarchical small-world graphs (Gaume et al., 2002) (see also the next section). The idea is to see a graph as a Markov chain whose states are the graph nodes and whose transitions are its edges, with equal probabilities. Then we send random particles walking through this graph, and their trajectories and the dynamics of their trajectories reveal their structural properties. In short, we assume the average distance a particle has made between two nodes after a given time is an indication of the semantic distance between these nodes. Obviously, nodes located in highly clustered areas will tend to be separated by smaller distance.</Paragraph> <Paragraph position="22"> Formally, if G = (V; E) is an irreflexive graph with jV j = n, we note [G] the n n adjacency matrix of G that is such that [G]i;j (the ith row and jth column) is 1 if there is an edge between node i and node j and 0 otherwise.</Paragraph> <Paragraph position="23"> We note [ ^G] the Markovian matrix of G, such</Paragraph> <Paragraph position="25"> In the case of graphs built from a dictionary as above,[ ^G]r;s is 0 if there is no edge between nodes r and s, and 1=D otherwise, where D is the number of neighbours of r. This is indeed a Markovian transition matrix since the sum of each line is one (the graph considered being connected).</Paragraph> <Paragraph position="26"> We note [ ^G]i the matrix [ ^G] multiplied i times by itself.</Paragraph> <Paragraph position="27"> Let now PROX(G,i,r,s) be [ ^G]ir;s. This is thus the probability that a random particle leaving node r will be in node s after i time steps. This is the measure we will use to determine if a node s is closer to a node r than another node t. Now we still have to find a proper value for i. The next section explains the choice we have made.</Paragraph> </Section> class="xml-element"></Paper>