File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-3811_intro.xml
Size: 5,820 bytes
Last Modified: 2025-10-06 14:04:15
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3811"> <Title>Synonym Extraction Using a Semantic Distance on a Dictionary</Title> <Section position="2" start_page="0" end_page="65" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Thesaurus are an important resource in many natural language processing tasks. They are used to help information retrieval (Zukerman et al., 2003), machine or semi-automated translation, (Ploux and Ji, 2003; Barzilay and McKeown, 2001; Edmonds and Hirst, 2002) or generation (Langkilde and Knight, 1998).</Paragraph> <Paragraph position="1"> Since the gathering of such lexical information is a delicate and time-consuming endeavour, some effort has been devoted to the automatic building of sets of synonyms words or expressions.</Paragraph> <Paragraph position="2"> Synonym extraction suffers from a variety of methodological problems, however. Synonymy itself is not an easily definable notion. Totally equivalent words (in meaning and use) arguably do not exist, and some people prefer to talk about near-synonyms (Edmonds and Hirst, 2002). A near-synonym is a word that can be used instead of another one, in some contexts, without too much change in meaning. This leaves of lot of freedom in the degree of synonymy one is ready to accept.</Paragraph> <Paragraph position="3"> Other authors include &quot;related&quot; terms in the building of thesaurus, such as hyponyms and hypernyms, (Blondel et al., 2004) in a somewhat arbitrary way.</Paragraph> <Paragraph position="4"> More generally, paraphrase is a preferred term referring to alternative formulations of words or expressions, in the context of information retrieval or machine translation.</Paragraph> <Paragraph position="5"> Then there is the question of evaluating the results.</Paragraph> <Paragraph position="6"> Comparing to already existing thesaurus is a debatable means when automatic construction is supposed to complement an existing one, or when a specific domain is targeted, or when simply the automatic procedure is supposed to fill a void. Manual verification of a sample of synonyms extracted is a common practice, either by the authors of a study or by independent lexicographers. This of course does not solve problems related to the definition of synonymy in the &quot;manual&quot; design of a thesaurus, but can help evaluate the relevance of synonyms extracted automatically, and which could have been forgotten. One can hope at best for a semi-automatic procedure were lexicographers have to weed out bad candidates in a set of proposals that is hopefully not too noisy.</Paragraph> <Paragraph position="7"> A few studies have tried to use the lexical information available in a general dictionary and find patterns that would indicate synonymy relations (Blon null del et al., 2004; Ho and Cedrick, 2004). The general idea is that words are related by the definition they appear in, in a complex network that must be semantic in nature (this has been also applied to word sense disambiguation, albeit with limited success (Veronis and Ide, 1990; H.Kozima and Furugori, 1993)).</Paragraph> <Paragraph position="8"> We present here a method exploiting the graph structure of a dictionary, where words are related by the definition they appear in, to compute a distance between words. This distance is used to isolate candidate synonyms for a given word. We present an evaluation of the relevance of the candidates on a sample of the lexicon.</Paragraph> <Paragraph position="9"> 2 Semantic distance on a dictionary graph We describe here our method (dubbed Prox) to compute a distance between nodes in a graph. Basically, nodes are derived from entries in the dictionary or words appearing in definitions, and there are edges between an entry and the word in its definition (more in section 3). Such graphs are &quot;small world&quot; networks with distinguishing features and we hypothetize these features reflect a linguistic and semantic organisation that can be exploited (Gaume et al., 2005).</Paragraph> <Paragraph position="10"> The idea is to see a graph as a Markov chain whose states are the graph nodes and whose transitions are its edges, valuated with probabilities. Then we send random particles walking through this graph, and their trajectories and the dynamics of their trajectories reveal their structural properties. In short, we assume the average distance a particle has made between two nodes after a given time is an indication of the semantic distance between these nodes. Obviously, nodes located in highly clustered areas will tend to be separated by smaller distance.</Paragraph> <Paragraph position="11"> Formally, if G = (V,E) is a reflexive graph (each node is connected to itself) with |V |= n, we note [G] the n x n adjacency matrix of G that is such that [G]i,j (the ith row and jth column) is non null if there is an edge between node i and node j and 0 otherwise. We can have different weights for the edge between nodes (cf. next section), but the method will be similar.</Paragraph> <Paragraph position="12"> The first step is to turn the matrix into a Markovian matrix. We note [ ^G] the Markovian matrix of G, such that</Paragraph> <Paragraph position="14"> The sum of each line of G is different from 0 since the graph is reflexive.</Paragraph> <Paragraph position="15"> We note [ ^G]i the matrix [ ^G] multiplied i times by itself. null Let now PROX(G,i,r,s) be [ ^G]ir,s. This is thus the probability that a random particle leaving node r will be in node s after i time steps. This is the measure we will use to determine if a node s is closer to a node r than another node t. The choice for i will depend on the graph and is explained later (cf. section 4).</Paragraph> </Section> class="xml-element"></Paper>