File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1173_metho.xml
Size: 8,198 bytes
Last Modified: 2025-10-06 14:08:49
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1173"> <Title>Word Sense Disambiguation using a dictionary for sense similarity measure</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Dictionaries as hierarchical </SectionTitle> <Paragraph position="0"> small-worlds Recent work in graph theory has revealed a set of features shared by many graphs observed &quot;in the field&quot; These features define the class of &quot;hierarchical small world&quot; networks (henceforth HSW) (Watts and Strogatz, 1998; Newman, 2003). The relevant features of a graph in this respect are the following: D the density of the network. HSWs typically have a low D, i.e. they have rather few edges compared to their number of vertices. null L the average shortest path between two nodes. It is also low in a HSW.</Paragraph> <Paragraph position="1"> C the clustering rate. This is a measure of how often neighbours of a vertex are also connected in the graph. In a HSW, this feature is typically high.</Paragraph> <Paragraph position="2"> I the distribution of incidence degrees (i.e. the number of neighbours) of vertices according to the frequency of nodes (how many nodes are there that have an incidence degree of 1, 2, ... n). In a HSW network, this distribution follows a power law: the probability P(k) that a given node has k neighbours decreases as k , with lambda > 0.</Paragraph> <Paragraph position="3"> It means also that there are very few nodes with a lot of neighbours, and a lot more nodes with very few neighbours.</Paragraph> <Paragraph position="4"> As a mean of comparison, table 1 shows the differences between randoms graphs (nodes are given, edges are drawn randomly between nodes), regular graphs and HSWs.</Paragraph> <Paragraph position="5"> The graph of a dictionary belongs to the class of HSW. For instance, on the dictionary we used, D=7, C=0.183, L=3.3. Table 2 gives a few characteristics of the graph of nouns only on the dictionary we used (starred columns indicate values for the maximal self-connected component).</Paragraph> <Paragraph position="6"> We also think that the hierarchical aspect of dictionaries (reflected in the distribution of incidence degrees) is a consequence of the role of hypernymy associated to the high polysemy of some entries, while the high clustering rate define local domains that are useful for disambiguation. We think these two aspects determine the dynamics of random walks in the graph as explained above, and we assume they are what makes our method interesting for sense disambiguation.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Word sense disambiguation </SectionTitle> <Paragraph position="0"> using Prox semantic distance We will now present a method for disambiguating dictionary entries using the semantic distance introduced section (3).</Paragraph> <Paragraph position="1"> The task can be defined as follows: we consider a word lemma occurring in the definition of one of the senses of a word . We want to tag with the most likely sense it has in this context. Each dictionary entry is coded as a tree of senses in the graph of the dictionary, with a number list associated to each sub-entry.</Paragraph> <Paragraph position="2"> For instance for a given word sense of word W, listed as sub-sense II.3.1 in the dictionary, we would record that sense as a node W(2,3,1) in the graph. In fact, to take homography into account we had to add another level to this, for instance W(1,1,2) is sense 1.2 of the first homograph of word W. In the absence of an homograph, the first number for a word sense will conventionally be 0.</Paragraph> <Paragraph position="3"> Let G=(V,E) the graph of words built as explained section 2, [G] is the adjacency matrix of G, and [ ^G] is the corresponding Markovian matrix . The following algorithm has then been applied: 3. Let L be the line in the result. 8k; L[k] = [ ^G]i ;k 4. Let E = fx1; x2; :::; xng be the nodes corresponding to all the sub-senses induced by the definition of .</Paragraph> <Paragraph position="4"> Then take xk = argmaxx2E(L[x]) Then xk is the sub-sense with the best rank according to the Prox distance.</Paragraph> <Paragraph position="5"> The following steps needs a little explanation: null 1 This neighbours are deleted because otherwise there is a bias towards the sub-senses of , which then form a sort of &quot;artificial&quot; cluster with respect to the given task. This is done to allow the random walk to really go into the larger network of senses.</Paragraph> <Paragraph position="6"> 2 Choosing a good value for the length of the random walk through the graph is not a simple matter. If it is too small, only local relations appear (near synonyms, etc) which might not appear in contexts to disambiguate (this is the main problem of Lesk's method); if it is too large, the distances between words will converge to a constant value. So it has to be related in some way to the average length between two senses in the graph. A reasonable assumption is therefore to stay close to this average length. Hence we took i = 6 since the average length is 5.21 (in the graph with a node for every sub-sense, the graph with a node for each entry having L=3.3)</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Evaluating the results </SectionTitle> <Paragraph position="0"> The following methodology has been followed to evaluate the process.</Paragraph> <Paragraph position="1"> We have randomly taken about a hundred of sub-entries in the chosen dictionary (out of about 140,000 sub-entries), and have hand-tagged all nouns and verbs in the entries with their sub-senses (and homography number), except those with only one sense, for a total of about 350 words to disambiguate. For all pair of (context,target), the algorithm gives a ranked list of all the sub-senses of the target. Although we have used both nouns and verbs to build the graph of senses, we have tested disambiguation first on nouns only, for a total of 237 nouns. We have looked how the algorithm behaves when we used both nouns and verbs in the graph of senses.</Paragraph> <Paragraph position="2"> To compare the human annotation to the automated one, we have applied the following measures, where (h1; h2; ; :::) is the human tag, and (s1; s2; ::) is the top-ranked system output for a context i defined as the entry and the target word to disambiguate: 1. if h1 = 0 then do nothing else the homograph score is 1 if h1 = s1, 0 otherwise; 2. in all cases, coarse polysemy count = 1 if</Paragraph> <Paragraph position="4"> Thus, the coarse polysemy score computes how many times the algorithm gives a sub-sense that has the same &quot;main&quot; sense as the human tag (the main-sense corresponds to the first level in the hierarchy of senses as defined above). The fine polysemy score gives the number of times the algorithm gives exactly the same sense as the human.</Paragraph> <Paragraph position="5"> To give an idea of the difficult of the task, we have computed the average number of main entry target system output human tag correct bal#n._m.*0_3 lieu#n. 1_1_3 1_1_1 correct van#n._m.*2_0_0_0_0 voiture#n. 0_2 0_2_3 error phonetisme#n._m.*0 moyen#n. 1_1_1 2_1 error creativite#n._f.*0 pouvoir#n. 2_3 2_1 error acme#n._m._ou_f.*0_1 phase#n. 0_1 0_4 sub-senses and the number of all senses, for each target word. This corresponds to a random algorithm, choosing between all senses of a given word. The expected value of this base-line is thus: homograph score=Px 1/(number of homographs of x) coarse polysemy = Px 1/(number of main sub-senses of x) fine polysemy = Px 1/(number of all sub-senses of x) A second baseline consists in answering always the first sense of the target word, since this is often (but not always) the most common usage of the word. We did not do this for homographs since the order in which they are given in the dictionary does not seem to reflect their importance.</Paragraph> <Paragraph position="6"> Table 4 sums up the results.</Paragraph> </Section> class="xml-element"></Paper>