File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/h92-1046_metho.xml
Size: 9,721 bytes
Last Modified: 2025-10-06 14:13:08
<?xml version="1.0" standalone="yes"?> <Paper uid="H92-1046"> <Title>Lexical Disambiguation using Simulated Annealing</Title> <Section position="3" start_page="238" end_page="238" type="metho"> <SectionTitle> 2. Simulated Annealing </SectionTitle> <Paragraph position="0"> The method of simulated annealing \[Metropolis et al., 1953; Kirkpatrick et al., 1983\] is a technique for solving large scale problems of combinatorial minimization.</Paragraph> <Paragraph position="1"> It has been successfully applied to the famous traveling Salesman problem of finding the shoitest route for a salesman who must visit a number of cities in turn, and is now a standard method for optimizing the placement of circuit elements on large scale integrated circuits. Simulated annealing was applied to parsing by Sampson \[1986\], but since the method has not yet been widely applied to Computational Linguistics or Natural Language Processing, we describe it briefly.</Paragraph> <Paragraph position="2"> The name of the algorithm is an analogy to the process by which metals cool and anneal. A feature of this phenomenon is that slow cooling usually allows the metal to reach a uniform composition and a minimum energy state, while fast cooling leads to an amorphous state with higher energy. In simulated annealing, a parameter T which corresponds to temperature is decreased slowly enough to allow the system to find its minimum.</Paragraph> <Paragraph position="3"> The process requires a function E of configurations of the system which corresponds to the energy. It is E that we seek to minimize. From a starting point, a new configuration is randomly chosen, and a new value of E is computed. If the new E is less than the old one, the new configuration is chosen to replace the older. An essential feature of simulated annealing is that even if the new E is larger than the old (indicating that this configuration is farther away from the desired minimum than the last choice), the new configuration may be chosen. The decision of whether or not to replace the old configuration with the new inferior one is made probabilistieally. This feature of allowing the algorithm to &quot;go up hill&quot; helps it to avoid settling on a local minimum which is not the actual minimum. In succeeding trials, it becomes more difficult for configurations which increase E to be chosen, and finally, when the method has retained the same configuration for long enough, that configuration is chosen as the solution. In the traveling salesman example, the configurations are the different paths through the cities, and E is the total length of his trip. The final configuration is an approximation to the shortest path through the cities. The next section describes how the algorithm may be applied to word-sense disambiguation.</Paragraph> </Section> <Section position="4" start_page="238" end_page="239" type="metho"> <SectionTitle> 3. Word-Sense Disambiguation </SectionTitle> <Paragraph position="0"> Given a sentence with N words, we may represent the senses of the ith word as sil,si2,...sik,, where ki is the number of senses of the ith word which appear in LDOCE. A configuration of the system is obtained by choosing a sense for each word in the sentence. Our goal is to choose that configuration which a human disambiguator would choose. To that end, we must define a function E whose minimum we may reasonable expect to correspond to the correct choice of the word senses.</Paragraph> <Paragraph position="1"> The value of E for a given configuration is calculated in terms of the definitions of the N senses which make it up. All words in these definitions are stemmed, and the results stored in a list. The redundancy R is computed by giving a stemmed word form which appears n times a score of n - 1 and adding up the scores. Finally, E is defined to be 1 I+R&quot; The rationale behind this choice of E is that word senses which belong together in a sentence will have more words in common in their definitions (larger values of R) than senses which do not belong together. Minimizing E will maximize//and determine our choice of word senses.</Paragraph> <Paragraph position="2"> The starting configuration C is chosen to be that in which sense number one of each word is chosen. Since the senses in LDOCE are generally listed with the most frequently used sense first, this is a likely starting point. The value of E is computed for this configuration. The next step is to choose at random a word number i and a sense Sij of that ith word. The configuration C ~ is is constructed by replacing the old sense of the ith word by the sense Sij. Let zXE be the change from E to the value computed for C ~. If zkE < 0, then C ~ replaces C, and we make a new random change in Cq If AE >= 0, we change to C ~ with probability P = e--mr. In this expression, T is a constant whose initial value is 1, and the decision of whether or not to adopt C ~ is made by calling a random number generator. If the number generated is less than P, C is replaced by Cq Otherwise, C is retained. This process of generating new configurations and checking to see whether or not to choose them is repeated on the order of 1000 times, T is replaced by 0.9 T, and the loop entered again. Once the loop is executed with no change in the configuration, the routine ends, and this final configuration tells which word senses are to be selected.</Paragraph> <Paragraph position="3"> these choices of word senses with the output of the program. Using the human choices as the standard, the algorithm correctly disambiguated 47% of the words to the sense level, and 72% to the homograph level.</Paragraph> <Paragraph position="4"> Direct comparisons of these success rates with those of other methods is difficult. None of the other methods was used to disambiguate the same text, and while we have attempted to tag every ambiguous word in a sentence, other methods were applied to one, or at most a few, highly ambiguous words. It appears that in some cases the fact that our success rates include not only highly ambiguous words, but some words with only a few senses is offset by the fact that other researchers have used a broader definition of word sense. For example, the four senses of &quot;interest&quot; used by Zernick and Jacobs \[1990\] may correspond more closely to our two homographs and not our ten senses of &quot;interest.&quot; Their success rate in tagging the three words &quot;interest&quot;, &quot;stock&quot;, and &quot;bond&quot; was 70%. Thus it appears that the method we propose is comparable in effectiveness to the other computational methods of word-sense disambiguation, and has the advantages of being automatically applicable to all the 28,000 words in LDOCE and of being computationally practical.</Paragraph> <Paragraph position="5"> Below we give two examples of the results of the technique. The words following the arrow are the stemmed words selected from the definitions and used to calculate the redundancy. The headword and sense numbers are those used in the machine readable version of LDOCE.</Paragraph> <Paragraph position="6"> Finally, we show two graphs (figure ??) which illustrate the convergence of the simulated annealing technique to the minimum energy (E) level. The second graph is a close-up of the final cycles of the complete process shown in the first graph.</Paragraph> </Section> <Section position="5" start_page="239" end_page="239" type="metho"> <SectionTitle> 4. An Experiment </SectionTitle> <Paragraph position="0"> The algorithm described above was used to disambiguate 50 example sentences from LDOCE. A stop list of very common words such as &quot;the&quot;, &quot;as&quot;, and &quot;of' was removed from each sentence. The sentences then contained from two to fifteen words, with an average of 5.5 ambiguous words per sentence. Definitions in LDOCE are broken down first into broad senses which we call &quot;homographs&quot;, and then into individual senses which distinguish among the various meanings. For example, one homograph of &quot;bank&quot; means roughly &quot;something piled up.&quot; There are five senses in this homograph which distinguish whether the thing piled up is snow, clouds, earth by a river, etc.</Paragraph> <Paragraph position="1"> Results of the algorithm were evaluated by having a literate human disambiguate the sentences and comparing</Paragraph> </Section> <Section position="6" start_page="239" end_page="240" type="metho"> <SectionTitle> 5. Conclusion </SectionTitle> <Paragraph position="0"> This paper describes a method for word-sense disambiguation based on the simple technique of choosing senses of the words in a sentence so that their definitions in a machine readable dictionary have the most words in common. The amount of computation necessary to find this optimal choice exactly quickly becomes prohibitive as the number of ambiguous words and the number of senses increase. The computational technique of simulated annealing allows a good approximation to be computed quickly. Advantages of this technique over previous work are that all the words in a sentence are disambiguated simultaneously, in a reasonable time, and automatically (with no hand disambiguation of training text). Results using this technique are comparable to other computational techniques and enhancements in-</Paragraph> </Section> <Section position="7" start_page="240" end_page="240" type="metho"> <SectionTitle> SENTENCE </SectionTitle> <Paragraph position="0"> The fish floundered on the river ban.k, struggling to breathe</Paragraph> </Section> class="xml-element"></Paper>