File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1046_metho.xml
Size: 17,290 bytes
Last Modified: 2025-10-06 14:13:48
<?xml version="1.0" standalone="yes"?> <Paper uid="H94-1046"> <Title>USING A SEMANTIC CONCORDANCE FOR SENSE IDENTIFICATION</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2. THE GUESSING HEURISTIC </SectionTitle> <Paragraph position="0"> A guessing strategy presumes the existence of a standard list of words and their senses, but it does not assume any knowledge of the relative frequencies of different senses of polysemous words.</Paragraph> <Paragraph position="1"> We adopted the lexical database WordNet \[2\] as a convenient on-line list of open-class words and their senses. Whenever a word is ancountered that has more than one sense in WordNet, a system with no other information could do no better than to select a sense at random.</Paragraph> <Paragraph position="2"> The guessing heuristic that we evaluated was defined as follows: on encountering a noun (other than a proper noun), verb, adjective, or adverb in the test material, look it up in WordNet. If the word is monosemous (has a single sense in WordNet), assign that sense to it. If the word is polysemous (has more than one sense in Word-Net), choose a sense at random with a probability of l/n, where n is the number of different senses of that word.</Paragraph> <Paragraph position="3"> This guess.ing heuristic was then used with the sample of 103 passages from the Brown Corpus. Given the distribution of open-class words in those passages and the number of senses of each word in WordNet, estimating the probability of a correct sense identification is a straightforward calculation. The result was that 45.0% of the 101,284 guesses would be correct. When the percent correct was calculated for just the 76,067 polysemous word tokens, it was 26.8%.</Paragraph> </Section> <Section position="5" start_page="0" end_page="241" type="metho"> <SectionTitle> 3. THE MOST-FREQUENT HEURISTIC </SectionTitle> <Paragraph position="0"> Data on sense frequencies do exist. During the 1930s, Lorge \[4\] hired students at Columbia University to count how often each of the senses in the Oxford English Dictionary occurred in some 4,500,000 running words of prose taken from magazines of the day.</Paragraph> <Paragraph position="1"> These and other word counts were used by Thomdike in writing the Thorndike-Barnhart Junior Dictionary \[5\], a dictionary for children that first appeared in 1935 and that was widely used in the public schools for many years. Not only was Thorndike able to limit his dictionary, to words in common use, but he was also able to list senses in the order of their frequency, thus insuring that the senses he included would be the ones that children were most likely to encounter in their reading. The Lorge-Thomdike data, however, do not seem to he available today in a computer-readable form.</Paragraph> <Paragraph position="2"> More recently, the editors of Collins COBUILD Dictionary of the English Language \[6\] made use of the 20,000,000-word COBUILD corpus of written English to insure that the most commonly used words were included. Entries in this dictionary are organized in such a way that, whenever possible, the first sense of a polysemous word is both common and central to the meaning of the word.</Paragraph> <Paragraph position="3"> Again, however, sense-frequencies do not seem to be generally available in a computer-readable form.</Paragraph> <Paragraph position="4"> At the ARPA Human Language Technology Workshop in March 1993, Miller, Leacock, Tengi, and Bunker \[7\] described a semantic concordance that combines passages from the Brown Corpus \[3\] with the WordNet lexical d~t_~base \[2\] in such a way that every open-class word in the text (every noun, verb, adjective, or adverb) carries both a syntactic tag and a semantic tag pointing to the appropriate sense of that word in WordNet. The version of this semantic concordance that existed in August 1993, incorporating 103 of the 500 passages in the Brown Corpus, was made publicly available, along with version 1.4 of WordNet to which the passages were tagged. 1 Passages in the Brown Corpus are approximately 2,000 words long, and average approximately 1,000 open-class words each. Although this sample is much smaller than one would like, this semantic concordance does provide a basis for estimating sense frequencies for open-class words broken down by part of speech (word/pus). For example, there are seven senses of the word &quot;board&quot; as a noun Cooard/nl, board/n2' .... board/h7), and four senses as a verb (boardNl, hoard/v2&quot; .... board/v4); the frequencies of all eleven senses in the semantic concordance can he tabulated separately to determine the most frequent board/n and the most frequent hoard/v.</Paragraph> <Paragraph position="5"> The fact that the words that occur most frequently in standard English tend to be the words that are most polysemous creates a bad news, good news situation. The bad news is that most of the content words in textual corpora require disambiguation. The good news is that polysemous words occur frequently enough that statistical estimates are possible on the basis of relatively small samples. It is possible., therefore, to pose the question: on the basis of the available sample, how often would the most frequent sense be correct? A larger semantic concordance would undoubtedly yield a more precise lower bound, but at least an approximate estimate can be obtained.</Paragraph> <Paragraph position="6"> The most-frequent heuristic was defined as follows: on encountering a noun, verb, adjective, or adverb in the test material, look it up in WordNet. If the word is monosamous, assign that sense to it. If the syntactically tagged word (word/pos) has more than one sense in WordNet, consult the semantic concordance to determine which sense occurred most often in that corpus and assign that sense to it; if there is a fie, select one of the equally frequent senses at random. If the word is polysemous but does not occur in the semantic concordance, choose a sense at random with a probability of 1/rh where n is the number of different senses of that word in WordNet.</Paragraph> <Paragraph position="7"> In short, when there are dam indicating the most frequent sense of a polyseanous word, use it; otherwise, guess.</Paragraph> <Paragraph position="8"> i via anonymous ftp fn:an chrity.pdnccton.odu.</Paragraph> <Section position="1" start_page="240" end_page="240" type="sub_section"> <SectionTitle> 3.1 A Preliminary Experiment </SectionTitle> <Paragraph position="0"> In order to obtain a preliminary estimate of the accuracy of the most-frequent heuristic, a new passage from the Brown Corpus (passage P7, an excerpt from a novel that was classified by Francis and Ku~era \[3\] as &quot;Imaginative Prose: Romance and Love Story&quot;) was semantically tagged to use as the test material. The ~aining material was the 103 other passages from the Brown Corpus (not including P7) that made up the semantic concordance. The semantic tags assigned by a human reader were then compared, one word at a time, with the sense assigned by the most-frequent heuristic.</Paragraph> <Paragraph position="1"> For this particular passage, only 62.5% of the open-class words were correctly tagged by the most-frequent heuristic. This estimate is generous, however, since 24% of the open-class words were monosemous. When the average is taken solely over polysemous words, the most frequent sense was right only 50.8% of the time.</Paragraph> <Paragraph position="2"> These results were lower than expected, so we asked whether passage P7 might be unusual in some way. For example, the sentences were relatively short and there were fewer monosemous words than in an average passage in the training material. However, an inspection of these data did not reveal any trend as a function of sentence length; short sentences were no harder than long ones. And the lower frequency of monosemous words is consistent with the non-technical nature of the passage; there is no obvious reason why that should influence the results for polysemous words. Without comparable data for other passages, there is no way to know whether these results for F7 are representative or not.</Paragraph> </Section> <Section position="2" start_page="240" end_page="241" type="sub_section"> <SectionTitle> 3.2 A Larger Sample </SectionTitle> <Paragraph position="0"> Rather than tag other new passages to use as test material, we decided to use passages that were already tagged semantically.</Paragraph> <Paragraph position="1"> That is to say, any tagged passage in the semantic concordance can be made to serve as a test passage by simply eliminating it from the training material. For example, in order to use passage X as a test passage, we can delete it from the semantic concordance; then, using this diminished training material, the most-frequent heuristic is evaluated for passage X. Next, X is restored, Y is deleted, and the procedure repeats. Since there are 103 tagged passages in the semantic concordance, this produces 103 data points in addition to the one we already have for PT.</Paragraph> <Paragraph position="2"> Using this procedure, the average number of correct sense identifications produced by the most-frequent heuristic is 66.9% (standard deviation, o = 3.7%) when all of the open-class words, both monosemous and polysemous, are included. Whan only polysemous words are considered, the average drops to 56.4% (o = 4.3%). This larger sample shows that the results obtained from the preliminary experiment with passage P7 were indeed low, more than a standard deviation below the mean.</Paragraph> <Paragraph position="3"> The scores obtained when the most-frequent heuristic is applied to these 2,000-word passages appear to be normally distributed.</Paragraph> <Paragraph position="4"> Cumulative distributions of the scores for all 104 passages are shown in Figure 1. Separate distributions are shown for all open-class words (both monosemous and polysemous) and for the polysemous open-class words alone.</Paragraph> <Paragraph position="5"> No doubt some of this variation is attributable to differences in genre between passages. Table 1 lists the 15 categories of prose sampled by Francis and Ku~,era \[5\], along with the number of passages of each type in the semantic concordance and the average frequent heuristic is applied to 104 passages from the Brown Corpus.</Paragraph> <Paragraph position="6"> percentage correct according to the most-frequent heuristic. The passages of &quot;Informative Prose&quot; (A through J) tend to give lower scores than the passages of &quot;Imaginative Prose&quot; (K through R), suggesting that fiction writers are slightly more likely to use words in their commonest senses. But the differences are small.</Paragraph> </Section> <Section position="3" start_page="241" end_page="241" type="sub_section"> <SectionTitle> 3.3 Effects of Guessing </SectionTitle> <Paragraph position="0"> As the most-frequent heuristic is defined above, when a polysemous open-class word is encountered in the test material that has not occurred anywhere in the training material, a random guess at its sere is used. Such cases, which lower the average scores, are a necessary but unfortunate consequence of the relatively small sample of tagged text that is available; with a large sample we should have sense frequencies for all of the polysemous words.</Paragraph> <Paragraph position="1"> However, we can get some idea of how significent this effect is by simply omitting all instances of guessing, i.e., by basing the percentage correct only on those words for which there are data available in the training material.</Paragraph> <Paragraph position="2"> When guesses are dropped out, an improvement of approximately 2% is obtained. That is to say, the mean for all substantive words increases fxom 66.9% to 69.0% (a = 3.8%), and the mean for polysemous words alone increases from 56.4% to 58.2% (o = 4.5%).</Paragraph> <Paragraph position="3"> We take these values to be our current best estimates of the performanca of a most-frequent heuristic when a large database is available. Stated differently: any sense identification system that does no better than 69% (or 58% for polysemous words) is no improvement over a most-frequent heuristic.</Paragraph> </Section> </Section> <Section position="6" start_page="241" end_page="242" type="metho"> <SectionTitle> 4. THE CO-OCCURRENCE HEURISTIC 1 </SectionTitle> <Paragraph position="0"> The criterion of correcmess in these studies is agreement with the judgment of a human reader, so it should be insu-uctive to consider how readers do it. A reader's judgments are made on the basis of whole phrases or sentences; senses of co-occurring words are allowed to determine one another and are identified together. The general rule is that only senses that suit all of the words in a sentence can co-oeeur; not only does word W. constrain the sense of 1 another word W in the same sentence, but W also constrains the 2 sense of W.. ~2hat is what is meant when we say that context 1 guides a reader in determining the senses of individual words.</Paragraph> <Paragraph position="1"> Given the importance of co-occurring senses, therefore, we undertook to determine whether, on the basis of the available data, co-occurrences could be exploited for sense identificatiorL In addition to information about the most frequent senses, a semantic concordance also contains information about senses that tend to occur together in the same sentences. It is possible to compile a semantic co-oocurrence matrix: a matrix showing how often the senses of each word co-occor in sentences in the semantic concordance. For example, if the test sentence is &quot;The horses and men were saved,&quot; we search the semantic co-occurrence malxix for co-occurrences of horse/n and man/n, horse/n and save/v, and man/n and save/v. This search reveals that the fifth sense of the noun horse, horse/nS, co-occurred twice in the same sentence with man/n2 and four times with man/n6, but neither horse/n nor man/n co-occurred in the same sentence with save/v. If we then take the most frequent of the two co-occurring senses of manht, we select man/n2. But no co-occurrence information is provided as to which one of the 7 senses of save/v should be chosen; for save/v it is necessary to resort to the most frequent sense, as described above.</Paragraph> <Paragraph position="2"> The co-occurrence heuristic was defined as follows. First, compile a semantic co-occurrence matrix. That is to say, for every word-sense in the semantic concordance, compile a list of all the other word-senses that co-occur with h in any sentence. Then, on encountering a noun, verb, adjective, or adverb in the test material, look it up in WordNet. If the word is monosemous, assign that sense to it. If the word has more than one sense in WordNet, consuit the semantic co-occurrence matrix to determine what senses of the word co-occur in the training material with other words in the test sentence. If only one sense of the polysemous word co-occurrs in the training material with other words in the test sentence, assign that sense to it. If more than one sense of the polysemous word co-occurs in the training material with other words in test sentence, select from among the co-occunLng senses the sense that is most frequent in the training material; break ties by a random choice. If the polysemous word does not co-occur in the training material with other words in the test sentence, select the sense that is most frequent in the training material; break ties by a random choice.</Paragraph> <Paragraph position="3"> And if the polysemous word does not occur at all in the training material, choose a sense at random with a probability of I/n.</Paragraph> <Paragraph position="4"> In short, where there are data indicating co-occurrences of senses of polysemous words, use them; if not, use the most-frequent heuristic; otherwise, guess.</Paragraph> <Paragraph position="5"> When this co-occurrence heuristic was applied to the 104 seraantical\]y tagged passages, the results were almost identical to those for the most-frequent heuristic. Means using the co-occurrence heuristic were perhaps a half percent lower than those obtained with the most-frequent heuristic. And when the effects of guessing were removed, an improvement of approximately 2% was obtained, as before. This similarity can be attributed to the limited size of the semantic concordance: no co-occurrence data were available for 28% of the polysemous words, so the most-frequent heuristic had to he used; moreover, those words for which co-occurrence data were available tended to occur in their most frequent senses.</Paragraph> <Paragraph position="6"> On the basis of results obtained with the available sample of semantically tagged text, therefore, there is nothing to be gained by using the more complex co-occurrence heuristic. Since context is so important in sense identification, however, we concluded that our semantic concordance is still too small to estimate the potential limits of a co-occurrence heuristic.</Paragraph> </Section> class="xml-element"></Paper>