XML Viewer - h92-1045

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/h92-1045_metho.xml
Size: 19,162 bytes
Last Modified: 2025-10-06 14:13:06
<?xml version="1.0" standalone="yes"?>
<Paper uid="H92-1045">
  <Title>tion Using Statistical Models of Roget's Categories</Title>
  <Section position="2" start_page="0" end_page="234" type="metho">
    <SectionTitle>
2. OUR PREVIOUS WORK ON
WORD-SENSE DISAMBIGUATION
2.1. Data Deprivation
</SectionTitle>
    <Paragraph position="0"> Although there has been a long history of work on word-sense disambiguation, much of the work has been stymied by difficulties in acquiring appropriate testing and training materials. AI approaches have tended to focus on &amp;quot;toy&amp;quot; domains because of the difficulty in acquiring large lexicons. So too, statistical approaches, e.g., Kelly and Stone (1975), Black (1988), have tended to focus on a relatively small set of polysemous words because they have depended on extremely scarce hand-tagged materials for use in testing and training.</Paragraph>
    <Paragraph position="1"> We have achieved considerable progress recently by using a new source of testing and training materials and the application of Bayesian discrimination methods. Rather than depending on small amounts of hand-tagged text, we have been making use of relatively large amounts of parallel text, text such as the Canadian Hansards, which are available in multiple languages. The translation can often be used in lieu of hand-labeling. For example, consider the polysemous word sentence, which has two major senses: (1) a judicial sentence, and (2), a syntactic sentence. We can collect a number of sense (1) examples by extracting instances that are translated as peine, and we can collect a number of sense (2) examples by extracting instances that are translated as phrase. In this way, we have been able to acquire a considerable amount of testing and training material for developing and testing our disambiguation algorithms.</Paragraph>
    <Paragraph position="2"> The use of bilingual materials for discrimination decisions in machine tranlation has been discussed by Brown and others (1991), and by Dagan, Itai, and Schwall (1991). The use of bilingual materials for an essentially monolingual purpose, sense disambiguation, is similar in method, but differs in purpose.</Paragraph>
    <Section position="1" start_page="0" end_page="233" type="sub_section">
      <SectionTitle>
2.2. Bayesian Discrimination
</SectionTitle>
      <Paragraph position="0"> Surprisingly good results can be achieved using Bayesian discrimination methods which have been used very successfully in many other applications, especially author identification (Mosteller and Wallace, 1964) and information retrieval (IR) (Salton, 1989, section 10.3). Our word-sense disambiguation algorithm uses the words in a 100-word context 1 surrounding the polysemous word very much like the other two applications use the words in a test document.</Paragraph>
      <Paragraph position="1"> Information Retreival (IR): Pr(wlret) I1 'r(wlirrd) w in doe lit is common to use very small contexts (e.g., 5-words) based on the observation that people do not need very much context in order to performance the disambiguation task. In contrast, we use much larger contexts (e.g., 100 words). Although people may be able to make do with much less context, we believe the machine needs all the help it can get, and we have found that the larger context makes the task much easier. In fact, we have been able to measure information at extremely large distances (10,000 words away from the polysemous word in question), though obviously most of the useful information appears relatively near the polysemous word (e.g., within the first 100 words or so). Needless to say, our 100-word contexts are considerably larger than the smaller 5-word windows that one normally finds in the literature.</Paragraph>
      <Paragraph position="2">  This model treats the context as a bag of words and ignores a number of important linguistic factors such as word order and collocations (correlations among words in the context). Nevertheless, even with these oversimplifications, the model still contains an extremely large number of parameters: 2V ~ 200,000. It is a non-trivial task to estimate such a large number of parameters, especially given the sparseness of the training data. The training material typically consists of approximately 12,000 words of text (100 words words of context for 60 instances of each of two senses). Thus, there are more than 15 parameters to be estimated from each data point. Clearly, we need to be fairly careful given that we have so many parameters and so little evidence.</Paragraph>
    </Section>
    <Section position="2" start_page="233" end_page="234" type="sub_section">
      <SectionTitle>
2.3. Parameter Estimation
</SectionTitle>
      <Paragraph position="0"> In principle, the conditional probabilities Pr(toklsense ) can be estimated by selecting those parts of the entire corpus which satisfy the required conditions (e.g., 100-word contexts surrounding instances of one sense of senfence, counting the frequency of each word, and dividing the counts by the total number of words satisfying the conditions. However, this estimate, which is known as the maximum likelihood estimate (MLE), has a number of well-known problems. In particular, it will assign zero probability to words that do not happen to appear in the sample, which is obviously unacceptable.</Paragraph>
      <Paragraph position="1"> We will estimate Pr(toklsense ) by interpolating between local probabilities, probabilities computed over the 100-word contexts, and global probabilities, probabilities computed over the entire corpus. There is a trade-off between measurement error and bias error. The local probabilities tend to be more prone to measurement error whereas the global probabilities tend to be more prone to bias error. We seek to determine the relevance of the larger corpus to the conditional sample in order to find the optimal trade-off between bias and measurement error. null The interpolation procedure makes use of a prior expectation of how much we expect the local probabilities to differ from the global probabilities. In their author identification work Mosteller and Wallace &amp;quot;expect\[ed\] both authors to have nearly identical rates for almost any word&amp;quot; (p. 61). In fact, just as they had anticipated, we have found that only 2% of the vocabulary in the Federalist corpus has significantly different probabilities depending on the author. In contrast, we expect fairly large differences in the sense disambiguation application.</Paragraph>
      <Paragraph position="2"> Approximately 20% of the vocabulary in the Hansards has a local probability that is significantly different from its global probability. Since the prior expectation depends so much on the application, we set the prior for a particular application by estimating the fraction of the vocabulary whose local probabilities differ significantly from the global probabilities.</Paragraph>
      <Paragraph position="3"> 2.4. A Small Study We have looked at six polysemous nouns in some detail: duty, drug, land, language, position and sentence, as shown in Table 1. The final column shows that performance is quite encouraging.</Paragraph>
      <Paragraph position="4">  These nouns were selected because they could be disambiguated by looking at their French translation in the Canadian Hansards (unlike a polysemous word such as interest whose French translation inter~1 is just as ambiguous as the English source). In addition, for testing methodology, it is helpful that the corpus contain plenty of instances of each sense. The second condition, for example, would exclude bank, perhaps the canonical example of a polysemous noun, since there are very few instances of the &amp;quot;river&amp;quot; sense of bank in our corpus of Canadian Hansards. We were somewhat surprised to discover that these two conditions are actually fairly stringent, and that there are a remarkably small number of polysemous words which (1) can be disambiguated by looking at the French translation, and (2) appear 150 or more times in two or more senses.</Paragraph>
    </Section>
    <Section position="3" start_page="234" end_page="234" type="sub_section">
      <SectionTitle>
Input Output
</SectionTitle>
      <Paragraph position="0"> Treadmills attached to cranes were used to lift heavy objects TOOLS/MACHINERY (348) and for supplying power for cranes, hoists, and lifts. The centrifug TOOLS/MACHINERY (348) Above this height, a tower crane is often used. This comprises TOOLS/MACHINERY (348) elaborate courtship rituals cranes build a nest of vegetation on ANIMALS/INSECTS (414) are more closely related to cranes and rails. They range in length ANIMALS/INSECTS (414) low trees..PP At least five crane species are in danger of extincti ANIMALS/INSECTS (414) 2.5. Training on Monolingual Material At first, we thought that the method was completely dependent on the availability of parallel corpora for training. This has been a problem since parallel text remains somewhat difficult to obtain in large quantity, and what little is available is often fairly unbalanced and unrepresentative of general language. Moreover, the assumption that differences in translation correspond to differences in word-sense has always been somewhat suspect.</Paragraph>
      <Paragraph position="1"> Recently, Yarowsky (1991) has found a way to train on the Roget's Thesaurus (Chapman, 1977) and Grolier's Encyclopedia (1991) instead of the Hansards, thus circumventing many of the objections to our use of the Hansards. Yarowsky's method inputs a 100-word context surrounding a polysemous word and scores each of  Table 2 shows some results for the polysemous noun crane.</Paragraph>
      <Paragraph position="2"> Each of the 1042 models, Pr(wlRoget Categoryl), is trained by interpolating between local probabilities and global probabilities just as before. However, the local probabilities are somewhat more difficult to obtain in this case since we do not have a corpus tagged with Roget Categories, and therefore, it may not be obvious how to extract subsections of the corpus meeting the local conditions. Consider the Roget Category: TOOLS/MACHINERY (348). Ideally, we would extract 100-word contexts in the 10 million word Grolier Encyclopedia surrounding words in category 348 and use these to compute the local probabilities. Since we don't have a tagged corpus, Yarowsky suggested extracting contexts around all words in category 348 and weighting appropriately in order to compensate for the fact that some of these contexts should not have been included in the training set. Table 3 below shows a sample of the 30,924 concordances for the words in category 348.</Paragraph>
    </Section>
  </Section>
  <Section position="3" start_page="234" end_page="234" type="metho">
    <SectionTitle>
CARVING .SB The gutter
</SectionTitle>
    <Paragraph position="0"> equipment such as a hydraulic</Paragraph>
    <Section position="1" start_page="234" end_page="234" type="sub_section">
      <SectionTitle>
Resembling a power
</SectionTitle>
      <Paragraph position="0"> adz has a concave blade for form shovel capable of lifting 26 cubic shovel mounted on a floating hull, generators, oil-refinery turbines, sickles were used to gather wild drills forced manufacturers to find Drills live in the forests of drill were unchanged, and crane is an assembly of fabricated crane, however, occasionally nests Note that some of the words in category 348 are polysemous (e.g., drill and crane), and consequently, not all of their contexts should be included in the training set for category 348. In particular, lines 7, 8 and 10 in Table 3 illustrate the problem. If one of these spurious senses was frequent and dominated the set of examples, the situation could be disastrous. An attempt is made to weight the concordance data to minimize this effect and to make the sample representative of all tools and machinery, not just the more common ones. If a word such as drill occurs k times in the corpus, all words in the context of drill contribute weight 1/k to frequency sums. Although the training materials still contain a substantial level of noise, we have found that the resulting models work remarkably well, nontheless. Yarowsky (1991) reports 93% correct disambiguation, averaged over the following words selected from the word-sense disambiguation literature: bow, bass, galley, mole, sentence, slug, star, duty, issue, taste, cone, interest.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="234" end_page="235" type="metho">
    <SectionTitle>
3. A HYPOTHESIS: ONE SENSE
PER DISCOURSE
</SectionTitle>
    <Paragraph position="0"> As this work was nearing completion, we observed that senses tend to appear in clumps. In particular, it appeared to be extremely unusual to find two or more senses of a polysemous word in the same discourse. 2 A simple (but non-blind) preliminary experiment provided some suggestive evidence confirming the hypothesis. A random sample of 108 nouns was extracted for further  study. A panel of three judges (the authors of this paper) were given 100 sets of concordance lines. Each set showed all of the instances of one of the test words in a particular Grolier's article. The judges were asked to indicate if the set of concordance lines used the same sense or not. Only 6 of 300 w,-ticle-judgements were judged to contain multiple senses of one of the test words. All three judges were convinced after grading 100 articles that there was considerable validity to the hypothesis.</Paragraph>
    <Paragraph position="1"> With this promising preliminary verification, the following blind test was devised. Five subjects (the three authors and two of their colleagues) were given a questionnaire starting with a set of definitions selected from OALD (Crowie et al., 1989) and followed by a number of pairs of concordance lines, randomly selected from Grolier's Encyclopedia (1991). The subjects were asked to decide for each pair, whether the two concordance lines corresponded to the same sense or not.</Paragraph>
    <Paragraph position="2"> antenna 1. jointed organ found in pairs on the heads of insects and crustaceans, used for feeling, etc. ~ the illus at insect.</Paragraph>
    <Paragraph position="3"> 2. radio or TV aerial.</Paragraph>
    <Paragraph position="4"> lack eyes, legs, wings, antennae and distinct mouthparts The Brachycera have short antennae and include the more evol The questionnaire contained a total of 82 pairs of concordance lines for 9 polysemous words: antenna, campaign, deposit, drum, hull, interior, knife, landscape, and marine. 54 of the 82 pairs were selected from the same discourse. The remaining 28 pairs were introduced as a control to force the judges to say that some pairs were different; they were selected from different discourses, and were checked by hand as an attempt to assure that they did not happen to use the same sense. The judges found it quite easy to decide whether the pair used the same sense or not. Table 4 shows that there was very high agreement among the judges. With the exception of judge 2, all of the judges agreed with the majority opinion in all but one or two of the 82 cases. The agreement rate was 96.8%, averaged over all judges, or 99.1%, averaged over the four best judges.</Paragraph>
    <Paragraph position="5">  As we had hoped, the experiment did, in fact, confirm the one-sense-per-discourse hypothesis. Of 54 pairs selected from the same article, the majority opinion found that 51 shared the same sense, and 3 did not. ~ We conclude that with probability about 94% (51/54), two polysemous nouns drawn from the same article will have the same sense. In fact, the experiment tested a particularly difficult case, since it did not include any unambiguous words. If we assume a mixture of 60% un-ambiguous words and 40% polysemous words, then the probability moves from 94% to 100% x .60 + 94% x .40 98%. In other words, there is a very strong tendency (98%) for multiple uses of a word to share the same sense in well-written coherent discourse.</Paragraph>
    <Paragraph position="6"> One might ask if this result is specific to Grolier's or to good writing or some other factor. The first author looked at the usage of these same nine words in the Brown Corpus, which is believed to be a more balanced sample of general language and which is also more widely available than Grolier's and is therefore more amenable to replication. The Brown Corpus consists of 500 discourse fragments of 2000 words, each. We were able to find 259 concordance lines like the ones above, showing two instances of one of the nine test words selected from the same discourse fragment. However, four of the nine test words are not very interesting in the Brown Corpus antenna, drum, hull, and knife, since only one sense is observed. There were 106 pairs for the remaining five words: campaign, deposit, interior, landscape, and marine. The first author found that 102 of the 106 pairs were used in the same sense. Thus, it appears that one-sense-per-discourse tendency is also fairly strong in the Brown Corpus (102/106 ~ 96%), as well as in the Grolier's Encyclopedia.</Paragraph>
  </Section>
  <Section position="5" start_page="235" end_page="236" type="metho">
    <SectionTitle>
4. IMPLICATIONS
</SectionTitle>
    <Paragraph position="0"> There seem to be two applications for the one-sense-per-discourse observation: first it can be used as an additional source of constraint for improving the performance of the word-sense disambiguation algorithm, and 3In contrast, of the 28 control pairs, the majority opinion found that only 1 share the same sense, and 27 did not.</Paragraph>
    <Paragraph position="1">  secondly, it could be used to help evaluate disambiguation algorithms that did not make use of the discourse constraint. Thus far, we have been more interested in the second use: establishing a group of examples for which we had an approximate ground truth. Rather than tagging each instance of a polysemous word one-by-one, we can select discourses with large numbers of the polysemous word of interest and tag all of the instances in one fell swoop. Admittedly, this procedure will introduce a small error rate since the one-sense-per-discourse tendency is not quite perfect, but thus far, the error rate has not been much of a problem. This procedure has enabled us to tag a much larger test set than we would have been able to do otherwise.</Paragraph>
    <Paragraph position="2"> Having tagged as many words as we have (all instances of 97 words in the Grolier's Encyclopedia), we are now in a position to question some widely held assumptions about the distribution of polysemy. In particular, it is commonly believed that most words are highly polysemous, but in fact, most words (both by token and by type) have only one sense, as indicated in Table 5 below.</Paragraph>
    <Paragraph position="3"> Even for those words that do have more than one possible sense, it rarely takes anywhere near log2senses bits to select the appropriate sense since the distribution of senses is generally quite skewed. Perhaps the word-sense disambiguation problem is not as difficult as we might have thought.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML