File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/p92-1032_metho.xml
Size: 6,137 bytes
Last Modified: 2025-10-06 14:13:21
<?xml version="1.0" standalone="yes"?> <Paper uid="P92-1032"> <Title>Estimating Upper and Lower Bounds on the Performance of Word-Sense Disambiguation Programs</Title> <Section position="4" start_page="254" end_page="255" type="metho"> <SectionTitle> 6. A Discrimination Experiment </SectionTitle> <Paragraph position="0"> For evaluation purposes, it is important to find a task that is somewhat easier for the judges. If the task is too hard (as Jorgensen's classification task may he), then there will be almost no room between the limits of the measurement and the baseline. In other words, there won't be enough dynamic range to measure differences between better systems and worse systems. In contrast, if we focus on easier tasks, then we might have enough dynamic range to show some interesting differences.</Paragraph> <Paragraph position="1"> Therefore, unlike Jorgensen who was interested in highlighting differences among judgments, we are much more interested in highlighting agreements. Fortunately, we have found in (Gale et al., 1992) that the agreement rate can be very high (96.8%), which is well above the baseline, under very different experimental conditions.</Paragraph> <Paragraph position="2"> Of course, it is a fairly major step to redefine the problem from a classification task to a discrimination one, as we are proposing. One might have preferred not to do so, but we simply don't know how one could establish enough dynamic range in that case to show any interesting differences. It has been our experience that it is very hard to design an experiment of any kind which will produce the desired agreement among judges. We are very happy with the 96.8% agreement that we were able to show, even if it is limited to a much easier task than the one that Jorgensen was interested in.</Paragraph> <Paragraph position="3"> We originally designed the experiment in Gale et al.</Paragraph> <Paragraph position="4"> (1992) to test the hypothesis that multiple uses of a polysemous word tend to have the same sense within a common discourse. A simple (but non-blind) pilot experiment provided some suggestive evidence confirming the hypothesis. A random sample of 108 nouns (which included the 97 words previously mentioned) was extracted for further study. A panel of three judges (the three authors of this paper) were given 100 sets of concordance lines containing one of the test words selected from a single article in Grolier's. The judges were asked to indicate if the set of concordance lines used the same sense or not. Only 6 of 300 articlejudgements were judged to contain multiple senses of one of the test words. All three judges were convinced after grading 100 articles that there was considerable validity to the hypothesis.</Paragraph> <Paragraph position="5"> With this promising preliminary verification, the following blind test was devised. Five subjects (the three authors and two of their colleagues) were given a questionnaire starting with a set of definitions selected from OALD (Crowie et al., 1989) and followed by a number of pairs of concordance lines, randomly selected from Grolier's Encyclopedia (1991). The subjects were asked to decide for each pair, whether the two concordance lines corresponded to the same sense or not.</Paragraph> <Paragraph position="6"> antenna 1. jointed organ found in pairs on the heads of insects and crustaceans, used for feeling, etc. ---> the illus at insect.</Paragraph> <Paragraph position="7"> 2. radio or TV aerial.</Paragraph> <Paragraph position="8"> lack eyes, legs, wings, antennae, and distinct mouthparts and The Brachycera have short antennae and include the more evolved silk moths passes over the antennae .SB Only males that detect relatively simple form of antenna is the dipole, or doublet The questionnaire contained a total of 82 pairs of concordance lines for 9 polysemous words: antenna, campaign, deposit, drum, hull, interior, knife, landscape, and marine. The results of the experiment are shown below in Table 3. With the exception of judge 2, all of the judges agreed with the majority opinion in all but one or two of the 82 cases. The agreement rate was 96.8%, averaged over all judges, or 99.1%, averaged over the four best judges. In either case, the agreement rate is well above the previously described ceiling.</Paragraph> <Paragraph position="9"> Incidentally, the experiment did, in fact, confirm the hypothesis that multiple uses of a polysemous word will generally take on the same sense within a discourse. Of the 82 judgments, 54 were selected from the same discourse and were judged to have the same sense by the majority in 96.9% of the cases. (The remaining 28 of the 82 judgments were used as a control to force the judges to say that some pairs were different.) Note that the tendency for multiple uses of a polysemous word to have the same sense is extremely strong; 96.9% is much greater than the baseline, and indeed, it is considerably above the level of performance that might be expected from state-of-the-art word-sense disambiguation systems. Since it is so reliable and so easy to compute, it might be used as a quick-and-dirty measure for testing such systems. Unfortunately, we also need a complementary measure that would penalize a system like the baseline system that simply assigned all instances of a polysemous word to the same sense.</Paragraph> <Paragraph position="10"> At present, we have yet to identify a quick-and-dirty measure that accomplishes this control, and consequently, we are forced to continue to depend on the relatively expensive panel of judges. But, at least, we have been able to establish that it is possible to design a discrimination experiment such that the panel of judges can agree with themselves often enough to be useful. In addition, we have established that the discourse constraint on polysemy is extremely strong, much stronger than our ability to tag word-senses automatically. Consequently, it ought to be possible to use this constraint in our next word-sense tagging algorithm to produce even better performance.</Paragraph> </Section> class="xml-element"></Paper>