File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/97/w97-0322_relat.xml
Size: 7,537 bytes
Last Modified: 2025-10-06 14:16:02
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0322"> <Title>Distinguishing Word Senses in Untagged Text</Title> <Section position="9" start_page="203" end_page="204" type="relat"> <SectionTitle> 7 Related Work </SectionTitle> <Paragraph position="0"> Word-sense disambiguation has more commonly been cast as a problem in supervised learning (e.g., (Black, 1988), (Yarowsky, 1992), (Yarowsky, 1993), (Leacock, Towell, and Voorhees, 1993), (Bruce and Wiebe, 1994), (Mooney, 1996), (Ng and Lee, 1996), (Pedersen, Bruce, and Wiebe, 1997), (Pedersen and Bruce, 1997a)). However, all of these methods require that manually sense tagged text be available to train the algorithm. For most domains such text is not available and is expensive to create. It seems more reasonable to assume that such text will not usually be available and attempt to pursue unsupervised approaches that rely only on the features in a text that can be automatically identified.</Paragraph> <Section position="1" start_page="203" end_page="204" type="sub_section"> <SectionTitle> 7.1 Bootstrapping </SectionTitle> <Paragraph position="0"> Bootstrapping approaches require a small amount of disambiguated text in order to initialize the unsupervised learning algorithm. An early example of such an approach is described in (Hearst, 1991). A supervised learning algorithm is trained with a small amount of manually sense tagged text and applied to a held out test set. Those examples in the test set that are most confidently disambiguated are added to the training sample.</Paragraph> <Paragraph position="1"> A more recent bootstrapping approach is described in (Yarowsky, 1995). This algorithm requires a small number of training examples to serve as a seed. There are a variety of options discussed for automatically selecting seeds; one is to identify collocations that uniquely distinguish between senses. For plant, the collocations manufacturing plant and living plant make such a distinction. Based on 106 examples of manufacturing plant and 82 examples of living plant this algorithm is able to distinguish between two senses of plant for 7,350 examples with 97 percent accuracy. Experiments with 11 other words using collocation seeds result in an average accuracy of 96 percent.</Paragraph> <Paragraph position="2"> While (Yarowsky, 1995) does not discuss distinguishing more than 2 senses of a word, there is no immediate reason to doubt that the &quot;one sense per collocation&quot; rule (Yarowsky, 1993) would still hold for a larger number of senses. In future work we will evaluate using the &quot;one sense per collocation&quot; rule to seed our various methods. This may help in dealing with very skewed distributions of senses since we currently select collocations based simply on frequency.</Paragraph> </Section> <Section position="2" start_page="204" end_page="204" type="sub_section"> <SectionTitle> 7.2 Clustering </SectionTitle> <Paragraph position="0"> Clustering has most often been applied in natural language processing as a method for inducing syntactic or semantically related groupings of words (e.g., (Rosenfeld, Huang, and Schneider, 1969), (Kiss, 1973), (Ritter and Kohonen, 1989), (Pereira, Tishby, and Lee, 1993), (Sch/itze, 1993), (Resnik, 1995a)).</Paragraph> <Paragraph position="1"> An early application of clustering to word-sense disambiguation is described in (Sch/itze, 1992).</Paragraph> <Paragraph position="2"> There words are represented in terms of the co-occurrence statistics of four letter sequences. This representation uses 97 features to characterize a word, where each feature is a linear combination of letter four-grams formulated by a singular value decomposition of a 5000 by 5000 matrix of letter four-gram co-occurrence frequencies. The weight associated with each feature reflects all usages of the word in the sample. A context vector is formed for each occurrence of an ambiguous word by summing the vectors of the contextual words (the number of contextual words considered in the sum is unspecified). The set of context vectors for the word to be disambiguated are then clustered, and the clusters are manually sense tagged.</Paragraph> <Paragraph position="3"> The features used in this work are complex and difficult to interpret and it isn't clear that this complexity is required. (Yarowsky, 1995) compares his method to (Schiitze, 1992) and shows that for four words the former performs significantly better in distinguishing between two senses.</Paragraph> <Paragraph position="4"> Other clustering approaches to word-sense disambiguation have been based on measures of semantic distance defined with respect to a semantic network such as WordNet. Measures of semantic distance are based on the path length between concepts in a network and are used to group semantically similar concepts (e.g. (Li, Szpakowicz, and Matwin, 1995)).</Paragraph> <Paragraph position="5"> (Resnik, 1995b) provides an information theoretic definition of semantic distance based on WordNet.</Paragraph> <Paragraph position="6"> (McDonald et al., 1990) apply another clustering approach to word-sense disambiguation (also see (Wilks et al., 1990)). They use co-occurrence data gathered from the machine-readable version of LDOCE to define neighborhoods of related words.</Paragraph> <Paragraph position="7"> Conceptually, the neighborhood of a word is a type of equivalence class. It is composed of all other words that co-occur with the designated word a significant number of times in the LDOCE sense definitions.</Paragraph> <Paragraph position="8"> These neighborhoods are used to increase the number of words in the LDOCE sense definitions, while still maintaining some measure of lexical cohesion.</Paragraph> <Paragraph position="9"> The &quot;expanded&quot; sense definitions are then compared to the context of an ambiguous word, and the sensedefinition with the greatest number of word overlaps with the context is selected as correct. (Guthrie et al., 1991) propose that neighborhoods be subject dependent. They suggest that a word should potentially have different neighborhoods corresponding to the different LDOCE subject code. Subject-specific neighborhoods are composed of words having at least one sense marked with that subject code.</Paragraph> </Section> <Section position="3" start_page="204" end_page="204" type="sub_section"> <SectionTitle> 7.3 EM algorithm </SectionTitle> <Paragraph position="0"> The only other application of the EM algorithm to word-sense disambiguation is described in (Gale, Church, and Yarowsky, 1995). There the EM algorithm is used as part of a supervised learning algorithm to distinguish city names from people's names.</Paragraph> <Paragraph position="1"> A narrow window of context, one or two words to either side, was found to perform better than wider windows. The results presented are preliminary but show an accuracy percentage in the mid-nineties when applied to Dixon, a name found to be quite ambiguous.</Paragraph> <Paragraph position="2"> It should be noted that the EM algorithm relates to a large body of work in speech processing. The Baum-Welch forward-backward algorithm (Baum, 1972) is a specialized form of the EM algorithm that assumes the underlying parametric model is a hidden Markov model. The Baum-Welch forward-backward algorithm has been used extensively in speech recognition (e.g. (Levinson, Rabiner, and Sondhi, 1983), (Kupiec, 1992)), (Jelinek, 1990)).</Paragraph> </Section> </Section> class="xml-element"></Paper>