File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/p04-1036_evalu.xml
Size: 4,532 bytes
Last Modified: 2025-10-06 13:59:10
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1036"> <Title>Finding Predominant Word Senses in Untagged Text</Title> <Section position="6" start_page="0" end_page="0" type="evalu"> <SectionTitle> 6 Related Work </SectionTitle> <Paragraph position="0"> Most research in WSD concentrates on using contextual features, typically neighbouring words, to help determine the correct sense of a target word. In contrast, our work is aimed at discovering the predominant senses from raw text because the first sense heuristic is such a useful one, and because hand-tagged data is not always available.</Paragraph> <Paragraph position="1"> A major benefit of our work, rather than reliance on hand-tagged training data such as Sem-Cor, is that this method permits us to produce predominant senses for the domain and text type required. Buitelaar and Sacaleanu (2001) have previously explored ranking and selection of synsets in GermaNet for specific domains using the words in a given synset, and those related by hyponymy, and a term relevance measure taken from information retrieval. Buitelaar and Sacaleanu have evaluated their method on identifying domain specific concepts using human judgements on 100 items. We have evaluated our method using publically available resources, both for balanced and domain specific text. Magnini and Cavagli`a (2000) have identified WordNet word senses with particular domains, and this has proven useful for high precision WSD (Magnini et al., 2001); indeed in section 5 we used these domain labels for evaluation. Identification of these domain labels for word senses was semi-automatic and required a considerable amount of hand-labelling. Our approach is complementary to this. It only requires raw text from the given domain and because of this it can easily be applied to a new domain, or sense inventory, given sufficient text.</Paragraph> <Paragraph position="2"> Lapata and Brew (2004) have recently also highlighted the importance of a good prior in WSD. They used syntactic evidence to find a prior distribution for verb classes, based on (Levin, 1993), and incorporate this in a WSD system. Lapata and Brew obtain their priors for verb classes directly from subcategorisation evidence in a parsed corpus, whereas we use parsed data to find distributionally similar words (nearest neighbours) to the target word which reflect the different senses of the word and have associated distributional similarity scores which can be used for ranking the senses according to prevalence. null There has been some related work on using automatic thesauruses for discovering word senses from corpora Pantel and Lin (2002). In this work the lists of neighbours are themselves clustered to bring out the various senses of the word. They evaluate using the lin measure described above in section 2.2 to determine the precision and recall of these discovered classes with respect to WordNet synsets. This method obtains precision of 61% and recall 51%.</Paragraph> <Paragraph position="3"> If WordNet sense distinctions are not ultimately required then discovering the senses directly from the neighbours list is useful because sense distinctions discovered are relevant to the corpus data and new senses can be found. In contrast, we use the neighbours lists and WordNet similarity measures to impose a prevalence ranking on the WordNet senses.</Paragraph> <Paragraph position="4"> We believe automatic ranking techniques such as ours will be useful for systems that rely on Word-Net, for example those that use it for lexical acquisition or WSD. It would be useful however to combine our method of finding predominant senses with one which can automatically find new senses within text and relate these to WordNet synsets, as Ciaramita and Johnson (2003) do with unknown nouns.</Paragraph> <Paragraph position="5"> We have restricted ourselves to nouns in this work, since this PoS is perhaps most affected by domain. We are currently investigating the performance of the first sense heuristic, and this method, for other PoS on SENSEVAL-3 data (McCarthy et al., 2004), although not yet with rankings from domain specific corpora. The lesk measure can be used when ranking adjectives, and adverbs as well as nouns and verbs (which can also be ranked using jcn). Another major advantage that lesk has is that it is applicable to lexical resources which do not have the hierarchical structure that WordNet does, but do have definitions associated with word senses.</Paragraph> </Section> class="xml-element"></Paper>