File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/04/c04-1177_relat.xml
Size: 4,199 bytes
Last Modified: 2025-10-06 14:15:43
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1177"> <Title>Automatic Identification of Infrequent Word Senses</Title> <Section position="5" start_page="0" end_page="0" type="relat"> <SectionTitle> 5 Related Work </SectionTitle> <Paragraph position="0"> WordNet is an extensive resource, as new versions are created new senses get included, however, for backwards compatibility previous senses are not deleted. For many NLP applications the problems of word sense ambiguity are significant. One way to cope with the larger numbers of senses for a word is by working at a coarser granularity, so that related senses are grouped together. There is useful work being done to cluster WordNet senses automatically (Agirre and Lopez de Lacalle, 2003). Pantel and Lin (2002) are working with automatically constructed thesauruses and identifying senses directly from the nearest neighbours, where the granularity depends on the parameters of the clustering process. In contrast we are using the nearest neighbours to indicate the frequency of the senses of the target word, using semantic similarity between the neighbours and the word senses listed in WordNet.</Paragraph> <Paragraph position="1"> We do so here in order to identify the senses of the word which are rare in corpus data.</Paragraph> <Paragraph position="2"> Lapata and Brew (2004) have recently used syntactic evidence to produce a prior distribution for verb senses and incorporate this in a WSD system.</Paragraph> <Paragraph position="3"> The work presented here focusses on using a prevalence ranking for word senses to identify and remove rare senses from a generic resource such as WordNet. We believe that this method will be useful for systems using such a resource, which can incorporate prior distributions over word senses or wish to identify and remove rare word senses. Systems requiring sense frequency distributions currently rely on available hand-tagged training data, and for WordNet the most extensive resource for all-words is SemCor. Whilst SemCor is extremely useful, it comprises only 250,000 words taken from a subset of the Brown corpus and a novel. Because of its size, and the zipfian distribution of words, there are many words which do not occur in this resource, for example embryo, fridge, pancake, wheelbarrow and many words which occur only once or twice.</Paragraph> <Paragraph position="4"> Our method using raw text permits us to obtain a sense ranking for any word from our corpus, subject to the constraint that we have enough occurrences in the corpus. Given the increasing amount of data on the web, this constraint is not likely to be problematic. null Another major benefit of the work here, rather than reliance on hand-tagged training data such as SemCor, is that this method permits us to produce a ranking for the domain and text type required.</Paragraph> <Paragraph position="5"> The sense distributions of many words depend on the domain, and filtering senses that are rare in a specific domain permits a generic resource such as WordNet to be tailored to the domain. Buitelaar and Sacaleanu (2001) have previously explored ranking and selection of synsets in GermaNet for specific domains using the words in a given synset, and those related by hyponymy, and a term relevance measure taken from information retrieval. Buitelaar and Sacaleanu have evaluated their method on identifying domain specific concepts using human judgements on 100 items.</Paragraph> <Paragraph position="6"> Magnini and Cavagli`a (2000) have identified WordNet word senses with particular domains, and this has proven useful for high precision WSD (Magnini et al., 2001); indeed in section 4 we used these domain labels for evaluation of our automatic filtering senses from domain specific corpora. Identification of these domain labels for word senses was semi-automatic and required a considerable amount of hand-labelling. Our approach is complementary to this. It provides a ranking of the senses of a word for a given domain so that manual work is not necessary, because of this it can easily be applied to a new domain, or sense inventory, given sufficient text.</Paragraph> </Section> class="xml-element"></Paper>