File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/e06-1016_intro.xml
Size: 2,492 bytes
Last Modified: 2025-10-06 14:03:20
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-1016"> <Title>Determining Word Sense Dominance Using a Thesaurus</Title> <Section position="3" start_page="121" end_page="122" type="intro"> <SectionTitle> 2 Thesauri </SectionTitle> <Paragraph position="0"> Published thesauri, such as Roget's and Macquarie, divide the English vocabulary into around a thousand categories. Each category has a list of semantically related words, which we will call category terms or c-terms for short. Words with multiple meanings may be listed in more than one category. For every word type in the vocabulary of the thesaurus, the index lists the categories that include it as a c-term. Categories roughly correspond to coarse senses of a word (Yarowsky, 1992), and the two terms will be used interchangeably. For example, in the Macquarie Thesaurus, bark is a c-term in the categories 'animal noises' and 'membrane'. These categories represent the coarse senses of bark. Note that published thesauri are structurally quite different from the &quot;thesaurus&quot; automatically generated by Lin (1998), wherein a word has exactly one entry, and its neighbors may be semantically related to it in any of its senses. All future mentions of thesaurus will refer to a published thesaurus.</Paragraph> <Paragraph position="1"> While other sense inventories such as WordNet exist, use of a published thesaurus has three distinct advantages: (i) coarse senses--it is widely believed that the sense distinctions of WordNet are far too fine-grained (Agirre and Lopez de Lacalle Lekuona (2003) and citations therein); (ii) computational ease--with just around a thousand categories, the word-category matrix has a manageable size; (iii) widespread availability--thesauri are available (or can be created with relatively less effort) in numerous languages, while Word-Net is available only for English and a few romance languages. We use the Macquarie Thesaurus (Bernard, 1986) for our experiments. It consists of 812 categories with around 176,000 c-terms and 98,000 word types. Note, however, that using a sense inventory other than WordNet will mean that we cannot directly compare performance with McCarthy et al. (2004), as that would require knowing exactly how thesaurus senses map to WordNet. Further, it has been argued that such a mapping across sense inventories is at best difficult and maybe impossible (Kilgarriff and Yallop (2001) and citations therein).</Paragraph> </Section> class="xml-element"></Paper>