File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/p04-1036_intro.xml
Size: 5,860 bytes
Last Modified: 2025-10-06 14:02:21
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1036"> <Title>Finding Predominant Word Senses in Untagged Text</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Method </SectionTitle> <Paragraph position="0"> In order to find the predominant sense of a target word we use a thesaurus acquired from automatically parsed text based on the method of Lin (1998). This provides the a3 nearest neighbours to each target word, along with the distributional similarity score between the target word and its neighbour. We then use the WordNet similarity package (Patwardhan and Pedersen, 2003) to give us a semantic similarity measure (hereafter referred to as the WordNet similarity measure) to weight the contribution that each neighbour makes to the various senses of the target word.</Paragraph> <Paragraph position="1"> To find the first sense of a word (a4 ) we take each sense in turn and obtain a score reflecting the prevalence which is used for ranking. Let a5a7a6 a8 a9a11a10a13a12a15a14a16a10a18a17a20a19a21a19a21a19a22a10a18a23a25a24 be the ordered set of the top scoring a3 neighbours of a4 from the thesaurus with associated distributional similarity scores a9a27a26a29a28a20a28a27a30a31a4a7a14a32a10 a12a16a33 a14a16a26a29a28a20a28a27a30a31a4a7a14a32a10 a17a16a33 a14a15a19a21a19a21a19a22a26a29a28a20a28a27a30a31a4a7a14a32a10 a23a16a33 a24 . Let a28a15a34a20a10a18a28a20a34a15a28a25a30a31a4 a33 be the set of senses of a4 . For each sense of a4 (a4a35a28a37a36a39a38a40a28a20a34a15a10a13a28a15a34a20a28a27a30a31a4 a33 ) we obtain a ranking score by summing over the a26a41a28a15a28a27a30a42a4a43a14a16a10a45a44 a33 of each neighbour (a10a46a44a47a38a48a5 a6 ) multiplied by a weight. This weight is the WordNet similarity score (a4a49a10a13a28a15a28 ) between the target sense (a4a35a28a37a36 ) and the sense of a10a45a44 (a10a13a28a37a50a51a38a52a28a15a34a15a10a13a28a20a34a15a28a27a30a42a10a45a44 a33 ) that maximises this score, divided by the sum of all such WordNet similarity scores for a28a20a34a15a10a13a28a15a34a20a28a27a30a31a4 a33 and a10a46a44 . Thus we rank each sense a4a49a28 a36 a38a53a28a15a34a20a10a18a28a20a34a15a28a25a30a31a4 a33 using:</Paragraph> <Paragraph position="3"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Acquiring the Automatic Thesaurus </SectionTitle> <Paragraph position="0"> The thesaurus was acquired using the method described by Lin (1998). For input we used grammatical relation data extracted using an automatic parser (Briscoe and Carroll, 2002). For the experiments in sections 3 and 4 we used the 90 million words of written English from the BNC. For each noun we considered the co-occurring verbs in the direct object and subject relation, the modifying nouns in noun-noun relations and the modifying adjectives in adjective-noun relations. We could easily extend the set of relations in the future. A noun, a4 , is thus described by a set of co-occurrence triples is a grammatical relation and a95 is a possible co-occurrence with a4 in that relation. For every pair of nouns, where each noun had a total frequency in the triple data of 10 or more, we computed their distributional similarity using the measure given by Lin (1998). If a98a56a30a31a4 a33 is the set of co-occurrence types</Paragraph> <Paragraph position="2"> a14a32a95 a33 is positive then the similarity between two nouns, a4 and a10 , can be computed as: a26a41a28a15a28a27a30a42a4a43a14a16a10 a33 a8</Paragraph> <Paragraph position="4"> A thesaurus entry of size a3 for a target noun a4 is then defined as the a3 most similar nouns to a4 .</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 The WordNet Similarity Package </SectionTitle> <Paragraph position="0"> We use the WordNet Similarity Package 0.05 and WordNet version 1.6. 2 The WordNet Similarity package supports a range of WordNet similarity scores. We experimented using six of these to provide the a4a49a10a13a28a20a28 in equation 1 above and obtained results well over our baseline, but because of space limitations give results for the two which perform the best. We briefly summarise the two measures here; for a more detailed summary see (Patwardhan et al., 2003). The measures provide a similarity score between two WordNet senses (a28a120a119 and a28a15a121 ), these being synsets within WordNet.</Paragraph> <Paragraph position="1"> lesk (Banerjee and Pedersen, 2002) This score maximises the number of overlapping words in the gloss, or definition, of the senses. It uses the glosses of semantically related (according to Word-Net) senses too.</Paragraph> <Paragraph position="2"> jcn (Jiang and Conrath, 1997) This score uses corpus data to populate classes (synsets) in the WordNet hierarchy with frequency counts. Each 2We use this version of WordNet since it allows us to map information to WordNets of other languages more accurately. We are of course able to apply the method to other versions of WordNet.</Paragraph> <Paragraph position="3"> synset, is incremented with the frequency counts from the corpus of all words belonging to that synset, directly or via the hyponymy relation. The frequency data is used to calculate the &quot;information content&quot; (IC) of a class a99a45a122a123a30a31a28 a33 a8a125a124a126a59a60a65a37a127a13a30a106a128a13a30a42a28 a33a86a33 . Jiang and Conrath specify a distance measure:</Paragraph> <Paragraph position="5"> where the third class (a28a15a133 ) is the most informative, or most specific, superordinate synset of the two senses a28a114a119 and a28a15a121 . This is transformed from a distance measure in the WN-Similarity package by taking the reciprocal:</Paragraph> <Paragraph position="7"/> </Section> </Section> class="xml-element"></Paper>