File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-0908_intro.xml
Size: 2,690 bytes
Last Modified: 2025-10-06 14:01:36
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-0908"> <Title>Improvements in Automatic Thesaurus Extraction</Title> <Section position="5" start_page="0" end_page="0" type="intro"> <SectionTitle> 4 Evaluation </SectionTitle> <Paragraph position="0"> For the purposes of evaluation, we selected 70 single-word noun terms for thesaurus extraction. To avoid sample bias, the words were randomly selected from WordNet such that they covered a range of values for the following word properties: number of senses WordNet and Macquarie senses; speci city depth in the WordNet hierarchy; concreteness distribution across WordNet subtrees. Table 3 lists some example terms with frequency and frequency rank data from the PTB, BNC and REUTERS, as well as the number of senses in Word-Net and Macquarie, and their maximum and minimum depth in the WordNet hierarchy. For each term we extracted a thesaurus entry with 200 potential synonyms and their similarity scores.</Paragraph> <Paragraph position="1"> The simplest method of evaluation is direct comparison of the extracted thesaurus with a manually-created gold standard (Grefenstette, 1994). However, on small corpora, rare direct matches provide limited information for evaluation, and thesaurus coverage is a problem. Our evaluation uses a combination of three electronic thesauri: the Macquarie (Bernard, 1990), Roget's (Roget, 1911) and Moby (Ward, 1996) thesauri. Roget's and Macquarie are topic ordered and the Moby thesaurus is head ordered. As the extracted thesauri do not distinguish between senses, we transform Roget's and Macquarie into head ordered format by con ating the sense sets containing each term. For the 70 terms we create a gold standard from the union of the synonyms from the three thesauri.</Paragraph> <Paragraph position="2"> With this gold standard in place, it is possible to use precision and recall measures to evaluate the quality of the extracted thesaurus. To help overcome the problems of direct comparisons we use several measures of system performance: direct matches (DIRECT), inverse rank (INVR), and precision of the top n synonyms (P(n)), for n = 1, 5 and 10.</Paragraph> <Paragraph position="3"> INVR is the sum of the inverse rank of each matching synonym, e.g. matching synonyms at ranks 3, 5 and 28 give an inverse rank score of 28 , and with at most 200 synonyms, the maximum INVR score is 5.878. Precision of the top n is the percentage of matching synonyms in the top n extracted synonyms. There are a total of 23207 synonyms for the 70 terms in the gold standard. Each measure is averaged over the extracted synonym lists for all 70 thesaurus terms.</Paragraph> </Section> class="xml-element"></Paper>