File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1177_metho.xml

Size: 3,884 bytes

Last Modified: 2025-10-06 14:08:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1177">
  <Title>Automatic Identification of Infrequent Word Senses</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Experiments Filtering Senses from
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Domain Specific Texts
</SectionTitle>
      <Paragraph position="0"> A major motivation for our work is to try to tailor a sense inventory to the text at hand. In this section we apply our filtering method to two domain specific corpora. We demonstrate that the senses filtered using our method on these corpora are determined by the domain. The Reuters corpus (Rose et al., 2002) is a collection of about 810,000 Reuters, English Language News stories (covering the period August 1996 to August 1997). Many of the news stories are economy related, but several other topics are included too. We have selected documents from the SPORTS domain (topic code: GSPO) and a limited number of documents from the FINANCE domain (topic codes: ECAT (ECONOMICS) and MCAT (MARKETS)). We chose the domains of SPORTS and FINANCE since there is sufficient material for these domains in this publically available corpus.</Paragraph>
      <Paragraph position="1"> The SPORT corpus consists of 35317 documents (about 9.1 million words). The FINANCE corpus consists of 117734 documents (about 32.5 million words). We acquired thesauruses for these corpora using the procedure described in section 2.2.</Paragraph>
      <Paragraph position="2"> There is no existing gold-standard that we could use to determine the frequency of word senses within these domain specific corpora. Instead we evaluate our method using the Subject Field Codes  which annotates WordNet synsets with domain labels. The SFC contains an economy label and a sports label. For this domain label experiment we selected all the words in WordNet that have at least one synset labelled economy and at least one synset labelled sports. The resulting set consisted of 38 words. The relative frequency of the domain labels for all the sense types of the 38 words is show in figure 1. The three main domain labels for these 38 words are of course sports, economy and factotum (domain independent). In figure 2 we contrast the relative frequency distribution of domain labels for filtered senses (using a98a41a147a185a7a187a186a117a162 ) of these 38 words in i) the BNC ii) the FINANCE corpus and iii) the SPORT corpus.</Paragraph>
      <Paragraph position="3"> From this figure one can see that there are more economy and commerce senses removed from the SPORT corpus, with no filtered sport labels. The FINANCE and BNC corpora do have some filtered economy and commerce labels, but these are only a small percentage of the filtered senses, and for FINANCE there are less than for the BNC.</Paragraph>
      <Paragraph position="4"> Table 5 shows the percentage of sense types filtered at different values of a98a41a147 . There are a relatively larger number of sense types filtered in the BNC compared to the FINANCE corpus, and this in turn has a larger percentage than the SPORT corpus.</Paragraph>
      <Paragraph position="5"> This is particularly noticeable at lower values of a98a41a147 and is because for these 38 words the ranking scores are less spread in the FINANCE, and SPORT corpus, arising from the relative size of the corpora and the spread of the distributional similarity scores. We conclude from these experiments that the value of a98a41a147 should be selected dependent on the corpus as well as the requirements of the application. There is also scope for investigating other distributional similarity scores and other filtering thresholds, for example, taking into account the variance of the ranking scores in the corpus.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML