File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-2023_metho.xml

Size: 3,458 bytes

Last Modified: 2025-10-06 14:08:16

<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-2023">
  <Title>Category-Based Pseudowords</Title>
  <Section position="4" start_page="2" end_page="2" type="metho">
    <SectionTitle>
3 Pseudoword Generation
</SectionTitle>
    <Paragraph position="0"> For the creation of pseudowrods with two-sense ambiguities, we first determined which ambiguous words fall into exactly two MeSH categories and built a list L of pairs (see Table 1). We then generated pseudowords with the following characteristics: * The two possible pseudoword categories represent a pair that is really seen in the testing corpus and thus needs to be disambiguated; * The number of pseudowords drawn from a particular pair is proportional to its frequency; * Multi-word concepts can be used as pseudoword elements: e.g., ion-exchange chromatography and long-term effects can be conflated as ion-exchange chromatography long-term effects * Only unambiguous words are used as pseudoword constituents.</Paragraph>
    <Paragraph position="1"> An important aspect of pseudoword creation is the relative frequencies of the underlying words. Since the standard baseline for a WSD algorithm is to always choose the most frequent sense, a baseline that is evaluated on words whose senses are evenly balanced will be expected to do more poorly than one tested against words that are heavily skewed towards one sense (Sanderson &amp; van Rijsbergen, 1999).</Paragraph>
    <Paragraph position="2"> In naturally occurring text, the more frequent sense for the two-sense distinction is reported to occur 92% of the time on average; this result has been found both on the CACM collection and on the WordNet SEMCOR sense-tagged corpus (Sanderson &amp; van Rijsbergen, 1999). However, the challenge for WSD programs is to work on the harder cases, and the artificially constructed SENSEVAL1 corpus has more evenly distributed senses (Gaustad, 2001).</Paragraph>
    <Paragraph position="3"> In these experiments, we explicitly compare pseudowords whose underlying word frequencies are even  against those that are skewed. To generate pseudowords with more uniform underlying distributions, we first calculate the expected testing corpus frequency of those words w</Paragraph>
    <Paragraph position="5"> that have been unambiguously mapped to MeSH and whose class is used in at least one pair in L.In this collection the expected frequency was E = 45.21 with a standard deviation of 451.19. We then built a list W of all MeSH concepts mapped in the text that have a class used in a pair in L and whose frequency is in the interval [E/2;3E/2], i.e. [34;56]. This yields a list of concepts that could potentially be combined in 64,596 pseudowords for evaluation of the WSD algorithm performance over the classes in L.</Paragraph>
    <Paragraph position="6"> We then generated a random subset of 1,000 pseudowords (88,758 instances) out of the possible 64,596 by applying the following importance sampling procedure:  from L by sampling from a multinomial distribution whose parameters are proportional to the frequencies of the elements of L.  has been sampled already, go to step 1) and try again.</Paragraph>
    <Paragraph position="7"> Table 2 shows a random selection of pseudowords generated by the algorithm. Note that the more unusual pairings come from the less frequent category pairs, whereas those in which word senses are closer in meaning are drawn from more common category pairs.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML