File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/n03-2023_evalu.xml

Size: 3,306 bytes

Last Modified: 2025-10-06 13:58:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-2023">
  <Title>Category-Based Pseudowords</Title>
  <Section position="5" start_page="2" end_page="4" type="evalu">
    <SectionTitle>
4 Results
</SectionTitle>
    <Paragraph position="0"> For the experiments reported below, we trained an unsupervised Naive Bayes classifier using the categories as both targets and as context features. For example, an occurrence of the word haptoglobin in the context surrounding the word to be disambiguated will be replaced by its category label D12. Only unambiguous context words were used. The result of the disambiguation step is a category name, standing as a proxy for the word sense.</Paragraph>
    <Paragraph position="1"> Table 3 reports accuracies for several experiments in terms of macroaverages (average over the individual accuracies for each pseudoword). Baseline refers to choos-CW Base. Pess. Real. Abbrev. Opt.</Paragraph>
    <Paragraph position="2">  . Pessimistic refers to the evenly distributed category-based pseudowords, generated by requiring the word frequency to fall in the interval [E/2;3E/2]. In the column labeled Realistic, the requirement for evenly distributed senses is dropped, although the component words must have a frequency of at least 5. The column labeled Optimistic refers to the results when the pseudowords are generated the standard way: the words are selected at random rather than according to the category sets.</Paragraph>
    <Paragraph position="3"> We expected the Realistic pseudowords to produce a better lower-bound estimate of the performance of a WSD algorithm on real word senses than Optimistic.To test this hypothesis we followed a method suggested by Liu et al. (2002) and evaluated the classifier on a set of 217 two-sense abbreviations (see Table 4).</Paragraph>
    <Paragraph position="4"> Abbreviations are real ambiguous words, but they are also artificial in a sense. Many homonyms are similar in meaning as well as spelling because they derive etymologically from the same root. By contrast, similar spelling in abbreviations is often simply an accident of shared initial characters in compound nouns. Thus abbreviations occupy an intermediate position between entirely random pseudowords and standard real ambiguous words.</Paragraph>
    <Paragraph position="5"> We extracted 98,841 unique abbreviation-expansion pairs  using code developed by Schwartz &amp; Hearst (2003), and retained only those abbreviations whose expansions could be fully and unambiguously mapped to a single truncated MeSH category. The different expansions of each abbreviation were required to correspond  The baseline is dependent on the (pseudo)words used. The one shown is the baseline for the abbreviations collection.  From med-line03n0210.xml to med-line03n0229.xml.</Paragraph>
    <Paragraph position="6"> to exactly two distinct categories (with overlap allowed when there were more than two expansions for a given abbreviation).</Paragraph>
    <Paragraph position="7"> The question we wanted to explore is how well does the classifier do on category-based pseudowords versus abbreviations. As can be seen from Table 3, the accuracies for the abbreviations (evaluated on 332,020 instances) fall between the Realistic and Optimistic pseudowords, as expected.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML