File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-1029_intro.xml

Size: 3,072 bytes

Last Modified: 2025-10-06 14:01:35

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1029">
  <Title>Ensemble Methods for Automatic Thesaurus Extraction</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Automatic Thesaurus Extraction
</SectionTitle>
    <Paragraph position="0"> The development of large thesauri and semantic resources, such as WordNet (Fellbaum, 1998), has allowed lexical semantic information to be leveraged to solve NLP tasks, including collocation discovery (Pearce, 2001), model estimation (Brown et al., 1992; Clark and Weir, 2001) and text classi cation (Baker and McCallum, 1998).</Paragraph>
    <Paragraph position="1"> Unfortunately, thesauri are expensive and time-consuming to create manually, and tend to suffer from problems of bias, inconsistency, and limited coverage. In addition, thesaurus compilers cannot keep up with constantly evolving language use and cannot afford to build new thesauri for the many sub-domains that NLP techniques are being applied to.</Paragraph>
    <Paragraph position="2"> There is a clear need for automatic thesaurus extraction methods.</Paragraph>
    <Paragraph position="3"> Much of the existing work on thesaurus extraction and word clustering is based on the observations that related terms will appear in similar contexts. These systems differ primarily in their de nition of context and the way they calculate similarity from the contexts each term appears in. Many systems extract co-occurrence and syntactic information from the words surrounding the target term, which is then converted into a vector-space representation of the contexts that each target term appears in (Pereira et al., 1993; Ruge, 1997; Lin, 1998b). Curran and Moens (2002b) evaluate thesaurus extractors based on several different models of context on large corpora. The context models used in our experiments are described in Section 3.</Paragraph>
    <Paragraph position="4"> We de ne a context relation instance as a tuple (w; r; w0) where w is a thesaurus term, occurring in a relation of type r, with another word w0 in the sentence. We refer to the tuple (r; w0) as an attribute of w. The relation type may be grammatical or it may label the position of w0 in a context window: e.g. the tuple (dog, direct-obj, walk) indicates that the term dog, was the direct object of the verb walk. After the contexts have been extracted from the raw text, they are compiled into attribute vectors describing all of the contexts each term appears in. The thesaurus extractor then uses clustering or nearest-neighbour matching to select similar terms based on a vector similarity measure.</Paragraph>
    <Paragraph position="5">  ing for thesaurus extraction, which calculates the pairwise similarity of the target term with every potential synonym. Given n terms and up to m attributes for each term, the asymptotic time complexity of k-nearest-neighbour algorithm is O(n2m). We reduce the number of terms by introducing a minimum occurrence lter that eliminates potential synonyms with a frequency less than ve.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML