File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/w05-1011_intro.xml

Size: 2,616 bytes

Last Modified: 2025-10-06 14:03:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-1011">
  <Title>Approximate Searching for Distributional Similarity</Title>
  <Section position="3" start_page="97" end_page="97" type="intro">
    <SectionTitle>
2 Distributional Similarity
</SectionTitle>
    <Paragraph position="0"> Distributional similarity systems can be separated into two components. The first component extracts the contexts from raw text and compiles them into a statistical description of the contexts each term appears in. The second component performs nearest-neighbour search or clustering to determine which terms are similar, based on distance calculations between their context vectors. The approach used in this paper follows Curran (2004).</Paragraph>
    <Section position="1" start_page="97" end_page="97" type="sub_section">
      <SectionTitle>
2.1 Extraction Method
</SectionTitle>
      <Paragraph position="0"> A context relation is defined as a tuple (w, r, w') where w is a term, which occurs in some grammatical relation r with another word w' in some sentence. We refer to the tuple (r, w') as an attribute of w. For example, (dog, diect-obj, walk) indicates that dog was the direct object of walk in a sentence.</Paragraph>
      <Paragraph position="1"> Context extraction begins with a Maximum Entropy POS tagger and chunker (Ratnaparkhi, 1996).</Paragraph>
      <Paragraph position="2"> The Grefenstette (1994) relation extractor produces context relations that are then lemmatised using the Minnen et al. (2000) morphological analyser. The relations for each term are collected together and counted, producing a context vector of attributes and their frequencies in the corpus.</Paragraph>
    </Section>
    <Section position="2" start_page="97" end_page="97" type="sub_section">
      <SectionTitle>
2.2 Measures and Weights
</SectionTitle>
      <Paragraph position="0"> Both nearest-neighbour and cluster analysis methods require a distance measure that calculates the similarity between context vectors. Curran (2004) decomposes this measure into measure and weight functions. The measure function calculates the similarity between two weighted context vectors and the weight function calculates a weight from the raw frequency information for each context relation.</Paragraph>
      <Paragraph position="1"> The SASH requires a distance measure that preserves metric space (see Section 4.1). For these experiments we use the JACCARD (1) measure and the TTEST (2) weight, as Curran and Moens (2002a) found them to have the best performance in their comparison of many distance measures.</Paragraph>
      <Paragraph position="3"/>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML