File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-1046_intro.xml

Size: 2,906 bytes

Last Modified: 2025-10-06 14:03:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1046">
  <Title>Scaling Distributional Similarity to Large Corpora</Title>
  <Section position="4" start_page="0" end_page="361" type="intro">
    <SectionTitle>
2 Distributional Similarity
</SectionTitle>
    <Paragraph position="0"> Measuring distributional similarity first requires the extraction of context information for each of the vocabulary terms from raw text. These terms are then compared for similarity using a nearest-neighbour search or clustering based on distance calculations between the statistical descriptions of their contexts.</Paragraph>
    <Section position="1" start_page="361" end_page="361" type="sub_section">
      <SectionTitle>
2.1 Extraction
</SectionTitle>
      <Paragraph position="0"> A context relation is defined as a tuple (w,r,wprime) where w is a term, which occurs in some grammatical relation r with another word wprime in some sentence. We refer to the tuple (r,wprime) as an attribute of w. For example, (dog, direct-obj, walk) indicates that dog was the direct object of walk in a sentence.</Paragraph>
      <Paragraph position="1"> In our experiments context extraction begins with a Maximum Entropy POS tagger and chunker. The SEXTANT relation extractor (Grefenstette, 1994) produces context relations that are then lemmatised. The relations for each term are collected together and counted, producing a vector of attributes and their frequencies in the corpus.</Paragraph>
    </Section>
    <Section position="2" start_page="361" end_page="361" type="sub_section">
      <SectionTitle>
2.2 Measures and Weights
</SectionTitle>
      <Paragraph position="0"> Both nearest-neighbour and cluster analysis methods require a distance measure to calculate the similarity between context vectors. Curran (2004) decomposes this into measure and weight functions. The measure calculates the similarity between two weighted context vectors and the weight calculates the informativeness of each context relation from the raw frequencies.</Paragraph>
      <Paragraph position="1"> For these experiments we use the Jaccard (1) measure and the TTest (2) weight functions, found by Curran (2004) to have the best performance.</Paragraph>
      <Paragraph position="3"/>
    </Section>
    <Section position="3" start_page="361" end_page="361" type="sub_section">
      <SectionTitle>
2.3 Nearest-neighbour Search
</SectionTitle>
      <Paragraph position="0"> The simplest algorithm for finding synonyms is a k-nearest-neighbour (k-NN) search, which involves pair-wise vector comparison of the target term with every term in the vocabulary. Given an n term vocabulary and up to m attributes for each term, the asymptotic time complexity of nearest-neighbour search is O(n2m). This is very expensive, with even a moderate vocabulary making the use of huge datasets infeasible. Our largest experiments used a vocabulary of over 184,000 words.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML