File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/05/j05-4002_abstr.xml

Size: 9,036 bytes

Last Modified: 2025-10-06 13:44:24

<?xml version="1.0" standalone="yes"?>
<Paper uid="J05-4002">
  <Title>Co-occurrence Retrieval: A Flexible Framework for Lexical Distributional Similarity</Title>
  <Section position="2" start_page="0" end_page="440" type="abstr">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> Over recent years, approaches to a broad range of natural language processing (NLP) applications have been proposed that require knowledge about the similarity of words.</Paragraph>
    <Paragraph position="1"> The application areas in which these approaches have been proposed range from speech recognition and parse selection to information retrieval (IR) and natural language [?] Department of Informatics, University of Sussex, Falmer, Brighton, BN1 9QH, UK.</Paragraph>
    <Paragraph position="2"> Submission received: 4 May 2004; revised submission received: 16 November 2004; accepted for publication: 16 April 2005.</Paragraph>
    <Paragraph position="3"> (c) 2006 Association for Computational Linguistics Computational Linguistics Volume 31, Number 4 generation. For example, language models that incorporate substantial lexical knowledge play a key role in many statistical NLP techniques (e.g., in speech recognition and probabilistic parse selection). However, they are difficult to acquire, since many plausible combinations of events are not seen in corpus data. Brown et al. (1992) report that one can expect 14.7% of the word triples in any new English text to be unseen in a training corpus of 366 million English words. In our own experiments with grammatical relation data extracted by a Robust Accurate Statistical Parser (RASP) (Briscoe and Carroll 1995; Carroll and Briscoe 1996) from the British National Corpus (BNC), we found that 14% of noun-verb direct-object co-occurrence tokens and 49% of noun-verb direct-object co-occurrence types in one half of the data set were not seen in the other half. A statistical technique using a language model that assigns a zero probability to these previously unseen events will rule the correct parse or interpretation of the utterance impossible.</Paragraph>
    <Paragraph position="4"> Similarity-based smoothing (Hindle 1990; Brown et al. 1992; Dagan, Marcus, and Markovitch 1993; Pereira, Tishby, and Lee 1993; Dagan, Lee, and Pereira 1999) provides an intuitively appealing approach to language modeling. In order to estimate the probability of an unseen co-occurrence of events, estimates based on seen occurrences of similar events can be combined. For example, in a speech recognition task, we might predict that cat is a more likely subject of growl than the word cap, even though neither co-occurrence has been seen before, based on the fact that cat is &amp;quot;similar&amp;quot; to words that do occur as the subject of growl (e.g., dog and tiger), whereas cap is not.</Paragraph>
    <Paragraph position="5"> However, what is meant when we say that cat is &amp;quot;similar&amp;quot; to dog? Are we referring to their semantic similarity, e.g., the components of meaning they share by virtue of both being carnivorous four-legged mammals? Or are we referring to their distributional similarity, e.g., in keeping with the Firthian tradition,  the fact that these words tend to occur as the arguments of the same verbs (e.g., eat, feed, sleep) and tend to be modified by the same adjectives (e.g., hungry and playful).</Paragraph>
    <Paragraph position="6"> In some applications, the knowledge required is clearly semantic. In IR, documents might be usefully retrieved that use synonymous terms or terms subsuming those specified in a user's query (Xu and Croft 1996). In natural language generation (including text simplification), possible words for a concept should be similar in meaning rather than just in syntactic or distributional behavior. In these application areas, distributional similarity can be taken to be an approximation to semantic similarity. The underlying idea is based largely on the central claim of the distributional hypothesis (Harris 1968), that is: The meaning of entities, and the meaning of grammatical relations among them, is related to the restriction of combinations of these entities relative to other entities. This hypothesized relationship between distributional similarity and semantic similarity has given rise to a large body of work on automatic thesaurus generation (Hindle 1990; Grefenstette 1994; Lin 1998a; Curran and Moens 2002; Kilgarriff 2003). There are inherent problems in evaluating automatic thesaurus extraction techniques, and much research assumes a gold standard that does not exist (see Kilgarriff [2003] and Weeds [2003] for more discussion of this). A further problem for distributional similarity methods for automatic thesaurus generation is that they do not offer any obvious way to distinguish between linguistic relations such as synonymy, antonymy, and hyponymy (see Caraballo [1999] and Lin et al. [2003] for work on this). Thus, one may question</Paragraph>
    <Section position="1" start_page="440" end_page="440" type="sub_section">
      <SectionTitle>
Weeds and Weir Co-occurrence Retrieval
</SectionTitle>
      <Paragraph position="0"> the benefit of automatically generating a thesaurus if one has access to large-scale manually constructed thesauri (e.g., WordNet [Fellbaum 1998], Roget's [Roget 1911], the Macquarie [Bernard 1990] and Moby  ). Automatic techniques give us the opportunity to model language change over time or across domains and genres. McCarthy et al. (2004) investigate using distributional similarity methods to find predominant word senses within a corpus, making it possible to tailor an existing resource (WordNet) to specific domains. For example, in the computing domain, the word worm is more likely to be used in its 'malicious computer program' sense than in its 'earthworm' sense. This domain knowledge will be reflected in a thesaurus automatically generated from a computing-specific corpus, which will show increased similarity between worm and virus and reduced similarity between worm and caterpillar.</Paragraph>
      <Paragraph position="1"> In other application areas, however, the requirement for &amp;quot;similar&amp;quot; words to be semantically related as well as distributionally related is less clear. For example, in prepositional phrase attachment ambiguity resolution, it is necessary to decide whether the prepositional phrase attaches to the verb or the noun as in the examples (1) and (2).  1. Mary ((visited their cottage) with her brother).</Paragraph>
      <Paragraph position="2"> 2. Mary (visited (their cottage with a thatched roof)).</Paragraph>
      <Paragraph position="3">  Hindle and Rooth (1993) note that the correct decision depends on all four lexical events (the verb, the object, the preposition, and the prepositional object). However, a statistical model built on the basis of four lexical events must cope with extremely sparse data. One approach (Resnik 1993; Li and Abe 1998; Clark and Weir 2000) is to induce probability distributions over semantic classes rather than lexical items. For example, a cottage is a type of building and a brother is a type of person, and so the co-occurrence of any type of building and any type of person might increase the probability that the PP in example (1) attaches to the verb.</Paragraph>
      <Paragraph position="4"> However, it is unclear whether the classes over which probability distributions are induced need to be semantic or whether they could be purely distributional. If we know that two words tend to behave the same way with respect to prepositional phrase attachment, does it matter whether they mean similar things? Other arguments for using semantic classes over distributional classes can similarly be disputed (Weeds 2003). For example, it is not necessary for a class of objects to have a name or symbolic label for us to know that the objects are similar and to exploit that information. Distributional classes do conflate word senses, but in a task such as PP-attachment ambiguity resolution, we are unlikely to be working with sense-tagged examples and therefore it is for word forms that we will wish to estimate probabilities of different attachments. Finally, distributional classes may be over-fitted to a specific corpus, but this may be beneficial to the extent that the over-fitting reflects a specific domain or dialect. Further, recent empirical evidence suggests that techniques based on distributional similarity may perform as well on this task as those based on semantic similarity.</Paragraph>
      <Paragraph position="5"> Li (2002) shows that using a fairly small corpus (126,084 sentences from the Wall Street Journal) and a distributional similarity technique, it is possible to outperform a state-of-the-art, WordNet-based technique in terms of accuracy, although not in terms of coverage. Pantel and Lin (2000) report performance of 84.3% using an unsupervised approach to prepositional phrase attachment based on distributional similarity</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML