File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/n03-4011_intro.xml

Size: 2,220 bytes

Last Modified: 2025-10-06 14:01:50

<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-4011">
  <Title>Automatically Discovering Word Senses</Title>
  <Section position="2" start_page="1" end_page="1" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Using word senses versus word forms is useful in many applications such as information retrieval (Voorhees 1998), machine translation (Hutchins and Sommers 1992), and question-answering (Pasca and Harabagiu 2001).</Paragraph>
    <Paragraph position="1"> The Distributional Hypothesis (Harris 1985) states that words that occur in the same contexts tend to be similar. There have been many approaches to compute the similarity between words based on their distribution in a corpus (Hindle 1990; Landauer and Dumais 1997; Lin 1998). The output of these programs is a ranked list of similar words to each word. For example, Lin's approach outputs the following similar words for wine and suit: wine: beer, white wine, red wine, Chardonnay, champagne, fruit, food, coffee, juice, Cabernet, cognac, vinegar, Pinot noir, milk, vodka,...</Paragraph>
    <Paragraph position="2"> suit: lawsuit, jacket, shirt, pant, dress, case, sweater, coat, trouser, claim, business suit, blouse, skirt, litigation, ...</Paragraph>
    <Paragraph position="3"> The similar words of wine represent the meaning of wine. However, the similar words of suit represent a mixture of its clothing and litigation senses. Such lists of similar words do not distinguish between the multiple senses of polysemous words.</Paragraph>
    <Paragraph position="4">  The demonstration is currently available online at www.cs.ualberta.ca/~lindek/demos/wordcluster.htm. We will demonstrate the output of a distributional clustering algorithm called Clustering by Committee (CBC) that discovers word senses automatically from text. Each cluster that a word belongs to corresponds to a sense of the word. The following is a sample output from our algorithm:  Each entry shows the clusters to which the head-word belongs along with its similarity to the cluster. The lists of words are the top-4 most similar members to the cluster centroid. Each cluster corresponds to a sense of the headword.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML