File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-4011_metho.xml

Size: 3,183 bytes

Last Modified: 2025-10-06 14:08:16

<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-4011">
  <Title>Automatically Discovering Word Senses</Title>
  <Section position="3" start_page="1" end_page="22" type="metho">
    <SectionTitle>
2 Feature Representation
</SectionTitle>
    <Paragraph position="0"> Following (Lin 1998), we represent each word by a feature vector. Each feature corresponds to a context in which the word occurs. For example, &amp;quot;sip __&amp;quot; is a verb-object context. If the word wine occurred in this context, the context is a feature of wine. These features are obtained by parsing a large corpus using Minipar (Lin 1994), a broad-coverage English parser. The value of the feature is the pointwise mutual information (Manning and Schutze 1999) between the feature and  the word. Let c be a context and F c (w) be the frequency count of a word w occurring in context c. The pointwise</Paragraph>
    <Paragraph position="2"> where N is the total frequency counts of all words and their contexts. We compute the similarity between two words w</Paragraph>
    <Paragraph position="4"> using the cosine coefficient (Salton and McGill 1983) of their mutual information vectors:</Paragraph>
  </Section>
  <Section position="4" start_page="22" end_page="22" type="metho">
    <SectionTitle>
3 Clustering by Committee
</SectionTitle>
    <Paragraph position="0"> CBC finds clusters by first discovering the underlying structure of the data. It does this by searching for sets of representative elements for each cluster, which we refer to as committees. The goal is to find committees that unambiguously describe the (unknown) target classes. By carefully choosing committee members, the features of the centroid tend to be the more typical features of the target class. For example, our system chose the following committee members to compute the centroid of the state cluster: Illinois, Michigan, Minnesota, Iowa, Wisconsin, Indiana, Nebraska and Vermont. States like Washington and New York are not part of the committee because they are polysemous.</Paragraph>
    <Paragraph position="1"> The centroid of a cluster is constructed by averaging the feature vectors of the committee members.</Paragraph>
    <Paragraph position="2"> CBC consists of three phases. Phase I computes each element's top-k similar elements. In Phase II, we do a first pass through the data and discover the committees. The goal is that we form tight committees (high intra-cluster similarity) that are dissimilar from one another (low inter-cluster similarity) and that cover the whole similarity space. The method is based on finding sub-clusters in the top-similar elements of every given element.</Paragraph>
    <Paragraph position="3"> In the final phase of the algorithm, each word is assigned to its most similar clusters (represented by a committee). Suppose a word w is assigned to a cluster c. We then remove from w its features that intersect with the features in c. Intuitively, this removes the c sense from w, allowing CBC to discover the less frequent senses of a word and to avoid discovering duplicate senses. The word w is then assigned to its next most similar cluster and the process is repeated.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML