File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-3812_intro.xml

Size: 2,911 bytes

Last Modified: 2025-10-06 14:04:17

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3812">
  <Title>Chinese Whispers an Efficient Graph Clustering Algorithm and its Application to Natural Language Processing Problems</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Clustering is the process of grouping together objects based on their similarity to each other. In the field of Natural Language Processing (NLP), there are a variety of applications for clustering.</Paragraph>
    <Paragraph position="1"> The most popular ones are document clustering in applications related to retrieval and word clustering for finding sets of similar words or concept hierarchies.</Paragraph>
    <Paragraph position="2"> Traditionally, language objects are characterized by a feature vector. These feature vectors can be interpreted as points in a multidimensional space. The clustering uses a distance metric, e.g. the cosine of the angle between two such vectors. As in NLP there are often several thousand features, of which only a few correlate with each other at a time - think about the number of different words as opposed to the number of words occurring in a sentence dimensionality reduction techniques can greatly reduce complexity without considerably losing accuracy.</Paragraph>
    <Paragraph position="3"> An alternative representation that does not deal with dimensions in space is the graph representation. A graph represents objects (as nodes) and their relations (as edges). In NLP, there are a variety of structures that can be naturally represented as graphs, e.g. lexical-semantic word nets, dependency trees, co-occurrence graphs and hyperlinked documents, just to name a few.</Paragraph>
    <Paragraph position="4"> Clustering graphs is a somewhat different task than clustering objects in a multidimensional space: There is no distance metric; the similarity between objects is encoded in the edges. Objects that do not share an edge cannot be compared, which gives rise to optimization techniques. There is no centroid or 'average cluster member' in a graph, permitting centroid-based techniques.</Paragraph>
    <Paragraph position="5"> As data sets in NLP are usually large, there is a strong need for efficient methods, i.e. of low computational complexities. In this paper, a very efficient graph-clustering algorithm is introduced that is capable of partitioning very large graphs in comparatively short time. Especially for small-world graphs (Watts, 1999), high performance is reached in quality and speed. After explaining the algorithm in the next section, experiments with synthetic graphs are reported in section 3. These give an insight about the algorithm's performance. In section 4, experiments on three NLP tasks are reported, section 5 concludes by discussing extensions and further application areas.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML