File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/h05-1063_intro.xml

Size: 7,715 bytes

Last Modified: 2025-10-06 14:02:50

<?xml version="1.0" standalone="yes"?>
<Paper uid="H05-1063">
  <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 499-506, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Mining Context Specific Similarity Relationships Using The World Wide Web</Title>
  <Section position="3" start_page="0" end_page="499" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Many modern information management tasks such as document retrieval, clustering, filtering and summarization rely on algorithms that compute similarity between text documents. For example, clustering algorithms, by definition, place documents similar to each other into the same cluster. Topic detection algorithms attempt to detect documents or passages similar to those already presented to the users. &amp;quot;Query by example&amp;quot; retrieval is based on similarity between a document selected as example and the other ones in the collection. Even a classical retrieval task can be formulated as rank ordering according to the similarity between the document (typically very short) representing user's query and all the documents in the collection.</Paragraph>
    <Paragraph position="1"> For similarity computation, text documents are represented by terms (words or phrases) that they have, and encoded by vectors according to a predominantly used vector space model (Salton &amp; McGill, 1983). Each coordinate corresponds to a term (word or phrase) possibly present within a document. Within that model, a high similarity between a pair of documents can be only indicated by sharing same terms. This approach has apparent limitations due to the notorious vocabulary problem (Furnas et al., 1997): people very often use different words to describe semantically similar objects. For example, within a classical vector space model, the similarity algorithm would treat words car and automobile as entirely different, ignoring semantic similarity relationship between them.</Paragraph>
    <Paragraph position="2"> It has been known for a long time that semantic similarity relationships between terms can be discovered by their co-occurrence in the same documents or in the vicinity of each other within documents (von Rijsbergen, 1977). Until the 1990s, the studies exploring co-occurrence information for building a thesaurus and using it in automated query expansion (adding similar words to the user query) resulted in mixed results (Minker et al., 1972; Peat &amp; Willett, 1991). The earlier difficulties may have resulted from the following reasons: 1) The test collections were small, sometimes only few dozens of documents. Thus, there was only a small amount of data available for statistical co-occurrence analysis (mining), not enough to establish reliable associations.</Paragraph>
    <Paragraph position="3"> 2) The evaluation experiments were based on retrieval tasks, short, manually composed queries.</Paragraph>
    <Paragraph position="4"> The queries were at times ambiguous and, as a result, wrong terms were frequently added to the query. E.g.</Paragraph>
    <Paragraph position="5"> initial query &amp;quot;jaguar&amp;quot; may be expanded with the words &amp;quot;auto&amp;quot;, &amp;quot;power&amp;quot;, &amp;quot;engine&amp;quot; since they co-occur with &amp;quot;jaguar&amp;quot; in auto related documents. But, if the user was actually referring to an animal then the retrieval accuracy would degrade after the expansion.</Paragraph>
    <Paragraph position="6"> 3) The expansion models were overly simplistic, e.g.</Paragraph>
    <Paragraph position="7"> by merely adding more keywords to Boolean queries (e.g. &amp;quot;jaguar OR auto OR power OR car&amp;quot;).</Paragraph>
    <Paragraph position="8"> Although more recent works removed some of the limitations and produced more encouraging results (Grefenstette, 1994; Church et al., 1991; Hearst et  al., 1992; Schutze and Pedersen, 1997; Voorhees, 1994) there are still a number of questions that remain open: 1) What is the range for the magnitude of the improvement. Can the effect be of practical importance? 2) What are the best mining algorithms and formulas? How crucial is the right choice of them? 3) What is the best way to select a corpus for  mining? Specifically, is it enough to mine only within the same collection that is involved in retrieval, clustering or other processing (target collection), or constructing and mining a larger  external corpus (like a subset of World Wide Web) would be of much greater help? 4) Even if the techniques studied earlier are effective (or not) for query expansion within the document retrieval paradigm, are they also effective for a more general task of document similarity computation? Similarity computation stays behind almost all information retrieval tasks including text document retrieval, summarization, clustering, categorization, query by example etc. Since documents are typically longer than user composed queries, their vector space representations are much richer and thus expanding them may be more reliable due to implicit disambiguation.</Paragraph>
    <Paragraph position="9"> Answering these questions constitutes the novelty of our work. We have developed a Context Specific Similarity Expansion (CSSE) technique based on word co-occurrence analysis within pages automatically harvested from the WWW (Web corpus) and performed extensive testing with a well known Reuters collection (Lewis, 1997). To test the similarity computation accuracy, we designed a simple combinatorial metric which reflects how accurately (as compared to human judgments) the algorithm, given a document in the collection, orders all the other documents in the collection by the perceived (computed) similarity. We believe that using this metric is more objective and reliable than trying to include all the traditional metrics specific to each application (e.g. recall/precision for document retrieval, type I/II errors for categorization, clustering accuracy etc.) since the latter may depend on the other algorithmic and implementation details in the system. For example, most clustering algorithms rely on the notion of similarity between text documents, but each algorithm (k-means, minimum variance, single link, etc.) follows its own strategy to maximize similarity within a cluster.</Paragraph>
    <Paragraph position="10"> We have found out that our CSSE technique have reduced similarity errors by up to 50%, twice as much as the improvement due to using other known techniques such as Latent Semantic Indexing (LSI) and Pseudo Relevance Feedback (PRF) within the same experimental framework. In addition to this dramatic improvement, we have established the importance of the following for the success of the expansion: 1) using external corpus (a constructed subset of WWW) in addition to the target collection 2) taking the context of the target collection into consideration 3) using the appropriate mining formulas. We suggest that these three crucial components within our technique make it significantly distinct from those explored early and also explain more encouraging results.</Paragraph>
    <Paragraph position="11"> The paper is structured as follows. Section 2 discusses previous research results that are closely related to our investigation. Section 3 presents algorithms implemented in our experiments. Section 4 describes our experiments including error reduction, sensitivity analysis, and comparison with other techniques. Finally, Section 5 concludes the paper by explaining our key contributions and outlining our future research.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML