File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/05/h05-1063_relat.xml

Size: 6,293 bytes

Last Modified: 2025-10-06 14:15:43

<?xml version="1.0" standalone="yes"?>
<Paper uid="H05-1063">
  <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 499-506, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Mining Context Specific Similarity Relationships Using The World Wide Web</Title>
  <Section position="4" start_page="499" end_page="500" type="relat">
    <SectionTitle>
2 Related Work
</SectionTitle>
    <Paragraph position="0"> Most of the prior works performed only mining within the target collection itself and revealed results ranging from small improvements to negative effects (degrading performance). Throughout our paper, we refer to them as self -mining to distinguish from mining external corpus, which we believe is more promising for similarity computation between documents due to the following intuitive consideration. Within self-mining paradigm, terms t1 and t2 have to frequently co-occur in the collection in order to be detected as associated (synonymic). In that case, expanding document D representation with a term t2 when the document already has term t1 is not statistically likely to enrich its representation since t2 is likely to be in document D anyway. We believe mining external larger and contextually related corpus has the potential to discover more interesting associations with much higher reliability than just from the target collection. That is why, this paper focuses on constructing and mining the external corpus.</Paragraph>
    <Paragraph position="1"> There are very few studies that used external corpus and standard evaluation collections. Grefenstette (1994) automatically built a thesaurus and applied it for query expansion, producing better results than using the original queries. Gauch et al. (1998) used one standard collection for mining (TREC4) and another (TREC5) for testing and achieved 7.6% improvement. They also achieved 28.5% improvement on the narrow-domain Cystic Fibrosis collection. Kwok (1998) also reported similar results with TREC non Web collections. Ballesteros and Croft (1998) used unlinked corpora to reduce the ambiguity associated with phrasal and term translation in Cross-Language Retrieval.</Paragraph>
    <Paragraph position="2"> There are even fewer studies involving semantic mining on the Web and its methodological evaluation. Gery and Haddad Gery (1999) used about 60,000 documents from one specific domain for mining similarity among French terms and tested the results using 4 ad hoc queries. Sugiura and Etzioni (2000) developed a tool called Q-Pilot that mined the web pages retrieved by commercial search engines and expanded the user query by adding similar terms. They reported preliminary yet encouraging results but tested only the overall system, which includes the other, not directly related to mining features, such as clustering, pseudo-relevance feedback, and selecting the appropriate external search engine. Furthermore, they only used the correctness of the engine selection as the evaluation metric . There are some other well known techniques that do not perform mining for a thesaurus explicitly but still capture and utilize semantic similarity between the terms in an implicit way, namely Latent Semantic Indexing (LSI) and Pseudo Relevance Feedback (PRF). Latent Semantic Indexing (Analysis) (Deerwester et al., 1998) a technique based on Singular Value Decomposition, was studied in a number of works . It reduces the number of dimensions in the document space thus reducing the noise (linguistic variations) and bringing semantically similar terms together, thus it  takes into consideration the correlation between the terms. The reported improvements so far however have not exceeded 10-15% in standard collections) and sensitive to the choice of the semantic axis (reduced dimensions). The general idea behind the Pseudo Relevance Feedback (PRF) (Croft &amp; Harper, 1979) or its more recent variation called Local Context Analysis (Xu &amp; Croft, 2000) is to assume that the top rank retrieved documents are relevant and use certain terms from them for the query expansion. A simple approach has been found to increase performance over 23% on the TREC3 and TREC4 collections and became internal part of modern IR systems. Although this idea has been only applied so far to users' queries, we extended it in this study to similarity computation between documents in order to compare with our approach. Although we believe this extension is novel, it is not the focus of this study. It is also worth mentioning that both LSI and PRF fall into &amp;quot;self-mining&amp;quot; category since they do not require external corpus.</Paragraph>
    <Paragraph position="3"> A manually built and maintained ontology (a thesaurus), such as WorldNet, may serve as a source of similarity between terms and has been shown to be useful for retrieval tasks (Voorhees, 1994).</Paragraph>
    <Paragraph position="4"> However, one major drawback of manual approach is high cost of creating and maintaining. Besides, the similarity between terms is context specific. For example, for a campus computer support center the words student, faculty, user are almost synonyms, but for designers of educational software (e.g.</Paragraph>
    <Paragraph position="5"> Blackboard), the words student and faculty would represent entirely different roles.</Paragraph>
    <Paragraph position="6"> Although the terms &amp;quot;mining&amp;quot;, &amp;quot;web mining&amp;quot; and &amp;quot;knowledge discovery&amp;quot; have been used by other researchers in various contexts (Cooley, 1997), we believe it is legitimate to use them to describe our work for two major reasons: 1) We use algorithms and formulas coming from the data mining field, specifically signal to noise ratio association metric (Church, 1989; Church, 1991) 2) Our approach interacts with commercial search engines and harvests web pages contextually close to the target collection, and there is mining of resources (the search engine database) and discovery of content (web pages) involved. We admit that the term &amp;quot;mining&amp;quot; may be also used for a more sophisticated or different kind of processing than our approach here.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML