File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/h05-1063_metho.xml
Size: 19,845 bytes
Last Modified: 2025-10-06 14:09:34
<?xml version="1.0" standalone="yes"?> <Paper uid="H05-1063"> <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 499-506, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Mining Context Specific Similarity Relationships Using The World Wide Web</Title> <Section position="5" start_page="500" end_page="502" type="metho"> <SectionTitle> 3 Algorithms And Implementations </SectionTitle> <Paragraph position="0"> The target collection (Reuters in our experiment) is indexed and its most representative terms are used to construct a corpus from an external source (e. g.</Paragraph> <Paragraph position="1"> World Wide Web). The term-to-term similarity matrix is created by co-occurrence analysis within the corpus and subsequently used to expand document vectors in order to improve the accuracy (correctness) of similarity computation between the documents in the target collection. Although in this work we do not study the effects on the individual applications of the similarity computation, it is crucial for such tasks as retrieval, clustering, categorization or topic detection.</Paragraph> <Section position="1" start_page="500" end_page="501" type="sub_section"> <SectionTitle> 3.1 Building a Web Corpus </SectionTitle> <Paragraph position="0"> We designed and implemented a heuristic algorithm that takes advantage of the capabilities provided by commercial web search engines. In our study, we used AltaVista (www.altavista.com), but most other search engines would also qualify for the task.</Paragraph> <Paragraph position="1"> Ideally, we would like to obtain web pages that contain the terms from the target collection in the similar context. While constructing Web corpus, our spider automatically sends a set of queries to AltaVista and obtains the resulting URLs. The spider creates one query for each term ti out of 1000 most frequent terms in the target collection (stop words excluded) according to the following formula:</Paragraph> <Paragraph position="3"> used to represent text strings literally and context_hint is composed of the top most frequent terms in the target collection (stop words excluded) separated by empty space. Although this way of defining context may seem a bit simplistic , it still worked surprisingly well for our purpose.</Paragraph> <Paragraph position="4"> According to AltaVista, a word or phrase preceded by '+' sign has to be present in the search results. The presence of the other words and phrases (context hint string in our case) is only desirable but not required. The total number of the context hint terms (108 in this study) is limited by the maximum length of the query string that the search engine can accept.</Paragraph> <Paragraph position="5"> We chose to use only top 1000 terms for constructing corpus to keep the downloading time manageable.</Paragraph> <Paragraph position="6"> We believe using a larger corpus would demonstrate even larger improvement. Approximately 10% of those terms were phrases. We only used the top 200 hits from each query and only first 20Kbytes of HTML source from each page to convert it into plain text. After removing duplicate URLs and empty pages, we had 19,198 pages in the Web corpus to mine.</Paragraph> <Paragraph position="7"> Downloading took approximately 6 hours and was performed in parallel, spawning up to 20 java processes at a time, but it still remained the largest scalability bottleneck.</Paragraph> </Section> <Section position="2" start_page="501" end_page="501" type="sub_section"> <SectionTitle> 3.2 Semantic Similarity Discovery </SectionTitle> <Paragraph position="0"> CSSE performs co-occurrence analysis at the document level and computes the following values: df(t1, t2) is the joint document frequency, i.e., the number of web pages where both terms t1 and t2 occur. df(t) is the document frequency of the term t, i.e., the number of web pages in which the term t occurs. Then, CSSE applies a well known signal to noise ratio formula coming from data mining (Church, 1991) to establish similarity between terms t1 and t2: sim(t1, t2)= )2()1( )2,1(log tdftdf ttdfN [?][?] / Nlog , (1) where N is the total number of documents in the mining collection (corpus), log N is the normalizing factor, so the sim value would not exceed 1 and be comparable across collections of different size.</Paragraph> <Paragraph position="1"> Based on the suggestions from the other studies using formula (1), before running our tests, we decided to discard as spurious all the co-occurrences that happened only within one or two pages and all the similarities that are less than the specified threshold (Thresh).</Paragraph> </Section> <Section position="3" start_page="501" end_page="502" type="sub_section"> <SectionTitle> 3.3 Vector Expansion </SectionTitle> <Paragraph position="0"> Since we were modifying document vectors (more general case), but not queries as in the majority of prior studies, we refer to the process as vector expansion. As we wrote in literature review, there are many possible heuristic ways to perform vector expansion. After preliminary tests, we settled on the simple linear modification with post re-normalization as presented below. The context of the target collection is represented by the similarity matrix sim(t1, t2) mined as described in the preceding section. Our vector expansion algorithm adds all the related terms to the vector representation of the document D with the weights proportional to the degree of the relationships and the global inverse</Paragraph> <Paragraph position="2"> w(t, D) is the initial, not expanded, weight of the term t in the document D (assigned according to TF-IDF weighting scheme in our case); w'(t, D) is the modified weight of the term t in the document D; t' iterates through all (possibly repeating) terms in the document D ; a is the adjustment factor (a parameter controlled in the expansion process).</Paragraph> </Section> </Section> <Section position="6" start_page="502" end_page="504" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="502" end_page="504" type="sub_section"> <SectionTitle> 4.1 Similarity Error Reduction </SectionTitle> <Paragraph position="0"> Since in this study we were primarily concerned with improving similarity computation but not retrieval per se, we chose a widely used for text categorization Reuters collection (Lewis, 1997) over TREC or similar collections with relevance judgments. We used a modified version of Lewis' (1992) suggestion to derive our evaluation metric, which is similar to the metric derived from Kruskal-Goodman statistics used in Haveliwala et al. (2002) for a study with Yahoo web directory (www.yahoo.com). Intuitively, the metric reflects the probability of algorithm guessing the correct order (called ground truth), imposed by a manually created hierarchy (simplified to a partition in Reuters case). Ideally, for each document D, the similarity computation algorithm should indicate documents sharing one or more Reuters categories with document D to be more similar to the document D than the documents not sharing any categories with D. We formalized this intuitive requirement into a metric by the following way. Let's define a test set Sa to be the set of all the document triples (D, D1, D2) such that D,,D1, D,,D2, D1,,D2, and furthermore D shares at least one common category with D1 but no common categories with D2. We defined total error count (Ec) as the number of triples in the test set Sa such that sim(D, D1) < sim(D, D2) since it should be the other way around. Our accuracy metric reported below is the total error count normalized by the size of the test set Sa: similarity error = Ec / #Sa, computed for each Reuters topics and averaged across all of them. The metric ranges from 0 (ideal case) to .5 (random ordering). It also needed an adjustment to provide the necessary continuity as justified in the following. Since the documents are represented by very sparse vectors, very often (about 5% of all triples) documents D, D1, D2 do not have any terms in common and as a result similarity computation results in a tie: sim(D,D1) = sim (D, D2). A tie can not be considered an error because in that case one can suggest a trivial improvement to the similarity algorithm by simply breaking the ties at random in any direction with an equal chance, and thus reducing errors in 50% of all ties. This is why the metric counts half of all the ties as errors, which completely removes this discontinuity.</Paragraph> <Paragraph position="1"> We used all the Reuters 78 topics from the &quot;commodity code&quot; group since they are the most &quot;semantic &quot;, not trying the others (Economic Indicator Codes, Currency Codes, Corporate Codes).</Paragraph> <Paragraph position="2"> We discarded the topics that had only 1 document and used only the documents that had at least one of the topics. This reduced our test collection to 1841 documents, still statistically powerful and computationally demanding since millions of triples had to be considered (even after some straightforward algorithmic optimizations). After indexing and stemming (Porter, 1980) the total number of unique stems used for the vector representation was 11461.</Paragraph> <Paragraph position="3"> for the different weighting schemes we tried first in our experiment. Since TF-IDF weighting was by far the best in this evaluation set up, we limited our expansion experiments to TF-IDF scheme only. For similarity measure between document vectors, we used the most common negative Euclidian distance after normalizing the vectors to unit length. It can be shown, that cosine metric (dot product), the other popular metric, results in the same order and, thus same similarity error as well. Without normalization or stemming the errors were almost twice as much larger.</Paragraph> <Paragraph position="4"> Although we varied the adjustment parameter a in our experiment, for better interpretation, we plotted our primary metric (average error reduction) as a function of Ca, the average Euclidian distance between the original and the modified document vectors when both vectors are normalized to unit length. Ca serves as a convenient parameter controlling the degree of change in the document vectors, better than a, because same values of a may result in different changes depending on the term-to-term similarity matrix sim(t1, t2). In theory, Ca varies from 0 (no change) to 2 , the case of maximum possible change (no common terms between initial and expanded representation). By varying adjustment factor a from 0 to 10 and higher we observed almost the entire theoretical range of Ca: starting from negligible change and going all the way to 2 , where the added terms entirely dominated the original ones. The average number of terms in the document representation was in 60-70 range before expansion and in 200-300 range after the expansion. This of course increased computational burden. Nevertheless, even after the expansion, the vector representations still remained sparse and we were able to design and implement some straightforward algorithmic improvements taking advantage of this sparsity to keep processing time manageable. The expansion for entire Reuters collection was taking less than one minute on a workstation with Pentium III 697 MHz processor, 256 MB of RAM, with all the sparse representations of the documents and similarity matrix stored in primary memory. This renders the expansion suitable for online processing.</Paragraph> <Paragraph position="5"> To evaluate the performance of each technique, we used the error reduction (%) relatively to the baseline shown in Table 1 (TF-IDF column) averaged across all the topics, which corresponds to the lowest original non-expanded similarity error. Figure 1 shows the error reduction as a function of Ca for various values of Thresh. We stopped increasing Ca once the improvement dropped below -10% to save testing time. Several facts can be observed from the results: 1) The error reduction for Thresh in the mid range of Ca [.2-.4] is very stable, achieves 50% , which is very large compared with the other known techniques we used for comparison as discussed below. The effect is also comparable with the difference between various weighting functions (Table 2), which we believe renders the improvement practically significant.</Paragraph> <Paragraph position="6"> 2) For small thresholds (Thresh < .1), the effect is not that stable, possibly since many non-reliable associations are involved in the expansion.</Paragraph> <Paragraph position="7"> 3) Larger thresholds (Thresh > .4) are also not very reliable since they result in a small number of associations created, and thus require large values of adjustment parameter a in order to produce substantial average changes in the document vectors (Ca), which results in too drastic change in some document vectors.</Paragraph> <Paragraph position="8"> 4) The error reduction curve is unimodal: it starts from 0 for small Ca, since document vectors almost do not change, and grows to achieve maximum for Ca somewhere in relatively wide .1 - .5 range. Then, it decreases, because document vectors may be drifting too far from the original ones, falling below 0 for some large values of Ca.</Paragraph> <Paragraph position="9"> 5) For thresholds (Thresh) .2 and .3, the effect stays positive even for large values of Ca, which is an interesting phenomenon because document vectors are getting almost entirely replaced by their expanded representations.</Paragraph> <Paragraph position="10"> Some sensitivity of the results with respect to the parameters Thresh, Ca is a limitation as occurs similarly to virtually all modern IR improvement techniques. Indeed, Latent Semantic Indexing (LSI) needs to have number of semantic axis to be correctly set, otherwise the performance may degrade. Pseudo Relevance Feedback (PRF) depends on several parameters such as number of documents to use for feedback, adjustment factor, etc. All previously studied expansion techniques depend on the adjustment factor as well. The specific choice of the parameters for real life applications is typically performed manually based on trial and error or by following a machine learning approach: splitting data into training and testing sets. Based on the above results, the similarity threshold (Thresh) in .2-.4 and Ca in .1-.5 range seem to be a safe combination, not degrading and likely to significantly (20-50%) improve performance. The performance curve being unimodal with respect to both Ca and Thresh also makes it easier to tune by looking for maxima.</Paragraph> <Paragraph position="11"> Although we have involved only one test collection in this study, this collection (Reuters) varies greatly in the content and the size of the documents, so we hope our results will generalize to other collections. We also verified that the effect typically diminishes when the size of the mining collection (corpus) is reduced by random sub-sampling. Those results were also similar to those obtained 4 months earlier, although only 80% of the pages in the mining corpus remained.</Paragraph> <Paragraph position="12"> average vector change due to Pseudo Relevance Feedback for several cut-off numbers Nc.</Paragraph> </Section> <Section position="2" start_page="504" end_page="504" type="sub_section"> <SectionTitle> 4.2 Sensitivity Analysis </SectionTitle> <Paragraph position="0"> To test the importance of the context, we removed the &quot;context hint&quot; terms from the queries used by our agent, and created another (less context specific) corpus for mining. We obtained 175,336 unique URLs, much more than with using &quot;context hint&quot; terms since the overlap between different query results was much smaller. We randomly selected 25,000 URLs of them and downloaded the referred pages. Then, to make the comparison more objective, we randomly selected 19,198 pages (same number as with using context hint) of the non-empty downloaded pages. We mined the similarity relationships from the selected documents in the same way as described above. The resulting improvement (shown in the Figure 2) was indeed much smaller (13% and less) than with using &quot;context hint&quot; terms. It also degrades much quicker for larger Ca and more sensitive to the choice of Thresh. This may explain why mixed results were reported in the literate when the similarity thesaurus was constructed in a very general setting, but not specifically for the target collection in mind. It is also interesting to note a similar behavior of error reduction as the function of Ca and Thresh: it is unimodal with maximum in approximately same range of arguments. This may also serve as indirect evidence of stability of the effect (even if smaller in that case) with respect to the parameters involved.</Paragraph> <Paragraph position="1"> To verify the importance of using external corpus vs.</Paragraph> <Paragraph position="2"> self-mining, we mined the similarity relationships from the same collection (Reuters) that we used for the tests (target collection) using the same mining algorithms. Figure 3 shows that the effect of such &quot;self-mining&quot; is relatively modest (up to 20%), confirming that using the external corpus (the Web in our approach) was crucial. Again, the behavior of the error reduction (even smaller in that case) with respect to Ca and Thresh is similar to the context specific web corpus mining.</Paragraph> </Section> <Section position="3" start_page="504" end_page="504" type="sub_section"> <SectionTitle> 4.3 Comparison with Other Techniques </SectionTitle> <Paragraph position="0"> Figure 4 shows the similarity error reduction as a function of the number of semantic axis when LSI is applied. The effect with the entire collection (second column) is always negative. So, the Reuters collection in our experiment set up was found to be not a good application of LSI technique, possibly because many of the topics have already small errors even before applying LSI. To verify our implementation and the applicability of LSI to the similarity computation, we applied it only to the &quot;tougher&quot; 26 topics, those in the upper half if ordered by the original similarity error. As Figure 4 reveals, LSI is effective in that case for numbers of semantic axis comparable with number of topics in the target collection. Our findings are well in line with reported in prior research.</Paragraph> <Paragraph position="1"> We adapted the classic Pseudo Relevance Feedback algorithm (Qiu, 1993), which has been so far applied only to document retrieval tasks, to similarity computation in a straightforward way and also tried several variations of if (not described here due to lack of space). Figure 5 shows the effect as a function of adjustment factor a for various cut-off parameters Nc (the number of top ranked documents used for feedback). The effect achieves the maximum of around 21%, consistent with the results reported in prior research. The improvement is close in magnitude to the one due to &quot;self-mining&quot; described above. We do not claim that our approach is better than PRF since it is not entirely meaningful to make this comparison due to the number of parameters and implementation details involved in both. Also, more important, the techniques rely on different source of data: PRF is a &quot;self-mining&quot; approach while CSSE builds and mines external corpus. Thus, CSSE can be used in addition to PRF.</Paragraph> </Section> </Section> class="xml-element"></Paper>