File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/n04-4031_intro.xml
Size: 8,953 bytes
Last Modified: 2025-10-06 14:02:21
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-4031"> <Title>Computational Linkuistics: word triggers across hyperlinks</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Given the size of the Web, it is intuitively very hard to find a given page of interest by just following links. Classic results have shown however, that the link structure of the Web is not random. Various models have been proposed including power law distributions (the &quot;rich get richer&quot; model), and lexical models. In this paper, we will investigate how the presence of a given word in a given Web document a9 a10 affects the presence of the same word in documents linked to a9 a10 . We will use the term Computational Linkuistics to describe the study of hyperlinks for</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Document Modeling and Information Retrieval purposes. 1.1 Link structure of the Web </SectionTitle> <Paragraph position="0"> Random graphs have been studied by Erd&quot;os and R'enyi (Erd&quot;os and R'enyi, 1960). In a random graph, edges are added sequentially with both vertices of a new edge chosen randomly.</Paragraph> <Paragraph position="1"> The diameter a9 of the Web (that is, the average number of links from any given page to another) has been found to be a constant (approximately a15a17a16a19a18a21a20a23a22a25a24a27a26a29a28a31a30a33a32a35a34a23a26a29a28a31a30a19a36 , where a32 is the number of documents on the Web and a36 is the average document out-degree (i.e., the number of pages linked from the document). This result was described in (Barab'asi and Albert, 1999) and is based on a corpus of 800 M web pages). This estimate of a9 would entail that in a random graph model, the size of the Web would be approximately a37a39a38a41a40 which is 10 M times its actual size. Clearly, a random graph model is not an appropriate description of the Web. Instead, it has been shown that due to preferential attachment (Barab'asi and Albert, 1999), the out-degree distribution follows a power law. The preferential model makes it more likely that a new random edge will connect two vertices that already have a high degree. Specifically, the degree of pages is distributed according to a42a35a43a45a44a46a24a47a36a19a48a25a49a50a15a51a34a39a36a19a52 , where a53 is a constant strictly greater than 0. (Note this is different from a15a54a34a54a53a56a55 , the distribution of out-degree on random graphs.) As a result, random walks on the Web graph soon reach wellconnected nodes.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 1.2 Lexical structure of the Web </SectionTitle> <Paragraph position="0"> Davison (Davison, 2000) discusses the topical locality hypothesis, namely that new edges are more likely to connect pages that are semantically related. In Davison's experiment, semantic and link distances between pairs of pages from a 100 K page corpus were computed. Davison describes results associating TF*IDF cosine similarity (Salton and McGill, 1983) and link hop distance. He reports that the cosine similarity between pages selected at random from his corpus is 0.02 whereas that number increases significantly for topologically related pages: 0.31 for pages from the same Web domain, 0.23 for linked pages, and 0.19 for sibling pages (pages pointed to by the same page).</Paragraph> <Paragraph position="1"> Menczer (Menczer, 2001) introduces the link-content conjecture states that the semantic content of a web page can be inferred from the pages that point to it.</Paragraph> <Paragraph position="2"> Menczer uses a corpus of 373 K pages and employs a non-linear least squares fit to come up with a semantic model connecting cosine-based semantic similarity and a58a62a61 (the shortest directed distance on the hypertext graph from a58 to a58a65a61 ). Menczer reports that a57 and a63 are connected via a power law:</Paragraph> <Paragraph position="4"> a57 a69 represents noise level in similarity.</Paragraph> <Paragraph position="5"> Menczer reports empirically determined values of the parameters of the fit as follows: a53</Paragraph> <Paragraph position="7"> Menczer's results further confirm Davison's observations that pages adjacent in hyperlink space to a given page are semantically connected.</Paragraph> <Paragraph position="8"> Our idea has been to investigate the circumstances under which the semantic similarity between linked pages can be explained in terms of the presence of individual words across links.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 1.3 Document modeling </SectionTitle> <Paragraph position="0"> In the computational linguistics and speech communities, the notion of a language model is used to describe a probability distribution over words. Since a cluster of documents contains a subset of an entire language, a document model is a special case of a language model. As such, it can be expressed as a conditional probability distribution indicating how likely a word is to appear in a document given some context (e.g., other similar documents, the topic of the document, etc.). Language models are used in speech recognition (Chen and Goodman, 1996), document indexing (Bookstein and Swanson, 1974; Croft and Harper, 1979) and information retrieval (Ponte and Croft, 1998).</Paragraph> <Paragraph position="1"> Document models are a special class of language models. One property of document models is that they can be used to predict some lexical properties of textual documents, e.g., the frequency of a certain word. Mosteller and Wallace (Mosteller and Wallace, 1984) discovered that content words are &quot;bursty&quot; - the appearance of a content word significantly increases the probability that the word would appear again. Church and his colleagues (Church and Gale, 1995; Church, 2000) describe document models based on the distribution of the frequencies of individual words over large document collections. In (Church and Gale, 1995), Church and Gale compare document models based on the Poisson distribution, the 2-Poisson distribution (Bookstein and Swanson, 1974), as well as generic Poisson mixtures. A Poisson mixture is described by a42a35a43a78a96a79a48a97a24a99a98 for a given integer non-negative value of a96 .</Paragraph> <Paragraph position="2"> Church and Gale empirically show that Poisson mixtures are a more accurate model for describing the distribution of words in documents within a corpus. They obtain the best fits with the Negative Binomial model and the K-mixture (both special cases of Poisson mixtures) (Church and Gale, 1995). In the Negative Binomial case,</Paragraph> <Paragraph position="4"> (which is the Gamma distribution) whereas in the K-mixture, a101 a43a29a103a11a48a12a24a125a43a75a15a124a72a126a53a56a48a75a63a33a43a29a103a11a48 a70 a52a127 a74 a77 a111a128 , where a63a33a43a78a96a79a48 is Dirac's delta function.</Paragraph> <Paragraph position="5"> Our study focuses on modeling across hyperlinks.</Paragraph> <Paragraph position="6"> Documents linked across the web are often written by people with different backgrounds and language usage pattern.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 1.4 Link-based document models </SectionTitle> <Paragraph position="0"> In Church et al.'s experiments, the documents being modeled do not have hyperlinks between them. When modeling hyperlinked corpora, it is important to decompose the document model into link-free and link-dependent components. The link-free component predicts the probability of a word a6 appearing in a document a129 regardless of the documents that point to a129 . The link-dependent part makes use of a particular incarnation of the link-content conjecture, namely micro link-content dependency (MLD), which we will propose in this paper.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 1.5 Our framework </SectionTitle> <Paragraph position="0"> In traditional Information Retrieval, the main object that is represented, and searched, is the document. In our setup, we will be looking at the hyperlink between two documents as the main object to retrieve. If a page a58 a10 points to page a58 a7 via link a26</Paragraph> <Paragraph position="2"> as the object to index and the two pages that it links as features describing the link.</Paragraph> <Paragraph position="3"> For our experiments, we used the 2-Gigabyte wt2g corpus (Hawking, 2002) which contains 247,491 Web documents connected with 3,118,248 links. These documents contain 948,036 unique words (after Porter-style stemming). null</Paragraph> </Section> </Section> class="xml-element"></Paper>