File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-4031_metho.xml

Size: 4,972 bytes

Last Modified: 2025-10-06 14:08:55

<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-4031">
  <Title>Computational Linkuistics: word triggers across hyperlinks</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 A link-based document model
</SectionTitle>
    <Paragraph position="0"> It is well known that the distributions of words in text depend on many factors such as genre, topic, author, etc.</Paragraph>
    <Paragraph position="1"> Certain words with high content has been found to &amp;quot;trigger&amp;quot; other words to appear. Interestingly, the hyperlinks which connect the text on the Web may also affect the word distributions in the hypertext. For example, if page a58a62a10 that contains education points to page a58a19a7 , then we would expect a higher probability of seeing education in page a58a33a7 than in a random page. This experiment was designed to discover how the links between pages can trigger words and change the word distributions.</Paragraph>
    <Paragraph position="2"> For each stemmed word in wt2g, we compute the following numbers: PagesContainingWord = how many pages in the collection contain the word.</Paragraph>
    <Paragraph position="3"> OutgoingLinks = the total number of outgoing links in all the pages that contain the word.</Paragraph>
    <Paragraph position="4"> LinkedPagesContainingWord = how many of the linked pages contain the word.</Paragraph>
    <Paragraph position="5"> For the latter two measures, only the links inside the collection were considered.</Paragraph>
    <Paragraph position="6"> The probability of a word a6 appearing in a random page a58a62a10 is computed as</Paragraph>
    <Paragraph position="8"> where Total Pages = 247,491. If a58a62a10 contains the word a6 and points to a new pagea58a62a61 , then the probability of the word a6 appearing in a58a65a61 is computed as</Paragraph>
    <Paragraph position="10"> We are interested in the ratio of posterior over prior probability for each stemmed word and would like to see if there is any interesting relationship between this ratio and other linguistic features.</Paragraph>
    <Paragraph position="11"> We will look at the ratio a14a173a24a47a58a19a130a87a133a83a152a148a153  a9a105a175a56a43a78a6a91a48 is the document frequency (fraction of all documents containing a6 ) and a144a185a9 is the number of documents in the collection.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Results and Discussion
</SectionTitle>
      <Paragraph position="0"> Table 1 shows the different measures for the 2000 words with lowest IDF. Each line shows the average values on a chunk of 100 words.</Paragraph>
      <Paragraph position="1"> As one can see in the table, the posterior probabilities are always higher than the prior probabilities. Hypothesis testing shows that the difference between prior and posterior is statistically significant, which verifies our assumption. It is noticeable that the link effect has the same trend as the IDF values. The correlation coefficient of these two columns is 0.7112. It is customary to use IDF as an indicator of words' content. Low IDF usually implies a low content value. We would like to investigate whether link effect can be used instead of IDF for certain IR tasks. Let's consider the sample words between and american on table 1. Intuitively, american has more content than between, but the later has an IDF of 2.37, higher than that of the former (2.36). However, their link effects agree with intuition: american: 1.97, one standard deviation higher than between: 1.40.</Paragraph>
      <Paragraph position="2"> Table 2 compares the link effects for two ranges of sample words with roughly the same IDF values within each range. It shows the words in the order of IDF and of Link Effect (a14 ). As one can see, the link effect tends to be high for content words when IDF value alone cannot discriminate the words.</Paragraph>
      <Paragraph position="3"> a186a41a187a118a188a91a189a191a190a83a192a193 a186a41a187a156a188a163a189a25a194a83a192a193 sorted by sorted by sorted by sorted by IDF link effect a195 IDF link effect a195  Figure 1 describes a linear fit of a14 over the 2000 words with the lowest IDF in our corpus. A very clear trend can be observed, whereby over most words, the value of a14 is almost a constant. When we looked only at the top 100 or 200 words, the trend was even cleaner. However, with 2000 words one cannot help but notice that a number of outliers appear in the left hand part of the figure. We ran a K-Means c (with K=2) to identify two clusters of words.</Paragraph>
      <Paragraph position="4"> The clusterer stopped after 32 iterations after identifying the two clusters (Figures 2 and 3), each with a very clear trend. Their means are 1.86 and 3.57, respectively.</Paragraph>
      <Paragraph position="5">  axis represents the prior probability a58 while the Y axis corresponds to the posterior probability a58a62a196 .</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML