File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/95/w95-0110_intro.xml

Size: 7,306 bytes

Last Modified: 2025-10-06 14:05:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="W95-0110">
  <Title>Unfortunately, we believe the words in the middle are often the most important words for Information</Title>
  <Section position="3" start_page="121" end_page="126" type="intro">
    <SectionTitle>
1989 Associated Press Newswire stones (D = 85,432).
2. A Good Keyword is far from Poisson
</SectionTitle>
    <Paragraph position="0"> To get a better look at the crucial differences between IDF and fin the middle frequency range ~= 103), we selected a set of 53 words for further investigation with 1000 &lt;f&lt; 1020 in the 1989 AP corpus. The 53 words are shown in Table 1, sorted by dr. Note that the words near the top of the list tend to be more appropriate for use in an information retrieval system than the words toward the bottom of the list.</Paragraph>
    <Paragraph position="1"> Stories that mention the word boycott, for example, are likely to be about boycotts. In contrast, stories that mention the word somewhat could be about practically anything.\]  expected under a Poisson, - log2 ( 1 - e -ff/D ). All but 6 of the circles fall below the x =y line. The data are the same as in Figure 1.</Paragraph>
    <Paragraph position="2"> Why is IDF such a useful quantity? One might try to answer the question in terms of information theory (Shannon, 1948). IDF can be thought of as the usefulness in bits of a keyword to a keyword retrieval system. If we tell you that the document that we are looking for has the keyword boycott, then we have narrowed the search space down to just 676/D documents.</Paragraph>
    <Paragraph position="3"> But, this answer doesn't explain the fundamental difference between boycott and somewhat, boycott has an IDF of -log2676/D =7.0 bits, only a little more than somewhat, which has an IDF of - log 2 979/D = 6.4. And yet, boycott is a reasonable keyword and somewhat is not.</Paragraph>
    <Paragraph position="4"> A good keyword, like boycott, picks out a very specific set of documents. The problem with somewhat is that it behaves almost like chance (Poisson). Under a Poisson, the 1013 instances of somewhat should be found in approximately D( 1 - n(0,0)) =D( 1 - n(1013/85432,0)) = 1007 documents. In fact, somewhat was found in 979 documents, only a little less than what would have been expected by chance. Good keywords tend to bunch up into many fewer documents, boycott, for example, bunches up into only 676 documents, much less than chance (D(1-rr(1009/85432,0))=1003). Almost all words are more &amp;quot;interesting&amp;quot; in this sense than Poisson, but good keywords like boycott are a lot more interesting than Poisson, and crummy ones like somewhat are only a little more interesting than Poisson.</Paragraph>
    <Paragraph position="5"> There is a weak tendency for nouns to appear higher on the list than non-nouns, though tendency is too weak to explain the pattern of the systematic deviations from Poisson. In addition, there are plenty of exceptions in both directions: raPe, pool, grants, code and premier are not necessarily nouns, and sweeping, leads, bound and worry are not necessarily non-nouns.</Paragraph>
    <Paragraph position="7"> tend to be larger in the middle of the frequency range (Germans), and smaller at both ends (Fromm, which). The data are the same as in Figures 1-2.</Paragraph>
    <Paragraph position="8"> On this account, a good keyword is one that behaves very differently from the null hypothesis (Poisson). We conjecture that the best keywords tend to be found toward the middle of the frequency range, where there are relatively large deviations from Poisson, as illustrated in Figure 3. This hypothesis runs counter to the standard practice in Information Retrieval of weighting words by IDF, favoring extremely rare words, no matter how they are distributed.</Paragraph>
    <Paragraph position="9"> Of course, IDF is but one of many ways to show deviations from chance. Figure 4 shows the distributions for boycott and somewhat. Note that somewhat is much &amp;quot;closer&amp;quot; to Poisson in almost any sense of closeness that one might consider. Three measures of &amp;quot;closeness&amp;quot; are presented in Table 2: IDF, variance (o2), and entropy (H). Table 2 compares the top 10 words in Table 1 (labeled &amp;quot;better keywords&amp;quot;) with the bottom 10 words in Table 1 (labeled &amp;quot;worse keywords&amp;quot;). The better keywords have more IDF, more variance and less entropy than what would be expected under a Poisson with O= f /D= 1000/85,432=0.012.</Paragraph>
    <Paragraph position="10"> 3. How robust are these deviations from chance? We were concerned that the crucial deviations from Poisson behavior might not hold up if we looked at another corpus of similar material. Figure 5 shows the word boycott in five different years of the AP news. The &amp;quot;fat tails&amp;quot; show up in each of the five years. Clearly, the non-Poisson phenomenon is robust.</Paragraph>
    <Paragraph position="11"> Figures 6 and 7 compare IDF and log 10 o 2 for the 53 words in Table 1, and find that IDF and log lo (I2 are reasonably stable across years. The correlations of IDF and log 10 (y2 across years are presented in Tables 3-4. All of the correlations are quite large. The correlations for IDF are perhaps somewhat larger than those for log\]0 O2, suggesting that IDF may be somewhat more robust, which is not  df  boycott, than for crummy keywords like somewhat.</Paragraph>
    <Paragraph position="12"> surprising given that empirical estimates of variance are notoriously subject to outliers. None of the correlations in Tables 3 and 4 can be attributed to word frequency effects since the 53 words were all chosen with almost the same 1989 frequency.</Paragraph>
    <Paragraph position="13"> In general, the correlations in Tables 3-4 are larger near the diagonal, suggesting that estimates degrade over time. If you want to predict next year's IDF, it is better to use this year's estimate than a ten-yearold estimate.</Paragraph>
    <Paragraph position="14">  up very clearly in the AP in 1988, 1989, 1990, 1991 and 1992 (dotted lines). Katz' K-mixture (Katz, personal communication), the solid line labelled &amp;quot;K,&amp;quot; fits the data better than the Poisson.</Paragraph>
    <Paragraph position="15"> Another way to confirm that our measurements of IDF, variance and H have consequences across years in the AP data, is to note that measurements of IDF, variance and H in 1989 can be used to predict word frequency in some other year. The correlations are shown in Table 5. They may not not be large, but they are too large to be due to chance and they all point in the same direction. The correlations cannot be attributed to variations in frequency in 1989, since all 53 words have almost the same 1989 frequency. Clearly, there are some interesting systematic relationships between IDF/variance/H and f that hold up to replication across multiple years in the AP, measurement errors, and other sources of noise.</Paragraph>
    <Paragraph position="16">  (for the 53 words in Table 1). Each scatter plot compares IDF in one year with IDF in another. The fact that most of the points line up fairly well indicates that IDF values are strongly correlated across years. The correlations are shown in Table 3.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML