File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/95/w95-0110_concl.xml

Size: 3,682 bytes

Last Modified: 2025-10-06 13:57:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="W95-0110">
  <Title>Unfortunately, we believe the words in the middle are often the most important words for Information</Title>
  <Section position="5" start_page="1992" end_page="1992" type="concl">
    <SectionTitle>
5. Conclusions
</SectionTitle>
    <Paragraph position="0"> Documents are much more than just a bag of words. The Poisson distribution predicts that lightning is unlike to strike twice in a single document. We shouldn't expect to see two or more instances of boycott in the same document (unless there is some sort of hidden dependency that goes beyond the Poisson). But when it rains, it pours. If a document is about boycotts, we shouldn't be surprised to find two boycotts or even a half dozen in a single document. The standard use of the Poisson in modeling the distribution of words and ngrams fails to fit the data except where there are almost no interesting hidden dependencies as in the case of somewhat.</Paragraph>
    <Paragraph position="1"> Why are the deviations from Poisson more salient for &amp;quot;interesting&amp;quot; words like boycott than for &amp;quot;boring&amp;quot; words like somewhat? Many applications such as information retrieval, text categorization, author identification and word-sense disambiguation attempt to discriminate documents on the basis of certain hidden variables such as topic, author, genre, style, etc. The more that a keyword (or ngram) deviates from Poisson, the stronger the dependence on hidden variables, and the more useful the keyword (or ngram) is for discriminating documents on the basis of these hidden dependences. Similar arguments apply in a host of other important applications such as text compression and language modeling for speech recognition where it is desirable for word and ngram probabilities to adapt appropriately to frequency changes due to various hidden dependencies.</Paragraph>
    <Paragraph position="2"> We have used document frequency, df, a concept borrowed from Information Retrieval, to find deviations from Poisson behavior. Document frequency is similar to word frequency, but different in a subtle but crucial way. Although inverse document frequency (IDF) and log 10f are extremely highly  correlated (p =- 0.994), it would be a mistake to try to model one with a simple transform of the other. Figure 5 showed one such attempt, where f was transformed into a predicted IDF by introducing a Poisson assumption: I/)F=-log2(l-e-deg), with 0=--.fw Unfortunately, the prediction errors were D relatively large for the most important keywords, words with moderate frequencies such as Germans.</Paragraph>
    <Paragraph position="3"> To get a better look at the subtle differences between document frequency and word frequency, we focused our attention on a set of 53 words that all had approximately the same word frequency in a corpus of 1989 AP stories. Table 1 showed that words with larger IDF tend to have more content.</Paragraph>
    <Paragraph position="4"> boycott, for example, is a better keyword than somewhat because it bunches up into a relatively small set of documents. Table 2 showed that variance and entropy can also be used as a measure of content (at least among a set of words with more or less the same word frequency). A good keyword like boycott is farther from Poisson (chance) than a crummy keyword like somewhat by almost any sense of closeness that one might consider, e.g., IDF, variance, entropy. These crucial deviations from Poisson are robust. We showed in section 4 that deviations from Poisson in one year of the AP can be used to predict deviations in another year of the AP.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML