File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/w95-0110_metho.xml

Size: 3,157 bytes

Last Modified: 2025-10-06 14:14:09

<?xml version="1.0" standalone="yes"?>
<Paper uid="W95-0110">
  <Title>Unfortunately, we believe the words in the middle are often the most important words for Information</Title>
  <Section position="4" start_page="126" end_page="1992" type="metho">
    <SectionTitle>
4. Katz' K-mixture
</SectionTitle>
    <Paragraph position="0"> Clearly, the Poisson does not fit our data very well, especially for good keywords like boycott.</Paragraph>
    <Paragraph position="1"> however, a negative result. Can we say something more constructive? This is, Katz (personal communication) proposed the following alternative to the Poisson. Prg(k) is the probability of k instances of w in a document.</Paragraph>
    <Paragraph position="3"> Poissons. Suppose that, within documents, boycott is distributed by a Poisson process, but, across documents, the Poisson parameter 0 is allowed to vary from one document to another depending on how much the document is about boycotts. In other words, Prg(k) can be expressed as a convolution of Poissons with a density function ~:</Paragraph>
    <Paragraph position="5"> In this way, the 0s can depend on an infinite number of unknowable hidden variables, e.g., what the documents are about, who wrote them, when they were written, what was going on in the world when they were written, etc., but we don't need to know these dependencies for any particular document. All we need to know is ~, the density of 0s, aggregated over all possible combinations of hidden variables.</Paragraph>
    <Paragraph position="6">  maybe not as predictable as IDF (for the 53 words in Table 1). The correlations are shown in Table 4.</Paragraph>
    <Paragraph position="7">  Katz' K-mixture has two parameters, cc and \[3. The ~ parameter determines the fraction of relevant and irrelevant documents. 1- cx of the documents have no chance of mentioning boycott (0 = 0) because they are totally irrelevant to boycotts. The \[3 parameter determines the average 0 among the relevant documents.</Paragraph>
    <Paragraph position="8"> The two parameters, tx and \[3, can be fit from almost any pair of variables considered thus far, e.g., f, IDF, t~ 2, H. We have found thatfand IDF are particularly easy to work with, and are more robust than some others such as ~2.</Paragraph>
    <Paragraph position="9"> ~: f 21DF - 1 f 1 (Z.~ hD\[3 null It has been our experience that Katz' K-mixture fits the data much better than the Poisson, as can be seen in Figure 5. Unlike the Poisson, the K-mixture has two parameters, tx and \[3, and can therefore account for the fact that IDF and fare not completely predictable from one another. In related work (Church and Gale, submitted), we looked at a number of different Poisson mixtures, and found that our data can also be fit by a negative binomial, which can be viewed as a Poisson mixture where Oun (0) is a Gamma distribution (Johnson and Kotz, 1969). See Mosteller and Wallace (1964) for an example of how to use the negative binomial in a Bayesian discrimination task. It is straightforward to generalize the Mosteller and Wallace approach to use Katz' K-mixture or any other mixture of Poissons.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML