File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-1037_metho.xml

Size: 23,265 bytes

Last Modified: 2025-10-06 14:08:13

<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-1037">
  <Title>Parametric Models of Linguistic Count Data</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Word Frequency in Fixed-Length Texts
</SectionTitle>
    <Paragraph position="0"> In preparation of their authorship study of The Federalist, Mosteller and Wallace (1984,SS2.3) investigated the variation of word frequency across contiguous passages of similar length, drawn from papers of known authorship. The occurrence frequencies of any in papers by Hamilton (op. cit., Table 2.3-3) are repeated here in Figure 1: out of a total of 247 passages there are 125 in which the word any does not occur; it occurs once in 88 passages, twice in 26 passages, etc. Figure 1 also shows the counts predicted by a Poisson distribution with mean 0.67. Visual inspection (&amp;quot;chi by eye&amp;quot;) indicates an acceptable fit between the model and the data, which is confirmed by a kh2 goodness-of-fit test. This demonstrates that certain words seem to be adequately modeled by a Poisson distribution, whose probability mass function is shown in (1):</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="525" type="metho">
    <SectionTitle>
Figure1: OccurrencecountsofanyinHamiltonpas-
</SectionTitle>
    <Paragraph position="0"> sages: raw countsand counts predicted undera Poisson model.</Paragraph>
    <Paragraph position="1"> For other words the Poisson distribution gives a much worse fit. Take the occurrences of were in papers by Madison, as shown in Figure 2 (ibid.). We calculate the kh2 statistic for the counts expected under a Poisson model for three bins (0, 1, and 2-5, to ensure that the expected counts are greater than 5) and obtain 6.17 at one degree of freedom (number of bins minus number of parameters minus one), which is enough to reject the null hypothesis that the data arose from a Poisson(0.45) distribution. On the other hand, the kh2 statistic for a negative binomial distribution NegBin(0.45,1.17) is only 0.013 for four bins (0, 1, 2, and 3-5), i. e., again 1 degree of freedom, as two parameters were estimated from the data. Now we are very far from rejecting the null hypothesis. This provides some quantitative backing for Mosteller and Wallace's statement that 'even the most motherly eye can scarcely make twins of the [Poisson vs. empirical] distributions' for certain words (op. cit., 31).</Paragraph>
    <Paragraph position="2"> The probability mass function of the negative binomial distribution, using Mosteller and Wallace's parameterization, is shown in (2):</Paragraph>
    <Paragraph position="4"> it is easy to see that NegBin(l,k) converges to Poisson(l) for l constant and k-[?]. On the other</Paragraph>
    <Paragraph position="6"> passages: raw counts and counts predicted under Poisson and negative binomial models.</Paragraph>
    <Paragraph position="7"> hand, small values of k drag the mode of the negative binomial distribution towards zero and increase its variance, compared with the Poisson.</Paragraph>
    <Paragraph position="8"> As more and more probability mass is concentrated at 0, the negative binomial distribution starts to depart from the empirical distribution. One can already see this tendency in Mosteller and Wallace's data, although they themselves never comment on it. The problem with a huge chunk of the probability mass at 0 is that one is forced to say that the outcome 1 is still fairly likely and that the probability should drop rapidly from 2 onwards as the term 1/x! starts to exert its influence. This is often at odds with actual data.</Paragraph>
    <Paragraph position="9"> Take the word his in papers by Hamilton and Madison (ibid., pooled from individual sections of Table 2.3-3). It is intuitively clear that his may not occur at all in texts that deal with certain aspects of the US Constitution, since many aspects of constitutional law are not concerned with any single (male) person. For example, Federalist No. 23 (The Necessity of a Government as Energetic as the One Proposed to the Preservation of the Union, approx. 1800 words, by Hamilton) does not contain a single occurrence of his, whereas Federalist No. 72 (approx. 2000 words, a continuation of No. 71 The Duration in Office of the Executive, also by Hamilton) contains 35 occurrences. The difference is that No. 23 is about the role of a federal government in the abstract, and Nos. 71/72 are about term limits for offices filled by (male) individuals. We might therefore expect the occurrences of his to vary more, de- null Madison passages (NB: y-axis is logarithmic).</Paragraph>
    <Paragraph position="10"> pending on topic, than any or were.</Paragraph>
    <Paragraph position="11"> The overall distribution of his is summarized in Figure 3; full details can be found in Table 1. Observe the huge number of passages with zero occurrences of his, which is ten times the number of passages with exactly one occurrence. Also notice how the negative binomial distribution fitted using the Method of Maximum Likelihood (MLE model, first line in Figure 3, third column in Table 1) overshoots at 1, but underestimates the number of passages with 2 and 3 occurrences.</Paragraph>
    <Paragraph position="12"> The problem cannot be solved by trying to fit the two parameters of the negative binomial based on the observed counts of two points. The second line in Figure 3 is from a distribution fitted to match the observedcountsat0and1. Althoughitfitsthosetwo points perfectly, the overall fit is worse than that of the MLE model, since it underestimates the observed counts at 2 and 3 more heavily.</Paragraph>
    <Paragraph position="13"> The solution we propose is illustrated by the third line in Figure 3. It accounts for only about a third of the data, but covers all passages with one or more occurrences of his. Visual inspection suggests that it provides a much better fit than the other two models, if we ignore the outcome 0; a quantitative comparison will follow below. This last model has relaxed the relationship between the probability of the outcome 0 and the probabilities of the other outcomes. In particular, we obtain appropriate counts for the outcome 1 by pretending that the outcome 0 occurs only about 71 times, compared with an actual 405 observed occurrences. Recall that the model accounts for only 34% of the data; the remaining  Madison passages.</Paragraph>
    <Paragraph position="14"> counts for the outcome 0 are supplied entirely by a second component whose probability mass is concentrated at zero. The expected counts under the full model are found in the rightmost column of Table 1.</Paragraph>
    <Paragraph position="15"> The general recipe for models with large counts for the zero outcome is to construe them as two-component mixtures, where one component is a degenerate distribution whose entire probability mass is assigned to the outcome 0, and the other component is a standard distribution, call itF(th). Such a nonstandard mixture model is sometimes known as a 'modified' distribution (Johnson and Kotz, 1969, SS8.4) or, more perspicuously, as a zero-inflated distribution. The probability mass function of a zero-inflated F distribution is given by equation (3), where 0[?]z[?]1 (z &lt; 0 may be allowable subject to additional constraints) and x[?]0 is the Kronecker delta dx,0.</Paragraph>
    <Paragraph position="17"> It corresponds to the following generative process: toss a z-biased coin; if it comes up heads, generate 0; if it comes up tails, generate according to F(th). If we apply this to word frequency in documents, what this is saying is, informally: whether a given word appears at all in a document is one thing; how often it appears, if it does, is another thing.</Paragraph>
    <Paragraph position="18"> This is reminiscent of Church's statement that '[t]he first mention of a word obviously depends on frequency, but surprisingly, the second does not.' (Church, 2000) However, Church was concerned with language modeling, and in particular cache-based models that overcome some of the limitations introduced by a Markov assumption. In such a setting it is natural to make a distinction between the first occurrence of a word and subsequent occurrences, which according to Church are influenced by adaptation (Church and Gale, 1995), referring to an increase in a word's chance of re-occurrence after it has been spotted for the first time. For empirically demonstrating the effects of adaptation, Church (2000) workedwith nonparametric methods.</Paragraph>
    <Paragraph position="19"> By contrast, our focus is on parametric methods, and unlike in language modeling, we are also interested in words that fail to occur in a document, so it is natural for us to distinguish between zero and nonzero occurrences.</Paragraph>
    <Paragraph position="20"> In Table 1, ZINB refers to the zero-inflated negative binomial distribution, which takes a parameter z in addition to the two parameters of its negative binomial component. Since the negative binomial itself can already accommodate large fractions oftheprobabilitymassat0, wemustaskwhetherthe ZINB model fits the data better than a simple negative binomial. The bottom row of Table 1 shows the negative log likelihood of the maximum likelihood estimate ^th for each model. Log odds of 2 in favor of ZINB are indeed sufficient (on Akaike's likelihood-based information criterion; see e. g. Pawitan 2001, SS13.5) to justify the introduction of the additional parameter. Also note that the cumulative kh2 probability of the kh2 statistic at the appropriate degrees of freedom is lower for the zero-inflated distribution.</Paragraph>
    <Paragraph position="21"> It is clear that a large amount of the observed variation of word occurrences is due to zero inflation, because virtually all words are rare and many words are simply not &amp;quot;on topic&amp;quot; for a given document. Even a seemingly innocent word like his turns out to be &amp;quot;loaded&amp;quot; (and we are not referring to gender issues), since it is not on topic for certain discussions of constitutional law. One can imagine that this effect is even more pronounced for taboo words, proper names, or technical jargon (cf. Church 2000).</Paragraph>
    <Paragraph position="22"> Our next question is whether the observed variation is best accounted for in terms of zero-inflation or overdispersion. We phrase the discussion in terms of a practical task for which it matters whether a word is on topic for a document.</Paragraph>
  </Section>
  <Section position="5" start_page="525" end_page="525" type="metho">
    <SectionTitle>
3 Word Frequency Conditional on
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="525" end_page="525" type="sub_section">
      <SectionTitle>
Document Length
</SectionTitle>
      <Paragraph position="0"> Word occurrence counts play an important role in document classification under an independent feature model (commonly known as &amp;quot;Naive Bayes&amp;quot;). This is not entirely uncontroversial, as many approaches to document classification use binary indicators for the presence and absence of each word, instead of full-fledged occurrence counts (see Lewis 1998 for an overview). In fact, McCallum and Nigam (1998) claim that for small vocabulary sizes one is generally better off using Bernoulli indicator variables; however, for a sufficiently large vocabulary, classification accuracy is higher if one takes word frequency into account.</Paragraph>
      <Paragraph position="1"> Comparing different probability models in terms of their effects on classification under a Naive Bayes assumption is likely to yield very conservative results, since the Naive Bayes classifier can perform accurate classifications under many kinds of adverse conditions and even when highly inaccurate probability estimates are used (Domingos and Pazzani, 1996; Garg and Roth, 2001). On the other hand, an evaluation in terms of document classification has the advantages, compared with language modeling, of computational simplicity and the ability to benefit from information about non-occurrences of words.</Paragraph>
      <Paragraph position="2"> Making a direct comparison of overdispersed and zero-inflated models with those used by McCallum and Nigam (1998) is difficult, since McCallum and Nigam use multivariate models - for which the &amp;quot;naive&amp;quot; independence assumption is different (Lewis, 1998) - that are not as easily extended to the cases we are concerned about. For example, the natural overdispersed variant of the multinomial model is the Dirichlet-multinomial mixture, which adds just a single parameter that globally controls the overall variation of the entire vocabulary. However, Church, Gale and other have demonstrated repeatedly (Church and Gale, 1995; Church, 2000) that adaptation or &amp;quot;burstiness&amp;quot; are clearly properties of individual words (word types). Using joint independentmodels(onemodelperword)bringsusback null into the realm of standard independence assumptions, makes it easy to add parameters that control overdispersion and/or zero-inflation for each word individually, and simplifies parameter estimation.</Paragraph>
      <Paragraph position="3">  ent vocabulary sizes on the Newsgroup data set.</Paragraph>
      <Paragraph position="4"> So instead of a single multinomial distribution we use independent binomials, and instead of a multivariate Bernoulli model we use independent Bernoulli models for each word. The overall joint model is clearly wrong since it wastes probability mass on events that are known a priori to be impossible, likeobservingdocumentsforwhichthesumof the occurrences of each word is greater than the document length. On the other hand, it allows us to take the true document length into account while using only a subset of the vocabulary, whereas on McCallum and Nigam's approach one has to either completely eliminate all out-of-vocabulary words and adjust the document length accordingly, or else map out-of-vocabulary words to an unknown-word token whose observed counts could then easily dominate.</Paragraph>
      <Paragraph position="5"> In practice, using joint independent models does not cause problems. We replicated McCallum and Nigam's Newsgroup experiment1 and did not find any major discrepancies. The reader is encouraged to compare our Figure 4 with McCallum and Nigam's Figure 3. Not only are the accuracy figures comparable, we also obtained the same critical vocabulary size of 200 words below which the Bernoulli model results in higher classification accuracy. null The Newsgroup data set (Lang, 1995) is a strati1Many of the datasets used by McCallum and Nigam (1998) are available at http://www.cs.cmu.edu/~TextLearning/ datasets.html.</Paragraph>
      <Paragraph position="6"> fied sample of approximately 20,000 messages total, drawn from 20 Usenet newsgroups. The fact that 20 newsgroups are represented in equal proportions makes this data set well suited for comparing different classifiers, as class priors are uniform and baseline accuracy is low at 5%. Like McCallum and Nigam (1998) we used (Rain)bow (McCallum, 1996) for tokenization and to obtain the word/ document count matrix. Even though we followed McCallum and Nigam's tokenization recipe (skipping message headers, forming words from contiguous alphabetic characters, not using a stemmer), our total vocabulary size of 62,264 does not match Mc-CallumandNigam'sfigureof62,258, butdoescome reasonably close. Also following McCallum and Nigam (1998) we performed a 4:1 random split into training and test data. The reported results were obtained by training classification models on the training data and evaluating on the unseen test data.</Paragraph>
      <Paragraph position="7"> We compared four models of token frequency.</Paragraph>
      <Paragraph position="8"> Each model is conditional on the document length n (but assumes that the parameters of the distribution do not depend on document length), and is derived from the binomial distribution</Paragraph>
      <Paragraph position="10"> which we view as a one-parameter conditional model, ourfirstmodel: xrepresentsthetokencounts (0[?]x[?]n); andnisthelengthofthedocumentmeasured as the total number of token counts, including out-of-vocabulary items.</Paragraph>
      <Paragraph position="11"> The second model is the Bernoulli model, which is derived from the binomial distribution by replac- null Our third model is an overdispersed binomial model, a &amp;quot;natural&amp;quot; continuous mixture of binomials with the integrated binomial likelihood - i. e. the Beta density (6), whose normalizing term involves the Beta function - as the mixing distribution.</Paragraph>
      <Paragraph position="13"> The resulting mixture model (7) is known as the P'olya-Eggenberger distribution (Johnson and Kotz, 1969) or as the beta-binomial distribution. It has been used for a comparatively small range of NLP applications (Lowe, 1999) and certainly deserves more widespread attention.</Paragraph>
      <Paragraph position="14">  As was the case with the negative binomial (which is to the Poisson as the beta-binomial is to the binomial), it is convenient to reparameterize the distribution. We choose a slightly different parameterization than Lowe (1999); we follow Ennis and Bi (1998) and use the identities</Paragraph>
      <Paragraph position="16"> To avoid confusion, we will refer to the distribution parameterized in terms of p and g as BB:  Comparing this with the expectation and variance of the standard binomial model, it is obvious that the beta-binomial has greater variance when g &gt; 0, and for g = 0 the beta-binomial distribution coincides with a binomial distribution.</Paragraph>
      <Paragraph position="17"> Using the method of moments for estimation is particularly straightforward under this parameterization (Ennis and Bi, 1998). Suppose one sample consists of observing x successes in n trials (x occurrences of the target word in a document of length n), where the number of trials may vary across samples. Now we want to estimate parameters based on a sequence of s samples&lt;x1,n1&gt; ,...,&lt;xs,ns&gt; . We equate sample moments with distribution moments</Paragraph>
      <Paragraph position="19"> and solve for the unknown parameters:</Paragraph>
      <Paragraph position="21"> In our experience, the resulting estimates are sufficiently close to the maximum likelihood estimates, while method-of-moment estimation is much faster than maximum likelihood estimation, which requires gradient-based numerical optimization2 in this case. Since we estimate parameters for up to 400,000 models (for 20,000 words and 20 classes), we prefer the faster procedure. Note that the maximum likelihood estimates may be suboptimal (Lowe, 1999), but full-fledged Bayesian methods (Lee and Lio, 1997) would require even more computational resources.</Paragraph>
      <Paragraph position="22">  Thefourthandfinalmodelisazero-inflatedbinomial distribution, which is derived straightforwardly via equation (3):</Paragraph>
      <Paragraph position="24"> Since the one parameter p of a single binomial model can be estimated directly using equation (9), maximum likelihood estimation for the zero-inflated binomial model is straightforward via the EM algorithm for finite mixture models. Figure 5 shows pseudo-code for a single EM update.</Paragraph>
      <Paragraph position="25"> Accuracy results of Naive Bayes document classification using each of the four word frequency models are shown in Table 2. One can observe that the differences between the binomial models are small, 2Not that there is anything wrong with that. In fact, we calculated the MLE estimates for the negative binomial models using a multidimensional quasi-Newton algorithm.</Paragraph>
      <Paragraph position="26">  EM iteration that updates the two parameters.</Paragraph>
      <Paragraph position="27"> but even small effects can be significant on a test set of about 4,000 messages. More importantly, note that the beta-binomial and zero-inflated binomial models outperform both the simple binomial and the Bernoulli, except on unrealistically small vocabularies (intuitively, 20 words are hardly adequate for discriminating between 20 newsgroups, and those words would have to be selected much more carefully). In light of this we can revise McCallum and Nigam's McCallum and Nigam (1998) recommendation to use the Bernoulli distribution for small vocabularies. Instead we recommend that neither the Bernoulli nor the binomial distributions should be used, since in all reasonable cases they are outperformed by the more robust variants of the binomial distribution. (The case of a 20,000 word vocabulary is quickly declared unreasonable, since most of the words occur precisely once in the training data, and so any parameter estimate is bound to be unreliable.) Wewanttoknowwhetherthedifferencesbetween the three binomial models could be dismissed as a chance occurrence. The McNemar test (Dietterich, 1998) provides appropriate answers, which are summarized in Table 3. As we can see, the classification results under the zero-inflated binomial and beta-binomial models are never significantly differ- null dicates a significant difference of the classification results when comparing a pair of of models.</Paragraph>
      <Paragraph position="28"> ent, in most cases not even approaching significance at the 5% level. A classifier based on the beta-binomial model is significantly different from one based on the binomial model; the difference for a vocabulary of 20,000 words is marginally significant (the kh2 value of 3.8658 barely exceeds the critical value of 3.8416 required for significance at the 5% level). Classification based on the zero-inflated binomial distribution differs most from using a standard binomial model. We conclude that the zero-inflated binomial distribution captures the relevant extra-binomial variation just as well as the overdispersed beta-binomial distribution, since their classification results are never significantly different. The differences between the four models can be seen more visually clearly on the WebKB data set  KB data set as a function of vocabulary size.</Paragraph>
      <Paragraph position="29"> (McCallum and Nigam, 1998, Figure 4). Evaluation results for Naive Bayes text classification using the four models are displayed in Figure 6. The zero-inflated binomial model provides the overall highest classification accuracy, and clearly dominates the beta-binomial model. Either one should be preferred over the simple binomial model. The early peak and rapid decline of the Bernoulli model had already been observed by McCallum and Nigam (1998).</Paragraph>
      <Paragraph position="30"> We recommend that the zero-inflated binomial distribution should always be tried first, unless there is substantial empirical or prior evidence against it: the zero-inflated binomial model is computationally attractive (maximum likelihood estimation using EM is straightforward and numerically stable, most gradient-based methods are not), and its z parameter is independently meaningful, as it can be interpreted as the degree to which a given word is &amp;quot;on topic&amp;quot; for a given class of documents.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML