File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/p03-1037_intro.xml
Size: 1,564 bytes
Last Modified: 2025-10-06 14:01:51
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-1037"> <Title>Parametric Models of Linguistic Count Data</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Linguistic count data often violate the simplistic assumptions of standard probability models like the binomial or Poisson distribution. In particular, the inadequacy of the Poisson distribution for modeling word (token) frequency is well known, and robust alternatives have been proposed (Mosteller and Wallace, 1984; Church and Gale, 1995). In the case of the Poisson, a commonly used robust alternative is the negative binomial distribution (Pawitan, 2001, SS4.5), which has the ability to capture extra-Poisson variation in the data, in other words, it is overdispersed compared with the Poisson. When a small set of parameters controls all properties of the distribution it is important to have enough parameters to model the relevant aspects of one's data. Simple models like the Poisson or binomial do not have enough parameters for many realistic applications, and we suspect that the same might be true of log-linear models. When applying robust models like the negative binomial to linguistic count data like word occurrences in documents, it is natural to ask to what extent the extra-Poisson variation has been captured by the model. Answering that question is our main goal, and we begin by reviewing some of the classic results of Mosteller and Wallace (1984).</Paragraph> </Section> class="xml-element"></Paper>