File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/w96-0106_intro.xml
Size: 3,460 bytes
Last Modified: 2025-10-06 14:06:10
<?xml version="1.0" standalone="yes"?> <Paper uid="W96-0106"> <Title>Relating Turing's Formula and Zipf's Law</Title> <Section position="3" start_page="0" end_page="770" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Turing's formula \[Good 1953\] and Zipf's law \[Zipf 1935\] indicate how population frequencies in general tend to behave. Turing's formula estimates locally what the frequency count of a species that occurred r times in a sample really would have been, had the sample accurately reflected the underlying population distribution. Zipf's law prescribes the asymptotic behavior of the relative frequencies of species as a function of their rank. The ranking scheme in question orders the species by frequency, with the most common species ranked first. The reason that these formulas are of interest in computational linguistics is that they can be used to improve probability estimates from relative frequencies, and to predict the frequencies of unseen phenomena, e.g., the frequency of previously unseen words encountered in running text.</Paragraph> <Paragraph position="1"> Due to limitations in the amount of available training data, the so-called sparse-data problem, estimating probabilities directly from observed relative frequencies may not always be very accurate. For this reason, Turing's formula, in the incarnation of Katz's back-off scheme \[Katz 1987\], has become a standard technique for improving parameter estimates for probabilistic language models used by speech recognizers. A more theoretical treatment of Turing's formula itseff can be found in \[NPSdas 1985\].</Paragraph> <Paragraph position="2"> Zipf's law is commonly regarded as an empirically accurate description of a wide variety of (linguistic) phenomena, but too general to be of any direct use. For a bit of historic controversy on Zipf's law, we refer to \[Simon 1955\], \[Mandelbrot 1959\], and subsequent articles in Information and Control. The model presented there for the stochastic source generating the various Zipfian distributions is however linguistically highly dubious: a version of the monkeywith-typewriter scenario.</Paragraph> <Paragraph position="3"> The remainder if this article is organized as follows. In Section 2, we induce a recurrence equation from Turing's local reestimation formula and from this derive the asymptotic behavior of the relative frequency as a function of rank, using a continuum approximation. The resulting probability distribution is then examined, and we rederive the recurrence equation from it. In Section 3, we start with the asymptotic behavior stipulated by Zipf's law, and derive a recurrence equation similar to that associated with Turing's formula, and from this induce a corresponding reestimation formula. We then rederive the Zipfian asymptote from the established recurrence equation. In Section 4, similar techniques are used to establish the asymptotic behavior inherent in a general class of recurrence equations, parameterized by a real-valued parameter, and then to rederive the recurrence equations from their asymptotes.</Paragraph> <Paragraph position="4"> The convergence region of this parameter for the cumulative of the frequency function, as rank approaches infinity, is also investigated. In Section 5, we summarize the results, discuss how they might be used practically, and compare them with related work.</Paragraph> </Section> class="xml-element"></Paper>