File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-2008_intro.xml
Size: 2,143 bytes
Last Modified: 2025-10-06 14:01:44
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-2008"> <Title>A very very large corpus doesn't always yield reliable estimates</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Theoretical Convergence Behaviour </SectionTitle> <Paragraph position="0"> Standard results in the theory of statistical inference govern the convergence behaviour and deviance from that behaviour of expectation statistics in the limit of sample size. The intuitive Law of Averages convergence of probabilities estimated from increasingly large samples is formalised by the Law(s) of Large Numbers. The de nition1 given in Theorem 1 is taken from Casella and Berger (1990): Theorem 1 (Strong Law of Large Numbers) Let X1;X2;X3;::: be i.i.d. random variables with EXi = and Var Xi = 2 < 1, and de ne the average Xn = 1nPni=1 Xi. Then, for every&quot;>0:</Paragraph> <Paragraph position="2"> The Law of the Iterated Logarithm relates the degree of deviance from convergent behaviour to the variance of the converging expectation estimates and the size of the sample. The de nition in Theorem 2 is taken from Petrov (1995): Theorem 2 (Law of the Iterated Logarithm) Let X1;X2;X3;::: be i.i.d. random variables with</Paragraph> <Paragraph position="4"> Limit theorems codify behaviour as sample size n approaches in nity. Thus, they can only provide an approximate guide to the nite convergence behaviour of the expectation statistics, particularly for smaller samples. Also, the assumptions these limit theorems impose on the random variables may not be reasonable or even approximately so. It is therefore an open question whether a billion word corpus is suf ciently large to yield reliable estimates.</Paragraph> <Paragraph position="5"> 1There are two different standard formulations: the weak and strong Law of Large Numbers. In the weak law, the probability is converging in the limit to one (called convergence in probability). In the strong law, the absolute difference is converging in the limit to less than epsilon with probability 1 (called almost sure convergence).</Paragraph> </Section> class="xml-element"></Paper>