File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/02/w02-2008_abstr.xml
Size: 1,194 bytes
Last Modified: 2025-10-06 13:42:43
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-2008"> <Title>A very very large corpus doesn't always yield reliable estimates</Title> <Section position="1" start_page="0" end_page="0" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> Banko and Brill (2001) suggested that the development of very large training corpora may be more effective for progress in empirical Natural Language Processing than improving methods that use existing smaller training corpora.</Paragraph> <Paragraph position="1"> This work tests their claim by exploring whether a very large corpus can eliminate the sparseness problems associated with estimating unigram probabilities. We do this by empirically investigating the convergence behaviour of unigram probability estimates on a one billion word corpus. When using one billion words, as expected, we do nd that many of our estimates do converge to their eventual value. However, we also nd that for some words, no such convergence occurs. This leads us to conclude that simply relying upon large corpora is not in itself suf cient: we must pay attention to the statistical modelling as well.</Paragraph> </Section> class="xml-element"></Paper>