File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-2008_metho.xml

Size: 13,796 bytes

Last Modified: 2025-10-06 14:08:09

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-2008">
  <Title>A very very large corpus doesn't always yield reliable estimates</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Experiments
</SectionTitle>
    <Paragraph position="0"> We would like to answer the question: how much training material is required to estimate the unigram probability of a given word with arbitrary con dence. This is clearly dependent on the relative frequency of the word in question. Words which appear to have similar probability estimates on small corpora can exhibit quite different convergence behaviour as the sample size increases.</Paragraph>
    <Paragraph position="1"> To demonstrate this we compiled a homogeneous corpus of 1.145 billion words of newspaper and newswire text from three existing corpora: the North American News Text Corpus, NANC (Graff, 1995), the NANC Supplement (MacIntyre, 1998) and the Reuters Corpus Volume 1, RCV1 (Rose et al., 2002). The number of words in each corpus is shown in Table 1.</Paragraph>
    <Paragraph position="2"> These corpora were concatenated together in the order given in Table 1 without randomising the individual sentence order. This emulates the process of collecting a large quantity of text and then calculating statistics based counts from the entire collection. Random shuf ing removes the discourse features and natural clustering of words which has such a signi cant in uence on the probability estimates.</Paragraph>
    <Paragraph position="3"> We investigate the large-sample convergence behaviour of words that appear at least once in a standard small training corpus, the Penn Treebank (PTB). The next section describes the convergence behaviour for words with frequency ranging from the most common down to hapax legomena.</Paragraph>
    <Paragraph position="4"> From the entire 1.145 billion word corpus we calculated the gold-standard unigram probability estimate, that is, the relative frequency for each word. We also calculated the probability estimates for each word using increasing subsets of the full corpus. These subset corpora were sampled every 5 million words up to 1.145 billion.</Paragraph>
    <Paragraph position="5"> To determine the rate of convergence to the gold-standard probability estimate as the training set increases, we plotted the ratio between the subset and gold-standard estimates. Note that the horizontal lines on all of the graphs are the same distance apart. The exception is Figure 5, where there are no lines  because there would be too many to plot within the range of the graph. The legends list the selected words with the relative frequency (as a percentage) of each word in the full corpus. Vertical lines show the boundaries between the concatenated corpora.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Empirical Convergence Behaviour
</SectionTitle>
    <Paragraph position="0"> Figure 1 shows the convergence behaviour of some very frequent closed-class words selected from the PTB. This graph shows that for most of these extremely common words, the probability estimates are accurate to within approximately 10% (a ratio of 1 0:1) of their nal value for a very small corpus of only 5 million words (the size of the rst subset sample).</Paragraph>
    <Paragraph position="1"> Some function words, for example, the and in, display much more stable probability estimates even amongst the function words, suggesting their usage is very uniform throughout the corpus. By chance, there are also some open-class words, such  as bringing,formand crucial, that also have very stable probability estimates. Examples of these are shown in Figure 2. The main difference between the convergence behaviour of these words and the function words is the ne-grained smoothness of the convergence, because the open-class words are not as uniformly distributed across each sample.</Paragraph>
    <Paragraph position="2"> Figure 3 shows the convergence behaviour of commonplace words that appear in the PTB between 30 and 100 times each. Their convergence behaviour is markedly different to the closed-class words. We can see that many of these words have very poor initial probability estimates, consistently low by up to a factor of almost 50%, ve times worse than the closed-class words.</Paragraph>
    <Paragraph position="3"> speculation is an example of convergence from a low initial estimate. After approximately 800 million words, many (but not all) of the estimates are correct to within about 10%, which is the same error as high frequency words sampled from a 5 million words corpus. This is a result of the sparse distribution of these words and their stronger context dependence. Their relative frequency is two to three orders of magnitude smaller than the relative frequencies of the closed-class words in Figure 1.</Paragraph>
    <Paragraph position="4"> What is most interesting is the convergence behaviour of rare but not necessarily unusual words, which is where using a large corpus should be most bene cial in terms of reducing sparseness. Figure 4 shows the very large corpus behaviour of selected hapax legomena from the PTB. Many of the words in this graph show similar behaviour to Figure 3, in that some words appear to converge relatively smoothly to an estimate within 20% of the nal value. This shows the improvement in stability of the estimates from using large corpora, although 20% is a considerable deviation from the gold-standard estimate.</Paragraph>
    <Paragraph position="5"> However, other words, for instance tightness, fail spectacularly to converge to their nal estimate before the in uence of the forced convergence of the ratio starts to take effect. tightness is an extreme example of the case where a word is seen very rarely, until it suddenly becomes very popular. A similar convergence behaviour can be seen for words with a very high initial estimate in Figure 5. The maximum decay ratio curve is the curve we would see if a word appeared at the very beginning of the corpus, but did not appear in the remainder of the corpus. A smooth decay with a similar gradient to the maximum decay ratio indicates that the word is extremely rare in the remainder of the corpus, after a high initial estimate. rebelled, kilometers and coward are examples of exceedingly high initial estimates, followed by very rare or no other occurrences. extremists,shellingand cricket are examples of words that were used more consistently for a period of time in the corpus, and then failed to appear later, with cricket having two periods of frequent usage.</Paragraph>
    <Paragraph position="6"> Unfortunately, if we continue to assume that a unigram model is correct, these results imply that we cannot be at all con dent about the probability estimates of some rare words even with over one billion words of material. We cannot dismiss this as an unreliable low frequency count because tightness occurs 2652 times in the full corpus. Thus we must look for an alternative explanation: and the most reasonable explanation is burstiness, the fact that word occurrence is not independent and identically distributed. So given that one billion words does not always yield reliable estimates for rare but not unusual words, it leaves us to ask if any nite number of words could accurately estimate the probability of pathologically bursty word occurrences.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Discussion
</SectionTitle>
    <Paragraph position="0"> It is worth re ecting on why some words appear to have more bursty behaviour than others. As we would expect, function words are distributed most evenly throughout the corpus. There are also some content words that appear to be distributed evenly.</Paragraph>
    <Paragraph position="1"> On the other hand, some words appear often in the rst 5 million word sample but are not seen again in the remainder of the corpus.</Paragraph>
    <Paragraph position="2"> Proper names and topic-speci c nouns and verbs  exhibit the most bursty behaviour, since the newspaper articles are naturally clustered together according to the chronologically grouped events. The most obvious and expected conditioning of the random variables is the topic of the text in question.</Paragraph>
    <Paragraph position="3"> However, it is hard to envisage seemingly topicneutral words, such as tightness and newly, being conditioned strongly on topic. Other factors that apply to many different types of words include the stylistic and idiomatic expressions favoured by particular genres, authors, editors and even the in-house style guides.</Paragraph>
    <Paragraph position="4"> These large corpus experiments demonstrate the failure of simple Poisson models to account for the burstiness of words. The fact that words are not distributed by a simple Poisson process becomes even more apparent as corpus size increases, particularly as the effect of noise and sparseness on the language model is reduced, giving a clearer picture of how badly the current language models fail. With a very large corpus it is obvious that the usual independence assumptions are not always appropriate.</Paragraph>
    <Paragraph position="5"> Using very large corpora for simple probability estimation demonstrates the need for more sophisticated statistical models of language. Without better models, all that training upon large corpora can achieve is better estimates of words which are approximately i.i.d.</Paragraph>
    <Paragraph position="6"> To fully leverage the information in very large corpora, we need to introduce more dependencies into the models to capture the non-stationary nature of language data. This means that to gain a signi cant advantage from large corpora, we must develop more sophisticated statistical language models.</Paragraph>
    <Paragraph position="7"> We should also brie y mention the other main bene t of increasing corpus size: the acquisition of occurrences of otherwise unseen words. Previously unseen linguistic events are frequently presented to NLP systems. To handle these unseen events the statistical models used by the system must be smoothed. Smoothing typically adds considerable computational complexity to the system since multiple models need to be estimated and applied together, and it is often considered a black art (Chen and Goodman, 1996). Having access to very large corpora ought to reduce the need for smoothing, and so ought to allow us to design simpler systems.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Conclusion
</SectionTitle>
    <Paragraph position="0"> The dif culty of obtaining reliable probability estimates is central to many NLP tasks. Can we improve the performance of these systems by simply using a lot more data? As might be expected, for many words, estimating probabilities on a very large corpus can be valuable, improving system performance signi cantly. This is due to the improved estimates of sparse statistics, made possible by the relatively uniform distribution of these words.</Paragraph>
    <Paragraph position="1"> However, there is a large class of commonplace words which fail to display convergent behaviour even on very large corpora. What is striking about these words is that pro cient language users would not recognise them as particularly unusual or specialised in their usage, which means that broad-coverage NLP systems should also be expected to handle them competently.</Paragraph>
    <Paragraph position="2"> The non-convergence of these words is an indication of their non-stationary distributions, which a simple Poisson model is unable to capture. Since it is no longer a problem of sparseness, even exceptionally large corpora cannot be expected to produce reliable probability estimates. Instead we must relax the independence assumptions underlying the existing language models and incorporate conditional information into the language models.</Paragraph>
    <Paragraph position="3"> To fully harness the extra information in a very large corpus we must spend more, and not less, time and effort developing sophisticated language models and machine learning systems.</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
7 Further Work
</SectionTitle>
    <Paragraph position="0"> We are particularly interested in trying to characterise the burstiness tendencies of individual words and word classes, and the resulting convergence behaviour of their probability estimates. An example of this is calculating the area between unity and the ratio curves. Some example words with different convergence behaviour selected using this area measure are given in Table 2 in the Appendix. We are also interested in applying the exponential models of lexical attraction and repulsion described by Beeferman et al. (1997) to the very large corpus.</Paragraph>
    <Paragraph position="1"> We would like to investigate the overall error in the probability mass distribution by comparing the whole distributions at each sample with the nal distribution. To estimate the error properly will require smoothing methods to be taken into consideration.</Paragraph>
  </Section>
  <Section position="9" start_page="0" end_page="0" type="metho">
    <SectionTitle>
Acknowledgements
</SectionTitle>
    <Paragraph position="0"> We would like to thank Marc Moens, Steve Finch, Tara Murphy, Yuval Krymolowski and the many anonymous reviewers for their insightful comments that have contributed signi cantly to this paper.</Paragraph>
    <Paragraph position="1"> This research is partly supported by a Commonwealth scholarship and a Sydney University Travelling scholarship.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML