File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/89/h89-2013_concl.xml
Size: 4,583 bytes
Last Modified: 2025-10-06 13:56:20
<?xml version="1.0" standalone="yes"?> <Paper uid="H89-2013"> <Title>Enhanced Good-Turing and Cat.Cal: Two New Methods for Estimating Probabilities of English Bigrams (abbreviated version)</Title> <Section position="6" start_page="88" end_page="89" type="concl"> <SectionTitle> 6. Conclusions </SectionTitle> <Paragraph position="0"> This paper has proposed two specific methods for backing-off bigrarn probability estimates to unigram probabilities: the enhanced Good-Turing method, and the Cat-Cal method. Three important points in this paper have extended the strength of these methods over previous methods: * the use of a second predictor (e.g., jii) to exploit the structure of n-grams, the distinguishing feature between the enhanced Good-Turing method and the basic Good-Turing method.</Paragraph> <Paragraph position="1"> * the estimation of variances for the bigram probabilities, which allows building significance tests for various practical applications, and in particular allows * the use of refined testing methods that can show important qualitative differences even though quantitative differences may be small.</Paragraph> <Paragraph position="2"> The use of a second predictor is the basis on which we distinguish the enhanced Good-Turing method (GT) proposed here from the basic Good-Turing method and the enhanced Cat-Cal (CC) from a basic Cat-Cal. If we had not introduced a second predictor, all bigrams that were observed once would be considered equally likely, and all bigrams that were observed twice would also be considered equally likely, and so on. This is extremely undesirable. Note that there are a large number of bigrams that have been seen just once (2,053,146 in a training corpus of 22 million words); we do not want to model all of them as equally probable. Much worse, there are a very large number of bigrams that have not been seen (160 billion bigrams in the same training corpus of 22 million words); we really do not want to model all of them as equally probable. By introducing the second predictor jii as we did, we were able to make much finer distinctions within groups of bigrams with the same number of observations r. In particular, for bigrams not seen in the training corpus, we have about 1200 significantly different estimates.</Paragraph> <Paragraph position="3"> It would be interesting to consider other variables besides jii. One might consider, for example, the number of letters in the bigram. Katz (1987) proposes an alternative variable: the first word of the ngram. Any variable that is not completely correlated with r would be of some use. jii has some advantages; it makes it possible to summarize the data so concisely that the relevant structure can be observed in a simple plot. Moreover, jii has a natural order and is continuous, so the number of bins can be adjusted for accuracy. In contrast, selecting the first word of the n-gram prescribes the number of bins.</Paragraph> <Paragraph position="4"> The second point, the calculation of variances, is often not discussed in the literature on using the Good-Turing model for language modeling. Variances are necessary to make statements about the statistical significance of differences between observed and predicted frequencies. In other work (Church, Gale, Hanks, and Hindle, 1989), we have used variances to distinguish unusual n-grams from chance.</Paragraph> <Paragraph position="5"> The third point we want to emphasize, the use of relined tests for differences in methods, is discussed in section 4. Four methods, MLE, UE, CC, and GT, were compared to the standard, t-scores were calculated for the differences between the standard and a proposed method and aggregate results across jii. We find that the GT method rapidly approaches ideal performance, though it is outperformed by CC when r is very small, presumably because the binomial assumption is apparently not quite satisfied for small frequencies.</Paragraph> <Paragraph position="6"> There are many ways that the language model presented could be improved. We have said very little about the unigram model; in fact, the unigram model was estimated with the MLE method. One could apply the methodology developed here to improve greatly on this. One could also obtain much improved estimates by starting with a better sample; the 1988 AP corpus is not a balanced sample of general English. This paper is primarily concerned with developing methods and evaluation procedures; in future work, we hope to use these results to construct better language models.</Paragraph> </Section> class="xml-element"></Paper>