File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/89/h89-2013_metho.xml

Size: 13,525 bytes

Last Modified: 2025-10-06 14:12:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="H89-2013">
  <Title>Enhanced Good-Turing and Cat.Cal: Two New Methods for Estimating Probabilities of English Bigrams (abbreviated version)</Title>
  <Section position="3" start_page="82" end_page="84" type="metho">
    <SectionTitle>
2. Estimation Methods
</SectionTitle>
    <Paragraph position="0"> Let r* be the adjusted frequency for a type observed r times. Then p, the probability of the type, is estimated by r*/N. In order to satisfy the constraint ~p= 1, the adjusted frequencies must satisfy r* = N. Two such methods will be considered at length: the Good-Turing Method (GT) and the Categorize-Calibrate Method (CC).</Paragraph>
    <Paragraph position="1"> These methods are considerably better than the Maximum Likelihood Estimator (MLE): r* = r. The main problem with MLE is that bigrams will be assigned zero probability if they didn't happen to occur in the training sample. Moreover, there are large errors when the counts are small (e.g., less than 20). In addition, the MLE fails to distinguish among bigrams with the same count. In our application there are billions of bigrams with a count of zero, some of which are much more likely than others. Their probability is neither zero nor identical.</Paragraph>
    <Section position="1" start_page="82" end_page="84" type="sub_section">
      <SectionTitle>
2.1 The Basic Good-Turing and Cat-Cal Methods
</SectionTitle>
      <Paragraph position="0"> We use the adjective basic to distinguish these methods from the enhanced methods that will be discussed in the next section. The main difference is that basic methods treat bigrams at atomic objects with no internal structure; enhanced methods will &amp;quot;back-off&amp;quot; and use the unigram model when appropriate.</Paragraph>
      <Paragraph position="1"> The Good-Turing method has been used very successfully by IBM speech recognition group (Niidas, 1984; Nitdas, 1985; Katz, 1987). The key insight suggested by Turing and developed by Good (1953), is the use of Nr, the number of bigrams which occur r times. We may refer to N,. as the frequency of frequency r. The GT estimate is r* = (r+I)N,.+i/N~ and it has a variance of r*(l+(r+l)*-r*).</Paragraph>
      <Paragraph position="2"> In practice it is necessary to use smoothed estimates of N~ instead of raw observations, especially when N~ is small. (Smoothing will not be discussed in this paper in order to save space.) The following table illustrates a use of the basic GT estimate (BGT). (This example was selected so that the Nr's are large enough that smoothing is not too important.)  The adjusted frequencies, r*, can be compared to the raw frequencies, r; they have the same order, and do not differ greatly. The GT method assigns some probability to bigrams which have not been seen, suggesting that we should act as if we had seen each of them 0.0000128 times instead of zero times. In order to compensate for moving 160 billion bigrams from 0 to 0.0000128, some other bigrams must be adjusted downwards. In this case, all bigrams with r &gt; 0 will be adjusted downwards.</Paragraph>
      <Paragraph position="3"> Notice that the calculation of r* for r = 0 depends on No, the number of bigrams that we have not seen. We can calculate No because V is provided by the the unigram model. (This marks a great difference in our application of the Good-Turing formula from many applications in population biology, where inferences about the population size are the desideratum.) The total universe of bigrams that we wish to know about has size V2=l.6xl011. No is the difference between V 2 and the number of distinct bigrams seen, ~Nr. Note that No=V 2 since V 2 &gt; N O &gt; V2-N and N &lt;&lt; V 2. In other words, r&gt;0 most bigrams have not been seen. In our experience, the problem only gets worse as we look at larger corpora because V 2 tends to grow faster than N.</Paragraph>
      <Paragraph position="4"> GT improves on MLE by making use of more information, namely {Nr}. CC gathers even more information. The training text is divided into two halves. Categorize each bigram, b, by its observed frequency r l (b) in the first part of the text. Denote the number of distinct bigrams in the category by Nr = ~ 1. Calibrate the category by counting all occurrences of all the bigrams in the category blrl(b)=r in the second part of the text, C, ~- ~ r2(b), where the r2(b) is the observed frequency of the bJrt(b)=r bigram, b, in the second half. The adjusted frequency is then: r* = Cr/N,. The only assumption behind this method is that both samples are generated by the same process. This assumption is weaker than the binomial assumption of GT. We refer to this method as the basic Cat-Cal method (BCC); the next section will consider an enhanced version that makes use of the bigrams' internal structure.  The adjusted frequencies for the BCC can be compared to the adjusted frequencies for the BGT as well as to the MLE. The differences between the BCC and the BGT are limited to the third significant  figure, while the differences of either from the MLE are in the first significant figure. The fifth column of the table, labeled repeat, contains the results of repeating the basic Cat-Cal method after exchanging the texts used for categorization and calibration. The differences are again limited to the third significant figure, showing that BCC agrees well with our standard. We originally established the Cat-Cal method as a standard against which to compare other methods. However, we came to realize that it could itself be used as a practical method. Thus Cat-Cal plays two roles: standard and potential method.</Paragraph>
      <Paragraph position="5"> The CC method can be extended to compute variances as illustrated below. Note that the variances computed by the CC method agree closely with those computed by GT.</Paragraph>
      <Paragraph position="6">  A key suggestion of this work is the introduction of a second predictor of frequency of observation in addition to an observed frequency; accounting for the second predictor constitutes what we call an enhanced method. We study an enhanced Good-Turing method and an enhanced Cat-Cal method. Both enhanced methods allow us to differentiate among the many bigrams which have not been seen. We will show that about 1200 significantly different probabilities can be estimated for bigrams not seen in the training text.</Paragraph>
      <Paragraph position="7"> A possible second predictor for bigrams is the following: jii = N e(p(x)) e(p(y)), where e(p(x)) and e(p(y)) are the unigram model's estimates of the probability of the first and second word in the bigram. jii is an acronym for &amp;quot;joint if independent&amp;quot;. We refer to values of jii as &amp;quot;Unigram Estimates (UE)&amp;quot; when we compare them to other estimates such as MLE or GT. In many of the following plots, we / / group bigrams into approximately 35 bins using the binning rule: j = \[31oglojii I.</Paragraph>
      <Paragraph position="8"> I_ .J Other second predictors are possible. We do not know what makes one variable better than another for grouping. A necessary property of the grouping variable is that it be possible to count the number of types included in each group, because we need to know No. We hypothesize that if one variable predicts r better than another, then it will make a better grouping variable. It is useful for smoothing that jii is a continuous variable.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="84" end_page="86" type="metho">
    <SectionTitle>
3. Qualitative Evaluation
</SectionTitle>
    <Paragraph position="0"> We find that the both GT and CC estimates agree very well with the standard estimates over the entire range of data that we can test. The smallest frequency observations are the most critical. The following figure shows the results for r = 1. Five predicted frequencies are shown in this and following figures:  (1) the standard, S, shown by points, (2) the maximum likelihood estimate, MLE, shown by long dashes, (3) the unigram estimate, UE, shown by long dashes, (4) the enhanced Cat-Cal estimate, CC, shown by a solid line, and (5) the enhanced Good-Turing estimate, GT, shown by short dashes. These estimates are plotted against the logarithm of the unigram estimator, jii. Note that CC and GT agree closely with the standard. They are quite distinct from either the MLE or UE but lie approximately between these two primary estimators.</Paragraph>
    <Paragraph position="2"> For frequency zero, the range of CC and GT is about five orders of magnitude, four orders of magnitude larger than for any other frequency. Over this range, both GT and CC agree well with the standard estimates. At the resolution shown, there is no visible difference between the three estimates for most of the range.</Paragraph>
    <Paragraph position="3">  Enhanced Good-Turing and Cat-Cal Agree with the Standard for Small r</Paragraph>
    <Paragraph position="5"> IoglO jii Note that r* depends more on jii when r is small; the slope of r* is very steep for r = 0, and pretty flat for r = 17. This means that UE is more important when r is small. We will return to this when we consider the number of significantly different probabilities.</Paragraph>
  </Section>
  <Section position="5" start_page="86" end_page="88" type="metho">
    <SectionTitle>
4. Quantitative Evaluation
</SectionTitle>
    <Paragraph position="0"> It is natural to evaluate methods with a t-score tjr = (r'j, - r~r)/C~y~, where r'jr is an estimate produced by one of the proposed methods for bin j and frequency r, r~ is the standard for the same jr cell, and c~j, is the standard deviation for the same jr cell. We use the GT method to estimate the standard deviation because it appears to match the CC variance while being less noisy and defined in more cells.</Paragraph>
    <Paragraph position="1"> We have some expectations about these t-scores. A perfect predictor would give an RMS t-score of about one, because the variance of one standard observation is used as the denominator. We find that GT is nearly perfect with RMS t-scores very close to one except for small r. In contrast, CC is not perfect anywhere because both the categorization and the calibration samples have the assumed variance. However, when r is very small, it appears that the binomial assumption is inappropriate, and consequently, the the more empirical, though imperfect, CC method is preferable.</Paragraph>
    <Paragraph position="2"> The two plots below show the RMS t-value averaged within each jii bin. The solid lines compare CC with the standard; the short dashed lines compare GT with the standard. The best performance theoretically possible is an RMS error of one, shown by a long dashed line in each panel. GT approaches this ideal quickly, though CC is preferable at very small frequencies. The CC values in the  upper panel are adjusted for sample size to be comparable to GT values.</Paragraph>
    <Paragraph position="3"> Comparison of the Enhanced Good-Turing and Cat-Cal Methods Cat-Cal is better for small r and worse for large r  The following plot shows that MLE does not reach ideal performance within the range shown. Moreover, for frequencies less than about 40, MI.E is substantially worse than GT. Over the smallest ten frequencies the MI.E has RMS t-values ranging from five to thirty times those of enhanced G(x3d-Turing estimates.</Paragraph>
    <Paragraph position="4"> Comparison of the Enhanced Good-Turing and MLE Methods Good-Turing is better, especially when r is small</Paragraph>
    <Paragraph position="6"> 5. How Many Significantly Different Probabilities? In this section we show that estimates in adjacent jii bins differ quite significantly. This implies that interpolation is justified, and leads to an estimate of the equivalent number of significantly different estimates.</Paragraph>
    <Paragraph position="7"> For each jii, let fj, denote a frequency estimated for bigrams in the jan bin and frequency r. Let ~j, be  variance of fir. The following figure investigates the t-score t= ()S,--/O-,r) / for the particularly important case of r = 0. The solid line shows the t-statistics for CC; the short dashed line shows the GT differences. Long dashed lines are drawn at conventional significance levels of +_ 1.65. These differences are highly significant, indicating that interpolation between the observed values is justified. We estimate the equivalent number of significantly different values by taking the sum of all the t-statistics and dividing by 1.65. For r = 0, the equivalent number of significantly different values is 1245.</Paragraph>
    <Paragraph position="8"> About 1200 Significantly Different Probabilities for r = 0</Paragraph>
    <Paragraph position="10"> The following figure shows the equivalent number of significant differences as a function of frequency.</Paragraph>
    <Paragraph position="11"> The dashed lines are drawn at log101 and 1og102. While the number of significantly different values falls rapidly with increasing r, it remains above two through r = 40, and continues to be greater than one even through frequency 100. This range encompasses the majority of bigram tokens and indicates the value of a second predictor for practical applications, indicating that enhancement is of considerable value for practical applications.</Paragraph>
    <Paragraph position="12"> Equivalent Number of Significantly Different Probabilities we can distinguish bigrams with the same frequency very well for small frequencies</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML