File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/c00-1027_intro.xml

Size: 17,566 bytes

Last Modified: 2025-10-06 14:00:45

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-1027">
  <Title>Empirical Estimates of Adaptation: The chance of Two Noriegas is closer to p/2 than p 2</Title>
  <Section position="2" start_page="0" end_page="183" type="intro">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> Adaptive language models were introduced in the Speech Recognition literature to model repetition.</Paragraph>
    <Paragraph position="1"> Jelinek (1997, p. 254) describes cache-based models which combine two estimates of word (ngram) probabilities, Prl., a local estimate based on a relatively small cache of recently seen words, and PrG, a global estimate based on a large training corpus.</Paragraph>
    <Paragraph position="2">  1. Additive: Pr A (w)= XPrL (w) +(1 - X) PrG (w) 2. Case-based: JX I Pr/.(w) if w~ cache &amp;quot; Pr G (w) otherwise PF c (W) = 1~ 2  Intuitively, if a word has been mentioned recently, then (a) the probability of that word (and related words) should go way up, and (b) many other words should go down a little. We will refer to (a) as positive adaptation and (b) as negative adaptation. Our empirical experiments confirm the intuition that positive adaptation, Pr(+adapt), is typically much larger than neg- null ative adaptation, Pr( - adapt). That is, Pr( +adapt) &gt;&gt; Pr(prior) &gt; Pr(-adapt). Two methods, Pr( + adapt t ) and Pr( + adapt2), will be introduced for estimating positive adaptation. 1. Pr( +adapt 1)=PrOve test\[w~ history) 2. Pr(+adapt2)=Pr(k'&gt;_2lk&gt;_l )=d.f2/dfl  The two methods produce similar results, usually well within a factor of two of one another. The first lnethod splits each document into two equal pieces, a hislory portion and a test portion. The adapted probabilities are modeled as the chance that a word will appeal&amp;quot; in the test portion, given that it appeared in the history. The second method, suggested by Church and Gale (1995), models adaptation as the chance of a second lnention (probability that a word will appear two or inore times, given that it appeared one or more times). Pr(+adapt2) is approximated by dJ2/dfl, where c./\['k is the number of documents that contain the word/ngram k or more times. (dfa. is a generalization of document .frequeno,, d.f~ a standard term in information Retrieval.) Both inethods are non-parametric (unlike cache lnodels). Parametric assumptions, when appropriate, can be very powerful (better estimates from less training data), but errors resulting from inappropriate assumptions can outweigh tile benefits. In this elnpirical investigation of the magnitude and shape o1' adaptation we decided to use conservative non-parametric methods to hedge against the risk of inappropriate parametric assumptions.</Paragraph>
    <Paragraph position="3"> The two plots (below) illustrate some of the reasons for being concerned about standard parametric assumptions. The first plot shows the number of times that tile word &amp;quot;said&amp;quot; appears ill each of the 500 documents ill the Brown Corpus (Francis &amp; Kucera, 1982). Note that there are quite a few documents with more than 15 instances of &amp;quot;said,&amp;quot; especially in Press and Fiction. There are also quite a few documents with hardly any instances of &amp;quot;said,&amp;quot; especially in the Learned genre. We have found a similar pattern in other collections; &amp;quot;said&amp;quot; is more common in newswire (Associated Press and Wall Street Journal) than technical writing (Department of Energy abstracts).</Paragraph>
    <Paragraph position="4">  The second plot (below) conlpares these f:hown Corpus o\[-)sorvations to a Poisson. Tile circles indicate the nulnber of docun-ierits that have .r instances of &amp;quot;said.&amp;quot; As mentioned above, Press and Fiction docunlents can lilentioil &amp;quot;said&amp;quot; 15 times or lllore, while doculllOlltS in the Learned genre might not mention the word at all. The line shows what woukt be expected under a Poisson.</Paragraph>
    <Paragraph position="5"> Clearly the line does not fit the circles very well. The probability of &amp;quot;said&amp;quot; depends on many factors (e.g, genre, topic, style, author) that make the distributions broader than chance (Poisson). We lind especially broad distributions for words that adapt a lot.</Paragraph>
    <Paragraph position="6"> &amp;quot;said&amp;quot; in Brown Corpus</Paragraph>
    <Section position="1" start_page="180" end_page="182" type="sub_section">
      <SectionTitle>
Re Press
</SectionTitle>
      <Paragraph position="0"/>
      <Paragraph position="2"> We will show that adaptation is huge.</Paragraph>
      <Paragraph position="3"> Pr(+ adapt) is ot'ten several orders of magnitude larger than Pr(prior). In addition, we lind that Pr(+adapt) has a very different shape fiom Pr(prior). By construction, Pr(prior) wu'ies over many orders o1' magnitude depending on the frequency of the word. Interestingly, though, we find that Pr(+adapt) has ahnost no dependence on word frequency, although there is a strong lexical dependence. Some words adapt more than others. The result is quite robust. Words that adapt more in one cortms also tend to adapt more in another corpus of similar material. Both the magnitude and especially the shape (lack of dependence on fiequency as well as dependence on content) are hard to capture ill an additive-based cache model.</Paragraph>
      <Paragraph position="4"> Later in the paper, we will study neighbmw, words that do not appear in the history but do appear in documents near the history using an information retrieval notion of near. We find that neighbors adapt more than non-neighbors, but not as much as the history. The shape is in between as well. Neighbors have a modest dependency on fiequency, more than the history, but not as much as the prior.</Paragraph>
      <Paragraph position="5"> Neighbors are an extension of Florian &amp; Yarowsky (1999), who used topic clustering to build a language model for contexts such as: &amp;quot;It is at least on the Serb side a real setback to lhe x.&amp;quot; Their work was motivated by speech recognition apl)lications where it would be desirable for the hmguage model to l'avor x = &amp;quot;peace&amp;quot; over x = &amp;quot;piece.&amp;quot; Obviously, acoustic evidence is not very hell~l'tfl in this case. Trigrams are also not very helpful because the strongest clues (e.g., &amp;quot;Serb,&amp;quot; &amp;quot;side&amp;quot; and &amp;quot;setback&amp;quot;) are beyond the window of three words. Florian &amp; Yarowsky cluster documents into about 102 topics, and compute a separate trigram language model for each topic. Neighbors are similar in spirit, but support more topics.</Paragraph>
      <Paragraph position="6"> 2. Estimates of Adaptation: Method 1 Method 1 splits each document into two equal pieces. The first hall' of each document is referred to as the histoo, portion of the document and the second half of each document is referred to as the test portion of the documenl. The task is to predict the test portion of the document given the histm3,. We star! by computing a contingency table for each word, as ilhlstrated below: l)ocuments containing &amp;quot;hostages&amp;quot; in 1990 AP test test history a =638 b =505 histoo, c =557 d =76787 This table indicates that there are (a) 638 documents with hostages in both the first half (history) and the second half (test), (b) 505 documents with &amp;quot;hostages&amp;quot; in just the first half, (c) 557 documents with &amp;quot;hostages&amp;quot; in just the second halt', and (d) 76,787 documents with &amp;quot;hostages&amp;quot; in neither half. Positive and negative adaptation are detined in terms a, b, c and d.</Paragraph>
      <Paragraph position="8"> Adapted probabilities will be compared to:</Paragraph>
      <Paragraph position="10"> Positive adaptation tends to be much large, than the prior, which is just a little larger than negative adaptation, as illustrated in the table below for the word &amp;quot;hostages&amp;quot; in four years of the Associated Press (AP) newswire. We find remarkably consistent results when we compare one yea,&amp;quot; of the AP news to another (though topics do come and go over time). Generally, the differences of interest are huge (orders of magnitude) compared to the differences among various control conditions (at most factors of two or three). Note that values are more similar within colmnns than across columns.</Paragraph>
      <Paragraph position="11">  We find that some words adapt more than others, and that words that adapt more in one year of the AP also tend to adapt more in another year of the AP. In general, words that adapt a lot tend to have more content (e.g., good keywords for information retrieval (IR)) and words that adapt less have less content (e.g., function words).</Paragraph>
      <Paragraph position="12"> It is often assumed that word fi'equency is a good (inverse) con'elate of content. In the psycholinguistic literature, the term &amp;quot;high frequency&amp;quot; is often used syrlouymously with &amp;quot;function words,&amp;quot; and &amp;quot;low frequency&amp;quot; with &amp;quot;content words.&amp;quot; In IR, inverse document fiequency (IDF) is commonly used for weighting keywords.</Paragraph>
      <Paragraph position="13"> The table below is interesting because it questions this very basic assumption. We compare two words, &amp;quot;Kennedy&amp;quot; and &amp;quot;except,&amp;quot; that are about equally frequent (similar priors).</Paragraph>
      <Paragraph position="14"> Intuitively, &amp;quot;Kennedy&amp;quot; is a content word and &amp;quot;except&amp;quot; is not. This intuition is supported by the adaptation statistics: the adaptation ratio, Pr(+adapt)/Pr(prior), is nmch larger for &amp;quot;Kennedy&amp;quot; than for &amp;quot;except.&amp;quot; A similar pattern holds for negative adaptation, but in the reverse direction. That is, Pr(-adapt)/Pr(prior) is lnuch slnaller for &amp;quot;Kennedy&amp;quot; than for &amp;quot;except.&amp;quot; Kenneclv adapts more than except prior +adapt -adapt source w  In general, we expect more adaptation for better keywords (e.g., &amp;quot;Kennedy&amp;quot;) and less adaptatiou for less good keywords (e.g., fnnction words such as &amp;quot;except&amp;quot;). This observation runs counter to the standard practice of weighting keywords solely on the basis of frequency, without considering adaptation. In a related paper, Umemura and Church (submitted), we describe a term weighting method that makes use of adaptation (sometimes referred to as burstiness).</Paragraph>
      <Paragraph position="15">  The table above compares surnames with first names. These surnames are excellent keywords unlike the first names, which are nearly as useless for IR as function words. The adaptation ratio, Pr(+adapt)/Pr(prior), is much larger for the surnames than for the first names.</Paragraph>
      <Paragraph position="16"> What is the probability of seeing two Noriegas in a document? The chance of the first one is p=0.006. According to the table above, the chance of two is about 0.75p, closer to p/2 than  lightning to strike twice, but it hapt)ens all the time, especially for good keywords.</Paragraph>
      <Paragraph position="17"> 4. Smoothing (for low frequency words) Thus fitr, we have seen that adaptation can be large, but to delnonstrate tile shape property (lack of dependence on frequency), tile counts in the contingency table need to be smoothed. The problem is that the estimates of a, b, c, d, and especially estimates of the ratios of these quantities, become unstable when the counts are small. The standard methods of smoothing in tile speech recognition literature are Good-Turing (GT) and tteld-Out (He), described in sections 15.3 &amp; 15.4 of Jelinek (1997). In both cases, we let r be an observed count of an object (e.g., the fi'equency of a word and/or ngram), and r* be our best estimate of r in another COl'pUS of the same size (all other things being equal).</Paragraph>
    </Section>
    <Section position="2" start_page="182" end_page="182" type="sub_section">
      <SectionTitle>
4.1 Standard Held-Out (He)
</SectionTitle>
      <Paragraph position="0"> He splits the training corpus into two halves.</Paragraph>
      <Paragraph position="1"> The first half is used to count r for all objects of intercst (e.g., the frequency of all words in vocal&gt; ulary). These counts are then used to group objects into bins. The r m bin contains all (and only) tile words with count r. For each bin, we colnpute N r, tile number of words in the r m bin.</Paragraph>
      <Paragraph position="2"> The second half of the training corpus is then used to compute Cr, tile a,,,,re,,'~m~ ~,.~ frequency of all the words in the r ~h bin. The final result is simply: r*=Cr./N,, ll' the two halves o1' tile trail)ing corpora or the lest corpora have dilTercnt sizes, then r* should be scaled appropriately.</Paragraph>
      <Paragraph position="3"> We chose He in this work because it makes few assumptions. There is no parametric model. All that is assumed is that tile two halves of tile training corpus are similar, and that both are similar to the testing corpus. Even this assulnption is a matter of some concern, since major stories come and go over time.</Paragraph>
    </Section>
    <Section position="3" start_page="182" end_page="183" type="sub_section">
      <SectionTitle>
4.2 Application of He to Contingency Tables
</SectionTitle>
      <Paragraph position="0"> As above, the training corpus is split into two halves. We used two different years of AP news.</Paragraph>
      <Paragraph position="1"> The first hall' is used to count document frequency rl/: (Document frequency will be used instead of standard (term) frequency.) Words are binned by df and by their cell in the coutingency table. The first half of tile corpus is used to compute the number of words in each bin: Nd, ., N4fj,, N41.(: and Ndl.,,t; the second half of the corpus is used to compute the aggregate document flequency for the words in each bin: C,!f, a, C41.,l), Cdl:,,c and C4f,d. The final result is strop!y: null c~}.=C,/.~/N~l.r and d~i=C4f,,//N4f,~/. We * ./ .I,: ,/,' ,I compute tile probabilities as before, but replace a, b, c, d with a *, b *, c *, d*, respectively.</Paragraph>
      <Paragraph position="2">  With these smoothed estimates, we arc able to show that Pr(+adcq~t), labeled h in tile plot above, is larger and less dependent on fi'equency than l)r(prior), labeled p. The plot shows a third group, labeled n for neighbors, which will be described later. Note that Ihe ns fall between tile ps and tile hs.</Paragraph>
      <Paragraph position="3"> Thus far, we have seen that adaptation can be huge: Pr(+a&amp;q)l)&gt;&gt; Pr(prior), often by two or three orders of magnitude. Perhaps even more surprisingly, although Ihe first mention depends strongly on frequency (d./), the second does not. Some words adapt more (e.g., Noriega, Aristide, Escobar) and some words adapt less (e.g., John, George, Paul). Tile results are robust. Words that adapt more in one year of AP news tend to adapt more in another year, and vice versa.</Paragraph>
      <Paragraph position="4"> 5. Method 2: l'r( + adapt2 ) So far, we have limited our attention to the relatively simple case where the history and the test arc tile same size. In practice, this won't be the case. We were concerned that tile observations above might be artil'acts somehow caused by this limitation.</Paragraph>
      <Paragraph position="5"> We exl~erimented with two approaches for understanding the effect of this limitation and found that the size of the history doesn't change Pr(+adal)t ) very much. The first approach split the history and the test at wlrious points ranging from 5% to 95%. Generally, Pr(+adaptl ) increases as the size of the test portion grows relative to the size of the history, but the effect is  relatively small (more like a factor of two than an order of magnitude).</Paragraph>
      <Paragraph position="6"> We were even more convinced by the second approach, which uses Pr(+adapt2 ), a completely different argument for estimating adaptation and doesn't depend on the relative size of the history and the test. The two methods produce remarkably silnilar results, usually well within a factor of two of one another (even when adapted probabilities are orders of magnitude larger than the prior).</Paragraph>
      <Paragraph position="7"> Pr(+adapt2) makes use of d./)(w), a generalization of document frequency, d,/)(w) is the number of documents with .j or more instances of w; (dfl is the standard notion of dJ).</Paragraph>
      <Paragraph position="8"> Pr( + adapt 2 ) = Pr(k&gt;_2 \[k&gt;_ 1 ) = df2/(!/&amp;quot; 1 Method 2 has some advantages and some disadvantages in comparison with method 1. On the positive side, method 2 can be generalized to compute the chance of a third instance: Pr(k&gt;_31k&gt;_2 ). But unfortunately, we do not know how to use method 2 to estimate negative adaptation; we leave that as an open question.</Paragraph>
      <Paragraph position="9">  The plot (above) is similar to the plot in section 4.2 which showed that adapted probabilities (labeled h) are larger and less dependent on frequency than the prior (labeled p). So too, the plot (above) shows that the second and third mentions of a word (labeled 2 and 3, respectively) are larger and less dependent on frequency than the first mention (labeled 1). The plot in section 4.2 used method 1 whereas the plot (above) uses method 2. Both plots use the He smoothing, so there is only one point per bin (df value), rather than one per word.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML