XML Viewer - p95-1030

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/p95-1030_metho.xml
Size: 29,256 bytes
Last Modified: 2025-10-06 14:14:04
<?xml version="1.0" standalone="yes"?>
<Paper uid="P95-1030">
  <Title>New Techniques for Context Modeling</Title>
  <Section position="4" start_page="220" end_page="223" type="metho">
    <SectionTitle>
2 Extension Model Class
</SectionTitle>
    <Paragraph position="0"> This section consists of four parts. In 2.1, we formally define the class of extension models and prove that they satisfy the axioms of probability. In 2.2, we show to estimate the parameters of an extension model using Moffat's (1990) &amp;quot;method C.&amp;quot; In 2.3, we provide codelength formulas for our model class, based on efficient enumerative codes. These codelength formulas will be used to match the complexity of the model to the complexity of the data. In 2.4, we present a heuristic model selection algorithm that adds parameters to an extension model only when they reduce the codelength of the data more than they increase the codelength of the model.</Paragraph>
    <Section position="1" start_page="220" end_page="221" type="sub_section">
      <SectionTitle>
2.1 Model Class Definition
</SectionTitle>
      <Paragraph position="0"> Formally, an extension model C/ : (E, D, E, A) consists of a finite alphabet E, \[E\[ = m, a dictionary D of contexts, D C E*, a set of available context extensions E, E C D x E, and a probability function I : E ---* \[0, 1\]. For every context w in D, E(w) is the set of symbols available in the context w and A(~rlw ) is the conditional probability of the symbol c~ in the context w. Note that )--\]o~ A(c~\[w) &lt; 1 for all contexts w in the dictionary D.</Paragraph>
      <Paragraph position="1"> The probability /5(h1C/ ) of a string h given the model C/, h * E', is calculated as a chain of conditional probabilities (1)</Paragraph>
      <Paragraph position="3"> while the conditional probability ih(elh, C/) of a single symbol ~r after the history h is defined as (2).</Paragraph>
      <Paragraph position="5"> (2) The expansion factor 6(h) ensures that/5(.\]h, C/) is a probability function if/5(-Ih2.., h,~, C/) is a probability function.</Paragraph>
      <Paragraph position="7"> Note that E(h) represents a set of symbols, and so by a slight abuse of notation )~(E(h)Ih ) denotes ~\]~eE(h) A(a\[h), ie., the sum of A(alh ) over all ~ in E(h).</Paragraph>
      <Paragraph position="8"> Examplel. Let E:{0,1},D: {e,&amp;quot;0&amp;quot; },E(e)</Paragraph>
      <Paragraph position="10"> The fundamental difference between a context model and an extension model lies in the inputs to the context selection rule, not its outputs. The traditional context model includes a selection rule s : E* --~ D whose only input is the history. In contrast, an extension model includes a selection rule s : E* x E --+ D whose inputs include the past history and the symbol to be predicted. This distinction is preserved even if we generalize the selection rule to select a set of candidate contexts. Under such a generalization, the context model would map every history to a set of candidate contexts, ie., s : E* ---* 2 D , while an extension model would map every history and symbol to a set of candidate contexts, ie., s : E* x E --* 2 D.</Paragraph>
      <Paragraph position="11"> Our extension selection rule s : E* x E --+ D is defined implicitly by the set E of extensions currently in the model. The recursion in (2) says that each symbol should be predicted in its longest candidate context, while the expansion factor 6(h) says that longer contexts in the model should be trusted more than shorter contexts when combining the predictions from different contexts.</Paragraph>
      <Paragraph position="12"> An extension model C/ is valid iff it satisfies the following constraints:</Paragraph>
      <Paragraph position="14"> These constraints suffice to ensure that the model C/ defines a probability function. Constraint (4a) states that every symbol has the empty string as a context.</Paragraph>
      <Paragraph position="15"> This guarantees that every symbol will always have at least one context in every history and that the recursion in (2) will terminate. Constraint (45) states that the sum of the probabilities of the extensions E(w) available in in a given context w cannot sum  to more than unity. The third constraint (4c) states that the sum of the probabilities of the extensions E(w) must sum exactly to unity when every symbol  is available in that context (ie., when E(w) : E). Lemma 2.1 VyEE* Vcr62E \[ fi(~\]lY) : 1 :~/\](EIqy) = 1 \] Proof. By the definition of 6(~ry).</Paragraph>
      <Paragraph position="16"> Theorem 1 If an exlension model C/ is valid, then vn \]S,es,, = 1.</Paragraph>
      <Paragraph position="17">  Proof. By induction on n. For the base case, n : 1 and the statement is true by the definition of validity (constraints 4a and 4c). The induction step is true by lemma 2.1 and definition (1). \[\]</Paragraph>
    </Section>
    <Section position="2" start_page="221" end_page="221" type="sub_section">
      <SectionTitle>
2.2 Parameter Estimation
</SectionTitle>
      <Paragraph position="0"> Let us now estimate the conditional probabilities A(.\[-) required for an extension model. Traditionally, these conditional probabilities are estimated using string frequencies obtained from a training corpus.</Paragraph>
      <Paragraph position="1"> Let c(c~\[w) be the number of times that the symbol followed the string w in the training corpus, and let c(w) be the sum ~es c(crlw) of all its conditional frequencies.</Paragraph>
      <Paragraph position="2"> Following Moffat (1990), we first partition the conditional event space E in a given context w into two subevents: the symbols q(w) that have previously occurred in context w and those that q(w) that have not. Formally, q(w) - {(r : c(,r\[w) &gt; 0} and ~(w) - E - q(w). We estimate )~c(q(w)lw ) as e(w)/(c(w) + #(w)) and )~c(4(w)\[w) as #(w)/(c(w)+ #(w)) where #(w) is the total weight assigned to the novel events q(w) in the context w. Currently, we calculate #(w) as min(\[q(w)l, Iq(w)\[) so that highly variable contexts receive more flattening, but no novel symbol in ~(w) receives more than unity weight. Next, )~c(alq(w ), w) is estimated as c(alw)/c(w ) for the previously seen symbols c~ e q(w) and Ac((r\]4(w), w) is estimated uniformly as 1/\[4(w)\[ for the novel symbols ~r * 4(w). Combining these estimates, we obtain our overall estimate (5).</Paragraph>
      <Paragraph position="4"> O) Unlike Moffat, our estimate (5) does not use escape probabilities or any other form of context blending.</Paragraph>
      <Paragraph position="5"> All novel events 4(w) in the context w are assigned uniform probability. This is suboptimal but simpler.</Paragraph>
      <Paragraph position="6"> We note that our frequencies are incorrect when used in an extension model that contains contexts that are proper suffixes of each other. In such a situation, the shorter context is only used when the longer context was not used. Let y and xy be two distinct contexts in a model C/. Then the context y will never be used when the history is E*xy. Therefore, our estimate of A(.ly ) should be conditioned on the fact that the longer context xy did not occur.</Paragraph>
      <Paragraph position="7"> The interaction between candidate contexts can become quite complex, and we consider this problem in other work (Ristad and Thomas, 1995).</Paragraph>
      <Paragraph position="8"> Parameter estimation is only a small part of the overall model estimation problem. Not only do we have to estimate the parameters for a model, we have to find the right parameters to use! To do this, we proceed in two steps. First, in section 2.3, we use the minimum description length (MDL) principle to quantify the total merit of a model with respect to a training corpus. Next, in section 2.4, we use our MDL codelengths to derive a practical model selection algorithm with which to find a good model in the vast class of all extension models.</Paragraph>
    </Section>
    <Section position="3" start_page="221" end_page="222" type="sub_section">
      <SectionTitle>
2.3 Codelength Formulas
</SectionTitle>
      <Paragraph position="0"> The goal of this section is to establish the proper tension between model complexity and data complexity, in the fundamental units of information. Although the MDL framework obliges us to propose particular encodings for the model and the data, our goal is not to actually encode the data or the model.</Paragraph>
      <Paragraph position="1"> Given an extension model C/ and a text corpus T, ITI = t, we define the total codelength L(T,C/I(I)) relative to the model class ~ using a 2-part code.</Paragraph>
      <Paragraph position="2"> L(T, C/\[(I)) : L(C/I~ ) + L(TIC/ , ~) Since conditioning on the model class (I) is always understood, we will henceforth suppress it in our notation.</Paragraph>
      <Paragraph position="3"> Firstly, we will encode the text T using the probability model C/ and an arithmetic code, obtaining the following codelength.</Paragraph>
      <Paragraph position="5"> Next, we encode the model C/ in three parts: the context dictionary as L(D), the extensions as L(EID), and the conditional frequencies c(.\[-) as L(e\[D, E).</Paragraph>
      <Paragraph position="6"> The dictionary D of contexts forms a suffix tree containing ni vertices with branching factor i. The m tree contains n = )--~i=l ni internal vertices and no leaf vertices. There are (no + nl + ... + nm 1)!/no!nl!...nm! such trees (Knuth, 1986:587). Accordingly, this tree may be encoded with an enumerative code using L(D) bits.</Paragraph>
      <Paragraph position="7"> LID): Lz(n)+log( n+m-lm_l )</Paragraph>
      <Paragraph position="9"> where \[DJ is the set of all contexts in D that are proper suffixes of another context in D. The first term encodes the number n of internal vertices using the Elias code. The second term encodes the counts {nl, n2,..., am}. Given the frequencies of these internal vertices, we may calculate the number no of leaf vertices as no = 1 + n2 + 2n3 + 3n4 +... + (m 1)am. The third term encodes the actual tree (without labels) using an enumerative code. The fourth term assigns labels (ie., symbols from E) to the edges in the tree. At this point the decoder knows all contexts which are not proper suffixes of other contexts, ie., D - LD\]. The fourth term encodes the magnitude of \[D\] as an integer bounded by the number n of internal vertices in the suffix tree. The fifth term identifies the contexts \[DJ as interior vertices in the tree that are proper suffices of another context in D.</Paragraph>
      <Paragraph position="10"> Now we encode the symbols available in each context. Let mi be the number of contexts that have exactly i extensions, ie., mi - J{w: JE(w)l = i}l.</Paragraph>
      <Paragraph position="11"> 7&amp;quot;n Observe that ~i=1 mi = IDI.</Paragraph>
      <Paragraph position="12"> () E m -F rni log i i--1 The first term represents the encoding of {mi } while the second term represents the encoding IE(w)l for each w in D. The third term represents the encoding of E(w) as a subset of E for each w in D.</Paragraph>
      <Paragraph position="13"> Finally, we encode the frequencies c(~rlw) used to estimate the model parameters wED + g ,o, ( C(deg) + ) IE(w)l where \[y\] consists of all contexts that have y as their maximal proper suffix, ie., all contexts that y immediately dominates, and \[y\] is the maximal proper suffix of y in D, ie., the unique context that immediately dominates y. The first term encodes ITI with an Elias code and the second term recursively partitions c(w) into c(\[w\]) for every context w. The third term partitions the context frequency c(w) into the available extensions c(E(w)lw ) and the &amp;quot;unallocated frequency&amp;quot; c(E- E(w)lw) = c(w) - c(E(w)\[w) in the context w.</Paragraph>
    </Section>
    <Section position="4" start_page="222" end_page="223" type="sub_section">
      <SectionTitle>
2.4 Model Selection
</SectionTitle>
      <Paragraph position="0"> The final component of our contribution is a model selection algorithm for the extension model class ~.</Paragraph>
      <Paragraph position="1"> Our algorithm repeatedly refines the accuracy of our model in increasingly long contexts. Adding a new parameter to the model will decrease the codelength of the data and increase the codelength of the model.</Paragraph>
      <Paragraph position="2"> Accordingly, we add a new parameter to the model only if doing so will decrease the total codelength of the data and the model.</Paragraph>
      <Paragraph position="3"> The incremental cost and benefit of adding a single parameter to a given context cannot be accurately approximated in isolation from any other parameters that might be added to that context. Accordingly, the incremental cost of adding the set E' of extensions to the context w is defined as (6) while the incremental benefit is defined as (7).</Paragraph>
      <Paragraph position="4"> ALe(w, E') - L(C/ U ({w} x E')) - L(C/) (6) ALT(W, E') - L(TIC/ ) - L(T\[C/ U ({w} x E')) (7) Keeping only significant terms that are monotonically nondecreasing, we approximate the incremental cost ALe(w, E') as loglDl+log IS'l + log c(Lwj) + log ( c(w)ls, i + C 'I ) The first term represents the incremental increase in the size of the context dictionary D. The second term represents the cost of encoding the candidate extensions E(w) = E ~. The third term represents (an upper bound on) the cost of encoding c(w). The fourth term represents the cost of encoding c(.Iw ) for E(w). Only the second and fourth terms are signficant.</Paragraph>
      <Paragraph position="5"> Let us now consider the incremental benefit of adding the extensions E' to a given context w. The addition of a single parameter (w, ~r) to the model C/ will immediately change A(alw), by definition of the model class. Any change to A(.Iw ) will also change the expansion factor 5(w) in that context, which may in turn change the conditional probabilities ~(E-E(w)lw, C/) of symbols not available in that context. Thus the incremental benefit of adding the extensions E' to the context w may be calculated as</Paragraph>
      <Paragraph position="7"> The first term represents the incremental benefit (in bits) for evaluating E - E' in the context w using the more accurate expansion factor 5(w). The second term represents the incremental benefit (in bits) of using the direct estimate A(a'lw ) instead of the model probability/5(cr'lw, C/) in the context w. Note that A(a'lw) may be more or less than/~(cr'lw , C/).</Paragraph>
      <Paragraph position="8"> Now the incremental cost and benefit of adding a single extension (w, cr) to a model that already contains the extensions (w, El/ may be defined as follows.</Paragraph>
      <Paragraph position="9"> ALe(w, E', a) -- ALe(w, E' U {a}) - ALe(w, E')  ALT(w, ~', a) - ALT(w, ~' U {a}) - ALT(W, ~') Let us now use these incremental cost/benefit formulas to design a simple heuristic estimation algorithm for the extension model. The algorithm consists of two subroutines. Refine(D,E,n) augments the model with all individually profitable extensions of contexts of length n. It rests on the assumption that adding a new context does not change the model's performance in the shorter contexts.</Paragraph>
      <Paragraph position="10"> Extend(w) determines all profitable extensions of the candidate context w, if any exist. Since it is not feasible to evaluate the incremental profit of every subset of E, Extend(w) uses a greedy heuristic that repeatedly augments the set of profitable extensions of w by the single most profitable extension until it is not longer profitable to do so.</Paragraph>
      <Paragraph position="12"> 1. D, := {};E, := {}; 2. Cn := {w: w e Cn-1 ~'\]~ A c(w) &gt; Cmi.} ; 3. if (( n &gt; nm~=) V (ICnl = 0)) then return; 4. for w E Cn 5. ~' := Extend(w); 6. if ISI &gt; o then D. :-- Dn U {w}; En(w) := S; 7. D:=DUDn;E:=EUEn; 8. Refine( D,E,n -F 1); Cn is the set of candidate contexts of length n, obtained from the training corpus. Dn is the set of profitable contexts of length n, while En is the set of profitable extensions of those contexts. Extend(w) 1. S :: {}; 2. o&amp;quot; := argmaxoe~. {AL(w, {at})} 3. while (AL(w,S,~) &gt; O) 4. S := S U {a}; S. o&amp;quot; := argrnax.e\]g_ s {AL(w, ,S', C/r)} 6. return(S);  The loop in lines 3-5 repeatedly finds the single most profitable symbol a with which to augment the set S of profitable extensions. The incremental profit AL(...) is the incremental benefit ALT(...) minus the incremental cost ALe(...). Our breadth-first search considers shorter contexts before longer ones, and consequently the decision to add a profitable context y may significantly decrease the benefit of a more profitable context xy, particularly when c(xy) ~ c(y). For example, consider a source with two hidden states. In the first state, the source generates the alphabet E = {0, 1,2} uniformly. In the second state, the source generates the string &amp;quot;012&amp;quot; with certainty. With appropriate state transition probabilities, the source generates strings where c(0) ~ c(1) ~ e(2), c(211)/c(1 ) &gt;&gt; c(21e)/c(c), and c(2101)/c(01 ) &gt; c(211)/c(1 ). In such a situation, the best context model includes the contexts &amp;quot;0&amp;quot; and &amp;quot;01&amp;quot; along with the empty context c. However, the divergence heuristic will first determine that the context &amp;quot;1&amp;quot; is profitable relative to the empty context, and add it to the model. Now the profitability of the better context deg'01&amp;quot; is reduced, and the divergence heuristic may therefore not include it in the model. This problem is best solved with a best first search. Our current implementation uses a breadth first search to limit the computational complexity of model selection.</Paragraph>
      <Paragraph position="13"> Finally, we note that our parameter estimation techniques and model selection criteria are comparable in computational complexity to Rissanen's context models (1983, 1986). For that reason, extension models should be amendable to efficient online implementation. null</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="223" end_page="226" type="metho">
    <SectionTitle>
3 Empirical Results
</SectionTitle>
    <Paragraph position="0"> By means of the following experiments, we hope to demonstrate the utility of our context modeling techniques. All results are based on the Brown corpus, an eclectic collection of English prose drawn from 500 sources across 15 genres (Francis and Kucera, 1982). The irregular and nonstationary nature of this corpus poses an exacting test for statistical language models. We use the first 90% of each file in the corpus to estimate our models, and then use the remaining 10% of each file in the corpus to evaluate the models. Each file contains approximately 2000 words. Due to limited computational resources, we set nmax = 10, Cmin -~- 8, and restrict our our alphabet size to 70 (ie., all printing ascii characters, ignoring case distinction).</Paragraph>
    <Paragraph position="1"> Our results are summarized in the following table. Message entropy (in bits/symbol) is for the testing corpus only, as per traditional model validation methodology. The nonmonotonic extension model (NEM) outperforms all other models for all orders using vastly fewer parameters. Its performance all the more impressive when we consider that no context blending or escaping is performed, even for novel events.</Paragraph>
    <Paragraph position="2"> We note that the test message entropy of the n-gram model class is minimized by the 5-gram at 2.38 bits/char. This result for the 5-gram is not honest because knowledge of the test set was used to select the optimal model order. Jelinek and Mercer (1980) have shown to interpolate n-grams of different order using mixing parameters that are conditioned on the history. Such interpolated Markov sources are considerably more powerful than traditional n-grams but contain even more parameters.</Paragraph>
    <Paragraph position="3"> The best reported results on the Brown Corpus are 1.75 bits/char using a large interpolated trigram word model whose parameters are estimated using over 600,000,000 words of proprietary training data (Brown et.al., 1992). The use of proprietary training data means that these results are not independently repeatable. In contrast, our results were obtained using only 900,000 words of generally available training data and may be independently verified by any- null model (NEM), the nonmonotonic context model (NCM), Rissanen's (1983,1986) monotonic context models (MCM1, MCM2) and the n-gram model. All models are order 7. The rightmost column contains test message entropy in bits/symbol. NEM outperforms all other model classes for all orders using significantly fewer parameters. It is possible to reduce the test message entropy of the NEM and NCM to 1.91 and 1.99, respectively, by quadrupling the number of model parameters.</Paragraph>
    <Paragraph position="4"> one with the inclination to do so. The amount of training data is known to be a significant factor in model performance. Given a sufficiently rich dictionary of words and a sufficiently large training corpus, a model of word sequences is likely to outperform an otherwise equivalent model of character sequences.</Paragraph>
    <Paragraph position="5"> For these three reasons - repeatability, training corpus size, and the advantage of word models over character models - the results reported by Brown et.al (1992) are not directly comparable to those reported here.</Paragraph>
    <Paragraph position="6"> Section 3.1 compares the statistical efficiency of the various context model classes. Next, section 3.2 anecodatally examines the complex interactions among the parameters of an extension model.</Paragraph>
    <Section position="1" start_page="224" end_page="225" type="sub_section">
      <SectionTitle>
3.1 Model Class Comparison
</SectionTitle>
      <Paragraph position="0"> Given the tremendous risk of overfitting, the most important property of a model class is arguably its statistical efficiency. Informally, statistical efficiency measures the effectiveness of individual parameters in a given model class. A high efficiency indicates that our model class provides a good description of the data. Conversely, a low efficiency indicates that the model class does not adequately describe the observed data.</Paragraph>
      <Paragraph position="1"> In this section, we compare the statistical efficiency of three model classes: context models, extension models, and fixed-length Markov processes (ie., n-grams). Our model class comparison is based on three criteria of statistical efficiency: total codelength, bits/parameter on the test message, and bits/order on the test message. The context and extension models are all of order 9, and were estimated using the true incremental benefit and a range of fixed incremental costs (between 5 and 25 bits/extension for the extension model and between 25 and 150 bits/context for the context model).</Paragraph>
      <Paragraph position="2"> According to the first criteria of statistical efficiency, the best model is the one that achieves the smallest total codelength L(T, C/) of the training corpus T and model C/ using the fewest parameters.</Paragraph>
      <Paragraph position="3"> This criteria measures the statistical efficiency of a model class according to the MDL framework, where we would like each parameter to be as cheap as possible and do as much work as possible. Figure 1 graphs the number of model parameters required to achieve a given total codelength for the training corpus and model. The extension model class is the overwhelming winner.</Paragraph>
      <Paragraph position="4"> ......... N. um ,be. r, of Param,et?rs.. vs: Codele, ng~ ........ 3,5 &amp;quot;l t ..... M-... 2,3,4 ngrarn  of the training corpus T and the model C/. By this criteria of statistical efficiency, the extension models completely dominate context models and n-grams.</Paragraph>
      <Paragraph position="5"> According to the second criteria of statistical efficiency, the best model is the one that achieves the lowest test message entropy using the fewest parameters. This criteria measures the statistical efficiency of a model class according to traditional model validation methodology, tempered by a healthy concern for overfitting. Figure 2 graphs the number of model parameters required to achieve a given test message entropy for each of the three model classes. Again, the extension model class is the clear winner. (This is particularly striking when the number of parameters is plotted on a linear scale.) For example, one of our extension models saves 0.98 bits/char over the trigram while using less than 1/3 as many parameters. Given the logarithmic nature of codelength and the scarcity of training data, this is a significant improvement.</Paragraph>
      <Paragraph position="6"> According to the third criteria of statistical efficiency, the best model is one that achieves the lowest test message entropy for a given model order.</Paragraph>
      <Paragraph position="7"> This criteria is widely used in the language modeling community, in part because model order is typi- null the number of model parameters and the amount of computation required to estimate the model. Figure 3 compares model order to test message entropy for each of the three model classes. As the order of the models increases from 0 (ie., unigram) to 10, we naturally expect the test message entropy to approach a lower bound, which is itself bounded below by the true source entropy. By this criteria, the extension model class is better than the context model class, and both are significantly better than the ngram. null</Paragraph>
    </Section>
    <Section position="2" start_page="225" end_page="226" type="sub_section">
      <SectionTitle>
3.2 Anecdotes
</SectionTitle>
      <Paragraph position="0"> It is also worthwhile to interpret the parameters of the extension model estimated from the Brown Corpus, to better understand the interaction between our model class and our heuristic model selection algorithm. According to the divergence heuristic, the decision to add an extension (w, ~) is made relative to that context's maximal proper suffix LwJ in D as well as any other extensions in the context w. An extension (w, ~) will be added only if the direct estimate of its conditional probability is significantly different from its conditional probability in its maximal proper suffix after scaling by the expansion factor in the context w, ie., if A(alw ) is significantly different than 6(w)~(c~ I LwJ).</Paragraph>
      <Paragraph position="1"> This is illusrated by the three contexts and six extensions shown immediately below, where +E(w) includes all symbols in E(w) that are more likely in w than they were in \[wJ and -E(w) includes all symbols in E(w) that are less likely in w than they were in L J.</Paragraph>
      <Paragraph position="2">  The substring blish is most often followed by the characters 'e', 5', and 'm', corresponding to the relatively frequent word forms publish{ ed, er, ing} and establish{ ed, ing, ment}. Accordingly, the context &amp;quot;blish&amp;quot; has three positive extensions {e,i,m}, of which e has by far the greatest probability. The context &amp;quot;blish&amp;quot; is the maximal proper suffix of two other contexts in the model, &amp;quot;ouestablish&amp;quot; and &amp;quot;euestablish&amp;quot;.</Paragraph>
      <Paragraph position="3"> The substring o establish occurs most frequently in the gerund to establish, which is nearly always followed by a space. Accordingly, the context &amp;quot;ouestablish&amp;quot; has a single positive extension &amp;quot;u&amp;quot;. The substring o establish is also found before the characters 'm', 'e', and 'i' in sequences such as to establishments, {who, ratio, also} established, and { to, into, also} establishing. Accordingly, the context &amp;quot;ouestablish&amp;quot; does not have any negative extensions. null In contrast, the substring e establish is overwhelmingly followed by the character 'm', rarely followed by 'e', and never followed by either 'i' or space. For this reason, the context &amp;quot;euestablish&amp;quot; has a single positive extension {m} corresponding to the great frequency of the string the establishment. This context also has single negative extension {e}, corresponding to the fact that the character 'e' is still possible in the context &amp;quot;euestablish&amp;quot; but considerably less likely than in that context's maximal proper suffix &amp;quot;blish&amp;quot;.</Paragraph>
      <Paragraph position="4"> Since 'i' is reasonably likely in the context &amp;quot;blish&amp;quot; but completely unlikely in the context &amp;quot;euestablish&amp;quot;, we may well wonder why the model  does not include the negative extension 'i' in addition to 'e' or even instead of 'e'. This puzzle is explained by the expansion factor as follows. Since the substring e establish is only followed by 'm' and 'e', the expansion factor ~(&amp;quot;e,,establish&amp;quot;) is essentially zero after 'm' and 'e' are added to that context, and therefore ~(~- {m, e}l &amp;quot;euestablish&amp;quot;) is also essentially zero. Thus, 'i' and space are both assigned nearly zero probability in the context &amp;quot;e, ,establish&amp;quot;, simply because 'm' and 'e' get nearly all the probability in that context.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML