XML Viewer - w97-0309

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0309_metho.xml
Size: 21,713 bytes
Last Modified: 2025-10-06 14:14:37
<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0309">
  <Title>Aggregate and mixed-order Markov models for statistical language processing</Title>
  <Section position="4" start_page="82" end_page="83" type="metho">
    <SectionTitle>
2 Aggregate Markov models
</SectionTitle>
    <Paragraph position="0"> In this section we consider how to construct class-based bigram models (Brown et al., 1992). The problem is naturally formulated as one of hidden variable density estimation. Let P(clwl ) denote the probability that word wl is mapped into class c.</Paragraph>
    <Paragraph position="1"> Likewise, let P(w21c) denote the probability that words in class c are followed by the word w2. The class-based bigram model predicts that word wl is followed by word w2 with probability</Paragraph>
    <Paragraph position="3"> where C is the total number of classes. The hidden variable in this problem is the class label c, which is unknown for each word wl. Note that eq. (1) represents the V 2 elements of the transition matrix P(w21wa) in terms of the 2CV elements of P(w2\]c) and P(clwl ).</Paragraph>
    <Paragraph position="4"> The Expectation-Maximization (EM) algorithm (Dempster, Laird, and Rubin, 1977) is an iterative procedure for estimating the parameters of hidden variable models. Each iteration consists of two steps: an E-step which computes statistics over the hidden variables, and an M-step which updates the parameters to reflect these statistics.</Paragraph>
    <Paragraph position="5"> The EM algorithm for aggregate Markov models is particularly simple. The E-step is to compute, for each bigram WlW 2 in the training set, the posterior</Paragraph>
    <Paragraph position="7"> Eq. (2) gives the probability that word wl was assigned to class c, based on the observation that it was followed by word w2. The M-step uses these posterior probabilities to re-estimate the model parameters. The updates for aggregate Markov models are:</Paragraph>
    <Paragraph position="9"> where N(Wl, w2) denotes the number of counts of wlw2 in the training set. These updates are guaranteed to increase the overall log-likelihood,</Paragraph>
    <Paragraph position="11"> (though not global) maximum of the log-likelihood.</Paragraph>
    <Paragraph position="12"> The perplexity V* is related to the log-likelihood by V* : e -~/N, where N is the total number of words processed.</Paragraph>
    <Paragraph position="13"> Though several algorithms (Brown et al., 1992; Pereira, Tishby, and Lee, 1993) have been proposed  the training and test sets; C is the number of classes. The case C = 1 corresponds to a ML unigram model; C = V, to a ML bigram model.</Paragraph>
    <Paragraph position="14"> 0.2 0.4 0.6 0.8 winning assignment probability  probabilities, maxc P(clw), for the three hundred most commonly occurring words.</Paragraph>
    <Paragraph position="15"> for performing the decomposition in eq. (1), it is worth noting that only the EM algorithm directly optimizes the log-likelihood in eq. (5). This has obvious advantages if the goal of finding word classes is to improve the perplexity of a language model. The EM algorithm also handles probabilistic constraints in a natural way, allowing words to belong to more than one class if this increases the overall likelihood. Our approach differs in important ways from the use of hidden Markov models (HMMs) for class-based language modeling (Jelinek et al., 1992). While HMMs also use hidden variables to represent word classes, the dynamics are fundamentally different. In HMMs, the hidden state at time t / 1 is predicted (via the state transition matrix) from the hidden state at time t. On the other hand, in aggregate Markov models, the hidden state at time t + 1 is predicted (via the matrix P(ct+llwt)) from the word at time t. The state-to-state versus word-tostate dynamics lead to different learning algorithms. For example, the Baum-Welch algorithm for HMMs requires forward and backward passes through each training sentence, while the EM algorithm we use does not.</Paragraph>
    <Paragraph position="16"> We trained aggregate Markov models with 2, 4, 8, 16, and 32 classes. Figure 1 shows typical plots of the training and test set perplexities versus the number of iterations of the EM algorithm. Clearly, the two curves are very close, and the monotonic decrease in test set perplexity strongly suggests little if any overfitting, at least when the number of classes is small compared to the number of words in the vocabulary. Table 1 shows the final perplexities (after thirty-two iterations of EM) for various aggregate Markov models. These results confirm that aggregate Markov models are intermediate in accuracy between unigram (C = 1) and bigram (C = V) models.</Paragraph>
    <Paragraph position="17"> The aggregate Markov models were also observed to discover meaningful word classes. Table 2 shows, for the aggregate model with C = 32 classes, the  las cents made make take ago day earfier Friday Monday month quarter reported said Thursday trading Tuesday  bank board chairman end group members number office out part percent price prices rate sales shares use a an another any dollar each first good her his its my old our their this  24 long Mr. year 7 twenty (0 (') 25 8 can could may should to will would 9 about at just only or than (&amp;) (;) i 10 economic high interest much no such tax united i 27 well 11 president 12 because do how if most say so then think very what when where 29 13 according back expected going him plan used way 15 don't I people they we you \[ Bush company court department more officials \] 30 16 pofice retort spokesman \[ 17 former the American big city federal general house mifitary 18 national party political state union York i business California case companies corporation dollars incorporated industry law money thousand time today war week 0) (unknown) 26 also government he it market she that there which who A. B. C. D. E. F. G. I. L. M. N. P. R. S. T. U. 28 both foreign international major many new oil other some Soviet stock these west world after all among and before between by during for from in including into like of off on over since through told under until while with eight fifteen five four half last next nine oh one second seven several six ten third three twelve two zero (-) 31 are be been being had has have is it's not still was were 32 chief exchange news public service trade  C = 32 classes. Class 14 is absent because it is not the most probable class for any of the selected words.) most probable class assignments of the three hundred most commonly occurring words. To be precise, for each class c*, we have listed the words for which c* = arg maxe P(c\]w). Figure 2 shows a histogram of the winning assignment probabilities, maxe P(c\[w), for these words. Note that the winning assignment probabilities are distributed broadly over the interval \[-~, 1\]. This demonstrates the utility of allowing &amp;quot;soft&amp;quot; membership classes: for most words, the maximum likelihood estimates of P(clw ) do not correspond to a winner-take-all assignment, and therefore any method that assigns each word to a single class (&amp;quot;hard&amp;quot; clustering), such as those used by Brown et al. (1992) or Ney, Essen, and Kneser (1994), would lose information.</Paragraph>
    <Paragraph position="18"> We conclude this section with some final comments on overfitting. Our models were trained by thirty-two iterations of EM, allowing for nearly complete convergence in the log-likelihood. Moreover, we did not implement any flooring constraints 1 on the probabilities P(clwl ) or P(w21c). Nevertheless, in all our experiments, the ML aggregate Markov lit is worth noting, in this regard, that individual zeros in the matrices P(w2\[c) and P(c\[wl) do not necessarily give rise to zeros in the matrix P(w21wt), as computed from eq. (1).</Paragraph>
    <Paragraph position="19"> models assigned non-zero probability to all the bi-grams in the test set. This suggests that for large vocabularies there is a useful regime 1 &lt;&lt; C &lt;&lt; V in which aggregate models do not suffer much from overfitting. In this regime, aggregate models can be relied upon to compute the probabilities of unseen word combinations. We will return to this point in Section 4, when we consider how to smooth n-gram language models.</Paragraph>
  </Section>
  <Section position="5" start_page="83" end_page="105" type="metho">
    <SectionTitle>
3 Mixed-order Markov models
</SectionTitle>
    <Paragraph position="0"> One of the drawbacks of n-gram models is that their size grows rapidly with their order. In this section, we consider how to make predictions based on a convex combination of'pairwise correlations. This leads to language models whose size grows linearly in the number of words used for each prediction.</Paragraph>
    <Paragraph position="1"> For each k &gt; 0, the ski_p-k transition matrix M(wt-k, wt) predicts the current word from the kth previous word in the sentence. A mixed-order Markov model combines the information in these matrices for different values of k. Let m denote the number of bigram models being combined. The probability distribution for these models has the form:</Paragraph>
    <Paragraph position="3"> The terms in this equation have a simple interpretation. The V x V matrices Mk (w, w') in eq. (6) define the skip-k stochastic dependency of w' at some position t on w at position t - k; the parameters Ak (w) are mixing coefficients that weight the predictions from these different dependencies. The value of Ak (w) can be interpreted as the probability that the model, upon seeing the word wt-k, looks no further back to make its prediction (Singer, 1996). Thus the model predicts from wt-1 with probability A1 (wt-1), from wt-2 with probability \[1 - Al(wt-1)\]A2(wt-~), and so on. Though included in eq. (6) for cosmetic reasons, the parameters Am (w) are actually fixed to unity so that the model never looks further than m words back.</Paragraph>
    <Paragraph position="4"> We can view eq. (6) as a hidden variable model.</Paragraph>
    <Paragraph position="5"> Imagine that we adopt the following strategy to predict the word at time t. Starting with the previous word, we toss a coin (with bias Ai(Wt_i) ) to see if this word has high predictive value. If the answer is yes, then we predict from the skip-1 transition matrix, Ml(Wt-l,Wt). Otherwise, we shift our attention one word tothe left and repeat the process.</Paragraph>
    <Paragraph position="6"> If after m- 1 tosses we have not settled on a prediction, then as a last resort, we make a prediction using Mm(wt-m, wt). The hidden variables in this process are the outcomes of the coin tosses, which are unknown for each word wt-k.</Paragraph>
    <Paragraph position="7"> Viewing the model in this way, we can derive an EM algorithm to learn the mixing coefficients Ak (w) and the transition matrices 2 Mk(w, w'). The E-step of the algorithm is to compute, for each word in the training set, the posterior probability that it was generated by Mk(wt-k, wt). Denoting these posterior probabilities by Ck(t), we have:</Paragraph>
    <Paragraph position="9"> where the denominator is given by eq. (6). The M-step of the algorithm is to update the parameters Ak(W) and Mk(w, w') to reflect the statistics in eq. (7). The updates for mixed-order Markov models are given by: ,s(w, wt-k)C/k (0 A (w) (8) ~Note that the ML estimates of Mk(w,w') do not depend only on the raw counts of k-separated bigrams; they are also coupled to the values of the mixing coefficients, Aa(w). In particular, the EM algorithm adapts the matrix elements to the weighting of word combinations in eq. (6). The raw counts of k-separated bigrams, however, do give good initial estimates.</Paragraph>
    <Paragraph position="11"> number of iterations of the EM algorithm. The results are for the m = 4 mixed-order Markov model.</Paragraph>
    <Paragraph position="12">  notes the number of bigrams that were mixed into each prediction. The first column shows the perplexities on the training set. The s.ec0nd shows the fraction of words in the test set that were assigned zero probability. The case m = 1 corresponds to a</Paragraph>
    <Paragraph position="14"> where the sums are over all the sentences in the training set, and J(w, w') = 1 iff w = w'.</Paragraph>
    <Paragraph position="15"> We trained mixed-order Markov models with 2 &lt; m _&lt; 4. Figure 3 shows a typical plot of the training set perplexity as a function of the number of iterations of the EM algorithm. Table 3 shows the final perplexities on the training set (after four iterations of EM). Mixed-order models cannot be used directly on the test set because they predict zero probability for unseen word combinations. Unlike standard n-gram models, however, the number of unseen word combinations actually decreases with the order of the model. The reason for this is that mixed-order models assign finite probability to all n-grams wlw~ ... wn for which any of the k-separated bigrams wkwn are observed in the training set. To illustrate this point, Table 3 shows the fraction of words in the test set that were assigned zero probability by the mixed-order model. As expected, this fraction decreases monotonically with the number of bigrams that are mixed into each prediction.</Paragraph>
    <Paragraph position="16"> Clearly, the success of mixed-order models depends on the ability to gauge the predictive value of each word, relative to earlier words in the same sentence. Let us see how this plays out for the  0.1 &lt; Al(w) &lt; 0.7 (-) and of (&amp;quot;) or (;) to (,) (&amp;) by with S. from  nine were for that eight low seven the (() (:) six are not against was four between a their two three its (unknown) S. on as is (--) five 0) into C. M. her him over than A.</Paragraph>
    <Paragraph position="17"> 0.96 &lt; Al(w) &lt; 1 officials prices which go way he last they earlier an Tuesday there foreign quarter she former federal don't days Friday next Wednesday (%) Thursday I Monday Mr. we half based part United it's years going nineteen thousand months  (.) million very cents San ago U. percent billion (?) according (.)  in an m = 2 mixed order model.</Paragraph>
    <Paragraph position="18"> second-order (m = 2) model in Table 3. In this model, a small value for ~l(w) indicates that the word w typically carries less information that the word that precedes it. On the other hand, a large value for Al(w) indicates that the word w is highly predictive. The ability to learn these relationships is confirmed by the results in Table 4. Of the threehundred most common words, Table 4 shows the fifty with the lowest and highest values of Al(w). Note how low values of Al(w) are associated with prepositions, mid-sentence punctuation marks, and conjunctions, while high values are associated with &amp;quot;contentful&amp;quot; words and end-of-sentence markers. (A particularly interesting dichotomy arises for the two forms &amp;quot;a&amp;quot; and &amp;quot;an&amp;quot; of the indefinite article; the latter, because it always precedes a word that begins with a vowel, is inherently more predictive.) These results underscore the importance of allowing the coefficients Al(w) to depend on the context w, as opposed to being context-independent (Ney, Essen, and Kneser, 1994).</Paragraph>
  </Section>
  <Section position="6" start_page="105" end_page="150958" type="metho">
    <SectionTitle>
4 Smoothing
</SectionTitle>
    <Paragraph position="0"> Smoothing plays an essential role in language models where ML predictions are unreliable for rare events.</Paragraph>
    <Paragraph position="1"> In n-gram modeling, it is common to adopt a recursive strategy, smoothing bigrams by unigrams, trigrams by bigrams, and so on. Here we adopt a similar strategy, using the (m - 1)th mixed-order model to smooth the ruth one. At the &amp;quot;root&amp;quot; of our smoothing procedure, however, lies not a uni-gram model, but an aggregate Markov model with C &gt; 1 classes. As shown in Section 2, these models assign finite probability to all word combinations, even those that are not observed in the training set.</Paragraph>
    <Paragraph position="2"> Hence, they can legitimately replace unigrams as the base model in the smoothing procedure.</Paragraph>
    <Paragraph position="3"> Let us first examine the impact of replacing uni-gram models by aggregate models at the root of the  aggregate Markov models with different numbers of classes (C).</Paragraph>
    <Paragraph position="4"> smoothing procedure. To this end, a held-out interpolation algorithm (Jelinek and Mercer, 1980) was used to smooth an ML bigram model with the aggregate Markov models from Section 2. The smoothing parameters, one for each row of the bigram transition matrix, were estimated from a validation set the same size as the test set. Table 5 gives the final perplexities on the validation set, the test set, and the unseen bigrams in the test set. Note that smoothing with the C = 32 aggregate Markov model has nearly halved the perplexity of unseen bigrams, as compared to smoothing with the unigram model.</Paragraph>
    <Paragraph position="5"> Let us now examine the recursive use of mixed-order models to obtain smoothed probability estimates. Again, a held-out interpolation algorithm was used to smooth the mixed-order Markov models from Section 3. The ruth mixed-order model had mV smoothing parameters 0&amp;quot;k (w), corresponding to the V rows in each skip-k transition matrix. The mth mixed-order model was smoothed by discounting the weight of each skip-k prediction, then filling in the leftover probability mass by a lower-order model. In particular, the discounted weight of the skip-k prediction was given by</Paragraph>
    <Paragraph position="7"> leaving a total mass of</Paragraph>
    <Paragraph position="9"> for the (m- 1)th mixed-order model. (Note that the m = 1 mixed-order model corresponds to a ML bigram model.) Table 6 shows the perplexities of the smoothed mixed-order models on the validation and test sets. An aggregate Markov model with C = 32 classes was used as the base model in the smoothing procedure. The first row corresponds to a bigram model smoothed by a aggregate Markov model; the second row corresponds to an m = 2 mixed-order model, smoothed by a ML bigram model, smoothed by an aggregate Markov model; the third row corresponds  els on the validation and test sets.</Paragraph>
    <Paragraph position="10"> to an m = 3 mixed-order model, smoothed by a m = 2 mixed-order model, smoothed by a ML bi-gram model, etc. A significant decrease in perplexity occurs in moving to the smoothed m = 2 mixed-order model. On the other hand, the difference in perplexity for higher values of m is not very dramatic. null Our last experiment looked at the smoothing of a trigram model. Our baseline was a ML trigram model that backed off 3 to bigrams (and when necessary, unigrams) using the Katz backoff procedure (Katz, 1987). In this procedure, the predictions of the ML trigram model are discounted by an amount determined by the Good-Turing coefficients; the left-over probability mass is then filled in by the backoff model. We compared this to a trigram model that backed off to the m = 2 model in Table 6. This was handled by a slight variant of the Katz procedure (Dagan, Pereira, and Lee, 1994) in which the mixed-order model substituted for the backoff model. One advantage of this smoothing procedure is that it is straightforward to assess the performance of different backoff models. Because the backoff models are only consulted for unseen word combinations, the perplexity on these word combinations serves as a reasonable figure-of-merit.</Paragraph>
    <Paragraph position="11"> Table 7 shows those perplexities for the two smoothed trigram models (baseline and backoff).</Paragraph>
    <Paragraph position="12"> The mixed-order smoothing was found to reduce the perplexity of unseen word combinations by 51%. Also shown in the table are the perplexities on the entire test set. The overall perplexity decreased by 16%--a significant amount considering that only 24% of the predictions involved unseen word combinations and required backing off from the trigram model.</Paragraph>
    <Paragraph position="13"> The models in Table 7 were constructed from all n-grams (1 &lt; n &lt; 3) observed in the training data. Because many n-grams occur very infrequently, a natural question is whether truncated models, which omit low-frequency n-grams from the training set, can perform as well as untruncated ones. The advantage of truncated models is that they do not need to store nearly as many non-zero parameters as untruncated models. The results in Table 8 were ob~We used a backoff procedure (instead of interpolation) to avoid the estimation of trigram smoothing parameters.</Paragraph>
    <Paragraph position="14"> backoff test unseen  els on the test set and the subset of unseen word combinations. The baseline model backed off to bi-grams and unigrams; the other backed off to the  less than t times. The table shows the baseline and mixed-order perplexities on the test set, the number of distinct trigrams with t or more counts, and the fraction of trigrams in the test set that required backing off.</Paragraph>
    <Paragraph position="15"> tained by dropping trigrams that occurred less than t times in the training corpus. The t = 1 row corresponds to the models in Table 7. The most interesting observation from the table is that omitting very low-frequency trigrams does not decrease the quality of the mixed-order model, and may in fact slightly improve it. This contrasts with the standard backoff model, in which truncation causes significant increases in perplexity.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML