File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/h93-1021_metho.xml
Size: 21,039 bytes
Last Modified: 2025-10-06 14:13:18
<?xml version="1.0" standalone="yes"?> <Paper uid="H93-1021"> <Title>ADAPTIVE LANGUAGE MODELING USING THE MAXIMUM ENTROPY PRINCIPLE</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 1. STATE OF THE ART </SectionTitle> <Paragraph position="0"> Until recently, the most successful language model (given enough training data) was the trigram \[1\], where the probability of a word is estimated based solely on the two words preceding it. The trigram model is simple yet powerful \[2\].</Paragraph> <Paragraph position="1"> However, since it does not use anything but the very immediate history, it is incapable of adapting to the style or topic of the document, and is therefore considered a static model.</Paragraph> <Paragraph position="2"> In contrast, a dynamic or adaptive model is one that changes its estimates as a result of &quot;seeing&quot; some of the text. An adaptive model may, for example, rely on the history of the current document in estimating the probability of a word.</Paragraph> <Paragraph position="3"> Adaptive models are superior to static ones in that they are able to improve their performance after seeing some of the data. This is particularly useful in two situations. First, when a large heterogeneous language source is composed of smaller, more homogeneous segments, such as newspaper articles. An adaptive model trained on the heterogeneous source will be able to hone in on the particular &quot;sublanguage&quot; used in each of the articles. Secondly, when a model trained on data from one domain is used in another domain. Again, an adaptive model will be able to adjust to the new language, thus improving its performance.</Paragraph> <Paragraph position="4"> The most successful adaptive LM to date is described in \[3\]. A cache of the last few hundred words is maintained, and is used *This work is now continued by Ron Rosenfeld at Carnegie Mellon University.</Paragraph> <Paragraph position="5"> to derive a &quot;cache trigrarn&quot;. The latter is then interpolated with the static trigram. This results in a 23% reduction in perplexity, and a 5%-24% reduction in the error rate of a speech recognizer.</Paragraph> <Paragraph position="6"> In what follows, we describe our efforts at improving our adaptive statistical language models by capitalizing on the information present in the document history.</Paragraph> </Section> <Section position="4" start_page="0" end_page="108" type="metho"> <SectionTitle> 2. TRIGGER-BASED MODELING </SectionTitle> <Paragraph position="0"> To extract information from the document history, we propose the idea of a trigger pair as the basic information bearing element. If a word sequence A is significantly correlated with another word sequence B, then (A---, B) is considered a &quot;trigger pair&quot;, with A being the trigger and B the triggered sequence.</Paragraph> <Paragraph position="1"> When A occurs in the document, it triggers B, causing its probability estimate to change.</Paragraph> <Paragraph position="2"> Before attempting to design a trigger-based model, one should study what long distance factors have significant effects on word probabilities. Obviously, some information about P(B) can be gained simply by knowing that A had occurred. But exactly how much? And can we gain significantly more by considering how recently A occurred, or how many times? We have studied these issues using the a Wail Street Journal corpus of 38 million words. Some illustrations are given in figs. 1 and 2. As can be expected, different trigger pairs give different answers, and hence should be modeled differently.</Paragraph> <Paragraph position="3"> More detailed modeling should be used when the expected return is higher.</Paragraph> <Paragraph position="4"> Once we determined the phenomena to be modeled, one main issue still needs to be addressed. Given the part of the document processed so far (h), and a word w considered for the next position, there are many different estimates of P(wlh). These estimates are derived from the various triggers of w, from the static trigram model, and possibly from other sources, how do we combine them all to form one optimal estimate? We propose a solution to this problem in the next section.</Paragraph> <Paragraph position="6"> tance from the last occurrence of 'STOCK' in the same document. The middle horizontal line is the unconditional probability. The top (bottom) line is the probability of 'SHARES' given that 'STOCK' occurred (did not occur) before in the document.</Paragraph> </Section> <Section position="5" start_page="108" end_page="108" type="metho"> <SectionTitle> 3. MAXIMUM ENTROPY SOLUTIONS </SectionTitle> <Paragraph position="0"> Using several different probability estimates to arrive at one combined estimate is a general problem that arises in many tasks. We use the maximum entropy (ME) principle (\[4, 5\]), which can be summarized as follows: 1. Reformulate the different estimates as constraints on the expectation of various functions, to be satisfied by the target (combined) estimate.</Paragraph> <Paragraph position="1"> 2. Among all probabilitydistributionsthat satisfy these constraints, choose the one that has the highest entropy.</Paragraph> <Paragraph position="3"> ber of times 'SUMMER' occurred before it in the same document. Horizontal lines are as in fig. 1.</Paragraph> <Paragraph position="4"> In the next 3 sections, we describe a succession of models we developed, all based on the ME principle. We then expand on the last model, describe possible future extensions to it, and report current results. More details can be found in \[6, 7\].</Paragraph> </Section> <Section position="6" start_page="108" end_page="109" type="metho"> <SectionTitle> 4. MODEL I: EARLY ATTEMPTS </SectionTitle> <Paragraph position="0"> Assume we have identified for each word w in a vocabulary, V, a set of nw trigger words tw~ t~... t~,,; we further assume that we have the relative frequency of observing a trigger word, t, occurring somewhere in the history, h, (in our case we have used a history length, K, of either 25, 50, 200, or 1000 words) and the word w just occurs after the history from some training text; denote the observed relative frequency of a trigger and a word w by c(t E h and w immediatelyf ollows h) d(t, w) = N where c(.) is the count in the training data. We use {t, w} to indicate the event that trigger t occurred in the history and word w occurs next; the term long-distance bigram has been used for this event.</Paragraph> <Paragraph position="1"> Assume we have a joint distribution p(h, w) of the history of K words and the next word w. We require this joint model to assign to the events {t, w} a probability that matches the observed relative frequencies. Assuming we have R such constraints we find a model that has Maximum Entropy: p*(h, w) = arg max - Ep(h, w) lgp(h, w) h,w subject to the R trigger constraints;: p(t, w) = E p(h, w) = d(t, w) h:tEh We also include the case that none of the triggers of word w occur in the history (we denote this event by {to, w}.) Using Lagrange multipliers, one can easily show that the Maximum Entropy model is given by:</Paragraph> <Paragraph position="3"> i.e., the joint probability is the product of lh(W) factors one factor for each trigger t,~ of word w that occurs in the history h (or one factor if none of the triggers occur.) The Maximum Entropy joint distribution over a space of \]VI K/l is given by R parameters, one for each constraint. In our case, we used a maximum of 20 triggers per word for a 20k vocabulary with an average of 10 resulting in 200,000 constraints.</Paragraph> <Paragraph position="4"> ! we also imposed unigram constraints to match the unigram distribution of the vocabulary * The log 4.1. ltow to determine the factors? One can use the &quot;Brown&quot; algorithm to determine the set of factors. At each iteration, one updates the factor of one constraint and as long as one cycles through all constraints repeatedly the factors will converge to the optimal value. At the i-th iteration, assume we are updating the factor that corresponds to the {t, w}-constraint. Then the update is given by:</Paragraph> <Paragraph position="6"> where the model predicted value m(t, w) is given by: m(t, w) = E Pdeglt(h' w) (I) h:tE h where pOll uses the old factor values. Using the ME joint model, we define a conditional unigrara model by:</Paragraph> <Paragraph position="8"> This is a &quot;time-varying&quot; unigram model where the previous K words determine the relative probability that w would occur next. The perplexity of the resulting model was about 2000 much higher than the perplexity of a static unigram model.</Paragraph> <Paragraph position="9"> In particular, the model underestimated the probability of the frequent words. To ease that problem we disallowed any triggers for the most frequent L words. We experimented with L ranging from 100 to 500 words. The resulting model was better though its perplexity was still about 1100 which is 43% higher than the static unigram perplexity of 772. One reason that we conjecture was that the ME model gives a rather high probability for histories that are quite unlikely in reality and the trigger constraints are matched using those unrealistic histories. We tried an ad hoc computation where the summation over the histories in Equation 1 was weighed by a crude estimate, w(h), of the probability of the history i.e.</Paragraph> <Paragraph position="11"> The resulting model had a much lower perplexity of 559, about 27% lower than the static unigram model on a test set of (1927 words). This ad hoc computation indicates that we need to model the histories more realistically. The model we propose in the next section is derived from the viewpoint that ME indicates that R factors define a conditional model that captures the&quot;Iong-distance&quot; bigram constraints and that using this parametric form with Maximum Likelihood estimation may ailow us to concentrate on typical histories that occur in the data.</Paragraph> <Paragraph position="12"> 5. MODEL H: ML OF CONDITIONAL ME The ME viewpoint results in a conditional model that belongs to the exponential family with K parameters when K constraints are contemplated. We can use Maximum Likelihood estimation to estimate the K factors of the model.</Paragraph> <Paragraph position="13"> likelihood of a training set is given by:</Paragraph> <Paragraph position="15"> where lh(w) is the set of triggers for word w that occur in h.</Paragraph> <Paragraph position="16"> The convexity of the log likelihood guarantees that any hill climbing method will converge to the global optimum. The gradient can be shown to be:</Paragraph> <Paragraph position="18"> one can use the gradient to iteratively re-estimate the factors by: new_ oil 1 P~ - Pwt + T~(d( t, w) - m'(t, w)) lawt where the model predicted value m'(t, w) for a constraint is:</Paragraph> <Paragraph position="20"> The training data is used to estimate the gradient given the current estimate of the factors. The size of the gradient step can be optimized by a line search on a small amount of training data.</Paragraph> <Paragraph position="21"> Given the &quot;time-varying&quot; unigram estimate, we use the methods of \[8\] to obtain a bigram LM whose unigram matches the time-varying unigram using a window of the most recent L words.</Paragraph> </Section> <Section position="7" start_page="109" end_page="110" type="metho"> <SectionTitle> 6. CURRENT MODEL: ML/ME </SectionTitle> <Paragraph position="0"> For estimating a probability function P(x), each constraint i is associated with a constraint function f i(x) and a desired expectation ci. The constraint is then written as: def E Epf i = P(x)ffi(x) = Ci . (2) x Given consistent constraints, a unique ME solutions is guaranteed to exist, and to be of the form:</Paragraph> <Paragraph position="2"> where the pi's are some unknown constants, to be found.</Paragraph> <Paragraph position="3"> Probability functions of the form (3) are called log-linear, and the family of functions defined by holding thefi's fixed and varying the pi's is called an exponential family.</Paragraph> <Paragraph position="4"> To search the exponential family defined by (3) for the ~i's that will make P(x) satisfy all the constraints, an iterative algorithm, &quot;Generalized Iterative Scaling&quot;, exists, which is guaranteed to converge to the solution (\[9\]).</Paragraph> <Section position="1" start_page="110" end_page="110" type="sub_section"> <SectionTitle> 6.1. Formulating Triggers as Constraints </SectionTitle> <Paragraph position="0"> To reformulate a trigger pair A---, B as a constraint, define the constraint functionf~..~ as:</Paragraph> <Paragraph position="2"> Set c,~-.,n to R~..~\], the empirical expectation offA-,B (ie its expectation in the training data). Now impose on the desired probability estimate P(h, w) the constraint: Ep \[fA--~t~\] = E \[f~--.B\] (5)</Paragraph> </Section> <Section position="2" start_page="110" end_page="110" type="sub_section"> <SectionTitle> 6.2. Estimating Conditionals: The ML/ME Solution </SectionTitle> <Paragraph position="0"> Generalized Iterative Scaling can be used to find the ME estimate of a simple (non-conditional) probability distribution over some event space. But in our case, we need to estimate conditional probabilities of the form P(wlh). How should this be done more efficiently than in the previous models? An elegant solution was proposed by \[10\]. Let P(h, w) be the desired probability estimate, and let P(h, w) be the empirical distribution of the training data. Letfi(h, w) be any constraint function, and let cl be its desired expectation. Equation 5 can be rewritten as:</Paragraph> <Paragraph position="2"> We now modify the constraint to be: PCh). ~ PCwlh) . f iCh, w) = ci (7) h w One possible interpretation of this modification is as follows. Instead of constraining the expectation offi(h, w) with regard to P(h, w), we constrain its expectation with regard to a different probability distribution, say Q(h, w), whose conditional Q(wlh) is the same as that of P, but whose marginal Q(h) is the same as that of P. To better understand the effect of this change, define H as the set of all possible histories h, and define Hi, as the partition of H induced byfi. Then the modification is equivalent to assuming that, for every constralntfi, P(Hfj) = P(Hf,). Since typically H/., is a very small set, the assumption is reasonable.</Paragraph> <Paragraph position="3"> The unique ME solution that satisfies equations like (7) or (6) can be shown to also be the Maximum Likelihood (ML) solution, namely that function which, among the exponential family defined by the constraints, has the maximum likelihood of generating the data. The identity of the ML and ME solutions, apart from being aesthetically pleasing, is extremely useful when estimating the conditional P(wlh). It means that hillclimbing methods can be used in conjunction with Generalized Iterative Scaling to speed up the search. Since the likelihood objective function is convex, hillclimbing will not get stuck in local minima.</Paragraph> <Paragraph position="4"> 6.3. Incorporating the trigram model We combine the trigger based model with the currently best static model, the N-Gram, by reformulating the latter to fit into the ML/ME paradigm. The usual unigram, bigram and trigram ML estimates are replaced by unigram, bigrarn and trigrarn constraints conveying the same information. Specifically, the constraint function for the unigram wl is:</Paragraph> <Paragraph position="6"> and its associated constraint is:</Paragraph> <Paragraph position="8"> Similarly, the constraint function for the bigram Wl, w2 is</Paragraph> <Paragraph position="10"> and its associated constraint is</Paragraph> <Paragraph position="12"> and similarly for higher-order ngrarns.</Paragraph> <Paragraph position="13"> The computational bottleneck of the Generalized Iterative Scaling algorithm is in constraints which, for typical histories h, are non-zero for a large number of w's. This means that bi-gram constraints are more expensive than trigram constraints. Implicit computation can be used for unigram constraints.</Paragraph> <Paragraph position="14"> Therefore, the time cost of bigram and trigger constraints dominates the total time cost of the algorithm.</Paragraph> </Section> </Section> <Section position="8" start_page="110" end_page="111" type="metho"> <SectionTitle> 7. ME: PROS AND CONS </SectionTitle> <Paragraph position="0"> The ME principle and the Generalized Iterative Scaling algorithm have several important advantages: .</Paragraph> <Paragraph position="1"> .</Paragraph> <Paragraph position="2"> The ME principle is simple and intuitively appealing. It imposes all of the constituent constraints, but assumes nothing else. For the special case of constraints derived from marginal probabilities, it is equivalent to assuming a lack of higher-order interactions \[11\].</Paragraph> <Paragraph position="3"> ME is extremely general. Any probability estimate of any subset of the event space can be used, including estimates that were not derived from the data or that are inconsistent with it. The distance dependence and count dependence illustrated in figs. 1 and 2 can be readily accommodated. Many other knowledge sources, including higher-order effects, can be incorporated. Note that constraints need not be independent of nor uncorrelated with each other.</Paragraph> <Paragraph position="4"> 3. The information captured by existing language models can be absorbed into the ML/ME model. We have shown how this is done for the conventional N-gram model. Later on we will show, how it can be done for the cache model of \[3\].</Paragraph> <Paragraph position="5"> 4. Generalized Iterative Scaling lends itself to incremental adaptation. New constraints can be added at any time. Old constraints can be maintained or else allowed to relax.</Paragraph> <Paragraph position="6"> 5. A unique ME solution is guaranteed to exist for consistent constraints. The Generalized Iterative Scaling algorithm is guaranteed to converge to it.</Paragraph> <Paragraph position="7"> This approach also has the following weaknesses: 1. Generalized Iterative Scaling is computationally very expensive. When the complete system is trained on the entire 50 million words of Wall Street Journal data, it is expected to require many thousands of MIPS-hours to run to completion.</Paragraph> <Paragraph position="8"> 2. While the algorithm is guaranteed to converge, we do not have a theoretical bound on its convergence rate. 3. It is sometimes useful to impose constraints that are not satisfied by the training data. For example, we may choose to use Good-Tmqng discounting \[12\], or else the constraints may be derived from other data, or be externally imposed. Under these circumstances, the constraints may no longer be consistent, and the theoretical results guaranteeing existence, uniqueness and convergence may not hold.</Paragraph> </Section> <Section position="9" start_page="111" end_page="111" type="metho"> <SectionTitle> 8. INCORPORATING THE CACHE MODEL </SectionTitle> <Paragraph position="0"> It seems that the power of the cache model, described in section 1, comes from the &quot;bursty&quot; nature of language. Namely, infrequent words tend to occur in &quot;bursts&quot;, and once a word occurred in a document, its probability of recurrence is significantly elevated.</Paragraph> <Paragraph position="1"> Of course, this phenomena can be captured by a trigger pair of the form A ~ A, which we call a &quot;self trigger&quot;. We have done exactly that in \[13\]. We found that self triggers are responsible for a disproportionatelylarge part of the reduction in perplexity. Furthermore, self triggers proved particularly robust: when tested in new domains, they maintained the correlations found in the training databetter than the&quot;regular&quot; triggers did.</Paragraph> <Paragraph position="2"> Thus self triggers are particularly important, and should be modeled separately and in more detail. The trigger model we currently use does not distinguish between one or more occurrences of a given word in the history, whereas the cache model does. For self-triggers, the additional information can be significant (see fig. 3).</Paragraph> <Paragraph position="3"> FAULT' as a function of the number of times it already occurred in the document. The horizontal line is the unconditional probability.</Paragraph> <Paragraph position="4"> We plan to model self triggers in more detail. We will consider explicit modeling of frequency of occurrence, distance from last occurrence, and other factors. All of these aspects can easily be formulated as constraints and incorporated into the ME formalism.</Paragraph> </Section> class="xml-element"></Paper>