File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/w95-0108_metho.xml
Size: 23,045 bytes
Last Modified: 2025-10-06 14:14:05
<?xml version="1.0" standalone="yes"?> <Paper uid="W95-0108"> <Title>Beyond Word N-Grams</Title> <Section position="4" start_page="96" end_page="96" type="metho"> <SectionTitle> 2 Prediction Suffix Trees over Unbounded Sets </SectionTitle> <Paragraph position="0"> Let U C E* be a set of words over the finite alphabet ~E, which represents here the set of actual and future words of a natural language. A prediction suffix tree (PST) T over U is a finite tree with nodes labeled by distinct elements of U* such that the root is labeled by the empty sequence e, and if s is a son of s' and s' is labeled by a 6 U* then s is labeled by wa for some w 6 U. Therefore, in practice it is enough to associate each non-root node with the first word in its label, and the full label of any node can be reconstructed by following the path from the node to the root. In what follows, we will often identify a PST node with its label.</Paragraph> <Paragraph position="1"> Each PST node s is has a corresponding.prediction function 7s : U' --+ \[0, 1\] where U' C U U {C/} and C/ represents a novel event, that is the occurrence of a word not seen before in the context represented by s. The value of 7, is the next-word probability function for the given context s. A PST T can be used to generate a stream of words, or to compute prefix probabilities over a given stream. Given a prefix wl...wk generated so far, the context (node) used for prediction is found by starting from the root of the tree and taking branches corresponding to Wk, wk-1, * * * until a leaf is reached or the next son does not exist in the tree. Consider for example the PST shown in Figure 1, where some of the values of% are:</Paragraph> <Paragraph position="3"> When observing the text ~... long ago and the first', the matching path from the root ends at the node 'and the first'. Then we predict that the next word is time with probability 0.6 and some other word not seen in this context with probability 0.1. The prediction probability distribution 7s is estimated from empirical counts. Therefore, at each node we keep a data structure to track of the number of times each word appeared in that context.</Paragraph> <Paragraph position="4"> A wildcard symbol, '*', is available in node labels to allow a particular word position to be ignored in prediction. For example, the text '... but this was' is matched by the node label 'this *', which ignores the most recently read word 'was'. Wildcards allow us to model conditional dependencies of general form P(ztlxt_il, zt-i2,..., zt-i~) in which the indices il < i2 < ... < iL are not necessarily consecutive.</Paragraph> <Paragraph position="5"> We denote by CT(Wl&quot; .wn) = w,-k&quot; .w, = s the context (and hence a corresponding node in the tree) used for predicting the word wn+l with a given PST T. Wildcards provide a useful capability in language modeling since syntactic structure may make a word strongly dependent on another a few words back but not on the words in between.</Paragraph> <Paragraph position="6"> One can easily verify that every standard n-gram model can be represented by a PST, but the opposite is not true. A trigram model, for instance, is a PST of depth two, where the leaves are all the observed bigrams of words. The prediction function at each node is the trigram conditional probability of observing a word given the two preceding words.</Paragraph> </Section> <Section position="5" start_page="96" end_page="100" type="metho"> <SectionTitle> 3 The Learning Algorithm </SectionTitle> <Paragraph position="0"> Within the framework of online learning, it is provably (see e.g. (DeSantis et al., 1988; Cesa-Bianchi et al., 1993)) and experimentally known that the performance of a weighted ensemble of models, each model weighted according to its performance (the posterior probability of the model), is not worse and generally much better than any single model in the ensemble. Although there might be exponentially many different PSTs in the ensemble, it has been recently shown (Willems et al., 1994) that a mixture of PSTs can be efficiently computed for small alphabets.</Paragraph> <Paragraph position="1"> words for language modeling. The numbers on the edges are the weights of the sub-trees starting at the pointed node.</Paragraph> <Paragraph position="2"> These weights are used for tracking a mixture of PSTs. The special string * represents a 'wild-card' that can be matched with any observed word.</Paragraph> <Paragraph position="3"> Here, we will use the Bayesian formalism to derive an online learning procedure for mixtures of PSTs of words. The mixture elements are drawn from some pre-specified set T, which in our case is typically the set of all PSTs with maximal depth < D for some suitably chosen D.</Paragraph> <Paragraph position="5"> where CT(wo) = e is the null (empty) context. The probability of the next word, given the past n observations, is provided by Bayes formula,</Paragraph> <Paragraph position="7"> where Po(T) is the prior probability of the PST, T.</Paragraph> <Paragraph position="8"> A nMve computation of (3) would be infeasible, because of the size of 7&quot;. Instead, we use a recursive method in which the relevant quantities for a PST mixture are computed efficiently from related quantities for sub-PSTs. In particular, the PST prior Po(T) is defined as follows. A node s has a probability c~, of being a leaf and a probability 1 - a, of being an internal node. In the latter case, its sons are either a single wildcard, with probability rio, or actual words with probability 1 - f~. To keep the derivation simple, we assume here that the probabilities as are independent of s and that there are no wildcards, that is, f~, -- 0, c~ -- c~ for all s. Context-dependent priors and trees with wildcards can be obtained by a simple extension of the present derivation. Let us also assume that all the trees have maximal depth D. Then Po(T) = a '~ (1 - a) ~2 , where n~ is the number of leaves of T of depth less than the maximal depth and n2 is the number of internal nodes of T.</Paragraph> <Paragraph position="9"> To evaluate the likelihood of the whole mixture we build a tree of maximal depth D containing all observation sequence suffixes of length up to D. Thus the tree contains a node s iff s -(wi-k+l,...,wi) with 1 < k _< D, 1 < i < n. At each node s we keep two variablesJ The first, ~In practice, we keep only a ratio related to the two variables, as explained in detail in the next section. ! Ln(s), accumulates the likelihood of the node seen as a leaf. That is, Ln(s) is the product of the predictions of the node on all the observation-sequence suffixes that ended at that node:</Paragraph> <Paragraph position="11"> For each new observed word wn, the likelihood values Ln(s) are derived from their previous values L~-i (s). Clearly, only the nodes labeled by w,~_l, wn-2w,~-l, ..., w,~-D..'w,~-i will need likelihood updates. For those nodes, the update is simply multiplication by the node's prediction for wn; for the rest of the nodes the likelihood values do not change:</Paragraph> <Paragraph position="13"> The second variable, denoted by Lmixn(s), is the likelihood of the mixture of all possible trees that have a subtree rooted at s on the observed suffixes (all observations that reached s). Lmixn(s) is calculated recursively as follows:</Paragraph> <Paragraph position="15"> In summary, the mixture likelihood values are updated as follows:</Paragraph> <Paragraph position="17"> At first sight it would appear that the update of Lmixn would require contributions from an arbitrarily large subtree, since U may be arbitrarily large. However, only the subtree rooted at (wn_ls\[_ 1 s) is actually affected by the update. Thus the following simplification holds:</Paragraph> <Paragraph position="19"> Note that Lmizn(s) is the likelihood of the weighted mixture of trees rooted at s on all past observations, where each tree in the mixture is weighted with its proper prior. Therefore,</Paragraph> <Paragraph position="21"> where T is the set of trees of maximal depth D and e is the null context (the root node). Combining Equations (3) and (10), we see that the prediction of the whole mixture for next word is the ratio of the likelihood values Lmi~n(e) and Lmixn_l(e) at the root node: P(wnlwl, . . ., wn-1) = Lmix,~(e)/Lmiz,~_l(e) . (li) A given observation sequence matches a unique path from the root to a leaf. Therefore the time for the above computation is linear in the tree depth (maximal context length). After predicting the next word the counts are updated simply by increasing by one the count of the word, if the word already exists, or by inserting a new entry for the new word with initial count set to one. Based on this scheme several n-gram estimation methods, such as Katz's backoff scheme (Katz, 1987), can be derived. Our learning algorithm has, however, the advantages of not being limited to a constant context length (by setting D to be arbitrarily large) and of being able to perform online adaptation. Moreover, the interpolation weights between the different prediction contexts are automatically determined by the performance of each model on past observations.</Paragraph> <Paragraph position="22"> In summary, for each observed word we follow a path from the root of the tree (back in the text) until a longest context (maximal depth) is reached. We may need to add new nodes, with new entries in the data structure, for the first appearance of a word. The likelihood values of the mixture of subtrees (Equation 8) are returned from each level of that recursion up to the root node. The probability of the next word is then the ratio of two consecutive likelihood values returned at the root.</Paragraph> <Paragraph position="23"> For prediction without adaptation, the same method is applied except that nodes are not added and counts are not updated. If the prior probability of the wildcard, j3, is positive, then at each level the recursion splits, with one path continuing through the node labeled with the wildcard and the other through the node corresponding to the proper suffix of the observation. Thus, the update or prediction time is in that case o(2D). Since D is usually very small (most currently used word n-grams models are trigrams), the update and prediction times are essentially linear in the text length.</Paragraph> <Paragraph position="24"> It remains to describe how the probabilities, P(wls ) = 7s(w) are estimated from empirical counts. This problem has been studied for more than thirty years and so far the most common techniques are based on variants of the Good-Turing (GT) method (Good, 1953; Church and Gale, 1991). Here we give a description of the estimation method that we implemented and evaluated. We are currently developing an alternative approach for cases when there is a known (arbitrarily large) bound on the maximal size of the vocabulary U.</Paragraph> <Paragraph position="25"> Let nl,n2,..s s . ,nr,S respectively, be the counts of occurrences of words wl, w2, * .., w~, at a given context (node) s, where r&quot; is the total number of different words that have been observed at node s. The total text size in that context is thus n&quot; = ~1 n~. We need estimates of 7,(wl) and of 7,(w0) the probability of observing a new word w0 at node s. The GT method sets 7,(w0) - t_~- , -- ns , where tl is the total number of words that were observed only once in that context. This method has several justifications, such as a Poisson assumption on the appearance of new words (Fisher et al., 1943). It is, however, difficult to analyze and requires keeping track of the rank of each word. Our learning scheme and data structures favor instead any method that is based only on word counts. In source coding it is common to assign to novel events the probability ~+r&quot; In this case the probability 7,(wl) of a word that has been observed n~ times is set to n~ As reported in (Witten and Bell, 1991), the performance of this method is similar to the GT estimation scheme, yet it is simpler since only the number of different words and their counts are kept.</Paragraph> <Paragraph position="26"> Finally, a careful analysis should be made when predicting novel events (new words). There are two cases of novel events: Ca) an occurrence'of an entirely new word~ that has never been seen before in any context; (b) an occurrence of a word that has been observed in some context, but is new in the current context.</Paragraph> <Paragraph position="27"> The following coding interpretation may help to understand the issue. Suppose some text is communicated over a channel and is encoded using a PST. Whenever an entirely new word is observed (first case) it is necessary to first send an indication of a novel event and then transfer the identity of that word (using a lower level coder, for instance a PST over the alphabet E in which the words in U are written. In the second case it is only necessary to transfer the identity of the word, by referring to the shorter context in which the word has already appeared. Thus, in the second case we incur an additional description cost for a new word in the current context. A possible solution is to use a shorter context (one of the ancestors in the PST) where the word has already appeared, and multiply the probability of the word in that shorter context by the probability that the word is new. This product is the probability of the word.</Paragraph> <Paragraph position="28"> In the case of a completely new word, we need to multiply the probability of a novel event by an additional factor Po(wn) interpreted as the prior probability of the word according to a lower-level model. This additional factor is multiplied at all the nodes along the path from the root to the maximal context of this word (a leaf of the PST). In that case, however, the probability of the next word wn+l remains independent of this additional prior, since it cancels out nicely:</Paragraph> <Paragraph position="30"> Thus, an entirelY new word can be treated simply as a word that has been observed at all the nodes of the PST. Moreover, in many language modeling applications we need to predict only that the next event is a new word, without specifying the word itself. In such cases the update derivation remains the same as in the first case above.</Paragraph> </Section> <Section position="6" start_page="100" end_page="101" type="metho"> <SectionTitle> 4 Efficient Implementation of PSTs of Words </SectionTitle> <Paragraph position="0"> Natural language is often bursty (Church, this volume), that is, rare or new words may appear and be used relatively frequently for some stretch of text only to drop to a much lower frequency of use for the rest of the corpus. Thus, a PST being build online may only need to store information about those words for a short period. It may then be advantageous to prune PST nodes and remove small counts corresponding to rarely used words. Pruning is performed by removing all nodes from the suffix tree whose counts are below a threshold, after each batch of K observations. We used a pruning frequency K of 1000000 and a pruning threshold of 2 in some of our experiments.</Paragraph> <Paragraph position="1"> Pruning during online adaptation has two advantages. First, it improves memory use. Second, and less obvious, predictive power may be improved. Rare words tend to bias the prediction functions at nodes with small counts, especially if their appearance is restricted to a small portion of the text. When rare words are removed from the suffix tree, the estimates of the prediction probabilities at each node are readjusted reflect better the probability estimates of the more frequent words. Hence, part of the bias in the estimation may be overcome.</Paragraph> <Paragraph position="2"> To support fast insertions, searches and deletions of PST nodes and word counts we used a hybrid data structure. When we know in advance a (large) bound on vocabulary size, we represent the root node by arrays of word counts and possible sons subscripted by word indices. At other nodes, we used splay trees (Sleator and Tarjan, 1985) to store both the counts and the branches to longer contexts. Splay trees support search, insertion and deletion in amortized O(log(n)) time per operation. Furthermore, they reorganize themselves to so as to decrease the cost of accessing to the most frequently accessed elements, thus speeding up access to counts and subtrees associated to more frequent words. Figure 2 illustrates the hybrid data structure: The likelihood values Lmix,~(s) and L,~ (s) decrease exponentially fast with n, potentially causing numerical problems even if log representation is used. Moreover, we are only interested in the predictions of the mixtures; the likelihood values are only used to weigh the predictions of different nodes. Let ~s(w,~) be the prediction of the weighted mixture of all subtrees rooted below s (including s itself) for w,~. By following the derivation presented in the previous section it can be verified</Paragraph> <Paragraph position="4"> and qn(s) = 1/(1 + e-n&quot;(')). Thus, the probability of w,~+l is propagated along the path corresponding to suffixes of the observation sequence towards the root as follows,</Paragraph> <Paragraph position="6"> Finally, the prediction of the complete mixture of PSTs for Wn is simply given by ~(wn).</Paragraph> </Section> <Section position="7" start_page="101" end_page="101" type="metho"> <SectionTitle> 5 Evaluation </SectionTitle> <Paragraph position="0"> We tested our algorithm in two modes. In online mode, model structure and parameters (counts) are updated after each observation. In batch mode, the structure and parameters are held fixed after the training phase, making it easier to compare the model to standard n-gram models. Our initial experiments used the Brown corpus, the Gutenberg Bible, and Milton's Paradise Lost as sources of training and test material. We have also carried out a preliminary evaluation on the</Paragraph> </Section> <Section position="8" start_page="101" end_page="103" type="metho"> <SectionTitle> ARPA North-American Business News (NAB) corpus. </SectionTitle> <Paragraph position="0"> For batch training, we partitioned randomly the data into training and testing sets. We then trained a model by running the online algorithm on the training set, and the resulting model, kept fixed, was then used to predict the test data.</Paragraph> <Paragraph position="1"> As a simple check of the model, we used it to generate text by performing random walks over the PST. A single step of the random walk was performed by going down the tree following the current context and stop at a node with the probability assigned by the algorithm to that node. Once a node is chosen, a word is picked randomly by the node's prediction function. A result of such a random walk is given in Figure 3. The PST was trained on the Brown corpus with maximal depth of five. The output contains several well formed (meaningless) clauses and also cliches such as &quot;conserving our rich natural heritage,&quot; suggesting that the model captured some longer-term statistical dependencies.</Paragraph> <Paragraph position="2"> every year public sentiment for conserving our rich natural heritage is growing but that heritage is shrinking even faster no joyride much of its contract if the present session of the cab driver in the early phases conspiracy but lacking money from commercial sponsors the stations have had to reduce its vacationing In online mode the advantage of PSTs with large maximal depth is clear. The perplexity of the model decreases significantly as a function of the depth. Our experiments so far suggest that the resulting models are fairly insensitive to the choice of the prior probability, a, and a prior which favors deep trees performed well. Table 1 summarizes the results on different texts, for trees of growing maximal depth. Note that a maximal depth 0 corresponds to a 'bag of words' model (zero order), 1 to a bigram model, and 2 to a trigram model.</Paragraph> <Paragraph position="3"> In our first batch tests we trained the model on 15% of the data and tested it on the rest. The results are summarized in Table 2. The perplexity obtained in the batch mode is clearly higher than that of the online mode, since a small portion of the data was used to train the models. Yet, even in this case the PST of maximal depth three is significantly better than a full trigram model. In this mode we also checked the performance of the single most likely (maximum aposteriori) model compared to the mixture of PSTs. This model is found by pruning the tree at the nodes that obtained the highest confidence value, Ln(s), and using only the leaves for prediction. As shown in the table, the performance of the MAP model is consistently worse than the performance of the mixture of PSTs.</Paragraph> <Paragraph position="4"> As a simple test of for applicability of the model for language modeling, we checked it on text which was corrupted in different ways. This situation frequently occurs in speech and handwriting recognition systems or in machine translation. In such systems the last stage is a language model, usually a trigram model, that selects the most likely alternative between the several options passed by the previous; stage. Here we used a PST with maximal depth 4, trained on 90% of the text of Paradise Lost. Several sentences that appeared in the test data were corrupted in different ways. We then used the model in the batch mode to evaluate the likelihood of each of the alternatives. In Table 3 we demonstrate one such case, where the first alternative is the correct one. The negative log likelihood and the posterior probability, assuming that the listed sentences are all the possible alternatives, are provided. The correct sentence gets the highest probability according to the model. Finally, we trained a depth two PST on randomly selected sentences from the NAB corpus totaling approximately 32.5 million words and tested it on two corpora: a separate randomly selected set of sentences from the NAB corpus, totaling around 2.8 million words, and a standard</Paragraph> </Section> <Section position="9" start_page="103" end_page="103" type="metho"> <SectionTitle> ARPA NAB development test set of around 8 thousand words. The PST perplexity on the first </SectionTitle> <Paragraph position="0"> test set was 168, and on the second 223. In comparison, a trigram backoff model built form the same training set has perplexity of 247.7 on the second test set. Further experiments using longer maximal depth and allowing comparisons with existing n-gram models trained on the full (280 million word) NAB corpus will require improved data structures and pruning policies to stay within reasonable memory limits.</Paragraph> </Section> class="xml-element"></Paper>