File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3242_metho.xml
Size: 17,409 bytes
Last Modified: 2025-10-06 14:09:29
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3242"> <Title>Random Forests in Language Modeling</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Basic Language Modeling </SectionTitle> <Paragraph position="0"> The purpose of a language model is to estimate the probability of a word string. Let a0 denote a string of a1 words, that is, a0 a2a4a3a6a5a8a7a9a3a11a10a12a7a14a13a14a13a14a13a15a7a9a3a17a16 . Then, by the chain rule of probability, we have</Paragraph> <Paragraph position="2"> In order to estimate the probabilities a52a54a53 a3a56a55a58a57 a3 a5 a7a14a13a14a13a14a13a15a7a9a3a56a55a60a59 a5a62a61 , we need a training corpus consisting of a large number of words. However, in any practical natural language system with even a moderate vocabulary size, it is clear that as a63 increases the accuracy of the estimated probabilities collapses. Therefore, histories a3a64a5a8a7a14a13a14a13a14a13a8a7a9a3 a55a60a59 a5 for word a3 a55 are usually grouped into equivalence classes. The most widely used language models, a2 -gram language models, use the identities of the last a2a66a65a4a67 words as equivalence classes. In an a2 -gram model, we then have</Paragraph> <Paragraph position="4"> where we have used a3 a55a60a59 a5 a55a60a59a74a73a12a75 a5 to denote the word sequence a3 a55a60a59a74a73a12a75 a5a14a7a14a13a14a13a14a13a76a7a9a3 a55a77a59 a5 . The maximum likelihood (ML) estimate of</Paragraph> <Paragraph position="6"> a61 is the number of times the string a3 a55a60a59a74a73a12a75 a5a14a7a14a13a14a13a14a13a76a7a9a3 a55 is seen in the training data.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Language Model Smoothing </SectionTitle> <Paragraph position="0"> An a2 -gram model when a2a85a2a87a86 is called a trigram model. For a vocabulary of size a57a48a88a24a57a42a2a89a67a14a90a42a91 , there are a57a48a88a92a57 a93a94a2a95a67a14a90 a5a96a10 trigram probabilities to be estimated.</Paragraph> <Paragraph position="1"> For any training data of a manageable size, many of the probabilities will be zero if the ML estimate is used.</Paragraph> <Paragraph position="2"> In order to solve this problem, many smoothing techniques have been studied (see (Chen and Goodman, 1998) and the references therein). Smoothing adjusts the ML estimates to produce more accurate probabilities and to assign nonzero probabilies to any word string. Details about various smoothing techniques will not be presented in this paper, but we will outline a particular way of smoothing, namely, interpolated Kneser-Ney smoothing (Kneser and Ney, 1995), for later reference. null The interpolated Kneser-Ney smoothing assumes the following form: where a109 is a discounting constant and a110 a53 a3 a55a60a59 a5</Paragraph> <Paragraph position="4"> is the interpolation weight for the lower order probabilities (a53 a2a111a65a112a67 a61 -gram). The discount constant is often estimated using leave-one-out, leading to the</Paragraph> <Paragraph position="6"> ber of a2 -grams with count one and a2a20a10 is the number of a2 -grams with count two. To ensure that the probabilities sum to one, we have Note that the lower order probabilities are usually smoothed using recursions similar to Equation 4.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Language Model Evalution </SectionTitle> <Paragraph position="0"> A commonly used task-independent quality measure for a given language model is related to the cross entropy of the underlying model and was introduced under the name of perplexity (PPL) (Jelinek, 1997):</Paragraph> <Paragraph position="2"> where a3a6a5a8a7a14a13a14a13a14a13a8a7a9a3a17a16 is the test text that consists of a1 words.</Paragraph> <Paragraph position="3"> For different tasks, there are different task-dependent quality measures of language models.</Paragraph> <Paragraph position="4"> For example, in an automatic speech recognition system, the performance is usually measured by word error rate (WER).</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Decision Tree and Random Forest </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Language Modeling </SectionTitle> <Paragraph position="0"> Although Random Forests (RFs) (Amit and Geman, 1997; Breiman, 2001; Ho, 1998) are quite successful in classification and regression tasks, to the best of our knowledge, there has been no research in using RFs for language modeling. By definition, an RF is a collection of randomly constructed Decision Trees (DTs) (Breiman et al., 1984). Therefore, in order to use RFs for language modeling, we first need to construct DT language models.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Decision Tree Language Modeling </SectionTitle> <Paragraph position="0"> In an a2 -gram language model, a word sequence a3 a55a60a59a74a73a12a75 a5a14a7a14a13a14a13a14a13a76a7a9a3 a55a77a59 a5 is called a history for predicting a3 a55 . A DT language model uses a decision tree to classify all histories into equivalence classes and each history in the same equivalence class shares the same distribution over the predicted words. The idea of DTs has been very appealing in language modeling because it provides a natural way to deal with the data sparseness problem. Based on statistics from some training data, a DT is usually grown until certain criteria are satisfied. Heldout data can be used as part of the stopping criterion to determine the size of a DT.</Paragraph> <Paragraph position="1"> There have been studies of DT language models in the literature. Most of the studies focused on improving a2 -gram language models by adopting various smoothing techniques in growing and using DTs (Bahl et al., 1989; Potamianos and Jelinek, 1998). However, the results were not satisfactory. DT language models performed similarly to traditional a2 -gram models and only slightly better when combined with a2 -gram models through linear interpolation. Furthermore, no study has been done taking advantage of the &quot;best&quot; stand-alone smoothing technique, namely, interpolated Kneser-Ney smoothing (Chen and Goodman, 1998).</Paragraph> <Paragraph position="2"> The main reason why DT language models are not successful is that algorithms constructing DTs suffer certain fundamental flaws by nature: training data fragmentation and the absence of a theoretically-founded stopping criterion. The data fragmentation problem is severe in DT language modeling because the number of histories is very large (Jelinek, 1997). Furthermore, DT growing algorithms are greedy and early termination can occur. null In recognition of the success of Kneser-Ney (KN) back-off for a2 -gram language modeling (Kneser and Ney, 1995; Chen and Goodman, 1998), we use a new DT growing procedure to take advantage of KN smoothing. At the same time, we also want to deal with the early termination problem. In our procedure, training data is used to grow a DT until the maximum possible depth, heldout data is then used to prune the DT similarly as in CART (Breiman et al., 1984), and KN smoothing is used in the pruning.</Paragraph> <Paragraph position="3"> A DT is grown through a sequence of node splitting. A node consists of a set of histories and a node splitting splits the set of histories into two subsets based on statistics from the training data. Initially, we put all histories into one node, that is, into the root and the only leaf of a DT. At each stage, one of the leaves of the DT is chosen for splitting. New nodes are marked as leaves of the tree. Since our splitting criterion is to maximize the log-likelihood of the training data, each split uses only statistics (from training data) associated with the node under consideration. Smoothing is not needed in the splitting and we can use a fast exchange algorithm (Martin et al., 1998) to accomplish the task. This can save the computation time relative to the Chou algorithm (Chou, 1991) described in Jelinek,1998 (Jelinek, 1997).</Paragraph> <Paragraph position="4"> Let us assume that we have a DT node a0 under consideration for splitting. Denote by a1 a0 a61 the set of all histories seen in the training data that can reach node a0 . In the context of a2 -gram type of modeling, there are a2 a65 a67 items in each history. A position in the history is the distance between a word in the history and the predicted word. We only consider splits that concern a particular position in the history. Given a position a63 in the history, we can define</Paragraph> <Paragraph position="6"> a61 to be the set of histories belonging to a0 , such that they all have word a3 at position a63 . It is clear that a1 a53 a0 a61 a2a6a5a8a7 a2 a55 a53a4a3 a61 for every position a63 in the history. For every a63 , our algorithm uses a2 a55 a53a10a9 a61 as basic elements to construct two subsets, a11 a55 and a12 a55 1, to form the basis of a possible split. Therefore, a node contains two questions about a history: (1) Is the history in a11 a55 ? and (2) Is the history in a12 a55 ? If a history has an answer &quot;yes&quot; to (1), it will proceed to the left child of the node. Similarly, if it has an answer &quot;yes&quot; to (2), it will proceed to the right child. If the answers to both questions are &quot;no&quot;, the history will not proceed further.</Paragraph> <Paragraph position="7"> For simplicity, we omit the subscript a63 in later discussion since we always consider one position at a time. Initially, we split a1 a53 a0 a61 into two non-empty disjoint subsets, a11 and a12 , using the elements a2 a53a10a9 a61 . Let us denote the log-likelihood of the training data associated with a0 under the split as a13 a53 a11 a61 . If we use the ML estimates for probabilities, we will have</Paragraph> <Paragraph position="9"> where a84 a53 a3 a7 a9 a61 is the count of word a3 following all histories in (a9 ) and a84 a53a10a9 a61 is the corresponding total 1a31 a38a33a32a35a34a11a38a37a36a39a38a41a40a43a42a45a44 and a31 a38a47a46a35a34a11a38a48a36a50a49 , the empty set. count. Note that only counts are involved in Equation 7, an efficient data structure can be used to store them for the computation. Then, we try to find the best subsets a11a1a0 and a12a2a0 by tentatively moving elements in a11 to a12 and vice versa. Suppose a2 a53a4a3 a61 a3 a11 is the element we want to move. The log-likelihood after we move a2 a53a4a3 a61 from a11 to a12 can be calculated using Equation 7 with the following changes: If a tentative move results in an increase in loglikelihood, we will accept the move and modify the counts. Otherwise, the element stays where it was.</Paragraph> <Paragraph position="10"> The subsets a11 and a12 are updated after each move.</Paragraph> <Paragraph position="11"> The algorithm runs until no move can increase the log-likelihood. The final subsets will be a11 a0 and a12 a0 and we save the total log-likelihood increase. After all positions in the history are examined, we choose the one with the largest increase in log-likelihood for splitting the node. The exchange algorithm is different from the Chou algorithm (Chou, 1991) in the following two aspects: First, unlike the Chou algorithm, we directly use the log-likelihood of the training data as our objective function. Second, the statistics of the two clusters a11 and a12 are updated after each move, whereas in the Chou algorithm, the statistics remain the same until the elements a2 a53a10a9 a61 are seperated. However, as the Chou algorithm, the exchange algorithm is also greedy and it is not guaranteed to find the optimal split.</Paragraph> <Paragraph position="12"> After a DT is fully grown, we use heldout data to prune it. Pruning is done in such a way that we maximize the likelihood of the heldout data, where smoothing is applied similarly to the interpolated where a14a16a15a18a17 a53a10a9 a61 is one of the DT nodes the history can be mapped to and a52a20a19 a16 a53 a3 a55 a57 a3 a55a60a59 a5</Paragraph> <Paragraph position="14"> tion 5. Note that although some histories share the same equivalence classification in a DT, they may use different lower order probabilities if their lower order histories a3 a55a60a59 a5 a55a60a59a74a73a12a75 a10 are different.</Paragraph> <Paragraph position="15"> During pruning, We first compute the potential of each node in the DT where the potential of a node is the possible gain in heldout data likelihood by growing that node into a sub-tree. If the potential of a node is negative, we prune the sub-tree rooted in that node and make the node a leaf. This pruning is similar to the pruning strategy used in CART (Breiman et al., 1984).</Paragraph> <Paragraph position="16"> After a DT is grown, we only use all the leaf nodes as equivalence classes of histories. If a new history is encountered, it is very likely that we will not be able to place it at a leaf node in the DT. In this case, we simply use a52a21a19 a16 a53 a3 a55 a57 a3 a55a60a59 a5</Paragraph> <Paragraph position="18"> to get the probabilities. This is equivalent to</Paragraph> <Paragraph position="20"/> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Constructing a Random Forest </SectionTitle> <Paragraph position="0"> Our DT growing algorithm in Section 3.1.1 is still based on a greedy approach. As a result, it is not guaranteed to construct the optimal DT. It is also expected that the DT will not be optimal for test data because the DT growing and pruning are based only on training and heldout data. In this section, we introduce our RF approach to deal with these problems. null There are two ways to randomize the DT growing algorithm. First, if we consider all positions in the history at each possible split and choose the best to split, the DT growing algorithm is deterministic. Instead, we randomly choose a subset of positions for consideration at each possible split. This allows us to choose a split that is not optimal locally, but may lead to an overall better DT. Second, the initialization in the exchange algorithm for node splitting is also random. We randomly and independently put each element a2 a53a10a9 a61 into a11 or a12 by the outcome of a Bernoulli trial with a success probability of 0.5. The DTs grown randomly are different equivalence classifications of the histories and may capture different characteristics of the training and heldout data.</Paragraph> <Paragraph position="1"> For each of the a2 -1 positions of the history in an a2 -gram model, we have a Bernoulli trial with a probability a24 for success. The a2 -1 trials are assumed to be independent of each other. The positions corresponding to successful trials are then passed to the exchange algorithm which will choose the best among them for splitting a node. It can be shown that the probability that the actual best position (among all a2 -1 positions) will be chosen is The probability a24 is a global value that we use for all nodes. By choosing a24 , we can control the randomness of the node splitting algorithm, which in turn will control the randomness of the DT. In general, the smaller the probability a24 is, the more random the resulting DTs are.</Paragraph> <Paragraph position="2"> After a non-empty subset of positions are randomly selected, we try to split the node according to each of the chosen position. For each of the positions, we randomly initialize the exchange algorithm as mentioned earlier.</Paragraph> <Paragraph position="3"> Another way to construct RFs is to first sample the training data and then grow one DT for each random sample of the data (Amit and Geman, 1997; Breiman, 2001; Ho, 1998). Sampling the training data will leave some of the data out, so each sample could become more sparse. Since we always face the data sparseness problem in language modeling, we did not use this approach in our experiments.</Paragraph> <Paragraph position="4"> However, we keep this approach as a possible direction in our future research.</Paragraph> <Paragraph position="5"> The randomized version of the DT growing algorithm is run many times and finally we get a collection of randomly grown DTs. We call this collection a Random Forest (RF). Since each DT is a smoothed language model, we simply aggregate all DTs in our RF to get the RF language model. Suppose we have can not be mapped to a leaf node in some DT, we back-off to the lower order KN probability a52a20a19 a16 a53 a3 a55 a57 a3 a55a60a59 a5 a55a60a59a74a73a12a75 a10 a61 as mentioned at the end of the previous section.</Paragraph> <Paragraph position="6"> It is worth to mention that the RF language model in Equation 10 can be represented as a single compact model, as long as all the random DTs use the same lower order probability distribution for smoothing. An a2 -gram language model can be seen as a special DT language model and a DT language model can also be seen as a special RF language model, therefore, our RF language model is a more general representation of language models.</Paragraph> </Section> </Section> class="xml-element"></Paper>