File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-3242_evalu.xml

Size: 9,485 bytes

Last Modified: 2025-10-06 13:59:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3242">
  <Title>Random Forests in Language Modeling</Title>
  <Section position="5" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> We will first show the performance of our RF language models as measured by PPL. After analyzing these results, we will present the performance when the RF language models are used in a large vocabulary speech recognition system.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Perplexity
</SectionTitle>
      <Paragraph position="0"> We have used the UPenn Treebank portion of the WSJ corpus to carry out our experiments. The UPenn Treebank contains 24 sections of hand-parsed sentences, for a total of about one million words. We used section 00-20 (929,564 words) for training our models, section 21-22 (73,760 words) as heldout data for pruning the DTs, and section 2324 (82,430 words) to test our models. Before carrying out our experiments, we normalized the text in the following ways: numbers in arabic form were replaced by a single token &amp;quot;N&amp;quot;, punctuations were removed, all words were mapped to lower case. The word vocabulary contains 10k words including a special token for unknown words. All of the experimental results in this section are based on this corpus and setup.</Paragraph>
      <Paragraph position="1"> The RF approach was applied to a trigram language model. We built 100 DTs randomly as described in the previous section and aggregated the probabilities to get the final probabilities for words in the test data. The global Bernoulli trial probability was set to 0.5. In fact, we found that this probability was not critical: using different values in our study gave similar results in PPL. Since we can add any data to a DT to estimate the probabilities once it is grown and pruned, we used both training and heldout data during testing, but only training data for heldout data results. We denote this RF language model as &amp;quot;RF-trigram&amp;quot;, as opposed to &amp;quot;KNtrigram&amp;quot; for a baseline trigram with KN smoothing2 The baseline KN-trigram also used both training and heldout data to get the PPL results on test data and only training data for the heldout-data results.</Paragraph>
      <Paragraph position="2"> We also generated one DT without randomizing the node splitting, which we name &amp;quot;DT-trigram&amp;quot;. As we  can see from Table 1, DT-trigram obtained a slightly lower PPL than KN-trigram on heldout data, but was much worse on the test data. However, the RF-trigram performed much better on both heldout and 2We did not use the Modified Kneser-Ney smoothing (Chen and Goodman, 1998). In fact, using the SRILM toolkit (Stolcke, 2002) with the Modified Kneser-Ney smoothing can reduce the PPL on test data to 143.9. Since we are not using the Modified Kneser-Ney in our DT smoothing, we only report KN-trigram results using Interpolated Kneser-Ney smoothing. test data: our RF-trigram reduced the heldout data PPL from 160.1 to 126.8, or by 20.8%, and the test data PPL by 10.6%. Although we would expect improvements from the DT-trigram on the heldout data since it is used to prune the fully grown DT, the actual gain using a single DT is quite small (0.9%).</Paragraph>
      <Paragraph position="3"> We also interpolated the DT-trigram and RF-trigram with the KN-trigram at different levels of interpolation weight on the test data. It is interesting to see from Table 2 that interpolating KN-trigram with DT-trigram results in a small improvement (1.9%) over the KN-trigram, when most of the interpolation weight is on KN-trigram (a110 a2 a90a36a13a1a0 ). However, interpolating KN-trigram with RF-trigram does not yield further improvements over RF-trigram by itself. Therefore, the RF modeling approach directly improves KN estimates by using randomized history clustering.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Analysis
</SectionTitle>
      <Paragraph position="0"> Our final model given by Equation 10 can be thought of as performing randomized history clustering in which each history is clustered into a0 different equivalence classes with equal probability. In order to analyze why this RF approach can improve the PPL on test data, we split the events (an event is a predicted word with its history) in test data into two categories: seen events and unseen events. For KN-trigram, seen events are those that appear in the training or heldout data at least once. For DTtrigram, a seen event is one whose predicted word is seen following the equivalence class of the history.</Paragraph>
      <Paragraph position="1"> For RF-trigram, we define seen events as those that are seen events in at least one DT among the random collection of DTs.</Paragraph>
      <Paragraph position="2"> It can be seen in Table 3 that the DT-trigram reduced the number of unseen events in the test data from 54.4% of the total events to 41.9%, but it increased the overall PPL. This is due to the fact that we used heldout data for pruning. On the other hand, the RF-trigram reduced the number of unseen events greatly: from 54.4% of the total events to only 8.3%. Although the PPL of remaining unseen  events is much higher, the overall PPL is still improved. The randomized history clustering in the RF-trigram makes it possible to compute probabilities of most test data events without relying on backoff. Therefore, the RF-trigram can effectively increase the probability of those events that will otherwise be backoff to lower order statistics.</Paragraph>
      <Paragraph position="3"> In order to reveal more about the cause of improvements, we also compared the KN-trigram and RF-trigram on events that are seen in different number of DTs. In Table 4, we splitted events into smaller groups according the the number of times they are seen among the 100 DTs. For the events seen times %total KN-trigram RF-trigram  seen in 100 DTs that are seen in all 100 DTs, the RF-trigram performs similarly as the KN-trigram since those are mostly seen for the KN-trigram as well. Interestingly, for those events that are unseen for the KNtrigram, the more times they are seen in the DTs, the more improvement in PPL there are. Unseen events in the KN-trigram depend on the lower order probabilities penalized by the interpolation weight, therefore, a seen event has a much higher probability. This is also true for each DT. According to Equation 10, the more times an event is seen in the DTs, the more high probabilities it gets from the DTs, therefore, the higher the final aggregated probability is. In fact, we can see from Table 4 that the PPL starts to improve when the events are seen in 3 DTs. The RF-trigram effectively makes most of the events seen more than 3 times in the DTs, thus assigns them higher probabilities than the KNtrigram. null There is no theoretical basis for choosing the number of DTs needed for the RF model to work well. We chose to grow 100 DTs arbitrarily. In Figure 1, we plot the PPL of the RF-trigram on held-out and test data as a function of number of DTs. It is clear that the PPL drops sharply at the beginning and tapers off quite quickly. It is also worth noting that for test data, the PPL of the RF-trigram with less than 10 DTs is already better than the KNtrigram. null</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 a1 -best Re-scoring Results
</SectionTitle>
      <Paragraph position="0"> To test our RF modeling approach in the context of speech recognition, we evaluated the models in the WSJ DARPA'93 HUB1 test setup. The size of the test set is 213 utterances, 3,446 words. The 20k words open vocabulary and baseline 3-gram model are the standard ones provided by NIST and LDC.</Paragraph>
      <Paragraph position="1"> The lattices and a1 -best lists were generated using the standard 3-gram model trained on 40M words of WSJ text. The a1 -best size was at most 50 for each utterance, and the average size was about 23.</Paragraph>
      <Paragraph position="2"> We trained KN-trigram and RF-trigram using 20M words and 40M words to see the effect of training data size. In both cases, RF-trigram was made of 100 randomly grown DTs and the global Bernoulli trial probability was set to 0.5. The results are reported in Table 5.</Paragraph>
      <Paragraph position="3">  For the purpose of comparison, we interpolated all models with the KN-trigram built from 40M words at different levels of interpolation weight. However, it is the a110 =0.0 column (a110 is the weight on the KN-trigram trained from 40M words) that is the most interesting. We can see that under both conditions the RF approach improved upon the regular KN approach, for as much as 1.1% absolute when 20M words were used to build trigram models. Standard a0 -test3 shows that the improvements are significant at a0a1a0 0.001 and a0a1a0 0.05 level respectively. null However, we notice that the improvement in WER using the trigram with 40M words is not as much as the trigram with 20M words. A possible reason is that with 40M words, the data sparseness problem is not as severe and the performance of the RF approach is limited. It could also be because our test set is too small. We need a much larger test set to investigate the effectiveness of our RF approach.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML