File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-3242_intro.xml
Size: 3,821 bytes
Last Modified: 2025-10-06 14:02:51
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3242"> <Title>Random Forests in Language Modeling</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> In many systems dealing with natural speech or language, such as Automatic Speech Recognition and Statistical Machine Translation, a language model is a crucial component for searching in the often prohibitively large hypothesis space. Most state-of-the-art systems use a2 -gram language models, which are simple and effective most of the time. Many smoothing techniques that improve language model probability estimation have been proposed and studied in the a2 -gram literature (Chen and Goodman, 1998). There has also been work in exploring Decision Tree (DT) language models (Bahl et al., 1989; Potamianos and Jelinek, 1998), which attempt to cluster similar histories together to achieve better probability estimation. However, the results were not promising (Potamianos and Jelinek, 1998): in a fair comparison, decision tree language models failed to improve upon the baseline a2 -gram models with the same order a2 .</Paragraph> <Paragraph position="1"> The aim of DT language models is to alleviate the data sparseness problem encountered in a2 -gram language models. However, the cause of the negative results is exactly the same: data sparseness, coupled with the fact that the DT construction algorithms decide on tree splits solely on the basis of seen data (Potamianos and Jelinek, 1998). Although various smoothing techniques were studied in the context of DT language models, none of them resulted in significant improvements over a2 -gram models.</Paragraph> <Paragraph position="2"> Recently, a neural network based language modeling approach has been applied to trigram language models to deal with the curse of dimensionality (Bengio et al., 2001; Schwenk and Gauvain, 2002). Significant improvements in both perplexity (PPL) and word error rate (WER) over backoff smoothing were reported after interpolating the neural network models with the baseline backoff models. However, the neural network models rely on interpolation with a2 -gram models, and use a2 -gram models exclusively for low frequency words. We believe improvements in a2 -gram models should also improve the performance of neural network models.</Paragraph> <Paragraph position="3"> We propose a new Random Forest (RF) approach for language modeling. The idea of using RFs for language modeling comes from the recent success of RFs in classification and regression (Amit and Geman, 1997; Breiman, 2001; Ho, 1998). By definition, RFs are collections of Decision Trees (DTs) that have been constructed randomly. Therefore, we also propose a new DT language model which can be randomized to construct RFs efficiently. Once constructed, the RFs function as a randomized history clustering which can help in dealing with the data sparseness problem. Although they do not perform well on unseen test data individually, the collective contribution of all DTs makes the RFs generalize well to unseen data. We show that our RF approach for a2 -gram language modeling can result in a significant improvement in both PPL and WER in a large vocabulary speech recognition system.</Paragraph> <Paragraph position="4"> The paper is organized as follows: In Section 2, we review the basics about language modeling and smoothing. In Section 3, we briefly review DT based language models and describe our new DT and RF approach for language modeling. In Section 4, we show the performance of our RF based language models as measured by both PPL and WER. After some discussion and analysis, we finally summarize our work and propose some future directions in Section 5.</Paragraph> </Section> class="xml-element"></Paper>