File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-1124_intro.xml
Size: 5,471 bytes
Last Modified: 2025-10-06 14:03:35
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1124"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Hierarchical Bayesian Language Model based on Pitman-Yor Processes</Title> <Section position="3" start_page="0" end_page="985" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Probabilistic language models are used extensively in a variety of linguistic applications, including speech recognition, handwriting recognition, optical character recognition, and machine translation. Most language models fall into the class of n-gram models, which approximate the distribution over sentences using the conditional distribution of each word given a context consisting of only the previous n[?]1 words,</Paragraph> <Paragraph position="2"> with n = 3 (trigram models) being typical. Even for such a modest value of n the number of parameters is still tremendous due to the large vocabulary size. As a result direct maximum-likelihood parameter fitting severely overfits to the training data, and smoothing methods are indispensible for proper training of n-gram models.</Paragraph> <Paragraph position="3"> A large number of smoothing methods have been proposed in the literature (see (Chen and Goodman, 1998; Goodman, 2001; Rosenfeld, 2000) for good overviews). Most methods take a rather ad hoc approach, where n-gram probabilities for various values of n are combined together, using either interpolation or back-off schemes.</Paragraph> <Paragraph position="4"> Though some of these methods are intuitively appealing, the main justification has always been empirical--better perplexities or error rates on test data. Though arguably this should be the only real justification, it only answers the question of whether a method performs better, not how nor why it performs better. This is unavoidable given that most of these methods are not based on internally coherent Bayesian probabilistic models, which have explicitly declared prior assumptions and whose merits can be argued in terms of how closely these fit in with the known properties of natural languages. Bayesian probabilistic models also have additional advantages--it is relatively straightforward to improve these models by incorporating additional knowledge sources and to include them in larger models in a principled manner. Unfortunately the performance of previously proposed Bayesian language models had been dismal compared to other smoothing methods (Nadas, 1984; MacKay and Peto, 1994).</Paragraph> <Paragraph position="5"> In this paper, we propose a novel language model based on a hierarchical Bayesian model (Gelman et al., 1995) where each hidden variable is distributed according to a Pitman-Yor process, a nonparametric generalization of the Dirichlet distribution that is widely studied in the statistics and probability theory communities (Pitman and Yor, 1997; Ishwaran and James, 2001; Pitman, 2002).</Paragraph> <Paragraph position="6"> Our model is a direct generalization of the hierarchical Dirichlet language model of (MacKay and Peto, 1994). Inference in our model is however not as straightforward and we propose an efficient Markov chain Monte Carlo sampling scheme.</Paragraph> <Paragraph position="7"> Pitman-Yor processes produce power-law distributions that more closely resemble those seen in natural languages, and it has been argued that as a result they are more suited to applications in natural language processing (Goldwater et al., 2006). We show experimentally that our hierarchical Pitman-Yor language model does indeed produce results superior to interpolated Kneser-Ney and comparable to modified Kneser-Ney, two of the currently best performing smoothing methods (Chen and Goodman, 1998). In fact we show a stronger result--that interpolated Kneser-Ney can be interpreted as a particular approximate inference scheme in the hierarchical Pitman-Yor language model. Our interpretation is more useful than past interpretations involving marginal constraints (Kneser and Ney, 1995; Chen and Goodman, 1998) or maximum-entropy models (Goodman, 2004) as it can recover the exact formulation of interpolated Kneser-Ney, and actually produces superior results. (Goldwater et al., 2006) has independently noted the correspondence between the hierarchical Pitman-Yor language model and interpolated Kneser-Ney, and conjectured improved performance in the hierarchical Pitman-Yor language model, which we verify here.</Paragraph> <Paragraph position="8"> Thus the contributions of this paper are threefold: in proposing a langauge model with excellent performance and the accompanying advantages of Bayesian probabilistic models, in proposing a novel and efficient inference scheme for the model, and in establishing the direct correspondence between interpolated Kneser-Ney and the Bayesian approach.</Paragraph> <Paragraph position="9"> We describe the Pitman-Yor process in Section 2, and propose the hierarchical Pitman-Yor language model in Section 3. In Sections 4 and 5 we give a high level description of our sampling based inference scheme, leaving the details to a technical report (Teh, 2006). We also show how interpolated Kneser-Ney can be interpreted as approximate inference in the model. We show experimental comparisons to interpolated and modified Kneser-Ney, and the hierarchical Dirichlet language model in Section 6 and conclude in Section 7.</Paragraph> </Section> class="xml-element"></Paper>