XML Viewer - w97-0309

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/w97-0309_intro.xml
Size: 6,853 bytes
Last Modified: 2025-10-06 14:06:20
<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0309">
  <Title>Aggregate and mixed-order Markov models for statistical language processing</Title>
  <Section position="3" start_page="0" end_page="82" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> The purpose of a statistical language model is to assign high probabilities to likely word sequences and low probabilities to unlikely ones. The challenge here arises from the combinatorially large number of possibilities, only a fraction of which can ever be observed. In general, language models must learn to recognize word sequences that are functionally similar but lexically distinct. The learning problem, one of generalizing from sparse data, is particularly acute for large-sized vocabularies (Jelinek, Mercer, and Roukos, 1992).</Paragraph>
    <Paragraph position="1"> The simplest models of natural language are n-gram Markov models. In these models, the probability of each word depends on the n- 1 words that precede it. The problems in estimating robust models of this form are well-documented. The number of parameters--or transition probabilities-scales as V n, where V is the vocabulary size. For typical models (e.g., n = 3, V = 104), this number exceeds by many orders of magnitude the total number of words in any feasible training corpus.</Paragraph>
    <Paragraph position="2"> The transition probabilities in n-gram models are estimated from the counts of word combinations in the training corpus. Maximum likelihood (ML) estimation leads to zero-valued probabilities for unseen n-grams. In practice, one adjusts or smoothes (Chen and Goodman, 1996) the ML estimates so that the language model can generalize to new phrases.</Paragraph>
    <Paragraph position="3"> Smoothing can be done in many ways--for example, by introducing artificial counts, backing off to lower-order models (Katz, 1987), or combining models by interpolation (Jelinek and Mercer, 1980).</Paragraph>
    <Paragraph position="4"> Often a great deal of information:is lost in the smoothing procedure. This is due to the great discrepancy between n-gram models of different order.</Paragraph>
    <Paragraph position="5"> The goal of this paper is to investigate models that are intermediate, in both size and accuracy, between different order n-gram models. We show that such models can &amp;quot;intervene&amp;quot; between different order n-grams in the smoothing procedure. Experimentally, we find that this significantly reduces the perplexity of unseen word combinations.</Paragraph>
    <Paragraph position="6"> The language models in this paper were evaluated on the ARPA North American Business News (NAB) corpus. All our experiments used a vocabulary of sixty-thousand words, including tokens for punctuation, sentence boundaries, and an unknown word token standing for all out-of-vocabulary words. The training data consisted of approximately 78 million words (three million sentences); the test data, 13 million words (one-half million sentences). All sentences were drawn randomly without replacement from the NAB corpus. All perplexity figures given in the paper are computed by combining sentence probabilities; the probability of sentence wow1 ...w~wn+l is given by yIn+lP(wilwo ..wi-1), where w0 and wn+l are i=1 the start- and end-of-sentence markers, respectively.</Paragraph>
    <Paragraph position="7"> Though not reported below, we also confirmed that the results did not vary significantly for different randomly drawn test sets of the same size.</Paragraph>
    <Paragraph position="8"> The organization of this paper is as follows.</Paragraph>
    <Paragraph position="9"> In Section 2, we examine aggregate Markov models, or class-based bigram models (Brown et al., 1992) in which the mapping from words to classes  is probabilistic. We describe an iterative algorithm for discovering &amp;quot;soft&amp;quot; word classes, based on the Expectation-Maximization (EM) procedure for maximum likelihood estimation (Dempster, Laird, and Rubin, 1977). Several features make this algorithm attractive for large-vocabulary language modeling: it has no tuning parameters, converges monotonically in the log-likelihood, and handles probabilistic constraints in a natural way. The number of classes, C, can be small or large depending on the constraints of the modeler. Varying the number of classes leads to models that are intermediate between unigram (C = 1) and bigram (C = V) models.</Paragraph>
    <Paragraph position="10"> In Section 3, we examine another sort of &amp;quot;intermediate&amp;quot; model, one that arises from combinations of non-adjacent words. Language models using such combinations have been proposed by Huang et al.</Paragraph>
    <Paragraph position="11"> (1993), Ney, Essen, and Kneser (1994), and Rosenfeld (1996), among others. We consider specifically the skip-k transition matrices, M(wt_k, wt), whose predictions are conditioned on the kth previous word in the sentence. (The value of k determines how many words one &amp;quot;skips&amp;quot; back to make the prediction.) These predictions, conditioned on only a single previous word in the sentence, are inherently weaker than those conditioned on all k previous words. Nevertheless, by combining several predictions of this form (for different values of k), we can create a model that is intermediate in size and accuracy between bigram and trigram models.</Paragraph>
    <Paragraph position="12"> Mixed-order Markov models express the predictions P(wt\[wt-1, wt-2,..., Wt-m) as a convex combination of skip-k transition matrices, M(wt-k, wt). We derive an EM algorithm to learn the mixing coefficients, as well as the elements of the transition matrices. The number of transition probabilities in these models scales as mV 2, as opposed to V m+l.</Paragraph>
    <Paragraph position="13"> Mixed-order models are not as powerful as trigram models, but they can make much stronger predictions than bigram models. The reason is that quite often the immediately preceding word has less predictive value than earlier words in the same sentence. In Section 4, we use aggregate and mixed-order models to improve the probability estimates from n-grams. This is done by interposing these models between different order n-grams in the smoothing procedure. We compare our results to a baseline tri-gram model that backs off to bigram and unigram models. The use of &amp;quot;intermediate&amp;quot; models is found to reduce the perplexity of unseen word combinations by over 50%.</Paragraph>
    <Paragraph position="14"> In Section 5, we discuss some extensions to these models and some open problems for future research.</Paragraph>
    <Paragraph position="15"> We conclude that aggregate and mixed-order models provide a compelling alternative to language models based exclusively on n-grams.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML