File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/95/w95-0108_intro.xml

Size: 6,657 bytes

Last Modified: 2025-10-06 14:05:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="W95-0108">
  <Title>Beyond Word N-Grams</Title>
  <Section position="3" start_page="0" end_page="96" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Finite-state methods for the statistical prediction of word sequences in natural language have had an important role in language processing research since Markov's and Shannon's pioneering investigations (C.E. Shannon, 1951). While it has always been clear that natural texts are not Markov processes of any finite order (Good, 1969), because of very long range correlations between words in a text such as those arising from subject matter, low-order alphabetic n-gram models have been used very effectively for such tasks as statistical language identification and spelling correction, and low-order word n-gram models have been the tool of choice for language modeling in speech recognition. However, low-order n-gram models fail to capture even relatively local dependencies that exceed model order, for instance those created by long but frequent compound names or technical terms. Unfortunately, extending model order to accommodate those longer dependencies is not practical, since the size of n-gram models is in principle exponential on the order of the model.</Paragraph>
    <Paragraph position="1"> Recently, several methods have been proposed (Ron et al., 1994; Willems et al., 1994) that are able to model longer-range regularities over small alphabets while avoiding the size explosion caused by model order. In those models, the length of contexts used to predict particular symbols is adaptively extended as long as the extension improves prediction above a given threshold. The key ingredient of the model construction is the prediction suffix tree (PST), whose nodes represent suffixes of past input and specify a predictive distribution over possible successors of the suffix. It was shown in (Ron et al., 1994) that under realistic conditions a PST is equivalent to a Markov process of variable order and can be represented efficiently by a probabilistic finite-state automaton. For the purposes of this paper, however, we will use PSTs as our starting point.</Paragraph>
    <Paragraph position="2"> The problem of sequence prediction appears more difficult when the sequence elements are words rather than characters from a small fixed alphabet. The set of words is in principle unbounded, since  in natural language there is always a nonzero probability of encountering a word never seen before. One of the goals of this work is to describe algorithmic and data-structure changes that support the construction of PSTs over unbounded vocabularies. We also extend PSTs with a wildcard symbol that can match against any input word, thus allowing the model to capture statistical dependencies between words separated by a fixed number of irrelevant words.</Paragraph>
    <Paragraph position="3"> An even more fundamental new feature of the present derivation is the ability to work with a mixture of PSTs. Here we adopted two important ideas from machine learning and information theory. The first is the fact that a mixture over an ensemble of experts (models), when the mixture weights are properly selected, performs better than almost any individual member of that ensemble (DeSantis et al., 1988; Cesa-Bianchi et al., 1993). The second idea is that within a Bayesian framework the sum over exponentially many trees can be computed efficiently using a recursive structure of the tree, as was recently shown by Willems et al. (1994). Here we apply these ideas and demonstrate that the mixture, which can be computed as almost as easily as a single PST, performs better than the most likely (maximum aposteriori -- MAP) PST.</Paragraph>
    <Paragraph position="4"> One of the most important features of the present algorithm that it can work in a fully online (adaptive) mode. Specifically, updates to the model structure and statistical quantities can be performed adaptively in a single pass over the data. For each new word, frequency counts, mixture weights and likelihood values associated with each relevant node are appropriately updated. There is not much difference in learning performance between the online and batch modes, as we will see. The online mode seems much more suitable for adaptive language modeling over longer test corpora, for instance in dictation or translation, while the batch algorithm can be used in the traditional manner of n-gram models in sentence recognition and analysis.</Paragraph>
    <Paragraph position="5"> From an information-theoretic perspective, prediction is dual to compression and statistical modeling. In the coding-theoretic interpretation of the Bayesian framework, the assignment of priors to novel events is rather delicate. This question is especially important when dealing with a statistically open source such as natural language. In this work we had to deal with two sets of priors. The first set defines a prior probability distribution over all possible PSTs in a recursive manner, and is intuitively plausible in relation to the statistical self-similarity of the tree. The second set of priors deals with novel events (words observed for the first time) by assuming a scalable probability of observing a new word at each node. For the novel event priors we used a simple variant of the Good-Turing method, which could be easily implemented online with our data structure. It turns out that the final performance is not terribly sensitive to particular assumptions on priors.</Paragraph>
    <Paragraph position="6"> Our successful application of mixture PSTs for word-sequence prediction and modeling make them a valuable approach to language modeling in speech recognition, machine translation and similar applications. Nevertheless, these models still fail to represent explicitly grammatical structure and semantic relationships, even though progress has been made in other work on their statistical modeling. We plan to investigate how the present work may be usefully combined with models of those phenomena, especially local finite-state syntactic models and distributional models of semantic relations.</Paragraph>
    <Paragraph position="7"> In the next sections we present PSTs and the data structure for the word prediction problem.</Paragraph>
    <Paragraph position="8"> We then describe and briefly analyze the learning algorithm. We also discuss several implementation issues. We conclude with a preliminary evaluation of various aspects of the model On several English corpora.</Paragraph>
    <Paragraph position="10"/>
  </Section>
class="xml-element"></Paper>
Download Original XML