File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/95/p95-1030_intro.xml
Size: 5,715 bytes
Last Modified: 2025-10-06 14:05:52
<?xml version="1.0" standalone="yes"?> <Paper uid="P95-1030"> <Title>New Techniques for Context Modeling</Title> <Section position="3" start_page="0" end_page="220" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Current approaches to automatic speech and handwriting transcription demand a strong language model with a small number of states and an even smaller number of parameters. If the model entropy is high, then transcription results are abysmal. If there are too many states, then transcription becomes computationally infeasible. And if there are too many parameters; then &quot;overfitting&quot; occurs and predictive performance degrades.</Paragraph> <Paragraph position="1"> In this paper we introduce three new techniques for statistical language models: extension modeling, nonmonotonic contexts, and the divergence heuristic. Together these techniques result in language models that have few states, even fewer parameters, and low message entropies. For example, our techniques achieve a message entropy of 1.97 bits/char on the Brown corpus using only 89,325 parameters.</Paragraph> <Paragraph position="2"> By modestly increasing the number of model parameters in a principled manner, our techniques are able to further reduce the message entropy of the Brown Corpus to 1.91 bits/char. 1 In contrast, the character 4-gram model requires 250 times as many parameters in order to achieve a message entropy of only 2.47 bits/char. Given the logarithmic nature of codelengths, a savings of 0.5 bits/char is quite significant. The fact that our model performs significantly better using vastly fewer parameters argues to replace the incremental cost formula ALe(w, ~', a) with a constant cost of 2 bits/extension. This small change reduces the test message entropy from 1.97 to 1.91 bits/char but it also quadruples the number of model parameters and triples the total codelength.</Paragraph> <Paragraph position="3"> that it is a much better probability model of natural language text.</Paragraph> <Paragraph position="4"> Our first two techniques - nonmonolonic contexts and exlension modeling - are generalizations of the traditional context model (Cleary and Witten 1984; Rissanen 1983,1986). Our third technique - the divergence heuristic - is an incremental model selection criterion based directly on Rissanen's (1978) minimum description length (MDL) principle. The MDL principle states that the best model is the simplest model that provides a compact description of the observed data.</Paragraph> <Paragraph position="5"> In the traditional context model, every prefix and every suffix of a context is also a context. Three consequences follow from this property. The first consequence is that the context dictionary is unnecessarily large because most of these contexts are redundant. The second consequence is to attenuate the benefits of context blending, because most contexts are equivalent to their maximal proper suffixes. The third consequence is that the length of the longest candidate context can increase by at most one symbol at each time step, which impairs the model's ability to model complex sources. In a non-monotonic model, this constraint is relaxed to allow compact dictionaries, discontinuous backoff, and arbitrary context switching.</Paragraph> <Paragraph position="6"> The traditional context model maps every history to a unique context. All symbols are predicted using that context, and those predictions are estimated using the same set of histories. In contrast, an extension model maps every history to a sel of contexts, one for each symbol in the alphabet. Each symbol is predicted in its own context, and the model's current predictions need not be estimated using the same set of histories. This is a form of parameter tying that increases the accuracy of the model's predictions while reducing the number of free parameters in the model.</Paragraph> <Paragraph position="7"> As a result of these two generalizations, nonmonotonic extension models can outperform their equivalent context models using significantly fewer parameters. For example, an order 3 n-gram (ie., the 4-gram) requires more than 51 times as many con- null texts and 787 times as many parameters as the order 3 nonmonotonic extension model, yet already performs worse on the Brown corpus by 0.08 bits/char.</Paragraph> <Paragraph position="8"> Our third contribution is the divergence heuristic, which adds a more specific context to the model only when it reduces the codelength of the past data more than it increases the codelength of the model.</Paragraph> <Paragraph position="9"> In contrast, the traditional selection heuristic adds a more specific context to the model only if it's entropy is less than the entropy of the more general context (Rissanen 1983,1986). The traditional minimum entropy heuristic is a special case of the more effective and more powerful divergence heuristic. The divergence heuristic allows our models to generalize from the training corpus to the testing corpus, even for nonstationary sources such as the Brown corpus.</Paragraph> <Paragraph position="10"> The remainder of our article is organized into three sections. In section 2, we formally define the class of extension models and present a heuristic model selection algorithm for that model class based on the divergence criterion. Next, in section 3, we demonstrate the efficacy of our techniques on the Brown Corpus, an eclectic collection of English prose containing approximately one million words of text.</Paragraph> <Paragraph position="11"> Section 4 discusses possible improvements to the model class.</Paragraph> </Section> class="xml-element"></Paper>