XML Viewer - w97-0309

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/97/w97-0309_concl.xml
Size: 4,663 bytes
Last Modified: 2025-10-06 13:57:50
<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0309">
  <Title>Aggregate and mixed-order Markov models for statistical language processing</Title>
  <Section position="7" start_page="150958" end_page="150958" type="concl">
    <SectionTitle>
5 Discussion
</SectionTitle>
    <Paragraph position="0"> Our results demonstrate the utility of language models that are intermediate in size and accuracy between different order n-gram models. The two models considered in this paper were hidden variable Markov models trained by EM algorithms for maximum likelihood estimation. Combinations of intermediate-order models were also investigated by Rosenfeld (1996). His experiments used the 20,000word vocabulary Wall Street Journal corpus, a predecessor of the NAB corpus. He trained a maximum-entropy model consisting of unigrams, bigrams, trigrams, skip-2 bigrams and trigrams; after selecting long-distance bigrams (word triggers) on 38 million words, the model was tested on a held-out 325 thousand word sample. Rosenfeld reported a test-set perplexity of 86, a 19% reduction from the 105 perplexity of a baseline trigram backoff model. In our experiments, the perplexity gain of the mixed-order model ranged from 16% to 22%, depending on the amount of truncation in the trigram model.</Paragraph>
    <Paragraph position="1"> While Rosenfeld's results and ours are not di- null rectly comparable, both demonstrate the utility of mixed-order models. It is worth discussing, however, the different approaches to combining information from non-adjacent words. Unlike the maximum entropy approach, which allows one to combine many non-independent features, ours calls for a careful Markovian decomposition. Rosenfeld argues at length against naive linear combinations in favor of maximum entropy methods. His arguments do not apply to our work for several reasons. First, we use a large number of context-dependent mixing parameters to optimize the overall likelihood of the combined model. Thus, the weighting in eq. (6) ensures that the skip-k predictions are only invoked when the context is appropriate. Second, we adjust the predictions of the skip-k transition matrices (by EM) so that they match the contexts in which they are invoked. Hence, the count-based models are interpolated in a way that is &amp;quot;consistent&amp;quot; with their eventual use.</Paragraph>
    <Paragraph position="2"> Training efficiency is another issue in evaluating language models. The maximum entropy method requires very long training times: e.g., 200 CPUdays in Rosenfeld's experiments. Our methods require significantly less; for example, we trained the smoothed m = 2 mixed-order model, from start to finish, in less than 12 CPU-hours (while using a larger training corpus). Even accounting for differences in processor speed, this amounts to a significant mismatch in overall training time.</Paragraph>
    <Paragraph position="3"> In conclusion, let us mention some open problems for further research. Aggregate Markov models can be viewed as approximating the full bigram transition matrix by a matrix of lower rank. (From eq. (1), it should be clear that the rank of the class-based transition matrix is bounded by the number of classes, C.) As such, there are interesting parallels between Expectation-Maximization (EM), which minimizes the approximation error as measured by the KL divergence, and singular value decomposition (SVD), which minimizes the approximation error as measured by the L2 norm (Press et al., 1988; Schiitze, 1992). Whereas SVD finds a global minimum in its error measure, however, EM only finds a local one. It would clearly be desirable to improve our understanding of this fundamental problem.</Paragraph>
    <Paragraph position="4"> In this paper we have focused on bigram models, but the ideas and algorithms generalize in a straight-forward way to higher-order n-grams. Aggregate models based on higher-order n-grams (Brown et al., 1992) might be able to capture multi-word structures such as noun phrases. Likewise, trigram-based mixed-order models would be useful complements to 4-gram and 5-gram models, which are not uncommon in large-vocabulary language modeling.</Paragraph>
    <Paragraph position="5"> A final issue that needs to be addressed is scaling--that is, how the performance of these models depends on the vocabulary size and amount of training data. Generally, one expects that the sparser the data, the more helpful are models that can intervene between different order n-grams. Nevertheless, it would be interesting to see exactly how this relationship plays out for aggregate and mixed-order Markov models.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML