File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/a94-1010_intro.xml

Size: 6,694 bytes

Last Modified: 2025-10-06 14:05:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="A94-1010">
  <Title>Improving Language Models by Clustering Training Sentences</Title>
  <Section position="2" start_page="0" end_page="59" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> In speech recognition and understanding systems, many kinds of language model may be used to choose between the word and sentence hypotheses for which there is evidence in the acoustic data. Some words, word sequences, syntactic constructions and semantic structures are more likely to occur than others, and the presence of more likely objects in a sentence hypothesis is evidence for the correctness of that hypothesis. Evidence from different knowledge sources can be combined in an attempt to optimize the selection of correct hypotheses; see e.g. Alshawi and Carter (1994); Rayner et al (1994); Rosenfeld (1994).</Paragraph>
    <Paragraph position="1"> Many of the knowledge sources used for this purpose score a sentence hypothesis by calculating a simple, typically linear, combination of scores associated with objects, such as N-grams and grammar rules, that characterize the hypothesis or its preferred linguistic analysis. When these scores are viewed as log probabilities, taking a linear sum corresponds to making an independence assumption that is known to be at best only approximately true, and that may give rise to inaccuracies that reduce the effectiveness of the knowledge source.</Paragraph>
    <Paragraph position="2"> The most obvious way to make a knowledge source more accurate is to increase the amount of structure or context that it takes account of. For example, a bigram model may be replaced by a trigram one, and the fact that dependencies exist among the likelihoods of occurrence of grammar rules at different locations in a parse tree can be modeled by associating probabilities with states in a parsing table rather than simply with the rules themselves (Briscoe and Carroll, 1993).</Paragraph>
    <Paragraph position="3"> However, such remedies have their drawbacks.</Paragraph>
    <Paragraph position="4"> Firstly, even when the context is extended, some important influences may still not be modeled. For example, dependencies between words exist at separations greater than those allowed for by trigrams (for which long-distance N-grams \[Jelinek et al, 1991\] are a partial remedy), and associating scores with parsing table states may not model all the important correlations between grammar rules. Secondly, extending the model may greatly increase the amount of training data required if sparseness problems are to be kept under control, and additional data may be unavailable or expensive to collect. Thirdly, one cannot always know in advance of doing the work whether extending a model in a particular direction will, in practice, improve results. If it turns out not to, considerable ingenuity and effort may have been wasted.</Paragraph>
    <Paragraph position="5"> In this paper, I argue for a general method for  extending the context-sensitivity of any knowledge source that calculates sentence hypothesis scores as linear combinations of scores for objects. The method, which is related to that of Iyer, Ostendorf and Rohlicek (1994), involves clustering the sentences in the training corpus into a number of subcorpora, each predicting a different probability distribution for linguistic objects. An utterance hypothesis encountered at run time is then treated as if it had been selected from the subpopulation of sentences represented by one of these subcorpora.</Paragraph>
    <Paragraph position="6"> This technique addresses as follows the three drawbacks just alluded to. Firstly, it is able to capture the most important sentence-internal contextual effects regardless of the complexity of the probabilistic dependencies between the objects involved. Secondly, it makes only modest additional demands on training data. Thirdly, it can be applied in a standard way across knowledge sources for very different kinds of object, and if it does improve on the unclustered model this constitutes proof that additional, as yet unexploited relationships exist between linguistic objects of the type the model is based on, and that therefore it is worth looking for a more specific, more powerful way to model them.</Paragraph>
    <Paragraph position="7"> The use of corpus clustering often does not boost the power of the knowledge source as much as a specific hand-coded extension. For example, a clustered bigram model will probably not be as powerful as a trigram model. However, clustering can have two important uses. One is that it can provide some improvement to a model even in the absence of the additional (human or computational) resources required by a hand-coded extension. The other use is that the existence or otherwise of an improvement brought about by clustering can be a good indicator of whether additional performance can in fact be gained by extending the model by hand without further data collection, with the possibly considerable additional effort that extension would entail. And, of course, there is no reason why clustering should not, where it gives an advantage, also be used in conjunction with extension by hand to produce yet further improvements.</Paragraph>
    <Paragraph position="8"> As evidence for these claims, I present experimental results showing how, for a particular task and training corpus, clustering produces a sizeable improvement in unigram- and bigram-based models, but not in trigram-based ones; this is consistent with experience in the speech understanding community that while moving from bigrams to trigrams usually produces a definite payoff, a move from trigrams to 4-grams yields less clear benefits for the domain in question. I also show that, for the same task and corpus, clustering produces improvements when sentences are assessed not according to the words they contain but according to the syntax rules used in their best parse. This work thus goes beyond that of Iyer et al by focusing on the methodological importance of corpus clustering, rather than just its usefulness in improving overall systemperformance, and by exploring in detail the way its effectiveness varies along the dimensions of language model type, language model complexity, and number of clusters used. It also differs from Iyer et al's work by clustering at the utterance rather than the paragraph level, and by using a training corpus of thousands, rather than millions, of sentences; in many speech applications, available training data is likely to be quite limited, and may not always be chunked into paragraphs.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML