XML Viewer - a94-1010

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/a94-1010_metho.xml
Size: 19,197 bytes
Last Modified: 2025-10-06 14:13:36
<?xml version="1.0" standalone="yes"?>
<Paper uid="A94-1010">
  <Title>Improving Language Models by Clustering Training Sentences</Title>
  <Section position="3" start_page="59" end_page="60" type="metho">
    <SectionTitle>
2 Cluster-based Language Modeling
</SectionTitle>
    <Paragraph position="0"> Most other work on clustering for language modeling (e.g. Pereira, Tishby and Lee, 1993; Ney, Essen and Kneser, 1994) has addressed the problem of data sparseness by clustering words into classes which are then used to predict smoothed probabilities of occurrence for events which may seldom or never have been observed during training. Thus conceptually at least, their processes are agglomerative: a large initial set of words is clumped into a smaller number of clusters. The approach described here is quite different. Firstly, it involves clustering whole sentences, not words. Secondly, its aim is not to tackle data sparseness by grouping a large number of objects into a smaller number of classes, but to increase the precision of the model by dividing a single object (the training corpus) into some larger number of sub-objects (the clusters of sentences). There is no reason why clustering sentences for prediction should not be combined with clustering words to reduce sparseness; the two operations are orthogonal.</Paragraph>
    <Paragraph position="1"> Our type of clustering, then, is based on the assumption that the utterances to be modeled, as sampled in a training corpus, fall more or less naturally into some number of clusters so that words or other objects associated with utterances have probability distributions that differ between clusters. Thus rather than estimating the relative likelihood of an utterance interpretation simply by combining fixed probabilities associated with its various characteristics, we view these probabilities as conditioned by the initial choice of a cluster or subpopulation from which the utterance is to be drawn. In both cases, many independence assumptions that are known to be at best reasonable approximations will have to be made. However, if the clustering reflects significant dependencies, some of the worst inaccuracies of these assumptions may be reduced, and system performance may improve as a result.</Paragraph>
    <Paragraph position="2"> Some domains and tasks lend themselves more obviously to a clustering approach than others. An obvious and trivial case where clustering is likely to be useful is a speech understander for use by travelers in an international airport; here, an utterance will typically consist of words from one, and only one, natural language, and clusters for different lan- null guages will be totally dissimilar. However, clustering may also give us significant leverage in monolingual cases. If the dialogue handling capabilities of a system are relatively rigid, the system may only ask the user a small number of different questions (modulo the filling of slots with different values). For example, the CLARE interface to the Autoroute PC package (Lewin et al, 1993) has a fairly simple dialogue model which allows it to ask only a dozen or so different types of question of the user. A Wizard of Oz exercise, carried out to collect data for this task, was conducted in a similarly rigid way; thus it is straightforward to divide the training corpus into clusters, one cluster for utterances immediately following each kind of system query. Other corpora, such as Wall Street Journal articles, might also be expected to fall naturally into clusters for different subject areas, and indeed Iyer el al (1994) report positive results from corpus clustering here.</Paragraph>
    <Paragraph position="3"> For some applications, though, there is no obvious extrinsic basis for dividing the training corpus into clusters. The ARPA air travel information (ATIS) domain is an example. Questions can mention concepts such as places, times, dates, fares, meals, airlines, plane types and ground transportation, but most utterances mention several of these, and there are few obvious restrictions on which of them can occur in the same utterance. Dialogues between a human and an ATIS database access system are therefore likely to be less clearly structured than in the Autoroute case.</Paragraph>
    <Paragraph position="4"> However, there is no reason why automatic clustering should not be attempted even when there are no grounds to expect clearly distinct underlying subpopulations to exist. Even a clustering that only partly reflects the underlying variability of the data may give us more accurate predictions of utterance likelihoods. Obviously, the more clusters are assumed, the more likely it is that the increase in the number of parameters to be estimated will lead to worsened rather than improved performance.</Paragraph>
    <Paragraph position="5"> But this trade-off, and the effectiveness of different clustering algorithms, can be monitored and optimized by applying the resulting cluster-based language models to unseen test data. In Section 4 below, 1 report results of such experiments with ATIS data, which, for the reasons given above, would at first sight seem relatively unlikely to yield useful resuits from a clustering approach. Since, as we will see, clustering does yield benefits in this domain, it seems very plausible that it will also do so for other, more naturally clustered domains.</Paragraph>
  </Section>
  <Section position="4" start_page="60" end_page="61" type="metho">
    <SectionTitle>
3 Clustering Algorithms
</SectionTitle>
    <Paragraph position="0"> There are many different criteria for quantifying the (dis)similarity between (analyses of) two sentences or between two clusters of sentences; Everitt (1993) provides a good overview. Unfortunately, whatever the criterion selected, it is in general impractical to find the optimal clustering of the data; instead, one of a variety of algorithms must be used to find a locally optimal solution.</Paragraph>
    <Paragraph position="1"> Let us for the moment consider the case where the language model consists only of a unigram probability distribution for the words in the vocabulary, with no N-gram (for N &gt; 1) or fuller linguistic constraints considered. Perhaps the most obvious measure of the similarity between two sentences or clusters is then Jaccard's coefficient (Everitt, 1993, p41), the ratio of the number of words occurring in both sentences to the number occurring in either or both. Another possibility would be Euclidean distance, with each word in the vocabulary defining a dimension in a vector space. However, it makes sense to choose as a similarity measure the quantity we would like the final clustering arrangement to minimize: the expected entropy (or, equivalently, perplexity) of sentences from the domain. This goal is analogous to that used in the work described earlier on finding word classes by clustering.</Paragraph>
    <Paragraph position="2"> For our simple unigram language model without clustering, the training corpus perplexity is minimized (and its likelihood is maximized) by assigning each word wi a probability Pi = fi/N, where f/ is the frequency of wi and N is the total size of the corpus. The corpus likelihood is then P1 = l-\[i P{', and the per-word entropy, -Y'\],L,, pilog(pi), is thus minimized. (See e.g. Cover and Thomas, 1991, chapter 2 for the reasoning behind this).</Paragraph>
    <Paragraph position="3"> If now we model the language as consisting of sentences drawn at random from K different subpopulations, each with its own unigram probability distribution for words, then the estimated corpus prob-</Paragraph>
    <Paragraph position="5"> where the iterations are over each utterance uj in the corpus, each cluster cl...cg from which uj might arise, and each word wi in utterance uj.</Paragraph>
    <Paragraph position="7"> arising from cluster (or subpopulation) ck, and pk,i is the likelihood assigned to word wi by cluster k, i.e. its relative frequency in that cluster.</Paragraph>
    <Paragraph position="8"> Our ideal, then, is the set of clusters that maximizes the cluster-dependent corpus likelihood PK.</Paragraph>
    <Paragraph position="9"> As with nearly all clustering problems, finding a global maximum is impractical. To derive a good approximation to it, therefore, we adopt the following algorithm.</Paragraph>
    <Paragraph position="10"> * Select a random ordering of the training corpus, and initialize each cluster ck,k = 1...K, to contain just the kth sentence in the ordering.</Paragraph>
    <Paragraph position="11"> * Present each remaining training corpus sentence in turn, initially creating an additional singleton cluster cK+l for it. Merge that pair of clusters cl. *. CK+I that entails the least additional cost, i.e. the smallest reduction in the value of PK for the subcorpus seen so far.</Paragraph>
    <Paragraph position="12">  such that u E ci but the probability of u is maximized by cj. Move all such u's (in parallel) between clusters. Repeat until no further movements are required.</Paragraph>
    <Paragraph position="13"> In practice, we keep track not of PK but of the overall corpus entropy HK = -log( Pl,: ). We record the contribution each cluster c~ makes to HK as</Paragraph>
    <Paragraph position="15"> where fik is the frequency of wi in ck and Fk = ~wjeck fjk, and find the value of this quantity for all possible merged clusters. The merge in the second step of the algorithm is chosen to be the one minimizing the increase in entropy between the unmerged and the merged clusters.</Paragraph>
    <Paragraph position="16"> The adjustment process in the third step of the algorithm does not attempt directly to decrease entropy but to achieve a clustering with the obviously desirable property that each training sentence is best predicted by the cluster it belongs to rather than by another cluster. This heightens the similarities within clusters and the differences between them.</Paragraph>
    <Paragraph position="17"> It also reduces the arbitrariness introduced into the clustering process by the order in which the training sentences are presented. The approach is applicable with only a minor modification to N-grams for N &gt; 1: the probability of a word within a cluster is conditioned on the occurrence of the N-1 words preceding it, and the entropy calculations take this into account. Other cases of context dependence modeled by a knowledge source can be handled similarly.</Paragraph>
    <Paragraph position="18"> And there is no reason why the items characterizing the sentence have to be (sequences of) words; occurrences of grammar rules, either without any context or in the context of, say, the rules occurring just above them in the parse tree, can be treated in just the same way.</Paragraph>
  </Section>
  <Section position="5" start_page="61" end_page="62" type="metho">
    <SectionTitle>
4 Experimental Results
</SectionTitle>
    <Paragraph position="0"> Experiments were carried out to assess the effectiveness of clustering, and therefore the existence of unexploited contextual dependencies, for instances of two general types of language model. In the first experiment, sentence hypotheses were evaluated on the N-grams of words and word classes they contained. In the second experiment, evaluation was on the basis of grammar rules used rather than word occurrences.</Paragraph>
    <Section position="1" start_page="61" end_page="62" type="sub_section">
      <SectionTitle>
4.1 N-gram Experiment
</SectionTitle>
      <Paragraph position="0"> In the first experiment, reference versions of a set of 5,873 domain-relevant (classes A and D) ATIS2 sentences were allocated to K clusters for K = 2, 3, 5, 6, 10 and 20 for the unigram, bigram and tri-gram conditions and, for unigrams and bigrams only, K = 40 and 100 as well. Each run was repeated for ten different random orders for presentation of the training data. The unclustered (K = 1) version of each language model was also evaluated. Some words, and some sequences of words such as &amp;quot;San Francisco&amp;quot;, were replaced by class names to improve performance.</Paragraph>
      <Paragraph position="1"> The improvement (if any) due to clustering was measured by using the various language models to make selections from N-best sentence hypothesis lists; this choice of test was made for convenience rather than out of any commitment to the N-best paradigm, and the techniques described here could equally well be used with other forms of speechlanguage interface.</Paragraph>
      <Paragraph position="2"> Specifically, each clustering was tested against 1,354 hypothesis lists output by a version of the DECIPHER (TM) speech recognizer (Murveit et al, 1993) that itself used a (rather simpler) bigram model. Where more then ten hypothesis were output for a sentence, only the top ten were considered. These 1,354 lists were the subset of two 1,000 sentence sets (the February and November 1992 ATIS evaluation sets) for which the reference sentence itself occurred in the top ten hypotheses. The clustered language model was used to select the most likely hypothesis from the list without paying any attention either to the score that DECIPHER assigned to each hypothesis on the basis of acoustic information or its own bigram model, or to the ordering of the list. In a real system, the DECIPHER scores would of course be taken into account, but they were ignored here in order to maximize the discriminatory power of the test in the presence of only a few thousand test utterances.</Paragraph>
      <Paragraph position="3"> To avoid penalizing longer hypotheses, the probabilities assigned to hypotheses were normalized by sentence length. The probability assigned by a cluster to an N-gram was taken to be the simple maximum likelihood (relative frequency) value where this was non-zero. When an N-gram in the test data had not been observed at all in the training sentences assigned to a given cluster, a &amp;quot;failure&amp;quot;, representing a vanishingly small probability, was assigned. A number of backoff schemes of various degrees of sophistication, including that of Katz (1987), were tried, but none produced any improvement in performance, and several actuMly worsened it.</Paragraph>
      <Paragraph position="4"> The average percentages of sentences correctly identified by clusterings for each condition were as given in Table 1. The maximum possible score was 100%; the baseline score, that expected from a random choice of a sentence from each list, was 11.4%.</Paragraph>
      <Paragraph position="5"> The unigram and bigram scores show a steady and, in fact, statistically significant 1 increase with the number of clusters. Using twenty clusters for bigrams (score 43.9%) in fact gi~ces more than half the advantage over unclustered bigrams that is given 1 Details of significance tests are omitted for space reasons. They are included in a longer version of this paper available on request from the author.</Paragraph>
    </Section>
    <Section position="2" start_page="62" end_page="62" type="sub_section">
      <SectionTitle>
N-rule models
</SectionTitle>
      <Paragraph position="0"> by moving from unclustered bigrams to unclustered trigrams. However, clustering trigrams produces no improvement in score; in fact, it gives a small but statistically significant deterioration, presumably due to the increase in the number of parameters that need to be calculated.</Paragraph>
      <Paragraph position="1"> The random choice of a presentation order for the data meant that different clusterings were arrived at on each run for a given condition ((N, K) for N-grams and K clusters). There was some limited evidence that some clusterings for the same condition were significantly better than others, rather than just happening to perform better on the particular test data used. More trials would be needed to establish whether presentation order does in general make a genuine difference to the quality of a clustering. If there is one, however, it would appear to be fairly small compared to the improvements available (in the unigram and bigram cases) from increasing the numbers of clusters.</Paragraph>
    </Section>
    <Section position="3" start_page="62" end_page="62" type="sub_section">
      <SectionTitle>
4.2 Grammar Rule Experiment
</SectionTitle>
      <Paragraph position="0"> In the second experiment, each training sentence and each test sentence hypothesis was analysed by the Core Language Engine (Alshawi, 1992) trained on the ATIS domain (Agn~ et al, 1994). Unanalysable sentences were discarded, as were sentences of over 15 words in length (the ATIS adaptation had concentrated on sentences of 15 words or under, and analysis of longer sentences was less reliable and slower). When a sentence was analysed successfully, several semantic analyses were, in general, created, and a selection was made from among these on the basis of trained preference functions (Alshawi and Carter, 1994). For the purpose of the experiment, clustering and hypothesis selection were performed on the basis not of the words in a sentence but of the grammar rules used to construct its most preferred analysis.</Paragraph>
      <Paragraph position="1"> The simplest condition, hereafter referred to as &amp;quot;lrule&amp;quot;, was analogous to the unigram case for word-based evaluation. A sentence was modeled simply as a bag of rules, and no attempt (other than the clustering itself) was made to account for dependencies between rules.</Paragraph>
      <Paragraph position="2"> Another condition, henceforth &amp;quot;2-rule&amp;quot; because of its analogy to bigrams, was also tried. Here, each rule occurrence was represented not in isolation but in the context of the rule immediately above it in the parse tree. Other choices of context might have worked as well or better; our purpose here is simply to illustrate and assess ways in which explicit context modeling can be combined with clustering.</Paragraph>
      <Paragraph position="3"> The training corpus consisted of the 4,279 sentences in the 5,873-sentence set that were analysable and consisted of fifteen words or less. The test corpus consisted of 1,106 hypothesis lists, selected in the same way (on the basis of length and analysability of their reference sentences) from the 1,354 used in the first experiment. The &amp;quot;baseline&amp;quot; score for this test corpus, expected from a random choice of (analysable) hypothesis, was 23.2%. This was rather higher than the 11.4% for word-based selection because the hypothesis lists used were in general shorter, unanalysable hypotheses having been excluded.</Paragraph>
      <Paragraph position="4"> The average percentages of correct hypotheses (actual word strings, not just the rules used to represent them) selected by the 1-rule and 2-rule conditions were as given in Table 2.</Paragraph>
      <Paragraph position="5"> These results show that clustering gives a significant advantage for both the 1-rule and the 2-rule types of model, and that the more clusters are created, the larger the advantage is, at least up to K = 20 clusters. As with the N-gram experiment, there is weak evidence that some clusterings are genuinely better than others for the same condition.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML