File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1013_metho.xml
Size: 23,627 bytes
Last Modified: 2025-10-06 14:13:47
<?xml version="1.0" standalone="yes"?> <Paper uid="H94-1013"> <Title>Weide, R., Huang, X., and Alleva, F., &quot;Improving Speech- Recognition Performance Via Phone-Dependent VQ Codebooks, Multiple Speaker Clusters And Adaptive Language Models&quot;, ARPA Spoken Language Systems Workshop, March</Title> <Section position="3" start_page="0" end_page="76" type="metho"> <SectionTitle> 1. OVERVIEW OF ME FRAMEWORK </SectionTitle> <Paragraph position="0"> Using several different probability estimates to arrive at one combined estimate is a general problem that arises in many tasks. The Maximum Entropy (ME) principle has recently been demonstrated as a powerful tool for combining statistical estimates from diverse sources\[l, 2, 3\]. The ME principle (\[4, 5\]) proposes the following: 1. Reformulate the different estimates as constraints on the expectation of various functions, to be satisfied by the target (combined) estimate.</Paragraph> <Paragraph position="1"> 2. Among all probability distributions that satisfy these con- null straints, choose the one that has the highest entropy. More specifically, for estimating a probability function P(x), each constraint i is associated with a constraintfunctionfi(x) and a desired expectation ci. The constraint is then written as:</Paragraph> <Paragraph position="3"> Given consistent constraints, a unique ME solutions is guaranteed to exist, and to be of the form:</Paragraph> <Paragraph position="5"> where the pi's are some unknown constants, to be found.</Paragraph> <Paragraph position="6"> Probability functions of the form (2) are called log-linear, and the family of functions defined by holding thefi's fixed and varying the pi's is called an exponential family.</Paragraph> <Paragraph position="7"> TO search the family defined by (2) for the pi's that will make P(x) satisfy all the constraints, an iterative algorithm, &quot;Generalized Iterative Scaling&quot; (GIS), exists, which is guaranteed to converge to the solution (\[6\]), as long as the constraints are mut~ally consistent. GIS starts with arbitrary p~ values.</Paragraph> <Paragraph position="8"> At each iteration, it computes the expectations Epfi over the training data, compares them to the desired values c/s, and then adjusts the tJz's by an amount proportional to the ratio of the two.</Paragraph> <Paragraph position="9"> Generalized Iterative Scaling can be used to find the ME estimate of a simple (non-conditional) probability distribution over some event space. An ~0aptation of GIS to conditional probabilities was proposed by \[7\], as follows. Let P(w\[h) be the desired probability estimate, and let lS(h,w) be the empirical distribution of the training data. Letfi(h,w) be any constraint function, and let cl be its desired expectation. Equation 1 is now modified to:</Paragraph> <Paragraph position="11"> See also \[1, 2\].</Paragraph> </Section> <Section position="4" start_page="76" end_page="76" type="metho"> <SectionTitle> 2. CAPTURING LONG-DISTANCE LINGUISTIC PHENOMENA </SectionTitle> <Paragraph position="0"> The ME framework is very general, freeing the modeler to concentrate on searching for significant information sources and choosing the phenomena to be modeled. In statistical language modeling, we are interested in information about the identity of the next word, wi, given the history h, namely the part of the document that was already processed by the system. We have so far considered the following information sources, all contained within the history: Conventional N-grams: the immediately preceding few words, say (wi-2, wi-l).</Paragraph> <Paragraph position="1"> Long distance N-grams\[8\]: N-grams preceding wi byjpositions. null triggers\[9\]: the appearance in the history of words related to wi.</Paragraph> <Paragraph position="2"> class triggers: trigger relations among word clusters. count-based cache: the number of times wi already occurred in the history.</Paragraph> <Paragraph position="3"> distance-based cache: the last time wi occurred in the history. null linguistically defined constraints: number agreement, tense agreement, etc.</Paragraph> <Paragraph position="4"> Any potential source can be considered separately, and the amount of information in it estimated. For example, in estimating the potential of count-based caches, we might measure dependencies of the form depicted in figure 1, and calculate the amount of information they may provide. See also \[3\].</Paragraph> <Paragraph position="6"> Similarly, the constraint function for the bigram wt, w2 is</Paragraph> <Paragraph position="8"> and its associated constraint is</Paragraph> <Paragraph position="10"> and similarly for higher-order N-grams.</Paragraph> <Section position="1" start_page="76" end_page="76" type="sub_section"> <SectionTitle> 2.2. Formulating long-distance N-grams as Constraints </SectionTitle> <Paragraph position="0"> The constraint functions for long distance N-grams are very similar to those for conventional (distance 1) N-gram. For example, the constrain function for the distance-2 trigram {wl, w2, w3} is: o l 2 3 4 5+ (2(DEFAUI~T) FAULT' as a function of the number of times it already occurred in the document. The horizontal line is the unconditional probability.</Paragraph> <Paragraph position="1"> Perhaps the most important feature of the Maximum Entropy framework is its extreme generality. For any conceivable linguistic or statistical phenomena, appropriate constraint functions can readily be written. We will demonstrate this process for several of the knowledge sources listed above. 2.1. Formulating N-grams as Constraints The usual unigram, bigram and trigram Maximum Likelihood estimates can be replaced by unigram, bigrarn and trigram constraints conveying the same information. Specifically, the constraint function for the unigram wl is:</Paragraph> <Paragraph position="3"> and its associated constraint is:</Paragraph> <Paragraph position="5"> ff h ends in {wl, w2, w* } for some w*, and its associated constraint is</Paragraph> <Paragraph position="7"> and similarly for other long distance N-grams.</Paragraph> <Paragraph position="8"> 2.3. Formulating Triggers as Constraints For class triggers, let A, B be two related word clusters. Define the constraint functionfa.~ as: I ff3wjEA, wjEh, wEB (10) f A..~(h, w) = 0 otherwise Set CA--~ tO E\[\]'~-~S\], the empirical expectation offA--~ (i.e, its expectation in the training data). NOW the constraint on</Paragraph> <Paragraph position="10"/> </Section> </Section> <Section position="5" start_page="76" end_page="77" type="metho"> <SectionTitle> 3. SELECTIVE UNIGRAM CACHE </SectionTitle> <Paragraph position="0"> In a document-based unigram cache, all words that occurred in the history of the document are stored, and are used to dynamically generate a unigram, which is in turn combined with other language model components. N-gram caches were first reported by \[10\].</Paragraph> <Paragraph position="1"> The motivation behind a unigram cache is that, once a word occurs in a document, its probability of re-occurring is typically greatly elevated. But the extent of this phenomenon depends on the prior frequency of the word, and is most pronounced for rare words. The occurrence of a common word like &quot;DIE&quot; provides little new information. Put another way, the occurrence of a rare word is more surprising, and hence provides more information, whereas the occurrence of a more common word deviates less from the expectations of the static model, and therefore requires a smaller modification to it. Bayesian analysis may be used to optimally combine the prior of a word with the new evidence provided by its occurrence.</Paragraph> <Paragraph position="2"> As a rough first approximation, we implemented a selective unigram cache, where only rare words are stored in the cache. A word is defined as rare relative to a threshold of static unigram frequency. The exact value of the threshold was determined by optimizing perplexity on unseen data. This scheme proved more useful for perplexity reduction than the conventional cache.</Paragraph> </Section> <Section position="6" start_page="77" end_page="77" type="metho"> <SectionTitle> 4. CONDITIONAL BIGRAM AND TRIGRAM CACHES </SectionTitle> <Paragraph position="0"> In a document-based bigram cache, all consecutive word pairs that occurred in the history of the document are stored, and are used to dynamically generate a bigram, which is in turn combined with other language model components. A trigram cache is similar but is based on all consecutive word triples.</Paragraph> <Paragraph position="1"> An alternative way of viewing a bigram cache is as a set of unigram caches, one for each word in the history. At most one such unigram is consulted at any one time, depending on the identity of the last word of the history. Viewed this way, it is clear that the bigram cache should contribute to the combined model only if the last word of the history is a (nonselective) unigram &quot;cache hit&quot;. In all other cases, the uniform distribution of the bigram cache would only serve to flatten, hence degrade, the combined estimate.</Paragraph> <Paragraph position="2"> We therefore chose to use a conditional bigram cache, which has a non-zero weight only during such a &quot;hit&quot;.</Paragraph> <Paragraph position="3"> A similar argument can be applied to the trigram cache. Such a cache should only be consulted if the last two words of the history occurred before, i.e. the trigram cache should contribute only immediately following a bigram cache hit. We experimented with such a trigram cache, constructed similarly to the conditional bigram cache. However, we found that it contributed little to perplexity reduction. This is to be expected: every bigram cache hit is also a unigram cache hit.</Paragraph> <Paragraph position="4"> Therefore, the trigram cache can only refine the distinctions already provided by the bigram cache. A document's history is typically small (225 words on average in the WSJ corpus).</Paragraph> <Paragraph position="5"> For such a modest cache, the refinement provided by the trigram is small and statistically unreliable.</Paragraph> <Paragraph position="6"> Another way of viewing the selective bigram and trigram caches is as regular (i.e. non-selective) caches, which are later interpolated using weights that depend on the count of their context. Then, zero context-counts force respective zero weights.</Paragraph> </Section> <Section position="7" start_page="77" end_page="79" type="metho"> <SectionTitle> 5. THE WSJ SYSTEM </SectionTitle> <Paragraph position="0"> As a testbed for the above ideas, we used ARPA's CSR task.</Paragraph> <Paragraph position="1"> The training data was 38 million words of Wall Street Journal OVSJ) text from 1987-1989. The vocabulary used was ARPA's official &quot;20o.nvp&quot; (20,000 most common WSJ words, non-verbalized punctuation).</Paragraph> <Paragraph position="2"> To measure the impact of the amount of training d,t~ on language model adaptation, we experimented with systems based on varying amounts of training d~t~= The largest model we built was based on the entire 38M words of WSJ training data, and is described below.</Paragraph> <Section position="1" start_page="77" end_page="78" type="sub_section"> <SectionTitle> 5.1. The Component Models </SectionTitle> <Paragraph position="0"> The adaptive language model was based on four component language models: .</Paragraph> <Paragraph position="1"> .</Paragraph> <Paragraph position="2"> A conventional &quot;compact&quot; backoff trigram model. &quot;Compact&quot; here means that singleton trigrams (word triplets that occurred only once in the training d~ta) were excluded from the model. It consisted of 3.2 million tri-grams and 3.5 million bigrams. This model also served as the baseline for comparisons, and was dubbed &quot;the static model&quot;.</Paragraph> <Paragraph position="3"> A Maximum En~opy model trained on the same d a!8 as the trigram, and consisting of the following knowledge sources: * High cutoff, distance-1 (conventional) N-grams: - All trigrams that occurred 9 or more times in the training data (428,000 in all).</Paragraph> <Paragraph position="4"> - All bigrams that occurred 9 or more times in the training data (327,000).</Paragraph> <Paragraph position="5"> - all unigrams.</Paragraph> <Paragraph position="6"> The high cutoffs were necessary in order to reduce the heavy computational requirements of the train- null ing procedure.</Paragraph> <Paragraph position="7"> * High cutoff, distance-2 bigrams and trigrams: - All distance-2 trigrams that occurred 5 or more times in the training data (795,000 in all). - All distance-2 bigrams that occurred 5 or more times in the training data (651,000).</Paragraph> <Paragraph position="8"> The cutoffs used for the conventional N-grams were higher than those applied to the distance-2 N-grams. This was done because we expected that the information lost from the former knowledge source will be re-introduced, at least partially, by interpolation with the static model.</Paragraph> <Paragraph position="9"> * Word Trigger Pairs: For every word in the vocabulary, the top 3 triggers were selected based on their mutual information with that word as computed from the training data\[l, 2\]. This resulted in some 43,000 word trigger pairs.</Paragraph> <Paragraph position="10"> 3. A selective unigram cache, as described earlier, using a unigram threshold of 0.001.</Paragraph> <Paragraph position="11"> 4. A conditional bigram cache, as described earlier. 5.2. Combining the LM Components The combined model was achieved by consulting an appropriate subset of the above four models. At any one time, the four component LMs were combined linearly. But the weights used were not fixed, nor did they follow a linear pattern over time.</Paragraph> <Paragraph position="12"> Since the Maximum Entropy model incorporated information from trigger pairs, its relative weight should be increased with the length of the history. But since it also incorporated new information from distance-2 N-grams, it is useful even at the very beginning of a document, and its weight should not start at zero.</Paragraph> </Section> <Section position="2" start_page="78" end_page="78" type="sub_section"> <SectionTitle> 5.4. Computational Costs </SectionTitle> <Paragraph position="0"> The computational bottleneck of the Generalized Iterative Scaling algorithm is in constraints which, for typical histoties h, are non-zero for a large number of words w's. This means that bigram constraints are more expensive than trigram constraints. Implicit computation can be used for unigram constraints. Therefore, the time cost of bigram and trigger constraints dominated the total time cost of the algorithm.</Paragraph> <Paragraph position="1"> The computational burden of training the Maximum Entropy model for the large system (38MW) was quite severe. Fortunately, the training procedure is highly paralleliTable (see \[1\]). Training was run in parallel on 10-25 high performance workstations, with an average of perhaps 15 machines. Even so, it took 3 weeks to complete.</Paragraph> <Paragraph position="2"> In comparison, training the 5MW system took only a few machine-days, and training the 1MW system was trivial.</Paragraph> </Section> <Section position="3" start_page="78" end_page="79" type="sub_section"> <SectionTitle> 5.5. Perplexity Reduction </SectionTitle> <Paragraph position="0"> We used 325,000 words of unseen WSJ d~tg_ to measure perplexities of the baseline trigram model, the Maximum Entropy component, and the interpolated a0aptive model (the latter consisting of the first two together with the unigram and bigram caches). This was done for each of the three systems (38MW, 5MW and 1MW). Results are summarized in table 1.</Paragraph> <Paragraph position="1"> We therefore started the Maximum Entropy model with a weight of ,,.,0.3, which was gradually increased over the first 60 words of the document, to ~0.7. The conventional trigram started with a weight of,,4).7, and was decreased concurrently to ~0.3. The conditional bigram cache had a non-zero weight only during a cache hit, which allowed for a relatively high weight of ,~,0.09. The selective unigram cache had a weight proportional to the size of the cache, saturating at -,,0.05. The weights were always normalized to sum to 1.</Paragraph> <Paragraph position="2"> While the general weighting scheme was chosen based on considerations discussed above, the specific values of the weights were chosen by minimizing perplexity of unseen data. It became clear later that this did not always correspond with minimizing error rate. Subsequently, further weight modifications were determined by direct trial-and-error measurements of word error rate on development data.</Paragraph> <Paragraph position="3"> 5.3. Varying the Training Data As mentioned before, we also experimented with systems based on less training data. We built two such systems, one based on 5 million words, and the other based on 1 million words. Both systems were identical to the larger systems described above, except that the Maximum Entropy model did not employ high cutoffs, but was instead based on the same N-gram information as the conventional trigram model.</Paragraph> <Paragraph position="4"> amt. of training data 1M 5M 38M and interpolated adaptive models over a conventional trigram model, for varying amounts of training data. The 38MW ME model used far fewer parameters than the baseline, since it employed high N-gram cutoffs. See texL As can be observed, the Maximum Entropy model, even when used alone, was significantly better than the static model.</Paragraph> <Paragraph position="5"> Its relative advantage seems greater with more training data. With the large (38MW) system, practical consideration required imposing high cutoffs on the ME model, and yet its perplexity is still significantly better than that of the baseline. This is particularly notable because the ME model uses only one third the number of parameters used by the trigram model (2.26M vs. 6.72M).</Paragraph> <Paragraph position="6"> When the Maximum Entropy model is supplemented with the other three components, perplexity is again reduced significantly. Here the relationship with the amount of training data is reversed: the less training data, the greater the improvement. This effect is due to the caches, and can be explained as follows: The amount of information provided by the caches is independent of the amount of training data, and is therefore fixed aCTOSS the three systems. However, the 1MW system has higher perplexity, and therefore the relative improvement provided by the caches is greater. Put another way, models based on more data are stronger, and therefore harder to improve on.</Paragraph> </Section> <Section position="4" start_page="79" end_page="79" type="sub_section"> <SectionTitle> 5.6. Error Rate Reduction </SectionTitle> <Paragraph position="0"> To evaluate error rate reduction, we used the Nov93 ARPA S1 evaluation set\[ll, 12, 13\]. It consisted of 424 utterances produced in the context of complete long documents by two male and two female speakers. We used the SPHINX-II recognizer(J14, 15, 16\]) with sex-dependent non-PD 10K senone acoustic models. In addition to the 20K words in the lexicon, 178 OOV words and their correct phonetic transcriptions were added in order to create closed vocabulary conditions. We first ran the forward and backward passes of SPHINX H to create word lattices, which were then used by three independent A* passes. The first such pass used the 38MW static trigram language model. The other two passes used the 38MW interpolated adaptive LM. The first of these two adaptive runs was for unsupervised word-by-word adaptation, in which the decoder output was used to update the language model. The other run used supervised adaptation, in which the decoder output was used for within-sentence adaptation, while the correct sentence transcription was used for across-sentence adaptation. Results are summarized in table 2.</Paragraph> <Paragraph position="1"> language model word error rate % reduction static trigram (baseline) 19.9% unsupervised adaptation 17.8% 10% supervised adaptation 17.0% 14% els over a conventional trigram model.</Paragraph> <Paragraph position="2"> which the test data comes from a source to which the language model has never been exposed. The most salient aspect of this case is the large number of out-of-vocabulary words, as well as the high proportion of new bigrams and trigrams.</Paragraph> <Paragraph position="3"> Cross-domain adaptation is most important in cases where no data from the test domain is available for training the system. But in practice this rarely happens. More likely, a limited amount of LM training can be obtained. Thus a hybrid paradigm, limited-data domain, might be the most important one for real-world applications.</Paragraph> <Paragraph position="4"> The main disadvantage of the Maximum Entropy framework is the computational requirements of training the ME model. But these are not severe for modest amounts of training d~t~ (up to, say, 5M words, with current CPUs). The approach is thus particularly attractive in limited-data domains.</Paragraph> </Section> </Section> <Section position="8" start_page="79" end_page="79" type="metho"> <SectionTitle> 7. THE AP WIRE EXPERIMENT </SectionTitle> <Paragraph position="0"> We have already seen the effect of the amount of training data on perplexity reduction in the WSJ system. To test our adaptation mechanisms under both the cross-domain and limited-data paradigms, we constructed another experiment, this time using AP wire data for testing.</Paragraph> <Paragraph position="1"> For measuring cross-domain aa_aptation, we used the 38MW WSJ models described above. For measuring limited-data adaptation, we used 5M words of AP wire to train a conventional compact backoff trigram, and a Maximum Entropy model, similar to the ones used by the WSJ system, except that the trigger pair list was copied from the WSJ system.</Paragraph> <Paragraph position="2"> All models were tested on 420,000 words of unseen AP a,t~: We chose the same &quot;200&quot; vocabulary used in the WSJ experiments, to facilitate cross comparisons. As before, we measured perplexities ofthebaseline trigram model, the maximum Entropy component, and the interpolated adaptive model. Resuits are summarized in table 3.</Paragraph> <Paragraph position="3"> To test error rate reduction under the cross.domain adaptation paradigm, we used 206 sentences, recorded by 3 male and 3 female speakers, under the same system configuration described in section. Results are reported in table 4.</Paragraph> </Section> <Section position="9" start_page="79" end_page="79" type="metho"> <SectionTitle> 6. THREE PARADIGMS OF ADAPTATION </SectionTitle> <Paragraph position="0"> The adaptation we concentrated on so far was the kind we call within-domain adaptation. In this paradigm, a heterogeneous language source (such as WSJ) is treated as a complex product of multiple domains-of-discourse (&quot;sublanguages&quot;). The goal is then to produce a continuously modified model that tracks sublangnage mixtures, sublanguage shifts, style shifts, etc.</Paragraph> <Paragraph position="1"> In contrast, a cross-domain adaptation paradigm is one in</Paragraph> </Section> class="xml-element"></Paper>