File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/92/h92-1080_intro.xml
Size: 19,411 bytes
Last Modified: 2025-10-06 14:05:16
<?xml version="1.0" standalone="yes"?> <Paper uid="H92-1080"> <Title>Applying SPHINX-II to the DARPA Wall Street Journal CSR Task</Title> <Section position="2" start_page="0" end_page="395" type="intro"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> Extending a continuous speech recognition system to a larger vocabulary and more general task domain requires more than a new dictionary and language model. The primary problem in the application of the SPHINX-II \[1\] system to the Wall Street Journal (WSJ) CSR task was to extend the Viterbi beam-search used in the SPHINX \[2\] system to be able to run experiments given the constraints of available processing and memory resources.</Paragraph> <Paragraph position="1"> First, we developed a practical form of between-word co-articulation modelng that was both time and memory efficient The use of left context dependent between-word triphones is a departure from the left and fight between-word context modeling but it allows the system to retain partial between-word co-articulation modelng despite the size and complexity of the task. Second, we significantly reduced the size of the memory required. To reduce the memory requirements of our search component it was necessary to change the Viterbi evaluation to use an in- null contained in this document are those of the authors and should not be interpreted as representing official polieie~, either expxessed or implied, of the Defense Advanced Research Pr'ojeets Agency or of the United States Government.</Paragraph> <Paragraph position="2"> place algorithm instead of a non-in-place one. Additionally we replaced the stack data structure used to recover the word sequence from the search, with a dictionary data structure. We decoupled the proto-type HMM state transition probabilities from the word specific HMM instances to avoid duplicating memory. We also found that our pointerless implementation of the HMM topology saved us both memory and time. Finally, we improved decoding efficiency substantially. One way to improve decoder efficiency is to reduce the search space. SPHINX-II reduces the search space with three pruning thresholds that are applied at the state, model, and word levels. In addition, evaluating a state requires an acoustic score computation and a graph update operation. Both of these operations run in constant time over one state. For discrete models, the cost of computing the acoustic score was on a par with the graph update operation since the acoustic score was computed by table lookup. With the introduction of semi-continuous models the cost of computing the acoustic score in the straight forward implementation is as much as an order of magnitude greater than the discrete model. This increase directly effects the overall time required by the search. To address this problem we decomposed the search into four phases. Shared distribution probability computation, HMM arc evaluation, active HMM instance evaluation and language model application. The shared distribution probability computation and HMM arc evaluation allow us to share computations that potentially would be repeated many times. Lastly, the introduction of full backoff language models made the previous approach of precomputing the entire table of non-zero arc probabilities impractical. For the SPHINX-II CSR decoder we use a cache table of active states in the language model to reduce the cost of accessing the language model.</Paragraph> <Paragraph position="3"> 2. Review of the SPHINX-II System In comparison with the SPHINX system \[2\], the SPHINX-II system \[1\] has reduced the word error rate by more than 50% on most tasks by incorporating between-word coarticulation modelng \[3\], high-order dynamics \[4\], sex-dependent semi-continuous hidden Markov models \[4\], and shared-distribution models \[5\]. This section will review SPHINX-H that will be used as the baseline acoustic modeling system for this study.</Paragraph> <Section position="1" start_page="393" end_page="393" type="sub_section"> <SectionTitle> 2.1 Signal Processing </SectionTitle> <Paragraph position="0"> The input speech signal is sampled at 16 kHz with a preemphasized filter, 1 - 0.9 Z &quot;1. A Hamming window with a width of 20 msec. is applied to the speech signal every 10 msec. A 32nd-order LPC analysis is used to compute the 12th-order cepstral coefficients. A bilinear transformation of cepstral coefficients is employed to approximate the mel-scale representation. In addition, relative power is also computed together with cepstral coefficients. The speech features used in SPHINX-II include LPC cepstral coefficients; 40-msec. and 80-msec differenced LPC cepstral coefficients; second-order differenced cepstral coefficients; and power, 40-msec differenced power, second-order differenced power. These features are vector quantized into four independent codebooks by the Linde-Buzo-Gray algorithm \[6\], each of which has 256 entries.</Paragraph> </Section> <Section position="2" start_page="393" end_page="393" type="sub_section"> <SectionTitle> 2.2 Training </SectionTitle> <Paragraph position="0"> Training procedures are based on the forward-backward algorithm. Word models are formed by concatenating phonetic models; sentence models by concatenating word models. There are two stages of training. The first stage is to generate the shared-distribution mapping table. Fortyeight context-independent discrete phonetic models are initially estimated from the uniform distribution. Deleted interpolation\[7\] is used to smooth the estimated parameters with the uniform distribution. Then context-dependent models are estimated based on the context-independent ones. There are 16,713 triphones in the DARPA WSJ-CSR training corpus when both within-word and left-context-dependent between-word triphones are considered. To simplify training, one codebook discrete models were used, where the acoustic features consist of the cepstral coefficients, 40-msec differenced cepstrum, and power and 40-msec differenced power. After the 16,713 discrete models are obtained, the shared-distribution clustering procedure \[5\] is applied to create the senones, 6255 in the case of the WSJ-CSR task. The second stage is to train 4-codebook models. We first estimate 51 context independent, four-codebook discrete models with the uniform distribution. With these context independent models and the senone table, we then estimate the shared-distribution SCHMMs. Because of substantial difference between male and female speakers, two sets of sex-dependent SCHMMs are are separately trained to enhance performance.</Paragraph> <Paragraph position="1"> To summarize, the configuration of the SPHINX-II for</Paragraph> </Section> <Section position="3" start_page="393" end_page="393" type="sub_section"> <SectionTitle> 2.3 Recognition </SectionTitle> <Paragraph position="0"> For each input utterance, the artificial sex is first determined automatically \[8, 9\]. After the sex is determined, only the models of the determined sex are activated during recognition. This saves both time and memory. For each input utterance, a Viterbi beam search is used to determine the optimal state sequence in the language network.</Paragraph> <Paragraph position="1"> 3. New Techniques for CSR Decoding</Paragraph> </Section> <Section position="4" start_page="393" end_page="394" type="sub_section"> <SectionTitle> 3.1 Left Context Dependent Cross-Word Models </SectionTitle> <Paragraph position="0"> Using context dependent acoustic models across word boundaries presents two problems. The first of which is training the models and the second of which is using them in a decoder. The training problem is a relatively simple one. Since we are using a supervised training procedure it is simply a matter of transcribing the acoustic sequence to account for the cross-word phonetic context. An additional complication is introduced when optional silences can appear between words but this is also relatively easy to deal with by adding the appropriate optional phonetic sequences. One question that does arise is whether context dependent models for word beginning, word ending and word middle should be considered separately. In SPHINX-II they are kept separate \[10\].</Paragraph> <Paragraph position="1"> The decoding problem is difficult since instead of a single word sequence to consider there are many alternative word sequences to consider. Consider the extension of a single word sequence W 1..n. Each possible one word extension of W gives rise to a particular phonetic right context at the end of w n. There may be as many as N of these, where N is the number of basic phonetic units in the system. A similar problem appears when considering the best word sequence prior to a word wn+ 1, each possible prior word, w n, gives rise to a particular phonetic left context for the start of wn+ 1. The final case to consider is a word that is exactly one phonetic unit in length. Here the number of possibilities to consider is order N 2. None the less, for small tasks (< 1000 words) with artificial grammars, it is possible to precompile only the relevant phonetic transitions since not all possible transitions will be allowed by the artificial grammar. When a larger and more natural task is considered, one such as WSJ CSR, these techniques are</Paragraph> <Paragraph position="3"> put distributions, lc i, depend only on the name of the model. In the multiplexed Bakis model each lc i is a function of the model name and the word sequence history, hist i.</Paragraph> <Paragraph position="4"> not applicable because of memory and run time constraints. null We made two important modifications in the application of cross-word context dependent phonetic models. The first was to model only the left context at word beginnings and ignore the right context at word endings. The second was to use the word-sequence-history information in each state to select the appropriate left context model for that state. See figure 1. An advantage afforded by left-context-onlymodeling is that on each inter-word transition only one context is considered since the left context is uniquely determined by the word history W1..n. If the right context is modeled all possible right contexts must be considered at word endings since the future is not yet known. The advantages afforded by using the best-word-sequence to select the appropriate left context model come in both space and time savings. Space is saved since only one model is needed at word beginnings rather than N. Time is saved since only one model is evaluated at word beginnings. null</Paragraph> </Section> <Section position="5" start_page="394" end_page="394" type="sub_section"> <SectionTitle> 3.2 Memory Organization </SectionTitle> <Paragraph position="0"> The WSJ-CSR task is significantly different from the previous CSR tasks in the size of the lexicon and in the style of the language model. The lexicon is nearly an order of magnitude larger than previous lexicons and the language model contains more than two orders of magnitude more transitions than the Resource Management task. Several changes were required in the decoder design so that it could be run with out paging to secondary storage because of limited memory. Our redesign entailed changing the Viterbi evaluation to use an in-place algorithm, changing the management of history pointers to use a hash table rather than a stack, decoupling the prroto-type HMM state transition probabilities from the word specific HMM instances, and changing from a statically compiled language model to dynamically interpreted language model. Finally, the pointerless implementation of the HMM topology continued to save both memory and time.</Paragraph> <Paragraph position="1"> In Place Viterbi Evaluation. In our previous decoder the Viterbi evaluation used a separate set of source and destination states. The advantage to this approach is that states may be updated without regard to order. The disadvantage to this approach is that two sets fields must be kept for each state. By changing to an in-place evaluation only one set of fields is needed. Another feature of the previous decoder was that a word HMM was instantiated by making a copy of the appropriate HMMs and concatenating them together. As result duplicate copies of the arc transition probabilties would be made for each occurrence of HMM i in a word. To save this space a pointer to the proto-type HMM is kept in the instance HMM and the arc transition probabilities are omitted.</Paragraph> <Paragraph position="2"> The pointerless topology is a feature of the previous decoder \[11\] that implicitly encodes the topology of the model in the evaluation procedure. Not only does this save the memory and time associated with pointer following but it also allows, at no additional cost, the order dependent evaluation required by the in place Viterbi evaluation.</Paragraph> <Paragraph position="3"> Taken together these changes reduced the per state memory cost from 28 bytes/state to 8 bytes/states.</Paragraph> <Paragraph position="4"> History Pointers and Language Model. By using a dictionary data structure instead of a stack data structure we reduced the amount of memory devoted to the word history sequences by an order of magnitude. The reduction comes because the dictionary does not differentiate identical word histories with differing segmentations. Besides the memory savings an advantage to this approach is that word histories can be rapidly compared for equality. A disadvantage is that the true segmentation cannot be recovered using this data structure. Finally, a consequence of using a fully backed-off language model is that it was no longer practical to precompile a graph that encoded all the language model transitions. Instead the language model is dynamically interpreted at mn time.</Paragraph> </Section> <Section position="6" start_page="394" end_page="395" type="sub_section"> <SectionTitle> 3.2 Search Reduction </SectionTitle> <Paragraph position="0"> Viterbi beam search depends on the underlying dynamic programming algorithm that restricts the number of states to be ISI, where is S is the set of Markov states. For the bigram language model ISI is a linear function of W, the size of the lexicon. Therefore the time to decode an utterance is O(ISI * 1) where I is the length of the input. The problem, at least when bigram language models are used, is not to develop a more efficient algorithm but to develop strategies for reducing the size of S. Beam search does this by considering only those states that fall with in some beam. The beam is defined to be all those states s, where score(s) is with in e of the best_score(S). In the WSJ-CSR task the size of S has increased by almost an order of magnitude. With this motivation a refinement of the beam search strategy was developed that reduces the number of the states kept in the beam by a factor of two.</Paragraph> <Paragraph position="1"> In the previous implementation of the decoder the beam was defined as beam = {s I score(s) > t~ + best_score(S)}.</Paragraph> <Paragraph position="2"> To further reduce the size of the beam two additional pruning thresholds have been added. The first threshold, re, is nominally for phone level pruning and the second, to, is nominally for word level pruning. The set of states, P that it is applied to corresponds to the final (dummy) states of each instance of a phonetic model. The set of states W, that co is applied to corresponds to the final (dummy) states of the final phonetic models of each word. The inequality relationship among the three beam thresholds is given by eqn. 1. The set containment relationship among the three sets is given by eqn. 2.</Paragraph> <Paragraph position="3"> 1. tx~x ~ C/o 2. SDPDW.</Paragraph> <Paragraph position="4"> The motivation for partitioning the state space into subsets of states that are subject to different pruning thresholds comes from the observation that leads to the use of a pruning threshold in the first place. A state s is most likely to participate in the final decoding of the input when score(s) is closest to best_score(S). Similarly a phonetic sub-word unit is most likely to participate in the final decoding when score(p) is closest to best_score(S). Likewise for the word units. The difference between the state sets P and W and the state set S is that there is more than a single state of contextual information available. Put another way, when there is more information a tight pruning threshold can be applied with out an increase in search errors. Currently all the pruning thresholds are determined empirically. Informally we have found that the best threshold settings for and to are two and four orders of magnitude tighter than t~.</Paragraph> </Section> <Section position="7" start_page="395" end_page="395" type="sub_section"> <SectionTitle> 3.3 Search Decomposition </SectionTitle> <Paragraph position="0"> The search is divided into four phases.</Paragraph> <Paragraph position="1"> 1. shared distribution probability computation 2. HMM arc probability evaluation 3. active HMM instance evaluation 4. language model application For each time frame the shared distribution probability computation first computes the probabilities of the top N=4 codewords in the codebook. Then the top N codewords and their probabilities are combined with each of D=6255 discrete output probability distribution functions. Although not all distributions will be used at every frame of the search a sufficiently large number are used so that computation on demand is less efficient. The D output probabilities are then combined with the M=I 6,713 models in the HMM arc probability evaluation. Here we only compute the arc probabilities of those HMMs that have active instances as part of a word. Two advantages accrue from separating the arc probability computation from the state probability computation. First the arc transition probability and acoustic probability need in terms of error rata, memory size and run time. The baseline result refers to the results obtained with original decoder that implemented no cross word modeling.</Paragraph> <Paragraph position="2"> only be combined once. Second this naturally leads to storing HMM arc transition probabilities separately from the HMM instances which results in a space savings.</Paragraph> <Paragraph position="3"> The active HMMs, ie. those HMM instances corresponding to phones in an active word, are updated with arc probabilities from the corresponding HMM protc-type. In this case updating an HMM means combining all the HMM instance state probabilities with the appropriate arc probabilities of the proto-type HMM and performing the Viterbi update procedure.</Paragraph> <Paragraph position="4"> For each word history h ending at time t the language model is consulted for the vector of probabilities corresponding to the probability of each of one word extension of h. Between the language model and the word transition module sits a cache. For the WSJ-CSR 5000 word system, a 200 entry LRU 2 cache provides a hit rate of 92%. The cache reduces the cost of using this language model by an order of magnitude. For the a 5000 word lexicon, a 200 entry cache requires four megabytes.</Paragraph> </Section> </Section> class="xml-element"></Paper>