File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1066_metho.xml
Size: 26,978 bytes
Last Modified: 2025-10-06 14:13:46
<?xml version="1.0" standalone="yes"?> <Paper uid="H94-1066"> <Title>SIGNAL PROCESSING FOR ROBUST SPEECH RECOGNITION</Title> <Section position="4" start_page="0" end_page="332" type="metho"> <SectionTitle> 2. ENVIRONMENTAL COMPENSATION ALGORITHMS </SectionTitle> <Paragraph position="0"> We begin this section by reviewing the previously-described MFCDCN algorithm, which is the basis for most of the new procedures discussed. We then discuss blind environment selection and environmental interpolation as they apply to MFCDCN. The complementary procedures of phone-dependent cepstral normalization and codebook adaptation are described. We close this section with brief description of reduced-bandwidth analysis and silence-codebook adaptation, which are very beneficial in processing telephone-bandwidth speech and speech recorded in the presence of strong background noise, respectively.</Paragraph> <Paragraph position="1"> (MFCDCN) provides additive cepstral compensation vectors that depend on signal-to-noise ratio (SNR) and that also vary from codeword to codeword of the vector-quantized (VQ) representation of the incoming speech at each SNR \[6\]. At low SNRs these vectors primarily compensate for effects of additive noise. At higher SNRs, the algorithm compensates for linear filtering, while at intermediate SNRs, they compensate for both of these effects.</Paragraph> <Paragraph position="2"> Environmental independence is provided by computing compensation vectors for a number of different environments and selecting the compensation environment that results in minimal residual VQ distortion.</Paragraph> <Paragraph position="3"> Compensation vectors for the chosen testing environment are applied to normalize the utterance according to the expression</Paragraph> <Paragraph position="5"> word index, instantaneous frame SNR, time frame index and the index of the chosen environment, respectively, and ~t, zt, and r are the compensated (transformed) data, original data and compensation vectors, respectively.</Paragraph> <Section position="1" start_page="330" end_page="331" type="sub_section"> <SectionTitle> 2.2. Blind Environment Selection </SectionTitle> <Paragraph position="0"> In several of the compensation procedures used, including MFCDCN, one of a set of environments must be selected as part of the compensation process. We considered three procedures for environment selection in our experiments.</Paragraph> <Paragraph position="1"> The first procedure, referred to as selection by compensation, applies compensation vectors from each possible environment successively to the incoming test utterance. The environment e is chosen that minimizes the average residual VQ distortion over the entire utterance. In the second approach, referred to as environmerit-specific VQ, environment-specific cookbooks are generated from the original uncompensated speech. By vector quantizing the test data using each environment-specific codebook in turn, the environment with the minimum VQ distortion is chosen. The third procedure, referred to as Gaassian environment classifier, models each environment with mixtures of Gaussian densities. Environment selection is accomplished so that the test data has the highest probability from the corresponding classifier. ~ latter approach is similar to one proposed previously by BBN \[7\].</Paragraph> <Paragraph position="2"> All three methods produce similar speech recognition accuracy.</Paragraph> <Paragraph position="3"> 2.3. Interpolated FCDCN (IFCDCN) In cases where the testing environment does not closely resemble any of the particular environment used to develop compensation parameters for MFCDCN, interpolating the compensation vectors of several environments can be more helpful than using compensation vectors from a single (incorrect) environment. As in MFCDCN, compensation vectors used in the interpolated fixed codeword-dependent cepstral normalization algorithm (IFCDCN) are precomputed for environments in the training data-base in the estimation phase. Compensation vectors for new environments are obtained by linear interpolation of several of the MFCDCN compensation vectors: ?\[k,l\] = ~E~=lfe&quot; rtk, l, el where ?\[k, I\] , t\[k,l,e\], andre are the estimated compensation vectors, the environment-specific compensation vector for the e th environment, and the weighting factor for the e th environment, respectively.</Paragraph> <Paragraph position="4"> The weighting factors for each environment are also based on residual VQ distortion:</Paragraph> <Paragraph position="6"> where O is the codebook standard deviation using speech from the CLSTLK microphone, ~ represents the testing utterance, and Dj and D e are the residual VQ distortions of the j4h and e th environments. We have generally used a value of 3 for E.</Paragraph> </Section> <Section position="2" start_page="331" end_page="332" type="sub_section"> <SectionTitle> 2.4. Phone-Dependent Cepstral Normalization </SectionTitle> <Paragraph position="0"> (PDCN) In this section we discuss an approach to environmental compensation in which additive cepstral compensation vectors are selected according to the current phoneme hypothesis in the search process, rather than according to physical parameters such as SNR or VQ eodeword identity. Since this phoneme-based approach relies on information from the acoustic-phonetic and language models to determine the compensation vectors, it can be referred to as a &quot;back-end&quot; compensation procedures, while other approaches such as MFCDCN which work independently of the decoder can be regarded as &quot;front-end&quot; compensation schemes.</Paragraph> <Paragraph position="1"> Estimation of PDCN compensation Vectors. In the current implementation of phone-dependent cepstral normalization (PDCN), we develop compensation vectors that are specific to individual phonetical events, using a base phone set of 51 phonemes, including silence but excluding other types of non-lexical events. This is accomplished by running the decoder in supervised mode using CLSTLK data and correction transcriptions. All CLSTLK utterances are divided into phonetic segments. For every phonetic label, a difference vector is computed by accumulating the cepstral difference between the CLSTLK training data, xp and its noisy counterpart, z r Compensation vectors are computed by averaging the corresponding difference vector as follows,</Paragraph> <Paragraph position="3"> whereft is the phoneme for frame t, p the phoneme index and T u length of the uth utterance out of A sentences.</Paragraph> <Paragraph position="4"> Compensation of PDCN in Recognition. The SPHINX-H system uses the senone \[4,8\], a generalized state-based probability density function, as the basic unit to compute the likelihood from acoustical models. The probability density function for senone s in frame t for the cepstral vector z t of incoming speech can be expressed as = where m t stands for the index of the best B Gaussian mixtures of and are the cor- senone s for cepstra vector zt, and J.l.rat, Omt , Wra t th * responding mean, standard deviation, and weight for the m t nuxture of senone s. Multiple compensated cepslxal vectors are formed in PDCN by adding various compensation vectors to incoming cepstra, ftt.p(t ) , where Rt. p = zt+c\[p \] on a frame-by-frame basis for the presumed phoneme index, p.</Paragraph> <Paragraph position="5"> The amount of computation needed for this procedure is reduced because in SPHINX-H, each senone corresponds to only one distinctive base phoneme. A cepstral vector can be normalized with a proper PDCN compensation factor corresponding to the particular base phonetical identity. As a result, senone probabilities can be calculated by the presumed phonetic identity that corresponds to a given senone. Using this approach, the senone probability in</Paragraph> <Paragraph position="7"> where n t is the index of the best B Gaussian mixtures for senone s at frame t with respect to the PDCN-normalized ceprtral vector ~t,p, for the corresponding phonetic labelp for senone s.</Paragraph> <Paragraph position="8"> Interpellated PDCN (IPDCN). PDCN, like SDCN and FCDCN \[3,6\], assumes the existence of a database of utterances recorded in stereo) in the training and testing environments. In situations where no data from any particular testing environment is available for estimation, IPDCN is desirable. Based on an ensemble of pre-computed PDCN compensation vectors, IPDCN applies to the incoming utterance an interpolation of compensation vectors from several of the closest environments (IPDCN). The interpolation is performed in the same way that it was for IFCDCN. In the current implementation, we use the 3 closest environments with the best 4 Ganssian mixtures in interpolation.</Paragraph> <Paragraph position="9"> 2,5. Codebook Adaptation (DCCA and BWCA) A vector quantization (VQ) codebook, which is a set of mean vectors and/or co-variance matrices of cepstral representations, also exhibits some fundamental differences when mismatches are encountered between training and testing environments \[7\]. This suggests that when such mismatches exist, the codebook can be &quot;tuned&quot; to better characterize the cepstral space of testing data. In this section, we propose two different implementations of such codebook adaptation.</Paragraph> <Paragraph position="10"> Dual-Channel Codebook Adaptation (DCCA). Dual-Channel Codebook Adaptation (DCCA) exploits the existence of speech that is simultaneously recorded using the CLSTLK microphone and a number of secondary microphones. From the viewpoint of front-end compensation, the senone probability density function can be expressed as the Gaussian mixture</Paragraph> <Paragraph position="12"> mixtures, noisy observation vector, compensation vector, compensated vector, mean vector for the k th mixture, and variance, respectively. The senone probability density function is re-written as</Paragraph> <Paragraph position="14"> where 8gt and 8o k are deviations from cepstral space of target noisy environment to that of the reference training environment for means and variance in the corresponding Gaussian mixtures.</Paragraph> <Paragraph position="15"> In implementing DCCA, VQ encoding is performed on speech from the CLSTLK microphone processed with CMN. The output VQ labels are shared by the CLSTLK data and the corresponding data in the secondary (or target) environment. For each subspace in the CLSTLK training environment, we generate the corresponding means and variances for the target environment. Thus, a one-to-one mapping between the means and variances of the cepstral space of the CLSTLK training condition and that of the target condition is established.</Paragraph> <Paragraph position="16"> Recognition is accomplished by shifting the means of the Gaussian mixtures according to the relationships</Paragraph> <Paragraph position="18"> Baum-Welch Codebook Adaptation (BWCA). There are many applications in which stereo data simuRaneously recorded in the CLSTLK and target environments are unavailable. In these circumstances, transformations can be developed between environments using the contents of the adaptation utterances using the Baum-Welch algorithm.</Paragraph> <Paragraph position="19"> In Bantu-Welch codebook adaptation, mean vectors and covariance matrices, along with senones, are re-estimated and updated using the Bantu-Welch algorithm \[5\] during each iteration of training process. To compensate for the effect of changes in acoustical environments, the Baum-Welch approach is used to transform the means and covariances toward the cepstral space of the target testing environments. This is exactly like normal training, except that only a few adaptation utterances are available, and that number of free parameters to be estimated (i.e. the means and variances of the VQ codewords) is very small.</Paragraph> <Paragraph position="20"> 2.6. Reduced Bandwidth Analysis for Telephone Speech The conventional SPHINX-II system uses signal processing that extracts Mel-frequency cepstral coefficients (MFCC) over an analysis range of 130 to 6800 Hz. This choice of analysis bandwidth is appropriate when the system processes speech recorded through goed-quality microphones such as the CLSTLK microphone. Nevertheless, when speech is recorded from telephone lines, previous research at CMU \[9\] indicates that error rates are sharply decreased when the analysis bandwidth is reduced. This is accomplished by performing the normal DFT analysis with the normal 16,000-Hz sampling rate, but only retaining DFT coefficients after the triangular frequency smothing from center frequencies of 200 to 3700 Hz. Reduced-bandwidth MFCC coefficients are obtained by performing the discrete-cosine transform only on these frequency-weighted DFr coefficients.</Paragraph> <Paragraph position="21"> To determine whether or not speech from an unknown environment is of telephone nature, we use the Gaussian environment classifier approach, as described in Sec. 2.2. Two VQ codebooks are used, one for telephone speech using a wideband front-end analysis and another for non-telephone speech. The speech was classified to maximize environmental likelihood.</Paragraph> <Paragraph position="22"> 2.7. Silence codebook adaptation When dealing with speech-like noises such as a speech or music in the background the compensation techniques described above pro- null vide only partial recovery. Most of these techniques assume certain statistical features for the noise (such as stationarity at the sentence level), that are not valid. The SPHINX-l/recognition system still produces a large number of insertion errors in difficult recognition environments, such as those used in the 1993 CSR Spoke 8 evaluation, even when cepstral compensation is used. We have found that the use of silence codebook adaptation (SCA) helps reduce insertion rates in these circumstances by providing better discrimination between speech and speech-like noises.</Paragraph> <Paragraph position="23"> In SCA processing, the HMM parameters (codebook means, variances, and probabilities) are updated for the silence and noise segments by exposure to lraining data from a corpus that more closely approximates the testing environment than speech from the CLSTLK microphone. If not enough data are available, an update of the cepstral means only is performed. Further details on how this procedure was implemented for the 1993 CSR Spoke 8 evaluation are provided in Sec. 4.2</Paragraph> </Section> </Section> <Section position="5" start_page="332" end_page="332" type="metho"> <SectionTitle> 3. PERFORMANCE OF ALGORITHMS IN DEVELOPMENTAL TESTING </SectionTitle> <Paragraph position="0"> In this and the following section we describe the results of a series of experiments that compare the recognition accuracy of the various algorithms described in Sec. 2 using the ARPA CSR Wall Street Journal task. The 7000 WSJ0 utterances recorded using the CLSTLK microphone were used for the training corpus, and in most cases the system was tested using the 330 utterances from secondary microphones in the 1992 evaluation test set. This test set has a closed vocabulary of 5000 words.</Paragraph> <Paragraph position="1"> Two implementations of the SPHINX recognition system were used in these evaluations. For most of the development work and for the official 1993 CSR evaluations for Spoke 5 and Spoke 8, a smaller and faster version of SPHINX-II was used than the implementation used for the official ATIS and CSR Hub evaluations.</Paragraph> <Paragraph position="2"> We refer to the faster system as SPHINX-IIa in this paper.</Paragraph> <Paragraph position="3"> SPHINX-Ha differs from SPHINX-II in two ways: it uses a bigram grammar (rather than a trigram grammar) and it uses only one codebook (rather than 27 phone-dependent codebooks). Spoke 5 and selected other test sets were subsequently re-evaluated using a versions of SPHINX-II that was very similar to the one used in the ATIS and CSR Hub evaluations.</Paragraph> <Section position="1" start_page="332" end_page="332" type="sub_section"> <SectionTitle> 3.1. Comparison of MFCDCN, IFCDCN, </SectionTitle> <Paragraph position="0"> PDCN, and IPDCN We first consider the relative performance of the MFCDCN, IFCDCN, PDCN, and IPDCN algorithms, which were evaluated using the training and test sets described above. Recognition accuracy using these algorithms is compared to the baseline performance, which was obtained using conventional Mel-cepstrum based signal processing in conjunction with cepstral man normalization (CMN).</Paragraph> <Paragraph position="1"> Table 1 compares word error rates obtained using various processing schemes along with the corresponding reduction of word error rates with the respect to the baseline with CMN. Compensation vectors used for these comparisons were developed from training data that include the testing environments. Table 2 summarizes similar results that were obtained when the actual testing environment was excluded from the set of data used to develop the compensation vectors.</Paragraph> </Section> </Section> <Section position="6" start_page="332" end_page="332" type="metho"> <SectionTitle> COMPENSATION CLSTLK OTHER </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="332" end_page="332" type="sub_section"> <SectionTitle> 7.6 21.4 7.6 16.1 7.6 14.8 7.6 15.6 7.6 13.5 </SectionTitle> <Paragraph position="0"> PDCN as in Table 1, but with the testing environments excluded from the corpus used to develop compensation vectors.</Paragraph> <Paragraph position="1"> The results of Table 1 indicate that PDCN when applied in isolation provides a recognition error rate that is not as good as that obtained using MFCDCN. Nevertheless, the effects of PDCN and MFCDCN are complementary in that the use of the two algorithms in combination provides a lower error rate than was observed with either algorithm applied by itself. The results in Table 2 demonstrate that the use of environment interpolation is helpful when the testing environment is not included in the set used to develop compensation vectors. Environmental interpolation degrades performance slightly, however, when the actual testing environment was observed in developing the compensation vectors.</Paragraph> </Section> <Section position="2" start_page="332" end_page="332" type="sub_section"> <SectionTitle> 3.2. Performance of Codebook Adaptation </SectionTitle> <Paragraph position="0"> Table 3 compares word error rates obtained with the DCCA and BWCA as described in Sec. 2.5 with error rates obtained with CMN and MFCDCN. The Bantu-Welch codebook adaptation was implemented with four iterations of re-estimation, re-estimating the codebook means only. (Means and variances were re-estimated in a pilot experiment, but with no improvement in performance.)</Paragraph> </Section> </Section> <Section position="7" start_page="332" end_page="333" type="metho"> <SectionTitle> COMPENSATION CLSTLK OTHER </SectionTitle> <Paragraph position="0"> Table 4 provides similar comparisons, but with the testing environments excluded from the corpus used to develop compensation vectors (as in Table 2).</Paragraph> </Section> <Section position="8" start_page="333" end_page="333" type="metho"> <SectionTitle> COMPENSATION CLSTLK OTHE </SectionTitle> <Paragraph position="0"> ronments excluded from the corpus used to develop the compen-The results of Tables 3 and 4 indicate that the effectiveness of codebook adaptation used in isolation to reduce error rate is about equal to that of MFCDCN. Once again, the use of environmental interpolation is helpful in cases in which the testing environment was not used to develop compensation vectors.</Paragraph> <Section position="1" start_page="333" end_page="333" type="sub_section"> <SectionTitle> 3.3. Reduced-bandwidth Processing for Tele- phone Speech </SectionTitle> <Paragraph position="0"> Table 5 compares error rates obtained using conventional signal processing versus reduced-bandwidth analysis for the telephonemicrophone subset of the development set, and for the remaining microphones.</Paragraph> <Paragraph position="1"> reduced-bandwidth analysis for the 1992 WSJ test set. It can be seen in Table 5 that the use of a reduced analysis bandwidth dramatically improves recognition error rate when the system is trained with high-quality speech and tested using telephone-bandwidth speech.</Paragraph> <Paragraph position="2"> In an unofficial evaluation we applied reduced-bandwidth analysis to the telephone speech data of Spoke 6 (known-microphone adaptation) from the 1993 ARPA CSR evaluation. Using a version of SPHINX-II trained with only the 7000 sentences of the WSJ0 corpus, we observed an error rate for the test set of 13.4%. This results compares favorably with results reported by other sites using systems that were trained with both the WSJ0 snd WSJ1 corpora.</Paragraph> </Section> </Section> <Section position="9" start_page="333" end_page="334" type="metho"> <SectionTitle> 4. PERFORMANCE USING THE 1993 CSR EVALUATION DATA </SectionTitle> <Paragraph position="0"> We summarize in this section the results of experiments using the 1993 ARPA CSR WSJ test set, including official results obtained using SPHINX-Ha and subsequent evaluations with SPHINX-II.</Paragraph> <Paragraph position="1"> 4.1. Spoke 5: Microphone Independence The testing data for Spoke 5 were samples of speech from 10 microphones that had never been used previously in ARPA evaluations. One of the microphones is a telephone handset and another is a speakerphone.</Paragraph> <Paragraph position="2"> The evaluation system for Spoke 5 first performs a crude environmental classification using the Gaussian environment classifier, blindly separating incoming speech into utterances that are assumed to be recorded either from wideband microphones or telephone-bandwidth microphones and channels. Speech that is assumed to be recorded from a full-bandwidth microphone is processed using a combination of IFCDCN and IPDCN, interpolating over the closest three environments for IFCDCN and over the best four environments for IPDCN. Speech that is believed to be recorded through a telephone channel is processed using a combination of narrow-band processing as described in Sec. 2.6 and MFCDCN. The systems were trained using the CLSTLK microphone. null different recognition systems, SPHINX-Ha and SPHINX-If.</Paragraph> <Paragraph position="3"> Recognition error rates on the 5000-word $5 task are summarized in Table 6. The results for SPHINX-Ha are the official evaluation results. The test data were re-run in 12/93 using a version of SPHINX-II that was very similar to that used for the Hub evaluation. Although this evaluation was &quot;unofficial&quot;, it was performed without any further algorithm development or exposure to the test set after the official evaluation. We note that the baseline system (without &quot;compensation&quot;) already includes CMN.</Paragraph> <Paragraph position="4"> We believe that one of the most meaningful figures of merit for environmental compensation is the ratio of errors for the P0 and C2 conditions (i.e. the ratio of errors obtained with CLSTLK speech and speech in the target environments with compensation enabled). For this test set, switching from speech from the CLSTLK microphone to speech from the secondary microphones causes the error rates to increase by a factor of 1.3 for the 8 non-telephone environments, by a factor of 2.4 for the 2 telephone environments, and by a factor of 1.5 fur the complete set of testing data. In fact, in 3 of the 10 secondary environments, the compensated error rate obtained using the secondary miss was within 25 percent of the CLSTLK error rate. Interestingly enough, the ratio of errors for the P0 and C2 conditions is unaffected by whether SPHINX-II or SPHINX-Ha was used for recognition, confirming that in these conditions, the amount of error reduction provided by environmental compensation does not depend on how powerful a recognition system is used.</Paragraph> <Paragraph position="5"> 4.2. Spoke 8: Calibrated Noise Sources Spoke 8 considers the performance of speech recognition systems in the presence of background interference consisting of speech from AM-radio talk shows or various types of music at 3 different SNRs, 0, 10 and 20 dB. The speech is simultaneously collected using two microphones, the CLSTLK microphone and a desktop Audio-Technica microphone with known acoustical characteristics. The 0-dB and 10-dB conditions are more difficult than the acoustical environment of Spoke 5, because the background signal for both the AM-radio and music conditions is frequently speechlike, and because it is highly non-stationary. SNRs are measured at the input to the Audio-Technica microphone The evaluation system used a combination of two algorithms for environmental robustness, MFCDCN, and silence codebook adaptation (SCA). New silence codebooks were created using an implementation of Baum-Welch codebook adaptation, as described in Sec. 2.5. Two ceps~al codebooks were developed for SPHINX-IIa, with one codebook representing the noise and silence HMMs, and the other codebook representing the other phones. The normal Baum-Welch re-estimation formulas were used updating only the means for the noise and silence HMMs.</Paragraph> <Paragraph position="6"> SPHINX-Ha. The evaluation system included MFCDCN and SCA. The Audio-Technica is used as the secondary microphone.</Paragraph> <Paragraph position="7"> The reduced-performance SPHINX-IIa system was used as in Spoke 5, except that two cepstral codebooks were needed to implement silence codebook adaptation, one to model the noise and silence segments, and the other to model the remaining phonemes. Four codebooks were developed to model silence segments for different combinations of SNR and background noise, and the codebook used to provide silence compensation was chosen blindly on the basis of minimizing residual VQ distortion. System parameters were chosen that optimize performance for the 10-dB SNR. Results for Spoke 8 are summarized in Table 7. By comparing the C2 and C3 results using the CLSTLK microphone with the P0 and S1 results using the Audio-Technica mic, we note that very little degradation is observed in the 20-dB condition but that recognition accuracy is quite low for the 0-riB condition, even when the signal is compensated. The use of MFCDCN and SCA improves recognition accuracy by 35.0 percent overall, and the ratio of overaU error rates for the C2 and P0 conditions is 1.49, as in Spoke 5. As expected, the AM-radio interference was more difficult to cope with than the musical interference at all SNRs, presumably because it is more speech-like.</Paragraph> </Section> class="xml-element"></Paper>