File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/94/h94-1067_evalu.xml
Size: 8,584 bytes
Last Modified: 2025-10-06 14:00:13
<?xml version="1.0" standalone="yes"?> <Paper uid="H94-1067"> <Title>MICROPHONE-INDEPENDENT ROBUST SIGNAL PROCESSING USING PROBABILISTIC OPTIMUM FILTERING</Title> <Section position="5" start_page="337" end_page="339" type="evalu"> <SectionTitle> 3. EXPERIMENTS </SectionTitle> <Paragraph position="0"> A series of experiments show how the mapping algorithm can be used in a continuous speech recognizer across acoustic environments. In all of the experiments the recognizer models are trained with data recorded with high-quality microphones and digitally sampled at 16,000 Hz. The analysis frame rate is 100 Hz.</Paragraph> <Paragraph position="1"> The tables below show three types of performance indicators: * Relative distortion measure. For a given component of a feature vector we define the relative distortion between the clean and noisy data as follows: lEE<z-y)2\]</Paragraph> <Paragraph position="3"> Error ratio. The error ratio is given by En/E c where E is the word recognition error for the test-noisy/train- n clean condition, and E c is the word recognition error of the test-clean/train-clean condition.</Paragraph> <Section position="1" start_page="337" end_page="337" type="sub_section"> <SectionTitle> 3.1. Single Microphone </SectionTitle> <Paragraph position="0"> To test the POF algorithm on a single target acoustic environment we used the DARPA Wall Street Journal database \[15\] on SRI's DECIPHER TM phonetically tied-mixture speech recognition system \[2\]. The signal processing consisted of a filterbank-based front end that generated six feature streams: cepstrum (cl-c12), cepstral energy (cO), and their first- and second-order derivatives.</Paragraph> <Paragraph position="1"> Cepstral-mean normalization \[16\] was used to equalize the channel. We used simultaneous recordings of high-quality speech (Sennheiser 414 head-mounted microphone with a noise-canceling element) along with speech recorded by a standard speaker phone (AT&T 720) and transmitted over local telephone lines. We will refer to this stereo data as clean and noisy speech, respectively. The models of the recognizer were trained using 42 male WSJ0 training talkers (3500 sentences) recorded with a Sennheiser microphone. The models of the mapping algorithm were trained using 240 development training sentences recorded by three speakers. The test set consisted of 100 sentences (not included in the training set) recorded by the same three speakers.</Paragraph> <Paragraph position="2"> In this experiment we mapped two of the six features: the cepstrum (cl-c12) and the cepstral energy (cO) separately. The derivatives were computed from the mapped vectors of the cepstral features. For the conditioning feature we used a 13-dimensional cepstral vector (c0-c12) modeled with 512 Gaussians with diagonal C/ovariance matrices. The results are shown in Table 2.</Paragraph> <Paragraph position="3"> ber of filter coefficients. The number of Oaussian distributions is 512 per feature and the conditioning feature is a 13-dimensional cepstral vector.</Paragraph> <Paragraph position="4"> The baseline experiment produced a word error rate of 27.6% on the noisy test set, that is, 2.46 times the error obtained when using the clean data channel. A 34% improvement in recognition performance was obtained when using only the additive filter coefficient b i. (Recognition error goes down to 18.1%.) The best result (15.9% recognition error) was obtained for the condition p=3, in which six neighboring noisy frames are being used to estimate the feature vector for the current frame. The correlation between the average relative distortion between the six clean and noisy features and the recognition error is 0.9.</Paragraph> </Section> <Section position="2" start_page="337" end_page="339" type="sub_section"> <SectionTitle> 3.2. ATIS Simultaneous Corpus </SectionTitle> <Paragraph position="0"> To test the performance of the POF algorithm on multiple microphones we used SRI's stereo-ATIS database. (See \[1\] for details.) A corpus of both training and testing speech was collected using simultaneous recordings made from subjects wearing a Sennheiser HMD 414 microphone and holding a telephone handset. The speech from the telephone handset was transmitted over local telephone lines during data collection. Ten different telephone handsets were used. Ten male speakers were designated as training speakers, and three male speakers were designated as the test set.</Paragraph> <Paragraph position="1"> The training set consisted of 3,000 simultaneous recordings of Sennheiser microphone and telephone speech. The test set consisted of 400 simultaneous recordings of Sennheiser and telephone speech. The results obtained with this pilot corpus are shown in test set performance. Results are word error rate on the 400 Sentence simultaneous test set.</Paragraph> <Paragraph position="2"> We can see from Table 3 that there is a 15.4% decrease in performance when using a telephone front end (7.8% increases to 9.0% word error) and testing on Sennheiser data. This is due to the loss of information in reducing the bandwidth from 100-6400 Hz to 300-3300 Hz. However, when we are using a telephone front end, there is only a 7.8% increase in word error when testing on telephone speech compared to testing on Sennheiser speech (9.7% versus 9.0%). This is a very surprising result, and we had expected a much bigger performance difference when Sennheiser models are tested on telephone speech acoustics.</Paragraph> <Paragraph position="3"> 3.3. Multiple Microphones: Single or Multiple Mapping The POF mapping algorithm can be used in a number of ways when the microphone is unknown. Some of these variations are shown in Table 4.</Paragraph> <Paragraph position="4"> when mapping algorithm is used in different ways.</Paragraph> <Paragraph position="5"> The differences between the experimental conditions are small, but the trends are different and depend on the mapping and the corpus. These differences depend on the similarities of the different microphones that are used in training conditions, and the relationship between the training and the testing conditions.</Paragraph> <Paragraph position="6"> When the microphones are all similar (10 telephone mappings), then averaging the features of each mapping helps improve performance. When the microphones are very different (e.g., those in the WSJ corpus), averaging the features of each mapping has a minimum when averaging two best (likelihood) feature streams.</Paragraph> <Paragraph position="7"> 3.4. Multiple Microphones: Conditioning Feature The next experiment varied the conditioning feature. The conditioning feature is the feature vector used to divide the space into different acoustic regions. In each region of the acoustic space a different linear transformation is trained.</Paragraph> <Paragraph position="8"> The mapping approach was fixed: we used a single POF mapping for multiple telephone handsets. For this experiment we mapped the eepstrum vector (cl-c12) and the eepstral energy (cO). The maximum delay of the filters was kept fixed atp=2, and the number of Gaussians was 512. The experimental variable was the feature the estimates were conditioned on. We tried the following conditioning features: * Cepstrum. Same conditioning feature used in the single microphone experiment (c0-c12).</Paragraph> <Paragraph position="9"> * Spectral SNR. This is an estimate of the instantaneous sig null nal-to-noise ratio computed on the log-ffilterbank energy domain. The vector size is 25.</Paragraph> <Paragraph position="10"> Cepstral SNR. This feature is generated by applying the discrete cosine transform (DCT) to the spectral SNR. The transformation reduces the dimensionality of the vector from 25 to 12 elements.</Paragraph> <Paragraph position="11"> The results are shown in Table 5. The baseline result is a 19.4% word error rate. This result is achieved when the same wide-band front end is used for training the models with clean data and for recognition using telephone data. When a telephone front end \[1\] is used for training and testing, the error decreases to 9.7%. The disadvantage of using this approach is that the acoustic models of the recognizer have to be reestimated. However, the POF-based front end operates on the clean models and results in better performance. The eepstral SNR produces the best result (8.7%). With this conditioning feature we combine the effects of noise and spectral shape in a compact representation.</Paragraph> </Section> </Section> class="xml-element"></Paper>