XML Viewer - h92-1056

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/h92-1056_metho.xml
Size: 10,153 bytes
Last Modified: 2025-10-06 14:13:07
<?xml version="1.0" standalone="yes"?>
<Paper uid="H92-1056">
  <Title>REDUCED CHANNEL DEPENDENCE FOR SPEECH RECOGNITION</Title>
  <Section position="5" start_page="0" end_page="280" type="metho">
    <SectionTitle>
3. THE RASTA FILTER
</SectionTitle>
    <Paragraph position="0"> RASTA filtering is a high-pass filter applied to a log-spectral representation of speech. It removes slow-moving variations from the log spectrum. The filtering is done on the log-spectral representation so that multiplicative distortions (such as a linear filter) become additive and may be removed with the RASTA filter. A simple RASTA filter may be implemented as follows: y(t) -~ x(t) --x(t--1) .-I- (C&amp;quot; y(t- 1) ) where x(t), as implemented in DECIPHERa~, is a log bandpass energy which is normally used in DECIPHER TM to compute the Mel-cepstral feature vector. Instead, x(t) is replaced by y(t), the high-pass version of x(t), when performing the cepstral transform.</Paragraph>
    <Paragraph position="1"> \]'he constant, C, in the above equation defines the time constant of the RASTA filter. It is desirable that C be such  that short-term variations in the log spectra (presumably important parts of the speech signal) are passed by the filter, but slower variations are blocked. We set C = 0.97 so that signals that vary faster than about 1 Hz are passed and those that vary less than once per second tend to be blocked. Figure 1 below plots the characteristic of this filter.</Paragraph>
    <Paragraph position="3"> When used in conjunction with SRI's spectral estimation algorithms \[4, 5\], the high-pass filter is applied to the filter-bank log energies after the spectral estimation operation.</Paragraph>
    <Paragraph position="4"> The estimates of clean filter-bank energies are highpass filtered and then transformed to obtain the cepstral vector.</Paragraph>
    <Paragraph position="5"> The cepstral vector is then differenced twice to obtain the delta-cepstral vector and the delta-delta-cepstral vector.</Paragraph>
    <Paragraph position="6"> 3.1. Removal of an Ideal Linear Filter We first evaluated RASTA filtering by applying a bandpass filter (Figure 2 below) to a speech recognition task--continuous digit recognition performance over telephone lines. The filter was applied to the test set only (no filtering was applied to the training data). We compared the resulting performance with the performance of an unfiltered test set for both standard and RASTA filtering. As Table 1 shows, the RASTA filtering was successful in removing the effects of the bandpass filter, whereas the standard system suffered a significant performance degradation due to the bandpass filter. Compared with our standard signal processing, the RASTA filtering was able to give a slight improvement on the female digit error rate, with no significant change in the male digit error rate. The dramatic decrease in performance that occurs when the telephone speech is bandpass filtered is removed by the RASTA filtering, and the results are comparable to the original speech signal.</Paragraph>
    <Paragraph position="7">  ing techniques and RASTA filtering techniques using clean and bandpass-filtered telephone speech.</Paragraph>
  </Section>
  <Section position="6" start_page="280" end_page="281" type="metho">
    <SectionTitle>
4. REDUCED MICROPHONE
DEPENDENCE
</SectionTitle>
    <Paragraph position="0"> After the encouraging initial study, we tested RASTA filtering in a more realistic manner--measuring the performance improvement, due to RASTA filtering, when dissimilar microphones are used in the test and training data.</Paragraph>
    <Paragraph position="1"> To do this, we recorded 50 sentences (352 words) from one talker simultaneously using two different microphones, a Sennheiser flat-response close-talking microphone that was used to train the system, and an Electrovoice 625 handset with a very different frequency characteristic. The user spoke queries for DARPA's ATIS air-travel planning task.</Paragraph>
    <Paragraph position="2"> Table 2 shows that for this speaker, the error rate was less sensitive to the difference in microphone when RASTA illtering was applied than when it wasn't. Further, there is no evidence from this and the previous study to indicate that RASTA filtering degrades performance when the microphone remains constant.</Paragraph>
  </Section>
  <Section position="7" start_page="281" end_page="282" type="metho">
    <SectionTitle>
5. DESKTOP MICROPHONES
</SectionTitle>
    <Paragraph position="0"> RASTA filtering is most effective when differences between training and testing conditions can be modeled as linear filters. However, many distortions do not fit this model. One example is testing with a desktop microphone with models trained with a close-talking microphone. In this scenario, although the microphones characteristics may be approximately related with a linear filter, additive noise picked up by the desktop microphone violates the linear-filter assumption.</Paragraph>
    <Paragraph position="1"> To see how important these effects are, we performed recognition experiment on systems trained with sennheiser microphones and tested with a Crown desktop microphone.</Paragraph>
    <Paragraph position="2"> These test recordings were made at Carnegie Mellon University (CMU) and at the Massachusetts Institute of Technology (MIT). They simultaneously recorded a speaker using both Sennheiser and Crown microphones interacting with an ATIS (air travel planning) system.</Paragraph>
    <Paragraph position="3"> The performance of DECIPHER TM on the ATIS recordings is shown in Tables 3 and 4. Table 3 shows the system performance results on MIT's recordings, while Table 4 contains the system performance results on CMU's recordings.  For the MIT recordings, note that the best performing system on the Crown microphone data was very close with the performance on the Sennheiser recordings (12.9% vs.</Paragraph>
    <Paragraph position="4"> 15.1%). The addition of RASTA processing did not help the standard processing on the Crown data (the error rate went up slightly from 22.5% to 23.7%) but it did help the noise-robust estimation processing (18.2% to 15.1%).</Paragraph>
    <Paragraph position="5"> The performance on CMU's Crown recordings were much lower. CMU's audio recordings for were noticeably noisier; the speaker sounded as if he was much farther from the microphone, and there were other nonstationary sounds in the background. Note that the error rate with the standard signal processing is extremely high (90.7% word error). For the CMU Crown microphone recordings, the addition of RASTA processing helped reduce the error rate for both the standard and noise-robust estimation processing conditions. The NRFE + RASTA processing was able to reduce the error rate by 60% over the no-processing condition on the CMU Crown microphone recordings (90.7% to 36.1%).</Paragraph>
    <Paragraph position="6"> SRI's noise-robust spectral estimation algorithms are designed to estimate the filter-bank log energies of the clean speech signal when there is additive colored noise. The estimation algorithms were designed to work independently from any spectral shape introduced by the microphone and channel variations. Therefore, some type of additional spectral normalization is required to compensate for these effects: the combined &amp;quot;NRFE + RASTA&amp;quot; system serves that purpose. The RASTA system (without estimation) can help compensate for the linear microphone effects, but it can help only to a limited degree with the nonlinearities introduced by other sounds.</Paragraph>
  </Section>
  <Section position="8" start_page="282" end_page="282" type="metho">
    <SectionTitle>
6. ROBUSTNESS OF REPRESENTATION
TO MICROPHONE VARIATION
</SectionTitle>
    <Paragraph position="0"> To understand the benefit that we have obtained using the different processing techniques, we developed a metric for the robustness of the representation that is separate from speech-recognition performance. The DARPA CSR corpus (Doddington \[15\]) was used for this evaluation since it is contains stereo recordings. By using stereo recordings, we can compare the robustness in the representation that occurs when the microphone is changed. In this CSR corpus, the first channel of these stereo recordings is always a Sennheiser close-talking microphone. The second recording channel uses one of 15 different secondary microphones.</Paragraph>
    <Paragraph position="1"> Using this stereo database, we can compute the cepstral feature vector on each microphone channel, and compare the two representations to determine the level of invariance provided by the signal-processing/representation. The metric that we used for determining the robustness of the representation is called relative-distortion and is computed in the following equation.</Paragraph>
    <Paragraph position="3"> The relative distortion for cepstral coefficient C i is computed by comparing the cepstral value of the first microphone with the same cepstral value computed on the secondary microphone. This average squared difference is then normalized by the variance of this cepstral feature on the two microphones. This metric gives an indication of how much variance there is due to the microphone differences relative to the overall variance of the feature due to phonetic variation. This metric is plotted as a function of the cepstral coefficient for different signal processing algorithms in figure 3.</Paragraph>
    <Paragraph position="4"> Figure 3 shows that the RASTA processing helps reduce the distortion in the lower order cepstral coefficients. When combined with SRI's noise-robust spectral estimation algorithms, the distortion decreases even further for the lower order cepstral coefficients. Neither of the algorithms help reduce the distortion for the higher cepstral coefficients.</Paragraph>
    <Paragraph position="5"> This metric indicates that even though the robust signal processing has reduced the recognition error rate due to microphone differences, there is still considerable variation in the cepstral representation when the microphone is changed.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML