File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1067_metho.xml
Size: 7,197 bytes
Last Modified: 2025-10-06 14:13:50
<?xml version="1.0" standalone="yes"?> <Paper uid="H94-1067"> <Title>MICROPHONE-INDEPENDENT ROBUST SIGNAL PROCESSING USING PROBABILISTIC OPTIMUM FILTERING</Title> <Section position="3" start_page="0" end_page="336" type="metho"> <SectionTitle> 1. INTRODUCTION </SectionTitle> <Paragraph position="0"> In many practical situations an automatic speech recognizer has to operate in several different but well-defined acoustic environments. For example, the same recognition task may be implemented using different microphones or transmission channels. In this situation it may not be practical to recollect a speech corpus to train the acoustic models of the recognizer. To alleviate this problem, we propose an algorithm that maps speech features between two acoustic spaces. The models of the mapping algorithm are trained using a small database recorded simultaneously in both environments.</Paragraph> <Paragraph position="1"> In the case of steady-state additive homogenous noise, we can derive a MMSE estimate of the clean speech filterbank-log energy features using a model for how the features change in the presence of this noise \[6-7\]. In these algorithms, the estimated speech spectrum is a function of the global spectral signal-to-noise ratio (SNR), the instantaneous spectral SNR, and the overall spectral shape of the speech signal. However, after studying simultaneous recordings made with two microphones, we befieve that the relationship between the two simultaneous features is nonlinear. We therefore propose to use a piecewise-nonlinear model to relate the two feature spaces.</Paragraph> <Paragraph position="2"> 1.1. Related Work on Feature Mapping Several algorithms in the literature have focused on experimentally training a mapping between the noisy features and the clean features \[8-13\]. The proposed algorithm differs from previous algorithms in several ways: with two different microphones. Once trained, the mapping parameters are fixed.</Paragraph> <Paragraph position="3"> * The algorithm can either map noisy speech features to clean features during training, or clean features to noisy features during recognition.</Paragraph> <Paragraph position="4"> 1.2. Related Work on Adaptation The algorithm used to map the incoming features into a more robust representation has some similarities to work on model adaptation. Some of the high-level differences between hidden Markov model (HMM) adaptation and the mapping algorithms proposed in this paper are: * The mapping algorithm works by primarily correcting shifts in the mean of the feature set that are correlated with observable information. Adapting HMM model parameters has certain degrees of freedom that the mapping algorithm does not have- for example the ability to change state variances, and mixture weights.</Paragraph> <Paragraph position="5"> difionaUy difficult to incorporate into HMM models and into adaptation algorithms. These include observations that span across several frames and the correlation of the state features with global characteristics of the speech waveform. These two techniques are not mutually exclusive and can be used together to achieve robust speech recognition performance. The boundary between these two techniques can be blurred when the mapping algorithm is dependent on the speech recognizer's hypothesis.</Paragraph> </Section> <Section position="4" start_page="336" end_page="337" type="metho"> <SectionTitle> 2. THE POF ALGORITHM </SectionTitle> <Paragraph position="0"> The mapping algorithm is based on a probabilistic piecewise-nonlinear transformation of the acoustic space that we call Probabilistic Optimum Filtering (POF). Let us assume that the recognizer is trained with data recorded with a high-quality close-talking microphone (clean speech), and the test data is acquired in a different acoustic environment (noisy speech). Our goal is to estimate a clean feature vector ~ given its corresponding noisy feature n Yn where n is the frame index. (A list of symbols is shown in Table 1.) To estimate the clean vector we vector-quantize the clean feature space in I regions using the generalized Lloyd algorithm \[14\]. Each VQ region is assigned a multidimensional transversal filter (see Figure 1). The error between the clean vector and the</Paragraph> <Paragraph position="2"> estimated vectors produced by the i-th filter is given by e ni = Xn - Xni = Xn - ~i Yn (1) where e_: is the error associated with region i, W. is the filter coeffficidh~t matrix, and Yn is the tapped-delay lind of the noisy vectors. Expanding these matrices we get</Paragraph> <Paragraph position="4"> n-p &quot;'&quot; Yn- 1 Yn Yn + 1 &quot;'&quot; Yn + p The conditional error in each region is defined as</Paragraph> <Paragraph position="6"> where p(gilzn ) is the probability that the clean vector x i belongs to region gi given an arbitrary conditional noisy feature vector z n . Note that the conditioning noisy feature can be any acoustic vector generated from the noisy speech frame. For example, it may include an estimate of the SNR, energy, cepstral energy, eepstrum, and so forth.</Paragraph> <Paragraph position="7"> The conditional probability density function p(Znlg i) is modeled as a mixture of I Gaussian distributions. Each Gaussian distribution models a VQ region. The parameters of the distributions (mean vectors and covariance matrices) are estimated using the corresponding z n vectors associated with that region. The posterior probabilities p(gilzn ) are computed using Bayes' theorem and the mixture weights P( gil are estimated using the relative number of training clean vectors that are assigned to a given VQ f region.</Paragraph> <Paragraph position="9"> feature vector size conditioning feature vector size number of training flames number of VQ regions maximum filter delay estimation error vector dean feature vector estimate of clean feature vector noisy feature vector conditioning noisy feature vector mean vector of gaussian i eovarianee matrix of gaussian i transversal filter coefficient matrix tap input vector multiplicative tap matrix additive tap matrix rrelation matrix ~rrelation matrix Table 1: List of symbols To compute the optimum filters in the mean-squared error sense, we minimize the conditional error in each VQ region. The minimum mean-squared error vector is obtained by taking the gradient of E i defined in Eq. (4) with respect to the filter coefficient matrix and equating all the dements of the gradient matrix to zero. As a result, the optimum filter coefficient matrix has the form, W. = RSlr. where</Paragraph> <Paragraph position="11"> is a probabilistic nonsingular auto-correlation matrix, and</Paragraph> <Paragraph position="13"> is a probabilistic cross-correlation matrix.</Paragraph> <Paragraph position="14"> The algorithm can be completely trained without supervision and requires no additional information other than the simultaneous waveforms.</Paragraph> <Paragraph position="15"> The run-time estimate of the clean feature vector can be computed by integrating the outputs of all the filters as follows:</Paragraph> <Paragraph position="17"/> </Section> class="xml-element"></Paper>