File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1065_metho.xml

Size: 17,031 bytes

Last Modified: 2025-10-06 14:13:48

<?xml version="1.0" standalone="yes"?>
<Paper uid="H94-1065">
  <Title>ADAPTATION TO NEW MICROPHONES USING TIED-MIXTURE NORMALIZATION</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1. INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> Interactive speech recognition systems are usually trained on substantial amounts of speech data collected with a high quality close-talking microphone. During recognition, these systems require the same type of microphone to be used in order to achieve their standard accuracy. This is a highly restdcting condition for practical applications of speech recognition systems. One can imagine a situation, where it would be desirable to use a different microphone for recognition than the one with which the training speech was collected.</Paragraph>
    <Paragraph position="1"> For example, some users may not want to wear a headmolmted microphone. Others may not want to pay for a high quality microphone. Additionally, many applications involve recognition of speech over telephone lines and telephone sets with high variability in quality and characteristics. However, we know that even highly accurate speech recognition systems perform very poorly when they are tested with microphones with different characteristics than the ones that they were trained on \[1\].</Paragraph>
    <Paragraph position="2"> There is a wide range of approaches in order to compensate for this degradation in performance including: Retrain the HMMs with data collected with the new microphone encountered during the recognition stage, a rather expensive approach for real applications, or by training on a large number of microphones in the hope that the system will obtain the necessary robustness.</Paragraph>
    <Paragraph position="3"> Use robust signal processing algorithms.</Paragraph>
    <Paragraph position="4"> Develop a feature transformation that maps the alternate microphone data to training microphone data.</Paragraph>
    <Paragraph position="5"> Use statistical methods in order to adapt the parameters of the acoustic models.</Paragraph>
    <Paragraph position="6"> In previous work we had discussed the use of Cepstmm Mean Subtraction and the RASTA algorithm as two simple signal processing algorithms to compensate the degradation caused by an alternate channel \[7\]. In this pape r, we present an approach towards feature mapping by modeling the difference between the test and the training microphone, prior to reco tion.</Paragraph>
    <Paragraph position="7"> We have developed the Tied-Mixture Normalization Algorithm, a technique for adaptation to a new microphone based on modifying the continuous densities in a tied-mixture I-IMM system, using a relatively small amount of stereo training speech. This method is presented in detail in Section 2. In Section 3 we describe several experiments on a known microphone task and the effect of the adaptation method in the performance of the recognition system.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="326" type="metho">
    <SectionTitle>
2. TIED MIXTURE NORMALIZATION
</SectionTitle>
    <Paragraph position="0"> In a Tied-Mixture Hidden Markov Model (TM-HMM) system \[2, 6\], speech is represented using an ensemble of Gaussian mixture densities. Every frame of speech is represented as a Gaussian nfixture model. Specifically the probability density function for an observation conditioned on the H/vIM state is expressed as:</Paragraph>
    <Paragraph position="2"> where zt, st, C, ck, #k, Zk are the observed speech flame at time ~, the HMM state at time t, the number of clusters of the codebook, and for k-th mixture density, the mixture weight, the mean and the covariance matrix respectively.</Paragraph>
    <Paragraph position="3"> The vector quantization (VQ) codebook which consists of these mean vectors and covariance matrices, has been derived from a subset of the training data, therefore it is mostly chaaacteristic of the location and distribution of the training data and the training microphone in the acoustic space.</Paragraph>
    <Paragraph position="4"> However if the codebook was created with data collected with some other microphone, due to the additive and convolutional effect on speech specific to this new microphone, the data would be disl~ibuted differently in the acoustic space and the ensemble of means and covariances of the codebook would reflect the characteristics of the new microphone. This is the case of the mismatch in training and testing microphone. Without any compensation, we quantize the test data, recorded with the new microphone, using the mixture codebook generated from recordings with the training microphone. This inevitably results in a degradatiun in performance, since the codebook does not model the test data.</Paragraph>
    <Paragraph position="5"> We introduce a new algorithm, called Tied Mixture Normalization (TMN) to compute the codebook transformation from the training microphone to the new test microphone. The TIV~N algorithm requires a relatively small amount of stereo speech adaptation data, recorded with the microphone used for training (primary microphone) and the new microphone (alternate microphone). Then using the stereo data, we can adapt the existing HMM model to work well on the new test condition despite the mismatch with the training.</Paragraph>
    <Paragraph position="6"> Figure 1 provides a schematic description of the TMN algorithm. We assume that we have a tied-mixture densities codebook (set of Gaussians distributions), derived from a subset of the training data that was recorded with the primary microphone. We quantize the adaptation data from the primary channel and label each frame of speech with the index of the most likely Gaussian distribution in the tied-mixture codebook. Since there is an one-to-one correspondence between data of the primary and alternate channel we use the VQ indices of the frames of the data of the primary channel to label the corresponding frames of the data of the alternate channel. Then for each of the VQ clusters, from all the frames of the alternate microphone data with the same VQ label, we compute the sample mean and the sample covariance of the cepstrum vectors that represent a possible shift and scaling of this cluster in the acoustic  ture densities codebook space (Fig. 2). These are the new means and covariances of the Gaussian distributions of the new normalized codebook.</Paragraph>
    <Paragraph position="8"> scaled version of the original codebook The new Gaussian densities are used in conjunction with the mixture weights ck (sometimes called the discrete probabilities) of the original model to compute the observation probability density function as expressed previously.</Paragraph>
    <Paragraph position="9"> One of the possible weaknesses of the TMN algorithm is that each cluster of the original codebook is transformed independently of all the others. This assumption goes against our intuition that a codebook transformation, due to different microphone characteristics, should maintain continuity between adjacent codebook clusters and shift all the clusters in the same general direction. Additionally, a potential problem may arise when a particular cluster does not have enough samples to compute its statistics. Hence, we may not estimate the correct transformation due to insufficient or distorted data by modeling each codebook cluster independently. To alleviate this problem we use the following approach, originally suggested for speaker adaptation \[4\]: when the centroid of the ith codebook cluster is denoted by rn~ and that of the transformed alternate microphone by #i,  the deviation vector between these two centroids is di = pi - ra/ i = 1, 2, ..., C (1) where C is the size of the codebook. For each cluster centroid ci, the deviation vectors of all clusters {d~} are summed with weighting factors {wik} to produce the shift vector zli:</Paragraph>
    <Paragraph position="11"> The weighting factor wik is the probability {P(mklra.i)} ~ of centroid mk of the original codebook to belong to the ith cluster raised to the a power. This weight is a measure of vicinity among clusters and the exponentiation controls the amount of smoothing between the clusters. Finally, the centroid c~ of the ith duster of the transformed codebook is:</Paragraph>
    <Paragraph position="13"> Similarly the covariances of the clusters of the new codebook a~-e computed as the averaged summations over all sample covariances computed in the first implementation of TMN.</Paragraph>
    <Paragraph position="14"> the development and evaluation sets of Spoke 6 of the WSJI corpus and consists of stereo recordings with the Sennheiser microphone and the Audio-Technica microphone or a telephone handset over external telephone lines. Adaptation data was supplied separately consisting of a total of 800 stereo recorded utterances from 10 speakers; 400 sentences recorded simultaneously with the Sennheiser and the Audio-Technica and 400 sentences recorded with the Sennheiser and the telephone handset.</Paragraph>
    <Paragraph position="15"> We evaluated the TMN algorithm for each of the two new microphones and we present the resuRs on the development and the 1993 ARPA WSJ official evaluation test sets.</Paragraph>
    <Paragraph position="16"> 3.1. Audio-Technica (AT) Microphone We applied the TMN algorithm, as described in Section 2, on the 400 adaptation sentences simultaneously recorded with the Sennheiser and the Audio-Technica (AT) microphones to compute the codebook transformation for the alternate microphone. For the evaluation of the system, the comparative experiments include:</Paragraph>
  </Section>
  <Section position="5" start_page="326" end_page="327" type="metho">
    <SectionTitle>
3. DESCRIPTION OF EXPERIMENTS
</SectionTitle>
    <Paragraph position="0"> In this section we describe the results we obtained applying the TMN algorithm for the Spoke 6 of the Wall Street Journal (WSJ) speech corpus. This is the known alternate microphone 5000-word closed recognition vocabulary, speaker independent speech recognition task. It addresses two different alternate microphones, the Audio-Technica 853a, a high quality directional, stand-mount microphone, and a standard telephone handset ( the AT&amp;T 720 speaker phone). The adaptation and test database includes simultaneous recordings of high quality speech using the primary microphone (Sennheiser HMD-414 head-mounted microphone with noise canceling element) and speech recorded with each of the two alternate microphones.</Paragraph>
    <Paragraph position="1"> Recognition on the Sennheiser recorded portion of the test data to access the lower bound on the error rate, that the baseline system can achieve with matched training and testing microphone.</Paragraph>
    <Paragraph position="2"> Recognition on the Audio-Technica recorded po~on of the test data to access the degradation in the performance of the baseline system for the mismatch condition when no adaptation is used, other than the standard cepstram mean subtraction.</Paragraph>
    <Paragraph position="3"> Recognition on the Audio-Technica recorded portion of the test data, using the proposed adaptation scheme to determine the improvement on the system performance due to the adaptation algorithm.</Paragraph>
    <Paragraph position="4"> All of the experiments that will be described were performed using the BBN BYBLOS speech recognition system \[3\]. The front end of the system computes steady-state, first- and second-order derivative Mel-frequency cepstral coefficients (MFCC) and energy features over an analysis range of 80 to 6000 Hz. Cepstrum mean subtraction is a standard feature of the system used to compensate for the unknown channel transfer function. In cepstmm mean subtraction we compute the sample mean of the cepstrum vector over the utterance, and then subtract this mean from the cepstrum vector at each frame. No distinction is made between speech and non-speech frames. The acoustic models are trained on 62 hours of speech (37000 sentences) from the WSJ0 and WSJ1 corpora, collected from 37 speakers, with the Sennheiser high quality close-talking microphone. The recognition is done using trigrarn language models. The test data comes from In Table 1, we list the word error rates for these experiments. The mismatch between the Audio-Technica and the  adaptation using the Sennheiser or the Audio-Technica microphone null Sennheiser microphone does not cause a serious degradation, even when no adaptation is used to account for the  channel mismatch. The TMN adaptation reduces the additional degradation due to the channel mismatch by about a factor of 2 in both test sets.</Paragraph>
    <Section position="1" start_page="327" end_page="327" type="sub_section">
      <SectionTitle>
3.2. Telephone Speech
</SectionTitle>
      <Paragraph position="0"> The telephone handset (TH) differs radically from the other two microphones, having the main characteristic of allowing a much narrower band of frequencies than the others. Therefore, prior to applying any adaptation scheme, we chose to bandlimit the Sennheiser training data between 300-3300 Hz, to create new bandlimited phonetic word models. This was accomplished by retaining the DFT coefficients of the feature analysis in the range 300-3300 Hz to compute the MFCC coefficients. We bandlimited the stereo adaptation and test data in the same way. We applied the TMN algorithm on the bandlimited adaptation data to compute the codebook transformation for the telephone speech. During testing, the data is bandlimited as described, and quanfized using the normalized telephone codebook. In evaluating the adaptation algorithm for the telephone speech we performed the same series of experiments as with the Audio-Technica microphone. We consider using full bandwidth phonetic models as the baseline system and the generation of bandlimited phonetic models as part of the scheme for adaptation to the telephone speech. In Table 2 we list the word error rates for these experiments. The degradation in performance due to  adaptation using the Sennheiser or the Telephone handset microphone lected with the primary microphone and comprise the WSJ0 and WSJ1 corpora with 12 and 50 hours of recorded speech respectively. We trained two sets of phonetic models using the WSJ0 corpus and the combined WSJ0+WSJ1 training data to determine the impact of additional training data collected with the primary microphone.</Paragraph>
      <Paragraph position="1"> Bandlimitedphonetic models: Determine the effect of bandlimiting separately from and in combination with the TMN algorithm.</Paragraph>
      <Paragraph position="2"> TMN Adaptation: Determine the effect the TMN algorithm separately from and in combination with of bandlimiting.</Paragraph>
      <Paragraph position="3"> The results are shown in Tables 3 and Tables 4. We have no clear explanation for the surprising result that additional training speech recorded with a high quality microphone improves the performance of the system on telephone speech. However the error rate reduces by a factor of 2 for some conditions by adding 50 hours of training high quality recorded speech. Furthermore bandlimiting is essential for the good performance of the system for telephone speech, as in all conditions reduces the error rate by a factor of 2. As a contrast, we also computed the error rate of the WSJ0+WSJ1 bandlimited system on the bandlimited Sennheiser recorded data portion of the test and found that to be 11.0%. The latter result compared with 8.9% (Table 2) which is the error rate of the full bandwidth system on the same speech implies that most of the loss in performance between recognizing high-quality Sennlaeiser recordings and telephone speech is due to the loss of information outside the telephone bandwidth. Using the telephone bandwith, switching from the high-quality Sennheiser microphone to the telephone handset increases the error rate only by a small factor, from 11.0% to 13.9%. Finally the effect of the TMN algorithm is much more significant when telephone bandwidth is not used.</Paragraph>
      <Paragraph position="4"> the mismatch between the Sennheiser recorded speech and the telephone speech is severe (the error rate goes from 8.9% to 29.5%). The combined effect of bandlimiting the data and the TMN adaptation reduces the error rate by a factor of 2.3 bringing the error rate of recognition of telephone speech close to that of high quality microphone recordings.</Paragraph>
      <Paragraph position="5"> Since the telephone speech is radically different from speech collected with the primary microphone, we conducted some more experiments to access the contribution of the bandlimiting process, the adaptation algorithm and the amount of training separately in the performance of the system. Specifically we tested the following conditions:  speech recorded with the primary microphone tested on WSJ Spoke 6 development test set telephone recordings.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML