File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1068_metho.xml
Size: 11,784 bytes
Last Modified: 2025-10-06 14:13:49
<?xml version="1.0" standalone="yes"?> <Paper uid="H94-1068"> <Title>vIicrophone Arrays and Neural Networks for Robust Speech Recognition</Title> <Section position="5" start_page="342" end_page="343" type="metho"> <SectionTitle> 3. SYSTEM OF MICROPHONE ARRAYS AND NEURAL NETWORKS </SectionTitle> <Paragraph position="0"> Figure 2 schematically shows the overall system design for robust speech recognition in variable acoustic environments, in-</Paragraph> <Section position="1" start_page="342" end_page="342" type="sub_section"> <SectionTitle> 3.1. Beamforming Microphone Arrays </SectionTitle> <Paragraph position="0"> As the distance between microphones and talker increases, the effects of room reverberation and ambient noise become more prominent. Previous studies have shown that beamforming/matched-fllter array microphones are effective in counteracting environmental interference. Microphone arrays can improve sound quality of the captured signal, and avoid hand-held, body-worn, or tethered equipment that might encumber the talker and restrict movement.</Paragraph> <Paragraph position="1"> The microphone array we use here is a one-dimensional beamforming line array. It uses direct-path arrivals to produce a slngle-beam delay-and-sum beamformer \[1, 2\]. (The talker typically faces the center of the llne array.) The array consists of 33 omni-direetlonal sensors, which are nonuniformly positioned (nested over three octaves). From Figure 1 it is seen that the wavefozm of the array resembles that of the close-talking microphone more than the desk-mounted microphone. null</Paragraph> </Section> <Section position="2" start_page="342" end_page="343" type="sub_section"> <SectionTitle> 3.2. Neural Network Processors </SectionTitle> <Paragraph position="0"> One of the neural network processors we have explored, is based on multi-layer perceptrons (MLP). The MLP has 3 lay-</Paragraph> </Section> </Section> <Section position="6" start_page="343" end_page="343" type="metho"> <SectionTitle> TRAINING USING THE STANDARD BACKPROPAGATION ONE HIDDEN LAYER WITH 4 SIGMOID NEURONS </SectionTitle> <Paragraph position="0"/> <Paragraph position="2"> stral coefficients of array speech to those of close-talking speech.</Paragraph> <Paragraph position="3"> ers. The input layer has 9 nodes, covering the current speech frame and four preceding and four following frames, as indicated in Figure 3. There are 4 sigmoid nodes in the hidden layer and 1 linear node in the output layer. 13 such MLP's are included, with one for each of the 13 cepstrum coefficients used in the SPHINX speech recognizer \[14\]. (Refer also to backpropagation method when microphone-array speech and close-talking speech are both available (see Figure 3). It is found that 10-seconds of continuous speech material are sufficient to train the neural networks and allow them to &quot;learn&quot; the acoustic environment. In the present study, the neural nets are trained in a speaker-dependent mode; That is, 13 different neural networks (one for each cepstrum coefficient) are dedicated to each subject 1. The trained networks are then utilized to transform cepstrum coefficients of array speech to those of close-talking speech, which are then used as inputs to the SPHINX speech recognizer.</Paragraph> </Section> <Section position="7" start_page="343" end_page="343" type="metho"> <SectionTitle> 4. EVALUATION RESULTS WITH SPHINX RECOGNIZER </SectionTitle> <Paragraph position="0"> As a baseline evaluation, recognition performance is measured on the command-word subset of the CAIP database.</Paragraph> <Paragraph position="1"> Performance is assessed for matched and unmatched testing/training conditions and include both the pretrained and retrained SPHINX system.</Paragraph> <Paragraph position="2"> The results for the pretrained SPHINX are given in Table 1. It includes four processing conditions: (i) close-talking; (ii) line array; (ili) line array with mean subtraction (MNSUB) \[15\]; and, (iv) line array with the neural network processor (NN).</Paragraph> <Paragraph position="3"> Table 2 gives the results for the retrained SPHINX under five processing conditions: (i) close-talking; (fi) line array; correct), using the pretrained SPHINX speech recognizer. (iii) desk-mounted microphone; (iv) line array with mean subtraction (MNSUB); and, (v) line array with the neural network processor (NN). The SPHINX speech recognizer is retrained using the CAIP speech corpus to eliminate system conditions in coUection of the Resource Management task (on which the original SPHINX system has been trained) and the CAIP speech database.</Paragraph> <Paragraph position="4"> As shown in Tables 1 and 2, the array-neural net system is capable of elevating word accuracy of the speech recognizer. For the retrained SPHINX, the microphone array and neural network system improves word accuracy from 21% to 85% for distant talking under reverberant conditions. On the other hand, the mean subtraction method under these conditions improves the performance only marginally.</Paragraph> <Paragraph position="5"> It is also seen from Table 2 that if the SPHINX system has been retrained with array speech at a distance of 3 meters, the performance is as high as 82%. The figure, obtained under a matched training/testing condition, is, however, lower than that obtained under an unmatched training/testing condition with microphone array and neural network. Similar results have been achieved for speaker identification \[9, 10\].</Paragraph> </Section> <Section position="8" start_page="343" end_page="344" type="metho"> <SectionTitle> 5. EVALUATION RESULTS WITH DTW RECOGNIZER </SectionTitle> <Paragraph position="0"> To more effectively and efficiently assess the capability of microphone arrays and neural network equalizers, a DTW-based speech recognizer is implemented \[12\]. The back end of DTW classification is simple, and hence, the results do not tend to be influenced by the complex back end of an HMM-based recognizer, including language models and word-pair grammars.</Paragraph> <Paragraph position="1"> correct), using a retrained SPHINX recognizer based on the CAIP speech database.</Paragraph> <Paragraph position="3"> function of the number of iterations when training the neural network processor.</Paragraph> <Paragraph position="4"> The DTW recognizer is applied to recognition of the command words. End-points of close-talking speech are automatically determined by the two-level approach \[11\] 2. Attempts have also been made to automatically detect end-points of array speech \[13\], but in the present paper, the starting/ending points are inferred from the simultaneously recorded close-talking speech, with an additional delay resulting from wave propagation. The DTW recognizer is speaker dependent, and is trained using close-talking speech. The measured features are 12th-order LPC-derived cepstral coefficients over a frame of 16 msec. The frame is Hamming-windowed and the consecutive windows overlap by 8 msec. The DTW recognizer is tested on array speech (with the originally computed and neural-network corrected cepstral coefficients) and on the other set of the close-talking recording. The Euclidean distance is utilized as the distortion measure.</Paragraph> <Paragraph position="5"> The recognition results, pooled over 10 male speakers, are presented in Table 3. The configuration of MLP used in this DTW based evaluation differs from that in Section 4. A single MLP with no window-sliding is now used to collectively transform all of 12 cepstral coefficients from array speech to cclosetalking. The MLP has 40 hidden nodes and 12 output nodes.</Paragraph> <Paragraph position="6"> The network is again speaker-dependently trained with standard backpropagation algorithms. The learning rate is set to 0.1 and the momentum term to 0.5. The backpropagation procedure terminates after 5000 iterations (epochs).</Paragraph> <Paragraph position="7"> It can be seen that the results in Table 3 are similar to those in Tables 2 and 1. The use of microphone arrays and neural networks elevates the DTW word accuracy from 34% to 94% under reverberant conditions. The elevated accuracy is close to that obtained for close-talking speech (98~0).</Paragraph> <Paragraph position="8"> DTW classification algorithms.</Paragraph> <Paragraph position="9"> Figure 4 illustrates the relationship between the number of training iterations of the neural networks and the word recognition accuracies. It is seen that as the iteration number increases from 100 to 1000, the recognition accuracy qnickiy rises from 32% to 87%. It can also be seen that after 5000 iterations the network is not overtralned, since recognition accuracy on the testing set is still improving.</Paragraph> </Section> <Section position="9" start_page="344" end_page="345" type="metho"> <SectionTitle> 6. PERFORMANCE COMPARISON OF DIFFERENT NETWORK ARCHITECTURES </SectionTitle> <Paragraph position="0"> We also perform comparative experiments with respect to different network architectures. It has been suggested in the communications literature that recurrent non-linear neural networks may outperform feedforward networks as equalizers. Since our problem can be interpreted as a room acoustics equalization task, we decide to evaluate the performance of recurrent nets. For the experiments reported here, we only train on data from the 3rd eepstral coefficient (out of 13 bands). The input to the neural net is the cepstral data from the microphone array; the target cepstral coefficient is taken from the close-talking microphone. The squared error between the target data and the neural net output is used as the cost function. The neural nets are trained by gradient descent. The following three different architectures have been evaluated: (i) a linear feedforward net (adaline) \[16\], (i.i) a non-linear feedforward net, (iii) and a non-linear recurrent network. The input layer of all nets consisted of a tapped delay line. The network configurations are depicted in Figures 5 and 6.</Paragraph> <Paragraph position="1"> Experimental results are summarized in Table 4, where the entry &quot;nflops/epoch&quot; stands for the number of (floating point) operations required during training per epoch. The entry &quot;#parameters&quot; holds the number of adaptive weights in the network.</Paragraph> <Paragraph position="2"> It is clear that, for this dataset, the non-linear networks perform better than the linear nets, but at the expense of considerably more computations during adaptation. This is not a problem if we assume that the transfer function from speaker to microphone is constant, but in a changing environment (moving speaker, doors opening, changing background noise) this is a problem, as the neural net needs to track the change in real-time. It should be noted that the used cost function, the squared error, is in all likelihood not a monotonic function of the recognizer performance. Currently experiments are underway that evaluate the performance of various network architectures in terms of word recognition accuracy.</Paragraph> <Paragraph position="3"> the 2-layer feedforward.</Paragraph> <Paragraph position="4"> architecture flnalsqe nflops/epoch #parameters adaptation rule configurations. The various runs are ordered by increasing performance. Final sqe (squared error) is the mean sqe per time step on the test database. The ops/epoch denotes the number of floating point operations per epoch during training. The number in brackets is the number of flops per epoch divided by flops/epoch for adaline (5 taps). ~ parameters denotes the number of adaptive parameters in the network.</Paragraph> </Section> class="xml-element"></Paper>