File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/h94-1068_intro.xml
Size: 3,020 bytes
Last Modified: 2025-10-06 14:05:47
<?xml version="1.0" standalone="yes"?> <Paper uid="H94-1068"> <Title>vIicrophone Arrays and Neural Networks for Robust Speech Recognition</Title> <Section position="4" start_page="0" end_page="342" type="intro"> <SectionTitle> 2. SPEECH CORPUS </SectionTitle> <Paragraph position="0"> A speech database has been recently created at the CAIP Center for evaluation of the integrated system of microphone arrays, neural networks, and ARPA speech recognizers. The database consists of 50 male and 30 female speakers. Each speaker speaks 20 isolated command-words, 10 digits, and 10 continuous sentences of the Resource Management task. Of the continuous sentences, two are the same for all speakers and the remaining 8 sentences are chosen at random. Two recording sessions are made for each speaker. One session is for simultaneous recording from a head-mounted close-talking microphone (HMD 224) and from a 1-D beamforming line array microphone (see Section 3.1). The other is for simultaneous recording of the head-mounted close-talking microphone and a desk-mounted microphone (PCC 160). The recording is done with an Ariel ProPort with a sampling fre- null and from the desk-mounted microphone (D). CA) and (B) are simultaneously zecozded in a session and (C) and (D) in a following session. The utterance is: &quot;Miczophone arzay,&quot; spoken by a male speaker !ABF).</Paragraph> <Paragraph position="1"> system. The neural network processor is trained using simultaneously recorded speech. The trained neural network processor is then used to transform spectral features of array input to those appropriate to close-talking. The transformed spectral featuzes are inputs to the speech recognition system. No retraining or modification of the speech recognizer is neeessary. The training of the neural net typically zequires about 10 seconds of signal.</Paragraph> <Paragraph position="2"> corporating microphone arrays, neural networks, and ARPA speech recognizers.</Paragraph> <Paragraph position="3"> quency of 16 kHz and 16-bit linear quantization. The recording environment is a hard-walled laboratory room of 6 x 6 x 2.7 meters, having a reverberation time of approximately 0.5 second. Both the desk-mounted microphone and the line axray microphone are placed 3 meters from the subjects. Ambient noise in the laboratory room is from several workstations, fans, and a large-size video display equipment for teleconfereneing. The measured 'A' scale sound pressure level is 50 dB. Indicative of the quality differences in outputs f~om various sound pickup systems, signal waveforms are given in Figure 1. Because of wave propagation from the speaker to distant microphones, a delay of approximately 9 msec is noticed in outputs of the line array and the desk-mounted microphone. Wave propagation between the subject's lips to the head-mounted close-talking microphone is negligible.</Paragraph> <Paragraph position="4"> The reader is referred to \[8\] for more details.</Paragraph> </Section> class="xml-element"></Paper>