XML Viewer - h92-1057

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/h92-1057_metho.xml
Size: 11,872 bytes
Last Modified: 2025-10-06 14:13:07
<?xml version="1.0" standalone="yes"?>
<Paper uid="H92-1057">
  <Title>Experimental Results for Baseline Speech Recognition Performance using Input Acquired from a Linear Microphone Array</Title>
  <Section position="3" start_page="285" end_page="285" type="metho">
    <SectionTitle>
2. The LEMS Speech Recognizer
</SectionTitle>
    <Paragraph position="0"> An HMM-based, connected-speech, 38-word vocabulary (alphabet, digits, 'space', 'period'), talker-independent speech recognition system has been running for two years in the LEMS facility \[20, 21\]. This small, but very difficult vocabulary has many of the problems associated with a phoneme recognizer.</Paragraph>
    <Paragraph position="1"> Speech, sampled at 16kHz from a close-talking microphone, is tnmcated through a 40ms Hamming window every 10ms.</Paragraph>
    <Paragraph position="2"> Twelve cepstral coefficients, twelve delta cepstral coefficients, overall energy and delta overall energy comprise the 26 element feature vector. Three 256-entry codebooks are used to vector quantize the data from cepstral, delta cepstral, and energy/delta energy features respectively 1. The recognizer differs from standard HMM models in that durational probabilities are handled explicitly \[22\]. For each state, self transitions are disallowed. During training, nonparametric statistics are gathered for each of 30 potential durations in the state, i.e., 10ms to 300ms. In the base system used for this experiment, a gamma distribution was fitted to the non-parametric statistics. The models used are word-level models having from five to twelve states. Only forward transitions and skips of a single state were allowed.</Paragraph>
    <Paragraph position="3"> The best available recognizer at the time was used for the experiment, except that the amount of speech normally used to develop the vector quantizafion codebooks was reduced from one and one-half hours to 15 minutes. This made it feasible to do several full k-means re-trainings of the system; VQ training took but two days (elapsed time) on a SUN SPARCstation 2 while VQ training for the one and one-half hour case would have taken an unacceptable twelve days2! The change to the VQ training degraded performance for the close-talking microphone data by 1.5%, i.e., the 79% performance of the system for 1) new talkers and 2) no grammar was reduced to 77.5%.</Paragraph>
    <Paragraph position="4"> About four hours of speech (2400 connected strings, or nearly 40,000 vocabulary items) from 80 talkers, half male, half female, were used to train the hidden Markov models.</Paragraph>
    <Paragraph position="5"> Currently, the training procedure requires 60 hours of CPU time from each of eight SPARC 1+/2 workstations linked in a loosely-coupled fashion through sockets. Well-known mechanisms for speeding up the process, such as doing the eat the time this experiment was initiated, semi-continuous modeling of output probabilities and better word models were not yet a part of the system. Current improvements have increased overall performance for the head-mounted microphone input by about 3%.</Paragraph>
    <Paragraph position="6">  computation in the logarithm domain using integers and a lookup table \[23\], as well as some detailed new programming speedups \[24\] are being used to reduce the training time.</Paragraph>
  </Section>
  <Section position="4" start_page="285" end_page="286" type="metho">
    <SectionTitle>
3. Data Development
</SectionTitle>
    <Paragraph position="0"> The original speech data were recorded in a large, generally not-too-noisy room through an Audio Technica ATM73a head-mounted, close-talking microphone. The speech was sampled through a Sony DAT- 16 bits at 48kHz sampling rate. It was then digitally decimated to 16kHz and fed directly to a SUN workstation to build a high-fidelity database \[25\].</Paragraph>
    <Paragraph position="1"> The signal-to-noise ratio is about 50dB.</Paragraph>
    <Paragraph position="2"> It would not have been possible, let alone feasible, to record another large dataset from the same talkers using the microphone array system for acquisition. Thus, a mechanism had to be developed to use the high-fidelity database as input to the array recording system. A high-quality transducer was used to play out the speech; the geometry is shown in Figure 1. The resulting real-time system for the data conversion is schematically shown in Figure 2. Three SPARC 1+/2 workstations are used. The first converts the digital speech data in speech recognition format into digital data acceptable for playback through the microphone array hardware. This involves changing the sampling rate from 16kHz to 20kHz and then applying an FIR inverse filter to undo the coloring that will come from the output transducer. This filter was obtained by running digital, band-limited white noise with DFr spectrum W(r) through the transducer and recording the output through an ultra-flat frequency response Briiel &amp; Kjaer (B&amp;K) condenser microphone system placed a few</Paragraph>
    <Section position="1" start_page="286" end_page="286" type="sub_section">
      <SectionTitle>
Inverse Filter
</SectionTitle>
      <Paragraph position="0"> centimeters in front of the middle of the output transducer.</Paragraph>
      <Paragraph position="1"> After accumulating an average magnitude spectrum of the B&amp;K's output via multiple 128-point DFT's, the spectrum S(r) was inverted, i.e., Y(r) = W(r)/S(r), and inverse transformed to produce a zero-phase FIR filter 3. Any spectral energy attenuated by the anti-aliasing filter i.e., frequencies above 7kHz, were forced to unity gain. S(r) and Y(r) are shown in Figure 3. The subjective audible effects as well as the flattened white-noise response indicate that this procedure was successful in removing the 'boominess' potentially introduced by the transducer system* Initially, small, omnidirectional electret microphones were mounted at the edge of a 5cmx 10cm board containing amplifier/filter electronics and the board was plugged vertically into a (2.5m) horizontal cage* Recent work disclosed that this system formed resonant cavities that impacted the performance of the linear microphone array. When the same microphones with the same spacing (18*4 cm) were inserted into a (180cmx 30cmx 15cm) block of six pound ester foam, the degradations due to the cavities disappeared as may be seen in Figure 4. Note that the data shown are for the transducer output after the noise has been inverse filtered.</Paragraph>
      <Paragraph position="2"> The remainder of the data conversion system is straightforward. Twenty kilohertz sampling interrupts are used both to produce the speech output(s) and to digitize the analog signals from the eight microphones. Sufficient memory is available for about 10 second utterances. Upon completion of an utterance, the microphone data are sent to a third 3Non-zero-pha~zinverse filters are also being investigated.</Paragraph>
      <Paragraph position="3"> SPARCstation for sample-rate conversion, signal processing for recognition, and archiving on hard disk as feattwe vectors for the recognition system*</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="286" end_page="287" type="metho">
    <SectionTitle>
4. Experiment and Results
</SectionTitle>
    <Paragraph position="0"> The system was trained, both for VQ and for the hidden Markov model parameters, three different times: 1) for the high-fidelity data, 2) for the output of a single microphone of the array (a. central one), and 3) for the simple delay-and-sum beamfonned output of the 8 microphone array. The recognizer was tested using 20 new talkers, again half male and half female, for a total of an hour of speech, or about 4800 vocabulary items. The data conversion system was run under 'quiet' conditions. Not including noise due to reverberations, the signal-to-noise ratios were significantly degraded by the acoustical noise to 24dB for the single remote microphone and 26dB for the beamformed signal. The results as a function of talker number are plotted in Figure 5. From the Figure, one may deduce that: For all cases, variation with respect to talker is far greater than variations due to other effects.</Paragraph>
    <Paragraph position="1"> Recognition performance is approximately the same for the single microphone as it is for the beamformed case, given no other point 'noise' sources..</Paragraph>
    <Paragraph position="2"> Performance for the high-fidelity signal is consistently about 12% better than for the acoustically degraded signal.</Paragraph>
    <Section position="1" start_page="287" end_page="287" type="sub_section">
      <SectionTitle>
Systems
</SectionTitle>
      <Paragraph position="0"> For completeness, each of the test datasets was run against each of the three systems. The results are given in Table 1.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="287" end_page="288" type="metho">
    <SectionTitle>
Averaged Results for Direct and Cross-Trained Systems
5. Discussion
</SectionTitle>
    <Paragraph position="0"> Given the degraded acoustical environment, it was not surprising that performance for the converted data was reduced using remote-microphone input. However, it was somewhat surprising that this very carefully done experiment indicates no performance advantage when simple beamforming is used to generate the input. This could be due to the following: * Low-frequency background noise is not effectively eliminated by an acoustic array of this type and size. Some filtering, perhaps combined with sub-band types of enhancements, should help.</Paragraph>
    <Paragraph position="1"> * The major reverberations in the room come from the ceiling and floor. They have been measured as being as much as 25% of the original wavefront in intensity.</Paragraph>
    <Paragraph position="2"> Even if the reflections average 10%, implying a 14dB signal-to-noise ratio, 'quiet' room conditions no longer hold. A focused two or three dimensional array could attenuate these reflections and thus address the problem.</Paragraph>
    <Paragraph position="3"> Altematively, pressure gradient microphones could be used in a one-dimensional array as done in \[13\].</Paragraph>
    <Paragraph position="4"> There is always some variability in an acoustical experiment regarding equipment positioning, overall amplitudes, microphone calibration etc. While great care was taken, certainly the beamformer output would be more susceptible to these variabilities than would be the single remote microphone.</Paragraph>
    <Paragraph position="5"> In order to determine the impact of beamforming, the testing data were run through the data conversion system (source at (1 m, 2m)) several additional times, each with a second transducer located at (2m, 2m). This second transducer repeated a few seconds of speech at various, controlled levels as the testing data were being recorded. This procedure permits the assessment of the effects of beamforming with respect to spatial filtering of off-axis noise. The test datasets for both the single remote microphone and the beam:formed data were run through their respective quiet-room recognizers. As the purpose of this test was to check the simple beamformer, more elaborate beamformers were not used to generate the data of Figure 6. Also, note that no background noise processing (such as high-pass filtering the signals) was used to remove the low-frequency 'rumble' of the room.</Paragraph>
    <Paragraph position="6">  As the graph indicates, there is an appreciable performance gain using the array for acoustic data collection in a noisy environment. The simple beamformer consistently scored 10-15% higher than a single microphone for SNR's less than 16dB. Note that in one case the recognition result is negative. This is a consequence of the method employed for calculating the performance score.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML