File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/93/h93-1016_intro.xml
Size: 3,883 bytes
Last Modified: 2025-10-06 14:05:23
<?xml version="1.0" standalone="yes"?> <Paper uid="H93-1016"> <Title>An Overview of the SPHINX-II Speech Recognition System</Title> <Section position="3" start_page="0" end_page="81" type="intro"> <SectionTitle> 2. FEATURE EXTRACTION </SectionTitle> <Paragraph position="0"> The extraction of reliable features is one of the most important issues in speech recognition and as a result the training data plays a key role in this research. However the curse of dimensionality reminds us that the amount of training data will always be limited. Therefore incorporation of additional features may not lead to any measurable error reduction. This does not necessarily mean that the additional features are poor ones, but rather that we may have insufficient data to reliably model those features. Many systems that incorporate environmentally-robust \[1\] and speaker-robust \[11\] models face similar constraints.</Paragraph> <Paragraph position="1"> 2.1. MFCC Dynamic Features Temporal changes in the spectra are believed to play an important role in human perception. One way to capture this information is to use delta coefficients that measure the change in coefficients over time. Temporal information is particularly suitable for HMMs, since HMMs assume each frame is independent of the past, and these dynamic features broaden the scope of a frame. In the past, the SPHINX system has utilized three codebooks containing \[23\]: (1) 12 LPC cepstrum coefficients x~(k), 1 <= k <= 12; (2) 12 differenced LPC cepstrum coefficients (40 msec. difference) Axt(k), 1 <= k <= 12; (3) Power and differenced power (40 msec.) zt(0) and Azt(0). Since we are using a multiple-codebook hidden Markov model, it is easy to incorporate new features by using an additional codebook. We experimented with a number of new measures of spectral dynamics, including: (1) second order differential cepstrum and power (AAzt(k),</Paragraph> <Paragraph position="3"> cepstrum and power. The first set of coefficients is incorporated into a new codebook, whose parameters are second order differences of the cepstrum. The second order difference for frame t, AAx~(k), where t is in units of 10ms, is the difference between t + 1 and t - 1 first order differential coefficients, or AAz~(k) = AX~_l(k) - Ax~+l(k).</Paragraph> <Paragraph position="4"> Next, we incorporated both 40 msec. and 80 msec. differences, which represent short-term and long-term spectral dynamics, respectively. The 80 msec. differenced cepstrum Az't(k) is computed as: Az'~(k) = z~_4(k) - xt+4(k).</Paragraph> <Paragraph position="5"> We believe that these two sources of information are more complementary than redundant. We incorporated both Azt and Aztt into one codebook (combining the two into one feature vector), weighted by their variances. We attempted to compute optimal linear combination of cepstral segment, where weights are computed from linear discriminants. But we found that performance deteriorated slightly. This may be due to limited training data or there may be little information beyond second-order differences. Finally, we compared mel-frequency cepstral coefficients (MFCC) with our bilinear transformed LPC cepstral coefficients. Here we observed a significant improvement for the SCHMM model, but nothing for the discrete model. This supported our early findings about problems with modeling assumptions \[15\]. Thus, the final configuration involves 51 features distributed among four codebooks, each with 256 entries. The codebooks are: (1) 12 mel-scale cepstrum coefficients; (2) 12 40-msec differenced MFCC and 12 80-msec differenced MFCC; (3) 12 second-order differenced MFCC; and (4) power, 40-msec differenced power, second-order differenced power. The new feature set reduced errors by more than 25% over the baseline SPHINX results on the WSJ task.</Paragraph> </Section> class="xml-element"></Paper>