File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/h91-1031_metho.xml

Size: 17,753 bytes

Last Modified: 2025-10-06 14:12:42

<?xml version="1.0" standalone="yes"?>
<Paper uid="H91-1031">
  <Title>Signal Representation Attribute Extraction and the Use Distinctive Features for Phonetic Classification 1</Title>
  <Section position="3" start_page="0" end_page="176" type="metho">
    <SectionTitle>
TASK AND CORPUS
</SectionTitle>
    <Paragraph position="0"> The task chosen for our experiments is the classification 2 of vowels in American English. The corpus consists of 13 monothongs /i, I, c, e, a~, a, o, A, o, u, u, ii and 3&amp;quot;/ and 3 diphthongs /aY, o y, aw/. The vowels are excised from the acoustic-phonetically compact portion of the TIMIT corpus \[6\], with no restrictions imposed on the phonetic contexts of the vowels. For the signal representation study, experiments are based on the task of classifying all 16 vowels. However, the dynamic nature of the diphthongs may render distinctive feature specification ambiguous. As a result, we excluded the diphthongs in our investigation involving distinctive features, and the size of the training and test sets were reduced correspondingly. The size and contents of the two corpora are summarized in Table 1.</Paragraph>
    <Paragraph position="1"> sit is a classification task in that the left and right boundaries of the vowel token are known through a hand-labelling procedure, and the classifier is only asked to determine the most likely label.</Paragraph>
    <Paragraph position="2">  els. It is used for investigation of signal representation. Corpus II is a subset of Corpus I. It consists of the monothongs only, and is used for investigation of distinctive features.</Paragraph>
    <Paragraph position="3"> For the experiments dealing with distinctive features, we characterized the 13 vowels in terms of 6 distinctive features, following the conventions set forth by others \[13\]. The feature values for these vowels are summarized in Table 2.</Paragraph>
    <Paragraph position="4"> The classifier for our experiments was selected with the following considerations. First, to facilitate comparisons of different results, we restrict ourselves to use the same classifier for all experiments. Second, the classifier must be flexible in that it does not make assumptions about specific statistical distributions or distance metrics, since different signal representations may have different characteristics. Based on these two constraints, we have chosen to use the multi-layer perceptron (MLP) \[7\]. In our signal representation experiments, the network contains 16 output units representing each of the 16 vowels. The input layer contains 120 units, 40 units each representing the initial, middle, and final third of the vowel segment. For the experiments involving acoustic attributes and distinctive features, the input layer may be the spectral vectors, a set of acoustic attributes, or the distinctive features, and the output layer may be the vowel labels or the distinctive features, as will be described later.</Paragraph>
    <Paragraph position="5"> All networks have a single hidden layer with 32 hidden units. This and other parameters had previously been adapted for better learning capabilities. In addition, input normalization and center initialization have been used \[8\].</Paragraph>
  </Section>
  <Section position="4" start_page="176" end_page="177" type="metho">
    <SectionTitle>
SIGNAL REPRESENTATION
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="176" end_page="176" type="sub_section">
      <SectionTitle>
Review of Past Work
</SectionTitle>
      <Paragraph position="0"> Several experiments on comparing signal representations have been reported in the past. Mermelstein and Davis \[10\] compared the reel-frequency cepstral coefficients (MFCC) with four other more conventional representations. They found that a set of 10 MFCC resulted in the best performance, suggesting that the reel-frequency cepstra possess significant advantages over the other representations. Hunt and Lefebvre \[4\] compared the performance of their psychoacoustically-motivated auditory model with that of a 20-channel melcepstrum. They found that the auditory model gave the highest performance under all conditions, and is least affected by changes in loudness, interfering noise and spectral shaping distortions. Later, they \[5\] conducted another comparison with the auditory model output, the reel-scale cepstrum with various weighing schemes, cepstrum coefficients augmented by the 5-cepstrum coefficients, and the IMELDA representation which combined between-class covariance information with within-class covariance information of the reel-scale filter bank outputs to generate a set of linear discriminant functions. The IMELDA outperformed all other representations.</Paragraph>
      <Paragraph position="2"> vowels These studies generally show that the choice of parametric representations is very important to recognition performance, and auditory-based representations generally yield better performance than more conventional representations. In the comparison of the psychoacoustically-motivated auditory model with MFCC, however, different methods of analysis led to different results. Therefore, it will be interesting to compare outputs of an auditory model with the computationally simpler reel-based representation when the experimental conditions are more carefully controlled.</Paragraph>
    </Section>
    <Section position="2" start_page="176" end_page="177" type="sub_section">
      <SectionTitle>
Experimental Procedure
</SectionTitle>
      <Paragraph position="0"> Our study compares six acoustic representations \[9\], using the MLP classifier. Three of the representations are obtained from the auditory model proposed by Seneff \[12\]. Two representations are based on reel-frequency, which has gained popularity in the speech recognition community. The remaining one is based on conventional Fourier transform. Attention is focused upon the relative classification performance of the representations, the effects of varying the amount of training data, and the tolerance of the different representations to additive white noise.</Paragraph>
      <Paragraph position="1"> For each representation, the speech signal is sampled at 16 kHz and a 40-dimensional spectral vector is computed once every 5 ms, covering a frequency range of slightly over 6 kHz.</Paragraph>
      <Paragraph position="2"> To capture the dynamic characteristics of vowel articulation, three feature vectors, representing the average spectra for the initial, middle, and final third of every vowel token, are determined for each representation. A 120-dimensional feature vector for the MLP is then obtained by appending the three average vectors.</Paragraph>
      <Paragraph position="3"> Seneff's auditory model (SAM) produces two outputs: the mean-rate response (MR) which corresponds to the mean probability of firing on the auditory nerve, and the synchrony response (SR) which measures the extent of dominance at  the critical band filters' characteristic frequencies. Each of these responses is a 40-dimensional spectral vector. Since the mean-rate and synchrony responses were intended to encode complementary acoustic information in the signal, a representation combining the two is also included by appending the first 20 principal components of the MR and SiR to form another 40-dimensional vector (SAM-PC).</Paragraph>
      <Paragraph position="4"> To obtain the mel-frequency spectral and cepstral coefficients (MFSC and MFCC, respectively), the signal is preemphasized via first differencing and windowed by a 25.6 ms Hamming window. A 256-point discrete Fourier transform (DFT) is then computed from the windowed waveform.</Paragraph>
      <Paragraph position="5"> Following Mermelstein et al \[10\], these Fourier transform coefficients are later squared, and the resultant magnitude squared spectrum is passed through the reel-frequency triangular filter-banks described below. The log energy output (in decibels) of each filter, Xk, k = 1,2,..,40, collectively form the 40-dimensional MFSC vector. Carrying out a cosine transform \[10\] on the MFSC according to the following equation yields the MFCC's, Yi, i = 1,2, .., 40.</Paragraph>
      <Paragraph position="7"> The lowest cepstrum coefficient, Y0, is excluded to reduce sensitivity to overall loudness.</Paragraph>
      <Paragraph position="8"> The mel-frequency triangular filter banks are designed to resemble the critical band filter bank of SAM. The filter bank consists of 40 overlapping triangular filters spanning the frequency region from 130 to 6400 Hz. Thirteen triangles are evenly spread on a linear frequency scale from 130 Hz to 1 kHz, and the remaining 27 triangles are evenly distributed on a logarithmic frequency scale from 1 kHz to 6.4 kHz, where each subsequent filter is centered at 1.07 times the previous filter's center frequency. The area of each triangle is normalized to unit magnitude.</Paragraph>
      <Paragraph position="9"> The Fourier transform representation is obtained by computing a 256-point DFT from a smoothed cepstrum, and then downsampling to 40 points.</Paragraph>
      <Paragraph position="10"> One of the experiments investigates the relative immunity of each representation to additive white noise. The noisy test tokens are constructed by adding white noise to the signal to achieve a peak signal-to-noise ratio (SNR) of 20dB, which corresponds to a SNR (computed with average energies) of slightly below 10dB.</Paragraph>
    </Section>
    <Section position="3" start_page="177" end_page="177" type="sub_section">
      <SectionTitle>
Results
</SectionTitle>
      <Paragraph position="0"> For each acoustic representation, four separate experiments were conducted using 2,000, 4,000, 8,000, and finally 20,000 training tokens. In general, performance improves as more training tokens are utilized. This is illustrated in Figure 1, in which accuracies on training and testing data as a function of the amount of training tokens for SAM-PC and MFCC. As the size of the training set increases, so does the classification accuracy on testing data. This is accompanied by a corresponding decrease in performance on training data.</Paragraph>
      <Paragraph position="1"> At 20,000 training tokens, the difference between training and testing set performance is about 5% for both representations.</Paragraph>
      <Paragraph position="2">  To investigate the relative immunity of the various acoustic representations to noise degradation, we determine the classification accuracy of the noise-corrupted test set on the networks after they have been fully trained on clean tokens.</Paragraph>
      <Paragraph position="3"> The results with noisy test speech are shown in Figure 2, together with the corresponding results on the clean test set.</Paragraph>
      <Paragraph position="4"> The decrease in accuracy ranges from about 12% (for the combined auditory model) to almost 25% (for the DFT).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="177" end_page="179" type="metho">
    <SectionTitle>
ACOUSTIC ATTRIBUTES AND
DISTINCTIVE FEATURES
</SectionTitle>
    <Paragraph position="0"> Our experiments were again conducted using an MLP classifier for speaker independent vow(.\] classification. Three experimental parameters were systematically varied, resulting in six different conditions, as depicted in Figure 3. These  three parameters specify whether the acoustic attributes are extracted, whether an intermediate distinctive feature representation is used, and how the feature values are combined for vowel classification. In some conditions (cf. conditions A, E, and F), the spectral vectors from the mean-rate response were used directly, whereas in others (cf. conditions B, C, and D), each vowel token was represented by a set of automatically-extracted acoustic attributes. In still other conditions (cf. conditions c, D, E, and F), an intermediate representation based on distinctive features was introduced.</Paragraph>
    <Paragraph position="1"> The feature values were either used directly for vowel identification through one bit quantization (i.e. transforming them into a binary representation) and table look-up (cf. conditions c and E), or were fed to another MLP for further classification (cf. conditions D and F). Taken as a whole, these experiments will enable us to answer the questions that we posed earlier. Thus, for example, we can assess the usefulness of extracting acoustic attributes by comparing the classification performance of conditions A versus S and D versus F.</Paragraph>
    <Section position="1" start_page="178" end_page="178" type="sub_section">
      <SectionTitle>
Acoustic Representation
</SectionTitle>
      <Paragraph position="0"> Each vowel token is characterized either directly by a set of spectral coefficients, or indirectly by a set of automatically derived acoustic attributes. In either case, three average vectors are used to characterize the left, middle, and right thirds of the token, in order to implicitly capture the context dependency of vowel articulation.</Paragraph>
      <Paragraph position="1"> Spectral Representation Comparative experiments described in the previous section indicate that representations from Seneff's auditory model result in performance superior to others. While the combined mean rate and synchrony representation (SAM-PC) gave the best performance, it may not be an appropriate choice for our present work, since the heterogeneous nature of the representation poses difficulties in acoustic attribute extraction. As a result, we have selected the next best representation - the mean rate response (MR).</Paragraph>
      <Paragraph position="2"> Acoustic Attributes The attributes that we extract are intended to correspond to the acoustic correlates of distinctive features. However, we do not as yet possess a full understanding of how these correlates can be extracted robustly.</Paragraph>
      <Paragraph position="3"> Besides, we must somehow capture the variabilities of these features across speakers and phonetic environments. For these reasons, we have adopted a more statistical and data-driven approach. In this approach, a general property detector is proposed, and the specific numerical values of the free parameters are determined from training data using an optimization criterion \[14\]. In our case, the general property detectors chosen are the spectral center of gravity and its amplitude. This class of detectors may carry formant information, and can be easily computed from a given spectral representation. Specifically, we used the mean rate response, under the assumption that the optimal signal representation for phonetic classification should also be the most suitable for defining and quantifying acoustic attributes, from which distinctive features can eventually be extracted.</Paragraph>
      <Paragraph position="4"> The process of attribute extraction is as follows. First, the spectrum is shifted down linearly on the bark scale by the median pitch for speaker normalization. For each distinctive feature, the training tokens are divided into two classes - \[+feature\] and \[-feature\]. The lower and upper frequency edges (or &amp;quot;free parameters&amp;quot;)of the spectral center of gravity are chosen so that the resultant measurement can maximize the Fisher's Discriminant Criterion (FDC) between the classes \[+feature\] and \[-feature\] \[2\].</Paragraph>
      <Paragraph position="5"> For the features \[BACK\], \[TENSE\], \[ROUND\], and \[RETRO-FLEX\] only one attribute per feature is used. For \[HIGH\] and \[LOW\], we found it necessary to include two attributes per feature, using the two sets of optimized free parameters giving the highest and the second highest FDC. These 8 frequency values, together with their corresponding amplitudes, make up 16 attributes for each third of a vowel token. Therefore, the overall effect of performing acoustic attribute extraction is to reduce the input dimensions from 120 to 48.</Paragraph>
    </Section>
    <Section position="2" start_page="178" end_page="179" type="sub_section">
      <SectionTitle>
Results
</SectionTitle>
      <Paragraph position="0"> The results of our experiments are summarized in Figure 4, plotted as classification accuracy for each of the conditions shown in Figure 3. The values in this figure represent the average of six iterations; performance variation among iterations of the same experiment amounts to about 1%.</Paragraph>
      <Paragraph position="1"> By comparing the results for conditions A and B, we see that there is no statistically significant difference in performance as one replaces the spectral representation by the  experimental paradigm acoustic attributes. This result is further corroborated by the comparison between conditions c and E, and D and F.</Paragraph>
      <Paragraph position="2"> Figure 4 shows a significant deterioration in performance when one simply maps the feature values to a binary representation for table look-up (i.e., comparing conditions A to E and B to C). We can also examine the accuracies of binary feature assignment for each feature, and the results are shown in Figure 5. The accuracy for individual features ranges from 87% to 98%, and there is again little difference between the results using the mean rate response and using acoustic attributes. It is perhaps not surprising that table look-up using binary feature values result in lower performance, since it would require that all of the features be identified correctly.  rate response and acoustic attributes However, when we use a second MLP to classify the features into vowels, a considerable improvement (&gt; 4%) is obtained to the extent that the resulting accuracy is again comparable to other conditions (cf. conditions A and F, and conditions B and D).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML