File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/h92-1058_metho.xml
Size: 10,915 bytes
Last Modified: 2025-10-06 14:13:07
<?xml version="1.0" standalone="yes"?> <Paper uid="H92-1058"> <Title>PHONETIC CLASSIFICATION ON WIDE-BAND AND TELEPHONE QUALITY SPEECH</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2. INTRODUCTION </SectionTitle> <Paragraph position="0"> Researchers typically make use of standardized databases in order to benchmark the performance of speech recognition/understanding systems between and within laboratories. Comparisons between laboratories are important in order to benchmark the progress of the field in general.</Paragraph> <Paragraph position="1"> Comparisons within a laboratory are important to benchmark progress as a function of the research-and-development cycle. Benchmarking phonetic classification algorithms for telephone-network-based speech recognition/understanding systems poses two problems. First, there is no commonly-accepted standard database for evaluating phonetic classification for telephone quality speech. As a result, few if any inter-laboratory comparisons have been made. Second, the telephone network presents speech recognition/understanding systems with a band-limited, noisy, and in some cases distorted speech signal. While we would like to benchmark the performance of recognition systems intended for network speech against that of systems intended for wide-band speech, we do not have adequate quantification of the impact of the telephone network's signal degradation on the performance of phonetic classification algorithms. Therefore, we do not know whether the performance of a telephone-speech classification algorithm is limited by characteristics of the algorithm(s) or by charactedstics of the test utterances themselves.</Paragraph> <Paragraph position="2"> Both problems noted above could be addressed given a standardized database in which the speech data is presented in two forms: speech with wide-band characteristics and the same speech data with telephone network characteristics. As reported in Jankowski et al. \[1\], the N-TIMIT data-base was created for this purpose. The N-TIMIT database is identical to TIMIT \[2\] except that the former has been transmitted over the telephone network. Figure 1 shows a sample spectrogram of a TIMIT and N-TIMIT utterance.</Paragraph> <Paragraph position="3"> The N-TIMIT versions were recorded over many different transmission paths in order to get a representative sample of the range of telephone network conditions. This data presents a platform to &quot;calibrate&quot; the impact of the telephone network on the performance of phonetic classification algorithms.</Paragraph> <Paragraph position="4"> The telephone network affects the speech signal it carries in many ways. Chigier and Spitz\[3\] discussed the possible effects of source characteristics (how the speech is produced) and transmission characteristics (the environment in which the speech is produced, including ambient noise levels and the characteristics of the channel through which the speech is recorded). Some of the more obvious changes are due to band limitation (the signal is band-passed between approximately 300 Hz and 3400 Hz), addition of noise (both switching and line noise), and crosstalk. The goal of this experiment is to quantify the combined effects of signal changes due to telephone transmission characteristics on phonetic classification and to present the performance of a classifier under development. In doing this we hope to provide a model for inter-laboratory and intra-laboratory benchmarking of telephone-based vs. wide-band algorithms. null</Paragraph> </Section> <Section position="4" start_page="0" end_page="291" type="metho"> <SectionTitle> 3. EXPERIMENTAL DESIGN </SectionTitle> <Paragraph position="0"> The wide-band speech data used in this experiment consists of a subset of utterances from the TIMIT database\[2\]. In</Paragraph> <Paragraph position="2"> order to investigate the effects of telephone network transmission characteristics, the same subset of the TIMIT utterances used by Lee\[5\], and their N-TIMIT counterparts, were selected. Specifically, the test set consisted of three sentences selected at random from the Brown Corpus\[4\] (&quot;si&quot; utterances), and five sentences that provide a wide coverage of phoneme pairs (&quot;sx&quot; utterances), all for each of 20 speakers. This resulted in 6016 phonetic segments in 160 unique sentences to be classified into one of 39 categories, also defined by Lee. The &quot;si&quot; and &quot;sx&quot; utterances for the remaining 610 speakers were used to train the classification system.</Paragraph> <Paragraph position="3"> product between the spectral slice and each of the 40 frequency responses of Seneff's auditory model\[6\]. This is similar to passing the signal through a bank of critical-band filters.</Paragraph> </Section> <Section position="5" start_page="291" end_page="291" type="metho"> <SectionTitle> 5. CLASSIFICATION </SectionTitle> <Paragraph position="0"> A full-covariance gaussian classifier was then used to classify each of the incoming segments into one of the 39 phonemes. The gaussian classifier used 56 context-independent models based on a uni-gram model for the phonemes.</Paragraph> </Section> <Section position="6" start_page="291" end_page="292" type="metho"> <SectionTitle> 4. SIGNAL PROCESSING 5.1. Feature Extraction </SectionTitle> <Paragraph position="0"> Identical signal processing was performed on TIMIT and N-TIMIT. The speech signals were sampled at 16 kHz and pre-emphasized. We have developed a new signal representation: bark auditory spectral coefficients (BASC). The BASC was obtained by filtering the FFT representation with the filters of Seneff's auditory model\[6\]. Specifically, a 128-point FFT was performed with a 28-ms Hanning window every 5 ms. The window size of 28 ms was empirically determined to be the best for this task given this classification system. Each spectral slice, produced by the FFT, was down-sampled to 40 coefficients by computing the dot Each segment was divided in time into three equal parts.</Paragraph> <Paragraph position="1"> Forty coefficients were averaged across each third, resulting in 120 features for each phoneme. The average spectral difference was computed with its center at the begin boundary and then calculated again with its center at the end boundary. This spectral difference measure was computed for each spectral coefficient (there are 40 spectral coefficients) around each boundary in a segment. Therefore this gave a total of 80 spectral difference features. In calculating the spectral average, the frames further away from the center of the boundary were weighted more heavily than the frames close to the boundary. This weighting scheme is similar to that proposed by Rabiner et al. \[7\]. Let S\[f,c\] be the value of the spectral representation at framef and spectral coefficient c. Thus, the spectral difference coefficient at a segment boundary, sb (begin or end boundary), AS\[sb,c\] is defined as:</Paragraph> <Paragraph position="3"> where 2N is the number of frames in the overall window, and w is the weighting factor.</Paragraph> <Paragraph position="4"> A pilot study was conducted to determine whether weighted averages provide better classification performance than traditional unweighted (the special case of w = 1 in Eq. 1) averages using the current classification system. The weighted versions slightly outperformed the unweighted averages when testing on the cross-validation set described above.</Paragraph> <Paragraph position="5"> Another pilot study was designed to determine the optimal number of frames to use when computing the weighted averages. The number of frames included was systematically varied from 0 to 10 (0 _< N _< 10 in Eq 1), both preceding and following the boundary, which resulted in a weighted average difference for each coefficient. (Note that for N = 0 frames, no difference information is denved). The optimal number of frames to include in the weighted average was found to be 7, which provided the highest classification score on the cross-validation set.</Paragraph> <Paragraph position="6"> The average spectral distance calculations result in 40 features at the begin boundary and 40 features at the end boundary. These were combined with the 120 features derived for each segment described above. Duration and maximum zero crossing count were added to the pool of features, resulting in 202 features that were passed on to the classification system.</Paragraph> <Section position="1" start_page="292" end_page="292" type="sub_section"> <SectionTitle> 5.2. Feature Selection </SectionTitle> <Paragraph position="0"> Principal component analysis was used to reduce the number of input dimensions to the classifiers. The principal components were ranked in decreasing order according to amount of variance accounted for in the original data (i.e., based on the eigenvalues). The final set of principal components used was determined empirically by adding one pnncipal component at a time to the classifier, training the classifier, and then evaluating performance on the cross-validation set. Finally, the set of pnncipal components that produced the best performance on the cross-validation set was used to train the classifier on the entire training set.</Paragraph> <Paragraph position="1"> This procedure was carried out separately for the N-TIMIT and the TIMIT database. The resulting two classifiers were evaluated on their respective test sets.</Paragraph> <Paragraph position="2"> Ranking the pnncipal components according to the amount of variance they account for may not reflect how well they discriminate between classes. Therefore, another procedure was also evaluated to determine which of the principal components have the most discriminating power. This procedure was a stepwise add-on procedure based on adding the principal component that improves the performance of the classifier the most on the cross-validation set. This ranking of the pnncipal components was determined by first training a classifier on all 202 principal components. Another classifier was then created by taking the features from the initial classifier, one at a time, and testing on the cross-validation set. The pnncipal component that performed the best was next used with the remaining features one at a time, and now the pair of features that gave the best performance was used with the remaining features. This procedure was carried out by incrementally adding pnncipal components to the classifier based on their ability to improve performance.</Paragraph> <Paragraph position="3"> This procedure is not an optimal procedure, but it is computationally feasible (the optimal procedure would require testing 2202 (approximately 6.4x106deg) classifiers).</Paragraph> </Section> </Section> class="xml-element"></Paper>