File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/89/h89-2034_evalu.xml

Size: 16,777 bytes

Last Modified: 2025-10-06 14:00:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="H89-2034">
  <Title>Speaker Adaptation Using Multiple Reference Speakers</Title>
  <Section position="4" start_page="256" end_page="260" type="evalu">
    <SectionTitle>
3 EXPERIMENTAL RESULTS
</SectionTitle>
    <Paragraph position="0"> Speaker-adaptation encompasses a wide variety of practical scenarios. Our current speaker-adaptation algo- null fithms are suited to a batch-style, limited-training scenario appropriate for bringing a new speaker up to an acceptable initial recognition performance level. We have concentrated on the new speaker start-up seenario, using supervised techniques on a small set of known training utterances, in the belief that supervised techniques are most likely to succeed in the short term.</Paragraph>
    <Paragraph position="1"> In most of our development work, and in the experiments described below, we have used the speaker-dependent data from the 1000-word DARPA Resource Management continuous speech database \[4\]. All results reported here used 2 minutes (40 utterances) of adaptation material (limited-training) from the target speaker. The standard word-pair grammar, defined as part of the database for evaluation purposes, was used in all cases except where specified otherwise. The number of test speakers and the identity of the test set vary across the experiments described below, and are noted where important. Unless otherwise noted, the performance numbers given for all experiments are: % Word Error = 100 x \[(substitutions + deletions + insertions) / total number of word tokens\]</Paragraph>
    <Section position="1" start_page="257" end_page="259" type="sub_section">
      <SectionTitle>
3.1 Single Reference Speakers
</SectionTitle>
      <Paragraph position="0"> We performed a series of experiments on the baseline (single reference speaker) system to examine several issues related to our proposal for using multiple reference speakers. The experiments described below investigate the importance of feature conditioning for DTW, determine the effect of reference speaker identity on the estimation of the between-speaker transformations, and establish a baseline performance level for the single reference speaker system.</Paragraph>
      <Paragraph position="1">  The raw speech parameters that we use (Mel-warped cePstra, cepstral differences, normalized and difference energy) have widely varying dynamic ranges. This necessitates some form of feature pre-conditioning to avoid degenerate alignments from the DTW.</Paragraph>
      <Paragraph position="2"> In the past, we have found that normalizing each feature independently to unit-variance (computed over the adaptation utterances) provided a satisfactory and convenient solution to the dynamic range problem. After such a normalization, each feature contributes equally on average to the alignment score computed by DTW.</Paragraph>
      <Paragraph position="3"> This simple approach performed marginally better than weighting the original feature vectors to equalize the contribution of each feature set to the DTW score.</Paragraph>
      <Paragraph position="4">  feature conditioning.</Paragraph>
      <Paragraph position="5"> The results shown in Table 1 compare three cases of feature conditioning, tested on six speakers. The results given in the column labeled, Norm Only, were achieved by computing the feature transformation from the target adaptation speech and applying it to the target's test speech. The transformed target speech was then quantized by the reference codebook and recognized using the reference (cross-speaker) HMM. The results given in the column labeled, + PDF Mapping, were achieved after applying the PDF spectrum tranformation to the reference HMM. For this condition, the target speech is quantized by a speaker-dependent codebook made from the target's adaptation speech. The PDF mapping is therefore computed between two independent codebooks as in our standard baseline system.</Paragraph>
      <Paragraph position="6"> The unit variance condition (1) establishes a baseline performance for the system. This condition is similar&amp;quot; to the system configuration used for the results from Feb. '89 reported in \[3\]. For condition (2) in the table, the sample mean is removed from the speech features of both reference and target speakers after normalizing the features to unit variance. This yields a small improvement for the Norm Only case but doesn't improve when the PDF transformation is used. Condition (3) applies a fixed, non-unit weighting to the features of both speakers after unit variance scaling and mean removal.</Paragraph>
      <Paragraph position="7"> This yields an additional 25% reduction in error for the normalization alone and marginally improves the performance of the PDF mapping. For this condition the cepstral features (unit-variance normalized) of both speakers were scaled by the square root of the cepstral index of the feature. The normalized energy feature was scaled by x/~, while the difference energy was left unchanged at unit variance.</Paragraph>
      <Paragraph position="8"> These results indicate that the DTW is sensitive to feature conditioning when computing alignments for the purpose of estimating a between-speaker normalization.</Paragraph>
      <Paragraph position="9">  This indicates that further work is needed in feature conditioning and suggests that improvements to the iterative normalization procedure itself may also be important. It is also evident that the performance of the PDF mapping is largely independent of the quality of the normalization. This result is important since we must rely more heavily upon the feature normalization procedure when using multiple reference speakers as we propose.</Paragraph>
      <Paragraph position="10">  We tested our baseline, single reference, speaker-adaptation system on new test data for the Oct. '89 DARPA speech recognition evaluation. We used the feature conditioning enhancements described above. In addition the models for our standard reference speaker were retrained using cross-word-boundary context-dependent triphones.</Paragraph>
      <Paragraph position="11"> The reference model was trained from 30 minutes of speech (600 utterances). Two minutes of speech (40 utterances) from the target speaker was used to compute the transformations. All development was done on the designated, May88 test set, consisting of 25 utterances per speaker.</Paragraph>
      <Paragraph position="12"> The twelve speaker average word error rate for the Oct. '89 test set was 7.4% for the word-pair grammar and 28.7% for the no-grammar condition. These average results are competitive with the best speaker-independent results being reported today (elsewhere in these proceedings) on different, but comparable, test data. While the speaker-independent scenario requires no adaptation speech from the target speaker, it does require a large training data collection effort to provide adequate training for the (pooled) reference model. Specifically, speaker-independent training for the DARPA evaluations utilizes about 3.5 hours (4000 utterances) of reference speech from over 100 speakers. In contrast, our baseline speaker-adaptation system uses only 30 minutes (600 utterances) of reference speech from a single speaker to achieve the same performance. This suggests that speaker-adaptation may offer a more economical approach for those applications which require rapid configuration on new task domains.</Paragraph>
      <Paragraph position="13"> Detailed Oct. '89 evaluation results for the word-pair grammar are shown in Table 2 in order of increasing word error rate. The results in the last column of the table are:  for Oct. '89 evaluation test with word-pair grammar.</Paragraph>
      <Paragraph position="14"> Curiously, the female target speakers tend to achieve higher recognition results despite the fact that the reference speaker is male. Also, these results show a wide variance across speakers that is not consistent with speaker-dependent results (elsewhere in these proceedings) obtained from these same speakers on the same test material.</Paragraph>
      <Paragraph position="15"> In order to prove useful, speaker-adaptation must perform reliably for most speakers, and must be considerably more powerful than can be demonstrated today. Below, we discuss several possible strategies for improving our speaker-adaptation performance.</Paragraph>
      <Paragraph position="16">  In all of our previous work in speaker adaptation, one particular speaker has been used as the reference. Here we investigated the effect of the reference speaker's identity on recognition performance. Our standard speaker (male) was recorded at BBN in a normal office environment and spoke in a clear deliberate style. The development training and test data, on the other hand, was collected at another site in a sound isolating booth, and the subjects (both male and female) often spoke in casual undirected styles.</Paragraph>
      <Paragraph position="17"> We tested the effect of reference speaker identity by selecting four additional speakers from the database to be used as reference speakers. The speakers were chosen with the sole criterion that their speaker-dependent mod- null els performed better than the average of the 12 speakers in the database.</Paragraph>
      <Paragraph position="18">  used for speaker-adaptation.</Paragraph>
      <Paragraph position="19"> In Table 3, we show results, averaged over five test speakers, for each of five reference speakers. Speaker, RS, is our standard reference speaker. The results show that selection of an adequate reference speaker is not a difficult task since three of the four new speakers chosen do as well as our standard speaker. Furthermore, the recording and speaking style differences between our standard reference speaker and the test speakers are apparently not important ones, since reference speakers selected from the homogeneous database material did no better than our standard speaker. The 20- confidence interval for this experiment is ,.~ :t: 2.2%.</Paragraph>
      <Paragraph position="20"> The alternate reference speaker results were also used to determine whether the individual pairings of reference and target speaker were important. Since each target speaker had been adapted to each of the 5 reference speakers, we could pick the best matching reference for each target based on overall recognition performance.</Paragraph>
      <Paragraph position="21">  for a given target speaker.</Paragraph>
      <Paragraph position="22"> The resulting average word error rate for (unfair) posthoe reference selection was 9.9% as show in Table 4. This is 20% less than the average across all targetreference combinations shown in Table 3. This result represents an upper bound on the improvement that could be expected from automatic reference speaker selection at the test set level, making such a strategy relatively unattractive.</Paragraph>
      <Paragraph position="23"> Since we need a larger improvement than seems likely from any single reference speaker, we are attempting to find effective methods of combining multiple reference speakers.</Paragraph>
    </Section>
    <Section position="2" start_page="259" end_page="260" type="sub_section">
      <SectionTitle>
3.2 Multiple Reference Speakers
</SectionTitle>
      <Paragraph position="0"> We have performed two preliminary experiments to explore the feasibility of combining multiple reference speakers for speaker-adaptation.</Paragraph>
      <Paragraph position="1">  One approach for combining multiple reference speakers into a single reference model is to adapt each reference speaker independently to the target speaker, and use the adapted models jointly in the recognition stage. A straight-forward method of combining the adapted models is to average the HMM (discrete) densities.</Paragraph>
      <Paragraph position="2"> We created such a combined reference model from the last 4 of the reference speakers shown in Table 3. The resulting recognition word error rate for the averaged model was 9.3%, compared to 12.4% for the average of the same 4 speakers used as single reference speakers.</Paragraph>
      <Paragraph position="3"> While this result is encouraging, the gain must be measured against the added expense of the scenario. Also this approach produces a more smoothed adapted model than the single reference baseline system, so that it may not extend to combinations of large numbers of reference speakers.</Paragraph>
      <Paragraph position="4"> In order to reduce the smoothing inherent in averaging HMM parameters, we have tried combining the reference speakers before the final adapted model is trained.</Paragraph>
      <Paragraph position="5">  The feature normalization component of our system is&amp;quot; designed to superimpose the speech features of one speaker onto another's for the purpose of improving the DTW alignment used for estimating the PDF mapping.</Paragraph>
      <Paragraph position="6"> This same component can be used to transform the features of many reference speakers to a single, common speaker (a prototypical reference speaker). The transformed speech can then be pooled and trained as if it  came from a single reference speaker. The resulting model parameters should be less smoothed (more discriminating) than a model made from similarly pooled, but unnormalized speech.</Paragraph>
      <Paragraph position="7"> A target speaker can be similarly normalized to the prototypicai reference before adapting with the PDF mapping component of the system, exactly as is done in our standard single-reference speaker-adaptation system. null  systems, with speech normalization and PDF mapping. Preliminary results from an experiment designed to test this proposal are shown in Table 5. The table compares performance for a single reference speaker against a 12 speaker reference model across four conditions:  1) cross-speaker recognition (train on reference speaker(s), test on target speaker) 2) speech normalization before cross-speaker recognition 3) PDF transformation of cross-speaker model to adapted  target model 4) speech normalization before PDF transformation of cross-speaker model.</Paragraph>
      <Paragraph position="8"> All conditions in Table 5 are based on the results from 6 target speakers on the designated May88 test set. Two minutes of speech (40 utterances) from the target speaker were used to estimate the speaker transformations. The single reference condition used 600 utterances from our standard reference speaker, RS, to train the reference model. For the 12 speaker reference condition, 11 speakers were normalized (the intended target speaker was held-out) to the prototypical reference speaker, RS. This resulted in a pool of 7200 normalized training utterances for each target speaker. A single codebook was made for the entire experiment from 100 utterances from each of the 13 speakers. The normalization used in this experiment did not include the feature conditioning improvements described earlier. The baseline unit-variance feature scaling was used here. Note that condition (1) shown in Table 1 is identical to the single reference condition, with normalization only, shown here in Table 5. The single reference results show that normalization alone halves the error rate relative to cross-speaker recognition, while PDF mapping alone yields a ten-fold reduction in error rate. When combined, however, the additional gain is small. In the past, this effect has led us to regard the normalization as a way to make small improvements to the DTW-based alignment used for com-. puting the PDF transformation.</Paragraph>
      <Paragraph position="9"> The 12 speaker results, however, show that the normalization alone can be made as powerful as the PDF mapping by utilizing speech from multiple reference speakers. A five-fold reduction in error rate is realized for normalizing 12 reference speakers instead of one. Since the 12 speaker unnormalized control condition (pooled cross-speaker) has not been completed at this writing, we cannot say what proportion of the improvement is due to the normalization procedure, the additional training speech, and the additional reference speakers. As was the case for the single reference condifion, combining the two transformations yields only a small additional improvement.</Paragraph>
      <Paragraph position="10"> While these absolute performance numbers are unimpressive, pooling the normalized speech of only 12 speakers has realized a dramatic reduction in error rate over the single reference normalization. At this point, it makes sense to ask: How much better would this condition be if done on 100 reference speakers? The speaker-independent portion of the DARPA Resource Management database will permit us to answer this question.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML