File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/89/h89-1011_metho.xml

Size: 12,404 bytes

Last Modified: 2025-10-06 14:12:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="H89-1011">
  <Title>Speaker Adaptation from Limited Training in the BBN BYBLOS Speech Recognition System</Title>
  <Section position="4" start_page="0" end_page="102" type="metho">
    <SectionTitle>
3. Methods For Computing the Transformation
</SectionTitle>
    <Paragraph position="0"> In 1987 \[5\] we reported a new algorithm for estimating a probabilistic spectral mapping between two speakers. The transformation in this method is equivalent to expanding the HMM of the prototype, replacing each state by N states and connecting them in parallel by N transitions. The transition probabilities on each of these paths axe then p(kils, ), which are the original prototype probabilities for each spectrum, i, given the state, ~. The pdf at each new state on these paths is p(k~lk~, C/(~)) which corresponds to one row of the transformation matrix, TC/(s).</Paragraph>
    <Paragraph position="1"> Since the conditional probability, i~(k~18 ) in equation (3) is computed by the expanded HMM, the familiar forward-backward procedure can be used to estimate T~b(s ). The target speech is first quantized by the prototype codebook and is automatically aligned against the prototype model.</Paragraph>
    <Paragraph position="2"> This method worked very well with low perplexity grammars but performance degraded unacceptably as the perplexity of the grammar increased.</Paragraph>
    <Paragraph position="3"> We found that cross-speaker quantization was a significant factor in the performance degradation. Also, the transformed pdfs were excessively smooth. We think that the original models, which have been smoothed appropriately for the prototype by interpolating context-dependent phoneme models, may not be specific enough to preserve important detail under the transformation.</Paragraph>
    <Paragraph position="4"> To overcome these problems, we investigated a text-dependent procedure which is described in \[2\]. In this method we constrain the prototype and target speaker to say a common set of training sentences. A class labeling, C/(s), is derived for each frame of prototype speech by using the prototype HMM to perform recognition while constrained to the correct word sequence. Matching pairs of utterances are time-aligned using a DTW procedure on the parameters of the training speech.</Paragraph>
    <Paragraph position="5"> This alignment of the speech frames defines a set of spectral co-occurrence triplets, {(k~, ki, C/(s))}, for all i, j, which can be counted to estimate the elements of each matrix TC/(a) directly.</Paragraph>
    <Paragraph position="6"> In this method the target speech is quantized by a codebook derived from the target's own training data thereby eliminating the cross-speaker quantization problem. The smoothing problem is overcome by using the prototype speech itself as the prototype model while estimating the transformation.</Paragraph>
    <Paragraph position="7"> We found that the second method outperformed the first using 30 seconds of training speech and an artificial grammar of perplexity .60. This remained true even after controlling for the quantization problem of the first method by adapting the prototype codebook after the manner of \[6\].</Paragraph>
    <Paragraph position="8"> Several enhancements have been made to the DTW-based method. As described in \[3\], we introduced an iterative normalization procedure which modifies the speech parameters of one speaker by shifting them toward the other speaker. A VQ codebook partitions the speech of one speaker into groups of spectra which quantize to a common VQ codeword. The DTW alignment maps the partition onto corresponding groups of spectra for the other speaker. The shift is then determined by the difference vector between the means of these corresponding groups of spectra. Each iteration of aligning and shifting reduces the mean-squared error of the aligned speech parameters until convergence.</Paragraph>
    <Paragraph position="9"> More recently, we have used additional features in the DTW to improve the alignment between utterances, and additional codebooks in the HMM to improve the prototype model.</Paragraph>
    <Paragraph position="10">  2. Basic Approach to Speaker Adaption We view the problem of speaker adaptation as one of modeling the difference between two speakers. One of the speakers, who we call the prototype, is represented by a speaker-dependent HMM trained from large amounts of speech data. The other speaker, called the target, is represented by only a small sample of speech. If the difference between the speakers can be successfully modeled, then one strategy for speaker adaptation is to make the prototype speaker look like the target speaker. This can be accomphshed by finding a transformation which can be apphed to the prototype HMM that makes it into a good model of the target speech.</Paragraph>
    <Paragraph position="11"> The difference between speakers is a complex one, involving the interaction of spectral, articulatory, phonological, and dialectal influences. A non-parametric probabihstic mapping between the VQ spectra of the two speakers has appropriate properties for such a problem. A probabilistic transformation can capture the many-to-many mapping typical of the differences between speakers and it can be made robust even when estimated from sparse data. Non-parametricity makes few constraining assumptions about the data under transformation. Mapping VQ spectra between speakers constrains the transformation to dimensionswhich can be estimated reasonably from the limited training data.</Paragraph>
    <Paragraph position="12"> We begin with high performance speaker-dependent phonetic models which have been trained from a large sample of speech from the prototype speaker. The speaker-dependent training procedure in the BYBLOS system has been described in \[1\]. For each state of the prototype HMM, we have a discrete probability density function (pdf) represented here as a row vector:</Paragraph>
    <Paragraph position="14"> where p(kils) is the probability of the VQ label ki at state s of the prototype HMM model, and N is the size of the VQ codebook.</Paragraph>
    <Paragraph position="15"> The elements of the desired transformed pdf, p'(s), can be computed from:</Paragraph>
    <Paragraph position="17"> Since we have insufficient data to estimate a separate transformation for each state we approximate</Paragraph>
    <Paragraph position="19"> where C/(s) specifies an equivalence class defined on the states s.</Paragraph>
    <Paragraph position="20"> For each of the classes, C/(s), the set of conditional probabihties, {p(k~,ki, C/(s))}, for an i and j form an N x N matrix, TC/(,), which cart be interpreted as a probabilistic transformation matrix from one speaker's spectral space to another's. We can then rewrite the computation of the transformed pdf, p'(s), as the product of the prototype row vector, p(s), and the matrix, TC/(deg):</Paragraph>
    <Paragraph position="22"> There are many ways to estimate TC/(s). We describe next two procedures that we have investigated. null</Paragraph>
  </Section>
  <Section position="5" start_page="102" end_page="104" type="metho">
    <SectionTitle>
4. Experimental Results
</SectionTitle>
    <Paragraph position="0"> The DARPA Resource Management database \[4\] defines a protocol for evaluating speaker adaptive recognition systems which is constrained to use 12 sentences common to all speakers in the database. To avoid problems due to unobserved spectra, we have chosen to develop our speaker adaptation methods on a larger training set, which restricts us to the speaker-dependent portion of the database for performance evaluation.</Paragraph>
    <Paragraph position="1"> This segment of the database includes training and test data for 12 speakers sampled from representative dialects of the United States. We have used the first 40 utterances (2 minutes of speech) of the designated training material for our limited training sample. Two development test sets have been defined by the National Institute of Standards and Technology (NIST). These test sets consist of 25 utterances for each speaker. Each test set is drawn from different sentence texts and includes about 200 word tokens.</Paragraph>
    <Paragraph position="2"> For all of our experiments, we have used one male prototype speaker originally from the New York area. 30 minutes of speech (600 sentences) were recorded at BBN in a normal office environment and used to train the prototype HMM. The speech is sampled at 20 kHz and analysed into 14 mel-frequency cepstral coefficients at a frame rate of 10 ms. 14 cepstral derivatives, computed as a linear regression over 5 adjacent frames, are derived from the original coefficients. The transformation matrices are made to be phoneme-dependent by defining the equivalence classes, ~b(s), over the 61 phonemes in the lexicon.</Paragraph>
    <Paragraph position="3">  grammar and the Oct. '87 test set.</Paragraph>
    <Paragraph position="4"> We have performed our development work on 8 speakers using the test set designated by NIST as Oct. '87 test. The results of this work, using the standard word-pair grammar, are summarized in Table 1, where: % Word Error = 100 x \[(substitutions + deletions + insertions) / number of word tokens\] For each experiment we show the number of features used in the DTW alignment, whether the iterative normalization procedure was used, and the number of codebooks used in recognition. Using experiment (1) as a baseline, the table shows a 45% decrease overall in word error rate for using all three improvements together. Comparing experiments using 14 features with their counterparts using 28 features shows that the contribution due to the differential features is roughly a 10% - 14% reduction in error rate. A similar comparison for using/not-using the normalization  reveals a 9% - 17% reduction. Finally, using the second codebook reduces the error rate by 26% 29%. null It should be mentioned that the 40 sentences used for training in these experiments are drawn equaUy from 6 recording sessions separated by several days. Furthermore, the test data is from another session altogether. For the adaptation methods described here, it is reasonable to assume that the training data would be recorded in a single session and only a few minutes before the transformed models were ready for use. This means that the adaptation training and test data should realistically come from the same recording session. From earlier published experiments using single-session training and test, we believe the multi-session material accounts for about I/5 of the total word error for the experiments reported here.</Paragraph>
    <Paragraph position="5">  We evaluated the three improvements to the system by testing on new data designated as the May 88 test set which is defined for 12 speakers. For this experiment, we added 2 features, normalized energy and differential energy, and an additional codebook for the energy features. All parameters for this experiment were fixed prior to testing. The results shown in Table 2 were obtained from the first run of the system on the May '88 test data. All entries in the table are percentages, where: % Word Correct = 100 x \[1 - (substitutions + deletions) / number of word tokens\] % Sentence Error = 100 x \[number of sentences with any error / number of sentences\] and % Word Error is defined as in Table 1.</Paragraph>
    <Paragraph position="6"> The speakers in Table 2 are ordered from the top by increasing word error rate. It is evident from the table that the speakers cluster into two distinct performance groups. It is remarkable that all 5 female speakers are included in the higher performance group despite the fact that the prototype is male. The ordering of speakers shown here is not predicted by their speaker-dependent  performance or by subjective listening.</Paragraph>
    <Paragraph position="7"> The average word error rate of 8.4% for this test set is comparable to previously reported results from speaker-independent systems on this identical test set. Using training from 105 speakers (4200 sentences), the word error rates for the Sphinx system of CMU was 8.9% and for the Lincoln Labs system; 10.1%. New results from these systems, on different test data but from the same 12 speakers, are reported elsewhere in these proceedings.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML