File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/89/h89-2034_metho.xml

Size: 5,449 bytes

Last Modified: 2025-10-06 14:12:21

<?xml version="1.0" standalone="yes"?>
<Paper uid="H89-2034">
  <Title>Speaker Adaptation Using Multiple Reference Speakers</Title>
  <Section position="3" start_page="0" end_page="256" type="metho">
    <SectionTitle>
2 BASELINE SYSTEM DESCRIPTION
</SectionTitle>
    <Paragraph position="0"> Our current baseline speaker-adaptation system consists of two distinct components, both of which estimate transformations between the reference and target speaker, with the goal of making one of them 'look' like the  other. The first component estimates a deterministic transformation which is applied to the speech features of the reference (targe0 speaker. After transformation within this speech normalization component, the speech features of the reference (targe0 speaker are superimposed upon the feature space of the target (reference). The second component estimates a probabilistic transformation which is applied to the HMM parameters of the reference speaker. After Ixansformation by this PDF mapping component, the modified reference model can be used as an approximation to a well-trained HMM for the target speaker. These two primary components of the system are described in more detail below.</Paragraph>
    <Section position="1" start_page="256" end_page="256" type="sub_section">
      <SectionTitle>
2.1 Speech Normalization
</SectionTitle>
      <Paragraph position="0"> Speech normalization is accomplished by aligning the speech features of the reference and target speakers from a small training set of utterances of known (supervised) and pair-wise identical (script-dependen0 transcription.</Paragraph>
      <Paragraph position="1"> Dynamic time warping (DTW) is used to derive the alignment of a given pair of utterences. The alignments can then be used to estimate a deterministic non-parametric transformation to describe differences in the feature spaces of the two speakers. Any unsupervised feature conditioning which can be applied prior to the DTW is also performed by the speech normalization component.</Paragraph>
      <Paragraph position="2"> The normalization procedure has been described in \[2\] and is briefly summarized here:  1. Make a VQ codebook for one of the speakers.</Paragraph>
      <Paragraph position="3"> 2. Partition the feature space of one speaker by quantizing that speaker's training speech.</Paragraph>
      <Paragraph position="4"> 3. Map the partitioning to the other speaker through the DTW alignment.</Paragraph>
      <Paragraph position="5"> 4. Compute the means of each sub-population defined by the VQ and the mapped-VQ.</Paragraph>
      <Paragraph position="6"> 5. Shift the features of one speaker by the difference in the means of the corresponding sub-populations.</Paragraph>
      <Paragraph position="7"> 6. Go to (3) if the alignment MSE has not converged.  The speech normalization procedure is typically applied iteratively since each application of steps (3) and (5) above reduce (or leave unchanged) the MSE of the alignment. Note that, in this procedure, the codebook is used only to partition the space of one speaker into compact regions to define the degrees of freedom in the non-parametric mapping between the speakers. The alignment of the paired utterances is computed on the original (unquantized) speech features.</Paragraph>
    </Section>
    <Section position="2" start_page="256" end_page="256" type="sub_section">
      <SectionTitle>
2.2 PDF Mapping
</SectionTitle>
      <Paragraph position="0"> PDF mapping is accomplished by aligning the (normalized) speech features of the reference and target speaker, again using DTW. This final alignment serves to define a pair-wise correspondence between the VQ spectra of the reference and target speakers which can be used to estimate a probabilistic mapping between them. The VQ spectra are determined by independent codebooks made for each of the speakers. The codebook for the target speaker is made from the limited training material available for adaptation. The computed mapping is then used to modify the discrete HMM observation density parameters of the reference model.</Paragraph>
      <Paragraph position="1"> The mapping procedure has been described in \[1\] and is summarized here:  1. Make VQ codebooks for both speakers.</Paragraph>
      <Paragraph position="2"> 2. Quantize the target and reference training speech. 3. Use DTW to define a set of co-occurring VQ pairs.. 4. Accumulate frequency counts of the VQ co-occurrences into a count matrix.</Paragraph>
      <Paragraph position="3"> 5. Normalize the count matrix yielding a transformation matrix.</Paragraph>
      <Paragraph position="4"> 6. Apply the transformation matrix to the reference HMM (discrete) observation densities.</Paragraph>
      <Paragraph position="5">  The resulting Uamformed model is then used directly in recognition as if it were a model derived from the target speaker.</Paragraph>
      <Paragraph position="6"> The transformation described above can be made more detailed by defining a set of class-dependent matrices and labeling the states of the reference HMM with their class membership. One easily implemented set of equivalence classes for a phoneme-based system such as BY-BLOS is the set of phoneme-dependent transformations defined by the phonemes in the lexicon. Since the reference speaker has provided enough speech to train a high-performance speaker-dependent HMM, the model can be used to automatically label the reference speech prior to computing the spectral mapping.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML