File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/92/h92-1037_evalu.xml

Size: 4,210 bytes

Last Modified: 2025-10-06 14:00:08

<?xml version="1.0" standalone="yes"?>
<Paper uid="H92-1037">
  <Title>Minimizing Speaker Variation Effects for Speaker-Independent Speech Recognition</Title>
  <Section position="4" start_page="193" end_page="193" type="evalu">
    <SectionTitle>
4. EXPERIMENTAL EVALUATION
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="193" end_page="193" type="sub_section">
      <SectionTitle>
4.1. Experiment conditions
</SectionTitle>
      <Paragraph position="0"> Through this study, only the cepstral vectors are considered for normalization. Once we have the normalized cepstral vector, the first-order and second-order time derivatives can be computed. We first clustered all the speakers in the training set into male and female clusters, and then generated 10 speaker-clusters for male and 7 speaker-clusters for female.</Paragraph>
      <Paragraph position="1"> We selected two golden speaker-clusters for both male and female. There were 13 and 6 speakers in the male and female golden cluster respectively. To provide learning examples for network learning, we first segmented all the training utterances into triphones using Viterbi alignment and then used the DTW algorithm to warp the data to the corresponding triphone pairs in the golden speaker-cluster. Thus, for a given frame of each training speaker, the desired output frame for network learning is the golden speaker frame paired in the DTW optimal path.</Paragraph>
    </Section>
    <Section position="2" start_page="193" end_page="193" type="sub_section">
      <SectionTitle>
4.2. Benchmark Experiments
</SectionTitle>
      <Paragraph position="0"> As benchmark experiments, speaker-independent speech recognition using SPHINX-II was first evaluated. The word error rate we used here reflects all three types of errors and is computed as substitutions + deletions + insertions</Paragraph>
      <Paragraph position="2"> The average error rate was 3.8% for speaker-independent speech recognition.</Paragraph>
    </Section>
    <Section position="3" start_page="193" end_page="193" type="sub_section">
      <SectionTitle>
4.3. Normalization Results
</SectionTitle>
      <Paragraph position="0"> The input of the network consists of three frames from the new speaker. Here, 12 cepstral coefficients and energy are used together. Thus, there are 93 input units in the network.</Paragraph>
      <Paragraph position="1"> The output of the network has 13 units corresponding the normalized frame, which is made to approximate the frame of the desired reference speaker. The energy output is discarded as it is relative unstable. The objective function for network learning is to minimize the distortion (mean squared error) between the network output and the desired reference speaker frame.</Paragraph>
      <Paragraph position="2"> The network has one hidden layer with 20 hidden units. Each hidden unit is associated with the generalized SIGMOID function, where c~, /~ and 7 are predefined to be 4.0, 1.8, 2.0 respectively. They are fixed for all the experiments conducted here. The weights and offsets in the network were initialized with small random values. The learning step and momentum are controlled dynamically. Experimental experience indicates that 300 to 600 epochs are required to achieve acceptable distortion. We created two golden speaker clusters for male and female respectively. There were seven female clusters and ten male clusters, which are designed according to the available amount of male/female training data. For each speaker cluster, we built a cluster-dependent codebook (size 16). For the input speech signal, joint VQ pdfs are used to select the top 2-5 clusters for normalization. Thus, let Ai denote the probability that acoustic vector belong to cluster i, and ,t'i denote the normalized vector using the ith cluster-dependent CDNN. The normalized vector 32 can then be computed as</Paragraph>
      <Paragraph position="4"> With the same training conditions as used in SPHINX-II, when the speaker-normalized front-end is used, we reduced the error rate from 3.8% to 3.3%, which represented 15% error reduction. The modest error reduction indicated the mapping quality still needs to be improved substantially.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML