File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/91/h91-1054_intro.xml

Size: 7,002 bytes

Last Modified: 2025-10-06 14:04:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="H91-1054">
  <Title>A Study on Speaker-Adaptive Speech Recognition</Title>
  <Section position="2" start_page="0" end_page="278" type="intro">
    <SectionTitle>
1 INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> Speaker-independent speech recognition systems could provide users with a ready-to-use system \[1, 2, 3, 4\]. There is no need to collect speaker-specific data to train the system, but collect data from a variety of speakers to reliably model many different speakers. Speaker-independent systems are definitely desirable in many applications where speaker-specific data do not exist. On the other hand, if speaker-dependent data are available, the system could be adapted to a specific speaker to further reduce the error rate. The problem of speaker-dependent systems is that for large-vocabulary continuous speech recognition, half an hour of speech from the specific speaker is generally needed to reliably estimate system parameters. The problem of speaker-independent systems is that the error rate of speaker-independent speech recognition systems is generally two to three times higher than that of speaker-dependent speech recognition systems \[2, 3\]. A logical compromise for a practical system is to start with a speaker-independent system, and then adapt the system to each individual user.</Paragraph>
    <Paragraph position="1"> Since adaptation is based on the speaker-independent system with only limited adaptation data, a good adaptation algorithm should be consistent with speaker-independent parameter estimation criterion, and adapt those parameters that are less sensitive to the limited training data. Two parameter sets, the codebook mean vector and the output distribution, are modified in the framework of maximum likelihood estimation criterion according to the characteristics of each speaker. In addition to modify those parameters, speaker normalization using neural networks is also studied in the hope that acoustic data normalization will not only rapidly adapt the system but also enhance the robustness of speaker-independent speech recognition.</Paragraph>
    <Paragraph position="2"> The codebook mean vector can represent the essential characteristics of different speakers, and can be rapidly estimated with only limited training data \[5, 6, 7\]. Because of this, it is considered to be the most important parameter set. The semi-continuous hidden Markov model (SCHMM) \[8\] is a good tool to modify the codebook for each speaker. With robust speaker-independent models, the codebook is modified according to the SCHMM structure such that the SCHMM likelihood can be maximized for the given speaker. This estimation procedure considers both phonetic and acoustic information. Another important parameter set is the output distribution (weighting coefficients) of the SCHMM. Since there are too many parameters in the output distributions, direct use of the SCHMM would not lead to any improvement. The speaker-dependent output distributions are thus shared (by clustering) with each other if they exhibit certain acoustic similarity. Analogous to Bayesian learning \[9\], speaker-independent estimates can then bc interpolated with the clustered speaker-dependent output distribution.</Paragraph>
    <Paragraph position="3"> In addition to modify codebook and output distribution parameters, speaker normalization techniques are also studied in the hope that speaker normalization can not only adapt the system rapidly but also enhance the robustness of speaker-independent speech recognition \[10\]. Normalization of cepstrum has also achieved many successful results in environment adaptation \[11\]. The normalization techniques proposed here involve cepstrum transformation of any target speaker to  the reference speaker. For each cepstrum vector A2, the norrealization function Jr(,12) is defined such that the SCHMM probability Pr(Jr(A;)\[.Ad) can be maximized, where .A4 can be either speaker-independent, or speaker-dependent models; and .T'(A2) can be either a simple function like.,4,12 +/3, or any complicated nonlinear function. Thus, a speaker-dependent function Jr(A2) can be used to normalize the voice of any target speaker to a chosen reference speaker, or a speaker-independent function Jr(h2) can be built to reduce speaker differences before speaker-independent training is involved such that the speaker-independent models are more accurate.</Paragraph>
    <Paragraph position="4"> In this paper, DARPA Resource Management task is used as the domain to investigate the performance of speaker-adaptive speech recognition. An improved speaker-independent speech recognition system, SPHINX \[12\], is used as the base-line system here. The error rate for the RM2 test set, consisting of two male (JLS and LPN) and two female (BJW and JRM) speakers with 120 sentences for each, is 4.3%. This result is based on June 1990 system \[13\]. Recent results using the shared SCHMM is not included, which led to additional 15 % error reduction \[12\].</Paragraph>
    <Paragraph position="5"> Proposed techniques have been evaluated with the RM2 test set. With 40 adaptation sentences (randomly extracted from training set with triphone coverage around 20%) for each speaker, the parameter adaptation algorithms reduced the error rate to 3.1%. In comparison with the best speaker-independent result on the same test set, the error rate is reduced by more than 25% As the proposed algorithm can be used to incrementally adapt the speaker-independent system, the adaptation sentences is incrementally increased to 300-600.</Paragraph>
    <Paragraph position="6"> With only 300 adaptation sentences, the error rate is lower than that of the best speaker-dependent system on the same test set (trained with 600 sentences). For speaker normalization, two experiments were carried out. In the first experiment, two transformation matrix .,4 and/3 are defined such that the speaker-independent SCHMM probability Pr(.Ah2 +/31.Ad) is maximized. The error rate for the same test set with speaker-independent models is 3.9%. This indicates that the linear transformation is insufficient to bridge the difference among speakers. Because of this, the multi-layer perceptron (MLP) with the back-propagation algorithm \[14, 15\] is employed for cepstrum transformation. When the speaker-dependent model is used, the recognition error rate for other speakers is 41.9%, which indicates vast differences of different speakers. However, when 40 speaker-dependent training sentences are used to build the MLP, the error rate is reduced to 6.8%, which demonstrated the ability of MLP-based speaker normalization. null The paper is organized as follows. In Section 2, the base-line system for this study is described. Section 3 describes the techniques used for speaker-adaptive speech recognition, which consists of codebook adaptation, output distribution adaptation, and cepstrum normalization.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML