XML Viewer - h92-1037

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/92/h92-1037_intro.xml
Size: 4,014 bytes
Last Modified: 2025-10-06 14:05:17
<?xml version="1.0" standalone="yes"?>
<Paper uid="H92-1037">
  <Title>Minimizing Speaker Variation Effects for Speaker-Independent Speech Recognition</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1. INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> For speaker-independent speech recognition, speaker variation is one of the major error sources. As a typical example, the error rate of a well-trained speaker-dependent speech recognition system is three times less than that of a speaker-independent speech recognition system \[11\]. To minimize speaker variation effects, we can use either speaker-clustered models \[28, 11\] or speaker normalization techniques \[2, 24, 3, 25, 7\]. Speaker normalization is interesting since its application is not restricted to a specific type of speech recognition systems. In comparison with speaker normalization techniques, speaker-clustered models will not only fragment data, but also increase the computational complexity substantially, since multiple models have to be maintained and compared during recognition.</Paragraph>
    <Paragraph position="1"> Recently, nonlinear mapping based on neural networks has attracted considerable attention because of the ability of these networks to optimally adjust the parameters from the training data to approximate the nonlinear relationship between two observed spaces (see \[22, 23\] for a review), albeit much remains to be clarified regarding practical applications. Non-linear mapping of two different observation spaces is of great interest for both theoretical and practical purposes. In the area of speech processing, nonlinear mapping has been applied to noise enhancement \[1, 32\], articulatory motion estimation \[29, 18\], and speech recognition \[16\]. Neural networks have been used successfully to transform data of a new speaker to a reference speaker for speaker-adaptive speech recognition \[11\]. In this paper, we will study how neural networks can be employed to minimize speaker variation effects for speaker-independent speech recognition. The network is used as a nonlinear mapping function to transform speech data between two speaker clusters. The mapping function we used is characterized by three important properties. First, the assembly of mapping functions enhances overall mapping quality. Second, multiple input vectors are used simultaneously in the transformation. Finally, the mapping function is derived from training data and the quality will dependent on the available amount of training data.</Paragraph>
    <Paragraph position="2"> We used the DARPA Resource Management (RM) task \[271 as our domain to investigate the performance of speaker normalization. The 997-word RM task is a database query task designed from 900 sentence templates \[271. We used word-pair grammar that has a test-set perplexity of about 60.</Paragraph>
    <Paragraph position="3"> The speaker-independent training speech database consists of 3990 training sentences from 109 speakers \[26\]). The test set comprises of a total of 600 sentences from 20 speakers. We used all training sentences to create multiple speaker clusters.</Paragraph>
    <Paragraph position="4"> A codeword-dependent neural network is associated with each speaker cluster. The cluster that contains the largest number of speakers is designated as the golden cluster. The objective function is to minimize distortions between acoustic data in each cluster and the golden speaker cluster. Performance evaluation showed that speaker-normalized front-end reduced the error rate by 15% for the DARPA resource management speaker-independent speech recognition task.</Paragraph>
    <Paragraph position="5"> This paper is organized as follows. In Section 2, the speech recognition system SPHINX-II is reviewed. Section 3 presents neural network architecture. Section 4 discusses its applications to speaker-independent speech recognition. Our findings are summarized in Section 5.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML