File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/n04-4021_intro.xml

Size: 2,651 bytes

Last Modified: 2025-10-06 14:02:18

<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-4021">
  <Title>Feature-based Pronunciation Modeling for Speech Recognition</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Pronunciation variation in spontaneous speech has been cited as a serious obstacle for automatic speech recognition (McAllester et al., 1998). Typical pronunciation models approach this problem by augmenting a phonemic dictionary with additional pronunciations, often resulting from the application of phone substitution, insertion, and deletion rules. By carefully constructing a rule set (Hazen et al., 2002), or by deriving rules or variants from data (Riley and Ljolje, 1996), many phenomena can be accounted for. However, the recognition improvement over a phonemic dictionary is typically modest, and some types of variation remain awkward to represent.</Paragraph>
    <Paragraph position="1"> These observations have motivated approaches to speech recognition based on multiple streams of linguistic features rather than a single stream of phones (e.g., King et al. (1998); Metze and Waibel (2002); Livescu et al. (2003)). Most of this work, however, has focused on acoustic modeling, i.e. the mapping between the features and acoustic observations. The pronunciation model is typically still phone-based, limiting the feature values to the target configurations of phones and forcing them to behave as a synchronous &amp;quot;bundle&amp;quot;. Some approaches have begun to relax these constraints. For example, Deng et al. (1997) and Richardson et al. (2000) model asynchronous feature trajectories using hidden Markov models (HMMs), with each state corresponding to a vector of feature values. This approach is powerful, but it cannot represent independencies between features. Kirchhoff (1996), in contrast, models the feature streams as independent, except for a requirement that they synchronize at syllable boundaries. As pointed out by Ostendorf (2000), such independence assumptions may allow for too much variability.</Paragraph>
    <Paragraph position="2"> In this paper, we propose a more general feature-based pronunciation model implemented using dynamic Bayesian networks (Dean and Kanazawa, 1989), which allow us to take advantage of inter-feature independencies while avoiding overly strong independence assumptions. In the following sections, we describe the model and present proof-of-concept experiments using phonetic transcriptions of utterances from the Switchboard conversational speech corpus (Greenberg et al., 1996).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML