File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/w98-1222_intro.xml

Size: 5,379 bytes

Last Modified: 2025-10-06 14:06:44

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1222">
  <Title>Extracting Phoneme Pronunciation Information from Corpora</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> When trying to match spoken words to dictionary entries during speech recognition, it is useful to be able to generate alternative versions of the spoken sequences of phones to account for the manner in which different speakers pronounce a phone. If we know the probabilities that the component sounds in a sequence of phones are pronounced like other sounds, then likely alternative pronunciations of that sequence of phones can be generated to match against a lexicon of known words. Furthermore, if we also have some idea of how the context within which a phone was uttered affects its pronunciation, we have extra information which can be used to produce more realistic alternative pronunciations.</Paragraph>
    <Paragraph position="1"> This paper considers the task of automatically extracting statistical information about how various sound sections of words (phonemes) are pronounced by speakers (as phones) by matching intended phonemes and uttered phones from a transcribed speech corpus. The same approach could be used to gather statistics about how phones recogaized (or mis-recognized) by a speech recognizer match the phonemes intended by a speaker.</Paragraph>
    <Paragraph position="2"> This information extraction process is part of the training phase for the lexical access component of a speech recognition system, where the pronunciation probabilities are generated from a training corpus.</Paragraph>
    <Paragraph position="3"> The study was done on the TIMIT corpus (Fisher et al., , 1986) -- a collection of American-English read sentences with correct time-aligned acoustic-phonetic and orthographic (word-aligned) transcriptions. 1 The corpus contains 3696 sentences spoken by 462 speakers from 8 different dialect divisions across the United States.</Paragraph>
    <Paragraph position="4"> Previous work by Riley (1989) and Withgott and Chen (1993) used Classification and Regression Trees (CART) on a large number of different features of the corpus (such as genderi dialect and speaking rate) to obtain pronunciation information of intended phonemes. Our system obtain~ similar results using positional information and context, and using exact matches from uttered phones to intended phonemes to guide other matches.</Paragraph>
    <Paragraph position="5"> Work by Cohen et (d., (1987) on pronunciation used a couple of set sentences for multiple speakers, but did not cover a wide range of words (and thus different phone contexts). Our study considers the pronunciation patterns of a wide range of different speakers using a large collection of words.</Paragraph>
    <Paragraph position="6"> A tree-based system by Luccassen and Mercer (1984) uses an information theoretic approach for deciding alternative pronunciations based on the classification of a large context feature vector. However, when building their decision tree, they do not evaluate the quality of the resulting tree, i.e., they keep testing attributes until a boundary situation is reached. In contrast, our system initially uses the relative positioning of uttered phones and intended phonemes to determine the phonemes possibly intended by a speaker when uttering a particular phone. The context of a phone is considered only as 1Transcriptions were made by a combination of hand transcriptions using multiple parametric representations of sentences as a guide, and automatic alignment (Zue and Seneff, 1988). The use of different representations is claimed as a good way of overcoming dialect biases during transcription (Withgott and Chen, 1993).</Paragraph>
    <Paragraph position="7"> Thomas, Zukerman and Raskutti 175 Extracting Phoneme Pronunciation Information Ian Thomas, Ingrid Zukerman and Bhavani Raskurd (1998) Extracting Phoneme Pronunciatim Information from Corpora. In D.M.W. Powers (ed.) NeMLaP3/CoNLL98: New Methods in Language Processing and Computational Natural Language Learning, ACL, pp 175-183.</Paragraph>
    <Paragraph position="8"> The Mayan neoclassic scholar disappeared while surveying ancient ruins. the Imayan \[neoclassic</Paragraph>
    <Paragraph position="10"> d lhh W ay L l dx lax W aa L \[ surveying \[ancient \[ruins I S axr V EY ix NG l- EY N SH ix n t \[R uw ih N Z I S er V EY - NG \[q EY N SH - en tcl IR ux ix N Z I  an extra source of information to qualify these predictions. Further, the method we apply for building decision trees evaluates whether context is meaningful in terms of its predictive power. Riley (1991) implements a similar system using a different method for tree induction, but estimates the probability of an uttered phoneme given a phoneme context and a partial phone context, whereas we are inferring an intended phoneme from an uttered phone context.</Paragraph>
    <Paragraph position="11"> Our specific aim is to test two hypotheses: (1) that phonemes are pronounced as phones in the same broad sound category (for example, vowels for vowels and fricatives for fricatives), and (2) that the context of a phone, that is, the attributes of the phones immediately preceding and following it, influence the pronunciation of this phone.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML