File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-4021_metho.xml
Size: 7,151 bytes
Last Modified: 2025-10-06 14:08:55
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-4021"> <Title>Feature-based Pronunciation Modeling for Speech Recognition</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Serval [sic] examples </SectionTitle> <Paragraph position="0"> To help ground the discussion, we first present several examples of pronunciation variation. One common phenomenon is the nasalization of vowels preceding nasal consonants. This is a result of asynchrony: The velum is lowered before the oral closure is made. In more extreme cases, the nasal consonant is entirely absent, leaving only a nasalized vowel, as in can't ! [ k ae n t ] 1. All of the feature values are still correct, although phonetically, this would be described as a deletion.</Paragraph> <Paragraph position="1"> Another example, taken from the Switchboard corpus, is several ! [s eh r v ax l]. In this case, the tongue and lips have desynchronized to the point that the tongue 1Here and throughout, we use the ARPAbet phonetic symbol set with additional diacritics, such as &quot; n&quot; for nasalization. retroflexion for [r] starts and ends before the lip narrowing gesture for [v]. Again, all of the feature streams are produced correctly, but there is an apparent exchange of two phones, which cannot be represented via singlephone confusions conditioned on phonemic context.</Paragraph> <Paragraph position="2"> A final example from Switchboard is everybody ! [eh r uw ay]. It is difficult to imagine a set of phonetic transformations that would predict this pronunciation without allowing a host of other impossible pronunciations. However, when viewed in terms of features, the transformation from [eh v r iy bcl b ah dx iy] to [eh r uw ay] is fairly simple. The tongue and lips desynchronize, causing the lips to start to close for the [bcl] during the previous vowel. In addition, the lip constrictions for [bcl] and [v], and the tongue tip gesture for [dx], are reduced. We will return to this example in the sections below.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Approach </SectionTitle> <Paragraph position="0"> A feature-based pronunciation model is one that explicitly models the evolution of multiple underlying linguistic feature streams to predict the allowed realizations of a word and their probabilities. Our approach begins with the usual assumption that each word has one or more target phonemic pronunciations, or baseforms. Each base-form is converted to a table of underlying feature values.</Paragraph> <Paragraph position="1"> Table 1 shows what part of this table might look like for the word everybody. The table may include &quot;unspecified&quot; values ('*' in the table). More generally, each table entry can be a distribution over the range of feature values.</Paragraph> <Paragraph position="2"> For now, we assume that all of the features go through the same sequence of indices (and therefore the same number of targets) in a given word; e.g., in Table 1, LIP-OPEN goes through the same indices as TT-LOC, although it has the same target value for indices 2 and 3. In the first time frame of speech, all of the features begin in index 0; in subsequent frames, each feature can either stay in the same index or transition to the next one with some probability.</Paragraph> <Paragraph position="3"> The surface feature values--i.e., the ones that are actually produced by the speaker--can stray from the underlying pronunciation in two ways, typically because of articulatory inertia: substitution, in which a feature fails to reach its target underlying value; and asynchrony, in which different features proceed through their sequences of indices at different rates. We define the degree of asynchrony between two sets of features as the difference between the average index of one set relative to the average index of the second. The degree of asynchrony is constrained: More &quot;synchronous&quot; configurations are more probable (soft constraints), and we make the further simplifying assumption that there is an upper bound on the degree of asynchrony (hard constraints).</Paragraph> <Paragraph position="4"> A natural framework for such a model is provided by dynamic Bayesian networks (DBNs), because of their index 0 1 2 3 ...</Paragraph> <Paragraph position="5"> phoneme eh v r iy ...</Paragraph> <Paragraph position="6"> LIP-OPEN wide critical wide wide ...</Paragraph> <Paragraph position="7"> TT-LOC alv. * ret. alv. ...</Paragraph> <Paragraph position="8"> ... ... ... ... ... ...</Paragraph> <Paragraph position="9"> In this feature set, LIP-OPEN is the lip opening degree; TT-LOC is the location along the palate to which the tongue tip is closest (alv. = alveolar; ret. = retroflex).</Paragraph> <Paragraph position="11"> a feature-based pronunciation model. Nodes represent variables; shaded nodes are observed. Edges represent dependencies between variables. Edges without parents/children point from/to variables in adjacent frames (see text).</Paragraph> <Paragraph position="12"> ability to efficiently implement factored state representations. Figure 1 shows one frame of the type of DBN used in our model (simplified somewhat for clarity of presentation). This example DBN assumes a feature set with three features. The variables at time frame t are as follows: lexEntryt - entry in the lexicon corresponding to the current word and baseform. Words with multiple baseforms have one entry per baseform. lexEntryt's parents are lexEntryt 1 and wdTrt 1 indjt - index of feature j into the underlying pronunciation, as in Table 1. indj0 = 0; in subsequent frames indjt is conditioned on lexEntryt 1, indjt 1, and wdTrt 1 (defined below).</Paragraph> <Paragraph position="13"> Ujt - underlying value of feature j. Its distribution p(Ujt jlexEntryt; indjt) is determined by the target feature table of lexEntryt.</Paragraph> <Paragraph position="14"> Sjt - observed surface value of feature j. p(Sjt jUjt ) encodes allowed feature substitutions.</Paragraph> <Paragraph position="15"> wdTrt - binary variable indicating whether this is the last frame of the current word.</Paragraph> <Paragraph position="16"> syncA;Bt - binary variable that enforces a synchrony constraint between subsets A and B of the feature set.</Paragraph> <Paragraph position="17"> It is observed with value 1; its distribution is constructed in such a way as to force its parent ind variables to obey the desired constraint. For example, to enforce a constraint between the average index of features 1 and 2 and the index of feature 3, we would have P(sync1;2;3t = 1jind1t ; ind2t ; ind3t ) = 0 whenever ind1t , ind2t , ind3t violate the constraint.</Paragraph> <Paragraph position="18"> In an end-to-end recognizer, the acoustic observations would depend on the Sjt , which would be unobserved.</Paragraph> <Paragraph position="19"> However, to facilitate quick experimentation and isolate the pronunciation model, we begin by testing how well we can do when given observed surface feature values.</Paragraph> </Section> class="xml-element"></Paper>