File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/n04-4021_evalu.xml

Size: 6,424 bytes

Last Modified: 2025-10-06 13:59:09

<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-4021">
  <Title>Feature-based Pronunciation Modeling for Speech Recognition</Title>
  <Section position="5" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> We have performed a pilot experiment using the following feature set, based on the vocal tract variables of articulatory phonology (Browman and Goldstein, 1992): degree of lip opening; tongue tip location and opening degree; tongue body location and opening degree; velum state; and glottal (voicing) state. We imposed the following synchrony constraints: (1) All four tongue features are completely synchronized; (2) the lips can desynchronize from the tongue by up to one index; and (3) the glottis and velum are synchronized, and their index must be within 2 of the mean index of the tongue and lips.</Paragraph>
    <Paragraph position="1"> We used the Graphical Models Toolkit (Bilmes and Zweig, 2002) to implement the model. The distributions p(Sjt jUjt ) were constructed by hand based on linguistic considerations, e.g. that features tend to go from more &amp;quot;constricted&amp;quot; values to less constricted ones, but not vice versa. p(Ujt jlexEntryt; indjt) was derived from manually-constructed phoneme-to-featureprobability mappings. For these experiments, no parameter learning has been done.</Paragraph>
    <Paragraph position="2"> The task was to recognize an isolated word, given a set of observed surface feature sequences Sjt . To create the observations, we used the detailed phonetic transcriptions created at ICSI for the Switchboard corpus (Greenberg et al., 1996). For each word, we converted its transcription to a sequence of feature vectors, one vector per 10 ms frame. For this purpose, we divided diphthongs and stops into pairs of feature configurations. Given the input feature sequences, we computed a Viterbi score for each lexical entry in a 3000+-word (5500+-lexEntry) vocabulary, by &amp;quot;observing&amp;quot; the lexEntry variable and finding the most likely settings of all remaining variables. The most likely variable settings can be thought of as a multistream alignment between the surface and underlying feature streams. Finally, we output the word corresponding to the highest-scoring lexical entry.</Paragraph>
    <Paragraph position="3"> We performed this procedure on a development set of 165 word transcriptions, which was used to tune settings such as synchronization constraints, and a test set of 236 transcriptions 2. We compared the performance of several models, measured in terms of word error rate (WER) and failure rate (FR), the percentage of inputs that had no Viterbi alignment with the correct word. To get a sense of the effect of feature asynchrony, we compared our asynchronous model with a version in which all features are forced to be synchronized, so that only feature substitution is allowed. This uses the same DBN, but with degenerate distributions for the synchronization variables.</Paragraph>
    <Paragraph position="4"> Also, since the Sj values are derived from phonetic transcriptions, and are therefore constant over several frames at a time, we also built a variant of the DBN in which Sj is allowed to change value with non-zero probability only when indj changes (by adding parents indjt , indjt 1, Sjt 1 to Sjt ); we refer to this DBN as &amp;quot;segment-based&amp;quot;, and to the original as &amp;quot;frame-based&amp;quot;. We compared four variants, differing along the &amp;quot;synchronous vs. asynchronous&amp;quot; and &amp;quot;frame-based vs. segment-based&amp;quot; dimensions. The variant which is both synchronous and segment-based is similar to a phone-based pronunciation model with only context-independent phone substitutions.</Paragraph>
    <Paragraph position="5"> dev set test set model WER FR WER FR baseforms only 63.6 61.2 69.5 66.9 phonological rules 50.3 47.9 59.7 55.5 sync. seg.-based 38.2 24.8 43.2 35.2 sync. fr.-based 35.2 23.0 46.2 31.4 async. seg.-based 32.7 19.4 41.1 31.4 async. fr.-based 29.7 16.4 42.7 26.3  as well as of two &amp;quot;baseline&amp;quot; models: one allowing only the baseform pronunciations (on average 1.7 per word), and another including all pronunciations produced by an extensive set of context-dependent phonological rules (about 4 per word), with no feature substitutions or asynchrony in either case. The phonological rules are the &amp;quot;full rule set&amp;quot; described in Hazen et al. (2002). We note that they were not designed with Switchboard in mind.</Paragraph>
    <Paragraph position="6"> The models that allow asynchrony outperform the ones that do not, in terms of both WER and FR. Looking more closely at the performance on the development set, the inputs on which the synchronous models failed but the asynchronous models succeeded were in fact the kinds of pronunciations that we expect to arise from feature asynchrony, including: nasals replaced by nasalization on a preceding vowel; a /t r/ sequence realized as /ch/; and everybody ! [eh r uw ay]. The relative merits of the frame-based and segment-based models is less clear, as 2We required that words in the development and test sets have phonemic pronunciations with at least 4 phonemes, so as to limit context effects from adjacent words.</Paragraph>
    <Paragraph position="7"> they have opposite relative performance on the development and test sets. For 27 (16.4%) development utterances, none of the models was able to find an alignment with the correct word. Most of these were due to apparent gesture deletions and context-dependent feature changes, which are not yet included in the model.</Paragraph>
    <Paragraph position="8"> Figure 2 shows a part of the Viterbi alignment of everybody with [eh r uw ay], produced by the segmentbased, asynchronous model. Using this model, everybody was the top-ranked word. As expected, the asynchrony is manifested in the [uw] region, and the lips do not close but reach only a narrow (glide-like) configuration.</Paragraph>
    <Paragraph position="9">  alignment, including the lip opening and tongue tip loca-tion variables. Indices are relative to the underlying pronunciation /eh v r iy bcl b ah dx iy/. Adjacent frames with equal values have been merged for easier viewing. WI = wide; NA = narrow; CR = critical; CL = closed; ALV</Paragraph>
    <Paragraph position="11"/>
  </Section>
class="xml-element"></Paper>
Download Original XML