File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/ackno/89/h89-2050_ackno.xml
Size: 23,588 bytes
Last Modified: 2025-10-06 13:51:40
<?xml version="1.0" standalone="yes"?> <Paper uid="H89-2050"> <Title>Contextually-Based Data-Derived Pronunciation Networks for Automatic Speech Recognition*</Title> <Section position="2" start_page="0" end_page="379" type="ackno"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> The context in which a phoneme occurs leads to consistent differences in how it is pronounced. Phonologists employ a variety of contextual descriptors, based on factors such as stress and syllable boundaries, to explain phonological variation. However, in developing pronunciation networks for speech recognition systems, little explicit use is made of context other than the use of whole word models and use of triphone models.</Paragraph> <Paragraph position="1"> This paper describes the creation of pronunciation networks using a wide variety of contextual factors which allow better prediction of pronunciation variation. We use a phoneme level representation which permits easy addition of new words to the vocabulary, with a flexible context representation which allows modeling of long-range effects, extending over syllables and across word-boundaries. In order to incorporate a wide variety of factors in the creation of pronunciation networks, we used data-derived context trees, which possess properties useful for pronunciation network creation.</Paragraph> <Paragraph position="2"> Introduction The context in which a phoneme occurs leads to consistent differences in how it is pronounced. Phonologists employ a variety of contextual descriptors to explain phonological variation. These descriptors are theoretically motivated by studies of different languages and are comprised of many factors, such as stress and syllable part. However, in current speech recognition systems, only a few contextual descriptors are employed when developing pronunciation networks. In these systems, generally the effects of only the preceding and following sounds, as in triphone models, or implicit within-word contextual effects, as in whole word models, are modeled.</Paragraph> <Paragraph position="3"> Context-dependent triphone models have been found to be better models than context-free phone models, or monophones, and are used now by many recognition systems (e.g. Chow, 1986). More recently, Lee (1988, 1989) introduced the idea of clustered triphones, in which triphones exhibiting similar coarticulatory behavior are grouped together. Clustered triphones require less training data because there are fewer models, resulting in better training of the models which have been defined.</Paragraph> <Paragraph position="4"> Paul (1988) has conducted studies comparing the recognition rates of whole word models and triphone models. In his studies, he found that whole word models provided somewhat better recognition rates over triphone models, and that triphone models have much better recognition rates than monophone models.</Paragraph> <Paragraph position="5"> However, in these studies, context across word boundaries was not modeled. In a subsequent study, Paul (1989) found that use of triphone models which modeled context across word boundaries provided significantly better recognition rates over word-internal triphone models. Extrapolating, one would expect even better performance with whole word models which also model contextual effects across word boundaries.</Paragraph> <Paragraph position="6"> However, the amount of data necessary to train such a model for a moderate size vocabulary could be prohibitive. Furthermore, addition of a new word to the vocabulary would require many new tokens of the word because at least one token of the word in each context in which it could appear would be required.</Paragraph> <Paragraph position="7"> Instead of words, the use of a smaller representational unit, such as phones, with an enriched set of contextual descriptors can provide models which capture many of the contextual effects that whole word *This work was sponsored in part by the Defense Advanced Research Projects Agency (DOD), under the Information Science and Technology Office, contract #N00140-86-C-8996.</Paragraph> <Paragraph position="8"> contextual factor values preceding phoneme following phoneme preceding phone following phone syllable part stress syllable boundary type foot boundary type word boundary type cluster type open syllable? true vowel? function word?</Paragraph> <Paragraph position="10"> onset, nucleus, coda primary, secondary, unstressed initial, final, internal, initial-and-final initial, final, internal, initial-and-final initial, final, internal, initial-and-final onset, coda, nil true, false true, false true, false models capture. In addition, phone models have a smaller training data requirement and provide a more general framework for adding new words to the vocabulary. Because more contextual effects can be captured by using a wide variety of factors, use of these factors when creating pronunciation networks from dictionary baseforms allows better predictions of pronunciation variants. Better modeling of observed phonological variation can result in better performance of speech recognition systems. A case in point is the work by Weintraub et al. (1989) who found that phonological modeling improved their recognition results. In their work, the phonological rules were derived by hand. This paper describes the creation of pronunciation networks using a wide variety of contextual factors. We propose a systematic methodology for creating pronunciation networks which can capture the predominant contextual effects using a representation at the phoneme level. With the use of a phoneme representation, new words can be added without the need for additional training data, as would be necessary in whole word models. Our representation also allows us to capture long-range effects. In this study, some factors extend over syllables and across word boundaries.</Paragraph> <Paragraph position="11"> Contextual Factors Represented in Context Trees Linguists describe the context of a phoneme using many types of theoretically motivated factors, such as stress and word boundary. Each contextual factor describing the context of a phoneme has a value. For example, the factor stress may take on any one value of primary, secondary, or unstressed. In this work, we describe the context of a phoneme using a set of linguistically motivated contextual factors. These factors and their corresponding values are listed in Table 1. Some of the factors are structures normally associated with several phonemes. For example, the factor syllable part may take on the value onset, which may be composed of up to three phonemes, as in the sequence/str/. In such cases, we assign the factor value to each phoneme within the structure. In our example, the exemplars/s/, /t/, and/r/in an/str/sequence would each be assigned a syllable part value of onset. This representation allows modeling of long-range contextual effects simultaneously with a local phoneme representation, which simplifies the addition of new words to a recognition vocabulary.</Paragraph> <Paragraph position="12"> If the context of a phoneme is described by simultaneously using all the contextual factors listed in Table 1, a prohibitive amount of data would be required to form an adequate description of each phoneme in each context. Each contextual factor represents a separate dimension, and with such a large number of dimensions, the distribution of phonemes in a context will be sparse. One way to handle this difficulty is to build a &quot;context tree,&quot; in which a subset of contextual factors is selected using a greedy algorithm which clustering for grouping factor values is used to create a context tree. The number of leaves in the tree and branching of the tree is determined by the data. In the next paragraph we give a brief description of context trees. A more detailed description of context trees and how they are created is given in Chen and Shrager (1989). An alternate method for grouping contextual factors, based on binary splitting of subspaces until a preset number of clusters is formed, is given by Sagayama (1989).</Paragraph> <Paragraph position="13"> A context tree is an n-ary decision tree which models the relationship between contextual factors and the allophones which occur in different contexts. We create context trees from a set of training exemplars using data derived from the hand-transcribed &quot;sx&quot; sentences of the TIMIT database (Lamel, et al., 1986; Fisher et al., 1987). An illustrative tree describing the distribution of allophones in context for some phoneme X is shown in Figure 1. The nodes of a context tree represent the values of a particular contextual factor. In the example, node 1 corresponds to the contextual factor word-boundary-type with the value initial or internal. Each leaf of a context tree encodes the distribution of allophones in each context. In general, more than one allophone occurs in a context because phoneme realizations are not deterministic. For example, leaf 2 of the example corresponds to the realizations of X, which is realized as the allophone y 70% and as z 30% of the time when in a word-final and either primary or secondary stress context.</Paragraph> <Paragraph position="14"> The representation used in context trees permits flexible modeling of contextual effects. Since the context trees are derived from data in which phonemes are described by a set of contextual factors, long-range contextual effects are modeled. Also, since each factor represents a separate dimension, a factor ignores structures which are irrelevant to it. Thus, a contextual factor such as preceding phoneme extends across word boundaries.</Paragraph> <Paragraph position="15"> This section describes a systematic method for creating pronunciation networks in which a wide variety of contextual factors are used. In addition to using more contextual factors, our method of network creation has a number of inherent advantages, such as the ability to estimate allophone distributions from partial contextual descriptions. This method is data-intensive, using the data to determine possible pronunciations as well as to estimate the probabilities associated with each pronunciation.</Paragraph> <Section position="1" start_page="375" end_page="379" type="sub_section"> <SectionTitle> Mapping Dictionary Pronunciations </SectionTitle> <Paragraph position="0"> Dictionary pronunciations of words are relatively coarse-grained in that they do not indicate allophonic variation. Phone-based speech recognition systems generally represent words using allophones rather than dictionary baseforms because the allophones of a dictionary &quot;phoneme&quot; may be very different, as measured by the acoustic similarity metrics commonly employed in recognition systems. The allophonically-based pronunciations are usually represented compactly in a pronunciation network. In this section, we describe the creation of pronunciation networks from dictionary baseforms. The networks are produced by mapping the dictionary pronunciation of a word into an allophonic representation specified by the context trees. The dictionary that we use was developed at Xerox PARC and is called the X-Dictionary t.</Paragraph> <Paragraph position="1"> The mapping from a dictionary baseform to a set of possible pronunciations is characterized by the substitution, deletion, and insertion of sounds. Each dictionary sound may be realized as an allophone or it may be deleted. Therefore, substitution and deletion of a dictionary phoneme may be treated identically in mapping a dictionary baseform into a network of allophones. A context tree, which we call a &quot;phoneme&quot; tree, is used to describe the observed substitutions and deletions of a dictionary phoneme in transcriptions of speech. One phoneme tree is created for each of the 45 dictionary phonemes and the data in each tree defines the set of allophones observed in each context for a dictionary phoneme.</Paragraph> <Paragraph position="2"> In addition to modeling substitutions and deletions, as the phoneme trees do, pronunciation network creation also requires modeling of insertions. Insertions do not fit the substitution/deletion model in which a &quot;phoneme&quot; is realized as an allophone. Instead, insertions may occur between any pair of &quot;phonemes.&quot; In addition, one must also model when insertions do not occur so that the probability of an insertion in any context can be predicted. These requirements are met by representing all insertions and non-insertions in one tree in which the contextual factors are redefined to be a set applicable to insertions. The contextual factors describing insertions do not describe the transformation of an underlying phoneme. Instead, the factors describe the context of the phonemes adjacent to where an insertion can occur. Contextual factors describing insertions are derived from factors describing the context of a phoneme by replacing each factor with ones describing the context of adjacent phonemes. For example, the factor stress is replaced with stress-of-preceding-phoneme and stress-of-following-phoneme. Since the new factors describe the context of adjacent phonemes, the value sentence-boundary is added to the allowable values of each factor to indicate the beginning or end of a sentence. In organizing the data to build an &quot;insertion&quot; tree, all pairs of phonemes in the training data are checked for whether or not an insertion occurred between them. If so, the context and insertion is noted; if not, the context and the fact that no insertion occurred is also noted. The &quot;insertion&quot; tree predicts when insertions can occur as well as what type of insertion can occur in a particular context.</Paragraph> <Paragraph position="3"> Network Creation Networks are created word by word and can be joined to produce a pronunciation network for a recognition system. Networks created using our method explicitly model cross-word-boundary contextual effects. If the context at the word boundaries is unknown, the possible allophones and corresponding probabilities are enumerated for each context value. Alternatively, if the context of the word is specified, only the allophones for the word boundary context are used. Since the context of the word is known, word-boundary phonemes can be treated the same as word-internal phonemes.</Paragraph> <Paragraph position="4"> To create a word network, a two-pass process is used. First, each dictionary &quot;phoneme&quot; in a word is mapped to the allophone distribution represented by the leaf in a phoneme tree corresponding to the context in which the phoneme occurs, producing a sequence of allophones representing the sequence of phonemes (see Figure 2a). The context of a leaf in a phoneme tree is described by the contextual factor values encountered in traversing a phoneme tree from the root node to the leaf. Contextual constraints associated with the allophones from a leaf are matched to contextual constraints of adjacent allophones. If the phoneme is word initial or word final and the context at the word boundaries is not specified, then the allophones for each context must be incorporated into the network. Insertions are then added between the leaf values when the context for an insertion is compatible. Insertions are added after substitutions and deletions because the context in which an insertion occurs is dependent upon adjacent phones, which is determined by the phoneme realizations.</Paragraph> <Paragraph position="5"> Using our method based on context trees, the pronunciation network produced for the word &quot;fence&quot; is shown in Figure 2. In creating this network, we made the simplifying assumption that we would not use the contextual factors describing adjacent phones for modeling substitutions and deletions. This produces the simple network in Figure 2b in which each set of arcs from a leaf node are connected at the beginning and end. Addition of insertions, in which we do include the contextual factors describing adjacent phones, produces the network shown in Figure 2c.</Paragraph> <Paragraph position="6"> In creating this network, we also assumed that the word was spoken in isolation and therefore preceded and followed by silence. Had we not specified a context, the boundaries of the word would be much more bushy with additional arcs representing the different possible allophones and probabilities in various contexts. For example, an /s/ in word-final position is more likely to be palatalized and pronounced as a \[~\] when followed by a/y/or a/~/, as in &quot;fence your&quot; or &quot;fence should,&quot; than when followed by a vowel, as in &quot;fence any.&quot; When the context of a word is not specified, possible palatalization is modeled through the addition of two arcs which require that the following phoneme can cause palatalization, such as/y/or/g/. One arc represents the \[s\] allophone and the other arc represents the \[g\] allophone; the probability of the two arcs sum to 1.0. The original Is\] arc remains untouched with a probability of 1.0 and now has a constraint on its following context prohibiting any following phoneme which can cause palatalization of/s/. When all word boundary contexts are listed, unobserved cross-word-boundary contexts may be handled by including a &quot;default&quot; in which context is ignored. That is, the &quot;default&quot; context is composed of all observed allophones across all contexts. A more detailed estimate, based on a partial context specification, requires actual lookup in a context tree.</Paragraph> <Paragraph position="7"> Because of limited training data, some of the words to be represented may contain a context value which has not been observed in the training data. However, each node of the tree contains the distribution of allophones for the partial context represented by the node. Thus, the allophones for unobserved contexts can be estimated from a partial context specification by tracing down the tree as far as consistent with the observed contextual factor values describing a phonemic baseform.</Paragraph> <Paragraph position="8"> The context trees possess properties which are useful for producing pronunciation networks. As noted in the previous section, allophone distributions in unobserved contexts may be easily estimated from a partial context specification by tracing down the tree until no context matches. Other properties of the context trees permit ease of context combination, specification of natural groups for tying (Jelinek and Mercer, 1980) in hidden Markov models (HMM), representation of allophone probabilities, and systematic reduction of network size.</Paragraph> <Paragraph position="9"> In the tree-induction method, the leaves of the tree, by definition, represent mutually exclusive contexts. This simplifies the comparison and combination of contexts during network creation. For example, finding a set of arcs with a compatible context is simplified if one can assume that once a matching constraint is found, one need not look any further.</Paragraph> <Paragraph position="10"> In HMMs, tying is used to constrain one probability in a network to be equal to another, the underlying idea being that equivalent items should be assigned the same probability. Each leaf of a phoneme tree represents all the allophones of that phoneme which have been observed in a particular context. Thus each leaf is a natural group for tying. In a network representation in which the labels are on the arcs, the probability assigned to an arc should be tied to all other arcs with the same label from the same leaf.</Paragraph> <Paragraph position="11"> Many rule sets have been developed to describe phonological variation. However, by using a data-intensive approach, allophone probabilities in a context may be directly estimated. Furthermore, counts of allophones in context can be used to reduce the size of pronunciation networks by removing unlikely allophones.</Paragraph> <Paragraph position="12"> The probabilities in a context tree can be further refined if the network is used in an HMM which is trained. The probabilities provide a good initial estimate, and more importantly, the absence of unlikely allophones in the network allow more robust training to be performed.</Paragraph> <Paragraph position="13"> In the creation of pronunciation networks, it is hard to define an &quot;optimum&quot; number of pronunciations to represent. With only a few pronunciations, recognition performance may not be optimal because the modeling of pronunciation variation in words is left to the front-end. With many pronunciations, recognizer performance may be poor because the amount of training data is sparse and unlikely pronunciations may confuse the recognizer. This is the problem described by SRI as the problem of &quot;many additional parameters to be estimated with the same amount of training data&quot; (Weintraub et al., 1989). SRI uses measures of coverage and overcoverage and accordingly modifies by hand the phonological rules they use.</Paragraph> <Paragraph position="14"> With context trees, this problem can be handled at a phoneme level. Given a large data set, context trees tend to overgenerate pronunciations because each new allophonic realization of a phoneme in a context translates into another possible arc in a network. But because context trees contain count information on allophones in context, pruning can be used in a systematic way to remove the more unlikely pronunciations, thus reducing the number of pronunciations in a network. To remove unlikely allophones in a given context, pruning is performed on the allophone distributions within a leaf. Pruning can be based upon counts of an allophone in a leaf or upon the percentage of exemplars of an allophone in a leaf. In the first case, allophones are removed if they are based on only a few exemplars. In the second case, unlikely allophones are removed. In each case, the arcs representing the removed allophones are not created. In addition to reducing the number of pronunciations, pruning may also result jn more robust predictions. For example, in a given context, one may observe just a couple exemplars of an allophone in several hundred realizations. These allophones may be due to transcription error, and so it is judicious to remove them.</Paragraph> <Paragraph position="15"> Summary This paper describes a systematic, data-intensive method for creating pronunciation networks. A phoneme representation with an enriched set of contextual descriptors is advocated for providing a general framework in which new words may be easily added to the vocabulary. A wide variety of factors is used to model contextual effects, including long-range and cross word boundary phenomena. The large number of dimensions entailed by a greater number of contextual descriptors is handled through the use of context trees for predicting allophonic variation. Context trees were shown to possess attributes, such as the ability to estimate distributions from partial contexts and the capacity to systematically reduce the size of a network based on the tree data, that make the trees a good representation from which to create pronunciation networks.</Paragraph> </Section> </Section> class="xml-element"></Paper>