File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/89/h89-1053_metho.xml
Size: 23,814 bytes
Last Modified: 2025-10-06 14:12:18
<?xml version="1.0" standalone="yes"?> <Paper uid="H89-1053"> <Title>Automatic Discovery of Contextual Factors Describing Phonological Variation</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> XEROX PALO ALTO RESEARCH CENTER 3333 Coyote Hill Road </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="0" end_page="288" type="metho"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> In this paper we describe a method for automatically discovering subsets of contextual factors which, taken together, axe useful for predicting the realizations, or pronunciations, of English words for continuous speech recognition. A decision tree is used for organizing contextual descriptions of phonological variation. This representation enables us to categorize different realizations according to the context in which they appear in the corpus. In addition, this organization permits us to consider simplifications such as pruning and branch clustering, leading to parsimonious descriptions that better predict allophones in these contexts. We created trees to examine the working assumption that preceding phoneme and following phoneme provide important contexts, as exemplified by the use of triphones in hidden Maxkov models; our results were in general accordance with the assumption. However, we found that other contexts also play a significant role in phoneme realizations.</Paragraph> <Paragraph position="1"> Introduction-Context Sensitivity in Realizations Phonologists claim that the context in which a phoneme occurs leads to consistent differences in how it is pronounced. For example, one phonological rule may state that the phoneme/t/is often flapped when it is preceded and followed by a vocalic (as in &quot;butter&quot;). The construction of these rules is typically an intricate process of theory formation, rule construction and then validation or disconfirmation of these rules.</Paragraph> <Paragraph position="2"> In this paper we describe an approach to partially automate rule construction, allowing for a larger number of examples to be examined and checked for consistencies. Our examples come from comparing transcriptions of spoken speech with a dictionary representation of the words spoken. We shall call the dictionary pronunciation symbols phonemes and define the realizations, or allophones, of a phoneme to be the set of transcription symbols corresponding to that phoneme. For example, pronunciations of the phoneme /t/include the released, flapped, and unreleased realizations as characteristically occur in &quot;tap&quot;, &quot;butter&quot;, and &quot;pat new&quot;, respectively. In addition, we shall refer to a context as having values. For example, the context stress has values primary, secondary, and unstressed.</Paragraph> <Paragraph position="3"> The approach is based on automatically forming and simplifying decision trees. Decision trees have been used for both understanding and classification of data (Henrichon and Fu, 1969). In our application, they provide a way of using context to organize the various realizations of a phoneme. The probability of a realization varies with the context in which the phoneme occurs. Contexts which have similar realization distributions are grouped together. The decision tree thus provides a method for representing the partitions of allophones with dissimilar probabilities, based on context.</Paragraph> <Paragraph position="4"> Decision trees can be formed automatically (Breiman et al., 1984; Quinlan, 1986) and converted to rules (Quinlan, 1987). The problem addressed here is to construct decision trees that are appropriate for use in the construction of pronunciation rules and for predicting realizations in context. In addition, an important part of the tree induction method is the discovery of appropriate descriptive categories for the formation of such trees. These categories often resemble theoretical categories, such as vocalic or plosive, and define the organization of the tree.</Paragraph> <Section position="1" start_page="0" end_page="284" type="sub_section"> <SectionTitle> Method </SectionTitle> <Paragraph position="0"> To organize the realizations of a phoneme according to context, we adapted a decision tree induction method based upon ID3 (Quinlan, 1986). The nodes of the tree represent the attribute upon which the branching is based. In a pronunciation tree, as in Figure 1, the various contexts correspond to attributes. Each branch of a tree represents a different value of an attribute. For example, in Figure 1 the context syllable boundary (SYLL-BDRY) can take on the values final (F) and initial (1) or not-initial-and-not-final (NI-NF). Associated with each leaf in the tree are the exemplars that are characterized by the context values encountered in traversing the tree from the root node to reach the leaf. The exemplars are grouped into classes, which correspond to</Paragraph> <Paragraph position="2"> Figure h Pruned tree showing contexts when /t/ is glottalized (+) and not glottalized (-). Nodes are numbered in the upper right corner of each box.</Paragraph> <Paragraph position="3"> realizations of a phoneme in our application, and the percentage of each realization in a leaf is paired with the class label.</Paragraph> <Paragraph position="4"> Two characteristics of speech are captured by our tree representation: 1) A phoneme can be pronounced in multiple ways, as in the previous example with/t/; and 2) The number of values for some contexts may be large. For example, the context preceding phoneme has 46 values in our dictionary's alphabet. Splitting on all 46 values rapidly decreases the number of exemplars per node. These two characteristics are accommodated by using a general metric in tree induction and by using clustering of context values, both of which are described in the next sections.</Paragraph> </Section> <Section position="2" start_page="284" end_page="286" type="sub_section"> <SectionTitle> Tree Induction </SectionTitle> <Paragraph position="0"> A decision tree is induced by recursively splitting a node into new nodes. At each node the attribute is selected which has values that best separate the realizations of the data, (i.e., make each node &quot;purer'l).</Paragraph> <Paragraph position="1"> The exemplars in the current node are subdi~,ided according to the value of the selected attribute for each exemplar, creating a new set of nodes.</Paragraph> <Paragraph position="2"> To handle multiple realizations, we used a reduction of entropy criterion with a variable number of possible classes. Intuitively, as entropy is reduced, the nodes of the tree become purer and the different realizations are better separated. We briefly review the criterion calculations here (adapted from Breiman et aL, 1984; Chou, 1988; Gallager, 1968; and Ganapathy and Rajaraman, 1973).</Paragraph> <Paragraph position="3"> Before splitting, the entropy at a node based on classes X is H(X). The average entropy of the classes in the new nodes created by splitting on the values V of a given attribute is: E(H(XIv)) = ~v P(v)g(XIv)&quot; The average entropy can also be expressed as the conditional entropy of class X given attribute values V, or H(XIV ). Thus the gain for attribute a, G(a), which is the difference between the entropy of the classes at a node, H(X), and the conditional entropy of X given V at a node, is the mutual information between X and V: G(a) = H(X) - H(XIV ) = I(Z; Y). To normalize for the variable number of attribute values, the gain for attribute a is normalized by the entropy of the number of values associated with a. Quinlan calls this the gain ratio: R(a) = G(a)/H(V). The attribute which maximizes the gain ratio, R(a), is selected for splitting.</Paragraph> <Paragraph position="4"> This decision tree induction procedure is used in our application to separate the different realizations of a phoneme based on context values. In the next section we discuss the clustering procedure which groups context values with similar realization distributions to potentially reduce the number of splits at a node and form categories.</Paragraph> <Paragraph position="5"> Clustering of Attribute Values Traditionally, in tree induction, nodes are split either along all values of an attribute (e.g., Quinlan, 1986) or else binary splits are used (e.g., Breiman C/t ai., 1984). When there are only a few values per attribute, splitting along all values of an attribute will not reduce quickly the number of exemplars per node. But z A pure node contains exemplars of only one realization type.</Paragraph> <Paragraph position="6"> sometimes attributes may have many different values. In speech, some of the values are thought to be similar in their influence on sound realizations; hence, in theory one would not want to split separately on all values, but instead would like to keep sounds with similar effects together. The values could be pre-clustered by a person according to theoretical ideas of what is similar, but the groupings may change depending on context. Alternatively, the values could be clustered into a predefined number of groups at each node (Chou, 1988). However, the appropriate number of groups is not the same for all sounds and again may depend on the current context. Thus, we want to cluster the values of each context at a node and want the number of clusters to be determined by the exemplars in the node. The context values within each resulting group will then be similar in their prediction of the distribution of realizations.</Paragraph> <Paragraph position="7"> Hierarchical clustering is used to group the values of an attribute. This type of clustering was chosen because it allows the number of clusters for each set of attribute values to be determined from the data, rather than predefined. Mutual information is used as the distance metric and is computed as in Jelinek (1985). That is, let the average mutual information between context value or attribute value vi and realizations or classes X be: The increase in average mutual information resulting from pairing two attribute values vm and v, is the difference between the average mutual information resulting from pairing vm and vn and the contribution to the average mutual information before pairing v,~ and v, : AI(V; X) = I(v,~ U v,; X) - I(v,,~; X) - I(v,; X). At each iteration, the pairing that results in the largest increase in mutual information is selected and forms a new cluster. This is continued until one of the following conditions for stopping is reached: 1) There are only two clusters left; 2) The increase in the mutual information is negative and more than doubles from one iteration to the next; 3) The increase in the mutual information measure decreases more than a threshold, which we set at -30. At each iteration, the increase in mutual information is often negative because some information is usually lost each time a new cluster is formed. The conditions for stopping define when the loss in mutual information is too great to continue clustering.</Paragraph> <Paragraph position="8"> Since we cluster the values of each attribute prior to splitting, it may be useful to split on this attribute again under a more specific context (i.e., farther down the tree). Thus, in contrast to ID3, after an attribute is selected for splitting, it is not removed from the set of attributes considered. This is also the case in the binary split method of Breiman et al. (1984); however, our method has the potential of providing meaningful splits which accommodate graded categorization in the realizations.</Paragraph> <Paragraph position="9"> A tree that has been constructed by this method may be too specialized to the training exemplars. In the extreme case, each leaf is pure, which is not desirable because such a tree would not be robust with respect to new data. In understanding the relationship between contexts and realizations, we want to uncover generalizations. This can be achieved by pruning, which combines subtrees, resulting in more general distinctions. Many methods of pruning have been suggested (e.g., Breiman et al., 1984 and Chou, 1988), but in general their primary goal is to optimize the probability of error versus some characteristic of the tree, such as average length or number of leaves.</Paragraph> <Paragraph position="10"> Since our concern is to use only the parts of the tree which will be robust to new data, a different type of pruning was used. First, nodes are extended only when the number of exemplars is greater than a specified threshold (Breiman et al., 1984); we used 20. In addition, only nodes relevant to the classification of exemplars are kept. A chi-square test (Quinlan, 1986) at the .01 level of significance is used. Each tree is also pruned by running a separate set of exemplars, or cross-validation set, through the constructed tree. If the cross-validation exemplars in a node indicate that an attribute is not relevant to the classification of the exemplars in a node, the subtree beginning at the node is collapsed into a leaf.</Paragraph> <Paragraph position="11"> Data The data comprised almost 30,000 hand-transcribed segments from approximately 900 of the &quot;sx&quot; sentences from the TIMIT acoustic-phonetic speech database (Lamel el al., 1986; Fisher et al., 1987), spoken by more than 180 different speakers. Trees were induced using 60% of the data; the results that will be described are based on trees pruned on the remaining 40% of the data. The transcribed TIMIT data were automatically onset, nucleus, coda primary, secondary, unstressed initial, final, not-initial-and-not-final, initial-and-final initial, final, not-initial-and-not-final, initial-and-final initial, final, not initial-and-not-final, initial-and-final onset, coda, nil true, false true, false true, false Table h Contexts used in pronunciation experiments aligned to the dictionary baseforms from the Merriam-Webster ~0,000 Word Pocket Dictionary to produce mappings between dictionary phonemes and transcribed segments. Each mapping was then described by a set of theoretically motivated contexts based on a set used by Withgott and Bagley (1987) in a pronunciation generation system. These contexts and corresponding values, which are the union of possible values over all phonemes, are listed in Table 1. Note that some contexts, such as stress and foot-boundary refer to how the phoneme functions within a larger unit. Because lexical context is used, contexts such as foot-boundary are easily determined. Also note that the context values based on adjacent phonemes are defined across word boundaries. For example, in the phrase &quot;two words&quot; the post (following) phoneme to/u/ would be/w/. Some of the common predicate rule contexts were combined into a context type and the clustering algorithm was used to group these values to form predicates when appropriate. For example, the predicates syllableinitial?, syllable-final?, and syllable-internal? were combined into the type syllable-boundary. The predicate syllable-initial? is formed when the values of syllable-boundary-type are clustered into the groups {initial} and {final, not-initial-and-not-final, initial-and-final}.</Paragraph> </Section> <Section position="3" start_page="286" end_page="288" type="sub_section"> <SectionTitle> Results </SectionTitle> <Paragraph position="0"> A sample tree constructed using the previously described method is shown in Figure 1. This tree is for the special case describing whether/t/ is glottalized. A context other than preceding phoneme and following phoneme is incorporated; the first split (nodes 0 and 10) in this tree is on syllable boundary (SYLL-BDRY), indicating that when /t/ is glottalized it is generally in syllable-final position. Note that grouping the syllable boundary context values not-initial-and-not-final (NF-NI) and initial (I) separately from final (F) can be interpreted as the predicate syllable-final?. Without clustering, selection of this attribute would result in a three-way split. Further examination of this tree suggests additional conditions on this generalization.</Paragraph> <Paragraph position="1"> For example, node 3 may be described as containing a subset of voiced sounds; in particular all semivowels and nasals and most of the voiced fricatives and tense vowels. Node 1 contains primarily plosives, unvoiced fricatives, and lax vowels. Additionally, the nodes labeled 5 and 9 are preceded by PRE-PHONEME nodes containing vocalics. Thus, we might produce a more precise rule, predicting that a/t/ in syllable-final position and preceded by a vocalic will be glottalized with greater likelihood when preceded by a phoneme m node 8 than in node 4.</Paragraph> <Paragraph position="2"> Along with trees such as these, we have constructed 45 general trees, one for each phoneme in the dictionary. These trees were more general in that all the realizations of a phoneme composed the classes.</Paragraph> <Paragraph position="3"> Twenty of the trees (p t k b d g m n w h u U i I I E x c s T) had preceding phoneme and following phoneme as their initial contexts for splitting; in nine trees (C J r l y e A Z D) preceding phoneme but not following phoneme appeared in the first two levels; and in eight trees (G o Y @ a z f v) the following phoneme but not preceding phoneme appeared in the first two levels. This agrees with the common working assumption that preceding phoneme and following phoneme are the most important contexts for describing phonological variation. However, we also observed that other contexts are useful for differentiating among the realization distributions. The additional contextual factors which appeared in the first two levels of the tree and the number of times each appeared are: stress 13, function-word7 8, foot boundary type 5, syllable boundary type 5, syllable part 3, open syllable? 1, and word boundary type 1.</Paragraph> <Paragraph position="4"> By using a clustering technique in which the number of groups was determined by the data, many times the preceding and following context categories corresponded to linguistic categories, such as place or manner. For example, the POST-PHONEME values in the /s/ realizations were clustered into the set {~, ~, V} and the set of all other phonemes. In addition, the predominance of preceding phoneme and following phoneme as useful contextual factors is due in part to the flexibility in the number of groups.</Paragraph> <Paragraph position="5"> Discussion-Relevance to Speech Recognition Contexts used in the creation of pronunciation networks for some hidden Markov model speech recognition systems have been limited. This is partially due to the amount of data needed to train the network units in which context is represented. These units include whole word models, where phones are represented in the context of the word in which they occur (e.g., Paul, 1988), generalized triphones (Lee, 1988), and a hierarchy from words to subsets of triphones (Chow et al., 1986). Whole word models provide the most complete context of the internal phones, but usually do not model word boundary effects well. Although a subword unit, such as the triphone, can be concatenated into word models, thus providing easy additions to the lexicon, triphones account for only a subset of contextual factors. Work using log-linear modeling (Chen, 1987) has shown that use of only the preceding and following contexts do not adequately describe realization distributions.</Paragraph> <Paragraph position="6"> To test the common working assumption that preceding phoneme and following phoneme are the most useful contextual factors, as in triphone models, we examined the contexts in the trees constructed for each phoneme. As stated in the previous section, when the realizations of a phoneme are considered jointly, the preceding phoneme and following phoneme are the most useful contexts overall. However, additional contextual factors other than preceding phoneme and following phoneme can provide better estimates of the likelihood of different realizations. We thus suggest a mixed context unit based on the partitioning represented by the computed trees for use in continuous speech recognition. The organization of the phonological realizations into trees provides a way to specify contexts for creating models intermediate in the continuum of context models from adjacent phone to whole word. A subset of a predetermined set of possible contexts which are useful for differentiating among the realization distributions is identified. This subset is a larger number of contexts than the data would permit if the selected contexts were always considered together.</Paragraph> <Paragraph position="7"> Consequently, a larger overall number of contexts can be used for describing the realizations. For example, in Figure 2, the contexts of PRE-PHONEME, FNC-WORD-P, and FOOT-BDRY are used for describing the realizations of/y/, but only two contexts, either eRE-PHONEME and FNC-WOaD-P or PRE-PHONEME and FOOT-BDRY, are used to describe each leaf.</Paragraph> <Paragraph position="8"> In tree induction, different realizations are considered simultaneously and the partitioning based on context values is mutually exclusive. By considering all realizations of a phoneme simultaneously, the overall usefulness of the different contextual factors is analyzed. The set of selected contexts generally is not the same as those chosen in trees in which the occurrence of each realization of a phoneme is separately computed. Since each exemplar belongs to one set of context values because of the partitioning, the proportion of each realization for each set of context values can be estimated.</Paragraph> <Paragraph position="9"> In our work describing phonological variation for speech recognition, we use a systematic, data-intensive approach. Contexts are identified that correlate with the phonological variation exhibited in a large hand-transcribed database of utterances. From these correlations, we identify useful context descriptions. The combination of decision tree induction and hierarchical clustering organizes the realization data into a representation conditioned on context. The tree induction attempts to separate different realizations, while the hierarchical clustering provides for theoretically meaningful grouping of the context values, which, in turn, allows for better estimates of the realization distributions. The use of a tree structure allows multiple mutually exclusive context sets to be used to describe allophone distributions. The trees can be traversed to produce pronunciation distributions for each phoneme in a dictionary baseform. Because the chosen contextual factors in a tree are dependent on contextual factors chosen closer to the root node (e.g., for the phoneme/y/in Figure 2, FOOT-BDKY is useful when PRE-PHONEME has the values {h, v, f, k, m, b, g, n, l, p, z}), the context trees, rather than the set of contexts in the trees, should be used for building pronunciation networks for speech recognition systems. The context trees can be reinterpreted straightforwardly, and we are examining the clustered context values in each branch for such general descriptions.</Paragraph> </Section> </Section> <Section position="3" start_page="288" end_page="288" type="metho"> <SectionTitle> Acknowledgement </SectionTitle> <Paragraph position="0"> This work was sponsored in part by the Defense Advanced Research Projects Agency (DOD), under the Information Science and Technology Office, contract #N00140-86-C-8996.</Paragraph> </Section> class="xml-element"></Paper>