XML Viewer - h91-1056

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/h91-1056_metho.xml
Size: 18,273 bytes
Last Modified: 2025-10-06 14:12:42
<?xml version="1.0" standalone="yes"?>
<Paper uid="H91-1056">
  <Title>LEXICAL ACCESS WITH A STATISTICALLY-DERIVED PHONETIC NETWORK</Title>
  <Section position="5" start_page="0" end_page="289" type="metho">
    <SectionTitle>
2. PROBABILISTIC MODEL
</SectionTitle>
    <Paragraph position="0"> We form the probabilistic model as follows. Let w be a sequence of words, let y be a sequence of phones, let d be a sequence of durations, and let s be a (fixed) speech signal. Then</Paragraph>
    <Paragraph position="2"> The lefthand side of this relation is the probability that a given speech signal corresponds to a particular word sequence.</Paragraph>
    <Paragraph position="3"> The word sequence that maximizes this term gives the minimum sentence error rate. The first factor on the righthand side gives the acoustic likelihoods provided by the phone recognizer. The second factor gives the lexical likelihoods to be provided by the lexlcal access stage describe here. The third factor represents whatever language model we use.</Paragraph>
    <Paragraph position="4"> In this paper, we have used the output of the current Bell Labs phone recognizer as input to the lexical access component \[1\]. At present, this recognizer outputs a single sequence of phones and durations per utterance, which represents its best estimate of the true sequence. As such, y and d are fixed in Eq. 2.1 for a given speech signal. A more general approach, which would consider alternative sequences - phone lattices - is currently under investigation, but not reported here.</Paragraph>
    <Paragraph position="5"> Also in this paper, in which we present results on the DARPA resource management task, we consider only the the simple word-pair language model. Thus, for a given utterance, the best scoring word sequence, w, will be the one that maximizes the lexical likelihood, P(y, d\[w) for a given phone recognizer output y and d, and which is a legal sequence in the word-pair grammar. In this model, finding the word sequence that maximizes this likelihood is the goal of lexical access and estimating this likelihood is the goal of this paper.</Paragraph>
    <Paragraph position="6"> A crucial factor for this estimation is that y and d are not the true sequence of phones and durations, but the output of a phone recognizer. As such, we must train our estimator on the output of the phone recognizer. This is theoretically correct for minimum error rate recognition in tnis model and is intuitively pleasing since it means that the model will learn the &amp;quot;confusion  matrix&amp;quot; of the phone recognizer and thus exploit its regularities. This combined with our probabilistic model differeniates us from other approaches to lexical access \[2,3\].</Paragraph>
    <Paragraph position="7"> We can further decompose this problem by breaking the lexical likelihoods into two factors:</Paragraph>
    <Paragraph position="9"> The first factor is the pronunciation model, which gives the probability of a phone sequence given a word sequence and the second factor is the segmental duration model which gives the probability of a duration sequence given the phone and word sequences.</Paragraph>
    <Paragraph position="10"> Given a word sequence we can use a dictionary to look up the corresponding phoneme sequence \[4\]. We can then replace the word sequence w in Eq. 2.2 with the phoneme sequence augmented with word boundaries and lexical stress with little loss of information.</Paragraph>
    <Paragraph position="11"> It is important not to confuse phonemes and phones at this point. A phoneme is a coarse description of the pronunciation of a word as usually found in a dictionary. A phone gives a finer description indicating how the speaker uttered a word in context. For example, the/'~/in 'butter' may be pronounced as a flap, \[dx\], or as a released t, \[tcl t\] . In this paper, we use the TIMITBET symbols, a superset of the ARPABET symbols, for specifying phones \[5\]. Which phone will be the realization of the phoneme/t/in this word depends, in part, on the speaker's dialect and speaking rate. Nor is the phonetic realization of a phoneme always determinstic; only about 75% of the /t/'s in a similar context to 'buttor' are flapped, estimated from the TIMIT database. It is precisely the phoneme-to-phone mapping that comprises the pronunciation model that we are trying to generate.</Paragraph>
    <Paragraph position="12"> Let us make this idea precise. Let y = zlz~...zm be the string of phonemes of some sentence. So that we can mark both word boundaries and stress we augment the phoneme set to include /$/as a word boundary marker and split each syllabic phoneme into an unstressed, a primary stressed, and a secondary stressed version. Further, let y = YlYz...Yn be the string of corresponding phones. We include the phone symbol \[-J to indicate that a phoneme may delete.</Paragraph>
    <Paragraph position="13"> The most general form of our predictor is J3(y\[x), where P estimates the probability that the phone sequence y is the realization of the phoneme sequence x.</Paragraph>
    <Paragraph position="14"> This specifies the probalitity of an entire phone sequence y. For convenience, we want to decompose this into one phone prediction at a time. Since</Paragraph>
    <Paragraph position="16"> we can restate the problem as finding a suitable predictor, ISk(yklxyl...yk_l), that estimates the probability that yk is the kth phone in the realization, given the phoneme sequence x and the previous k-1 phones Yl...Yk-1.</Paragraph>
    <Paragraph position="17"> Eq. 2.3 is more general than necessary since realistically the kth phone wiLl depend only on a few neighboring phonemes and phones. Suppose that we can place the phoneme and phone strings into alignment. In fact, forming a good alignment betwen phonemes and phones is easy if deletions and insertions are permitted, using a phonetic feature distance measure and standard string alignment techniques \[6\]. Since we have augmented the phone set to include a deletion symbol, the only stumbling block to such an alignment would be if phones insert. For the moment, assume that they don't; we wiLl come back to insertions later. Thus, under this assumption we can talk about the kth phoneme and its corresponding phone. We assume pk(yklx Ya...Yk-O = p(yklxk ..... Zk-lZkXk+l--.gk+rYl.--Yk-1)- (2.4) In other words, ptc is stationary and depends only on the ::t::r neighboring phonemes.</Paragraph>
    <Paragraph position="18"> If we assume the kth phone does not depend any of the previous phones, we have</Paragraph>
    <Paragraph position="20"> This is the assumption that phones are conditionally independent given the phonemic context.</Paragraph>
    <Paragraph position="21"> To handle phone insertions, we add a second model that predicts the phone insertions. Consider a phone sequence zoylzly2z2...y,~zn that is the reaLization of phoneme sequence xlx2...xn. We view phone Yi as the realization of phoneme xl and view phone zi as an insertion between phoneme yi and yi+l.</Paragraph>
  </Section>
  <Section position="6" start_page="289" end_page="290" type="metho">
    <SectionTitle>
3. DECISION TREES
</SectionTitle>
    <Paragraph position="0"> We now discuss the question of how, in general, we can estimate the likelihoods in Eq. 2.2. We stated in the introduction that we intend to estimate them directly from training data by statistical means. In the DARPA resources management task, we use the output of the phone recognizer run on the training set. Since the phone recognizer is also trained on this same data set, the phone recognition rate would be much better than on independent test sets if we did this directly. Instead, we train the phone recognizer on 9/10 of the training set and then run it on the remaining 1/10. By doing this ten times on the different portions of the training set, we are able to obtain a more realistic phone training set for lexical access.</Paragraph>
    <Paragraph position="1"> Given this data, how can we obtain estimates the the pronunciation and duration likelihoods in Eq. 2.2? The simplest procedure would be to collect n-gram statistics on the training data. A bi-phonemic or possibly tri-phonemic context would be the largest possible with available training data if we want statistically reliable estimates.</Paragraph>
    <Paragraph position="2"> We believe that a straight-forward n-gram statistics on the phonemes are probably not ideal for this problem since the contextual effects that we are trying to model often depend on a whole class of phonemes in a given position, e.g., whether the preceding phoneme is a vowel or not. A procedure that had all vowels in that position clustered into one class for that case would produce a more compact description, would be more easily estimated, and would allow a wider effective context to be  examined.</Paragraph>
    <Paragraph position="3"> Thus intuitively we would like a procedure that pools together contexts that behave similarly, but splits apart ones that differ. An attractive choice from this point of view is a statistically-generated decision tree with each branch labelled with some subset of phonemes for a particular position. The tree is generated by spliting nodes that statistical tests, based on available data, indicate improve prediction, but terminating nodes otherwise.</Paragraph>
    <Paragraph position="4"> An excellent description of the theory and implementation of tree-based statistical models can be found in Classification and Regression Trees \[7\]. The interesting questions for generating a decision tree from data - how to decide which splits to take and when to label a node terminal and not expand it further are discussed in these references along with the widely-adopted solutions.</Paragraph>
    <Paragraph position="5"> Suffice it to say here the result is a binary decision tree whose branches are labelled with binary cuts on the continuous features and with binary partitions on the categorical features and whose terminal nodes are labelled with continuous predictions (regression tree) or categoricM predictions (classil\]cation tree). By a continuous feature or prediction we mean a real-valued, linearlyordered variable (e.g., the duration of a phone, or the number of phonemes in a word); by a categorical feature or prediction we mean an element of an unordered, finite set. (e.g., the phoneme set).</Paragraph>
    <Paragraph position="6"> When categorical predictions are made, the relative probability of each outcome at a node can be directly estimated, and when continuous predictions are made, the distribution at a node can be para~meterically estimated. In this way, the trees can serve as estimators of distributions like in Eq. 2.2 and not just as classifters and predictors.</Paragraph>
    <Paragraph position="7"> We have chosen to use decision trees to form our estimators since they (1) relatively efficiently use the available data, (2) are able to handle both categorical and continuous inputs and outputs, (3) are trainable to new corpuses quickly (which is necessary since we train on the output of a changing phone recognizer), and (4) generalize well to new test data due to the cross-validation procedure for selecting tree size \[7\]. The use of decision trees for these kinds of purposes has already met with some success \[8-11\].</Paragraph>
  </Section>
  <Section position="7" start_page="290" end_page="290" type="metho">
    <SectionTitle>
4. PRONUNCIATION MODEL
</SectionTitle>
    <Paragraph position="0"> In the exposition in Section 2, we combined word boundary and stress information into the phoneme set itself. When we actually input the features into the tree classification procedure we have found it more convenient to keep them separate.</Paragraph>
    <Paragraph position="1"> We include +-r phonemes around the phoneme that is to be realized (r = 2). This is irrespective of word boundaries. We pad with blank symbols at sentence start and end.</Paragraph>
    <Paragraph position="2"> Since there are about 40 different phonemes, if we directly input each phoneme into the tree classification routine, 240 possible splits would have to be considered per phoneme position at each node, since, by default, all possible binary partitions are considered. This is clearly intractable, so instead we encode each phoneme as a feature vector. A manageable choice is to encode each phoneme as a four element vector: (consonant-manner, consonant-place, vowel-manner, vowel-place). Each component can take one of about a dozen values and includes 'n/a' for 'not applicable'. For example,/a/is encoded as (voiceless-fricative, palatal, n/a, n/a) and/iy/is encoded as (n/a, n/a, y-diphthong, highfront) null If the phoneme to be realized is syllabic, then we also input whether it has primary or secondary stress or is unstressed. We use stress as predicted by the Bell Labs text-to-speech system; this is essentially lexical stress with function words de-accented. If the phoneme is not syllabic, we input both the stress of the first syllabic segment to the left and to the right if present within the same word (and use 'n/a'e' if not).</Paragraph>
    <Paragraph position="3"> To encode word boundaries, we input the number of phonemes from the beginning and end of the current word to the phoneme that is being realized.</Paragraph>
    <Paragraph position="4"> Our output set is simply a direct encoding of the 47 element phone set used in Ljolje\[1\] plus the symbol \[-\] if the phoneme deletes. Computation time grows only linearly with the number of output classes so this direct encoding presents no problem similar to the exponential growth found with size of the input feature dasses.</Paragraph>
    <Paragraph position="5"> We now describe the results of this model applied to the DARPA resource management database. The phonetic transcription for 3838 sentences of the training set produced by our phone recognizer as described above were aligned with their phonemic transcription as predicted by the Bell Labs text4o-speech system from their orthographic transcription. For each of the resulting 140168 phonemes, the phonemic context was encoded as described. A classification tree was grown on this data and the tree size was chosen to minimize prediction error in a 5-fold cross-validation. The resulting tree had approximately 300 terminal nodes. The resulting model predicts the phone output by the recognizer 79.5% of the time (cross-validated), contains the &amp;quot;correct&amp;quot; phone in the top 5 guesses 97% of the time, and has a conditional entropy of 1.1 bits.</Paragraph>
    <Paragraph position="6"> The corresponding insertion tree predicts whether or not the phone recognizer inserts a phone between phonemes 94.5% of the time. This seemingly good prediction is, in fact, quite poor, since the mere constant decision &amp;quot;doesn't insert&amp;quot; is correct by almost the same percentage. The best cross-validated insertion tree has only six terminal nodes, which essentially represents a fixed insertion distribution depending little on context. This reflects the fact that our choice of phone set does not produce many regular insertions (as it would if stop closure and release were separate phones), and the fact that the phone recognizer apparently does not insert spurious phones in a predictable manner.</Paragraph>
  </Section>
  <Section position="8" start_page="290" end_page="291" type="metho">
    <SectionTitle>
5. DURATION MODEL
</SectionTitle>
    <Paragraph position="0"> Our duration model, corresponding to the second factor in Eq. 2.2, has a very similar form to the pronunciation model. Our prediction, of course, is a continuous quantity, segmental duration, so we use a regression tree. We include all the input features  described above for the pronuncation tree, but we now add the corresponding phone too. We encode the phone with a scheme similar to that for phonemes, but add a few extra categories to fully specify all the phones. Perhaps a useful additional input feature would try to capture speech rate; we have not tried this yet.</Paragraph>
    <Paragraph position="1"> The standard deviation in the residual in the prediction of the durations of phones output by the phone recognizer is 29 msec. This compares with an overall 45 msec. standard deviation in the phones themselves. The best cross-validated tree-length is about 300 terminal nodes.</Paragraph>
    <Paragraph position="2"> For the lexical access, we need to represent the probability distribution of the durations. To do so, we can fit a gamma distribution to the data at each terminal node in the tree.</Paragraph>
  </Section>
  <Section position="9" start_page="291" end_page="291" type="metho">
    <SectionTitle>
6. IMPLEMENTATION OF LEXICAL ACCESS
</SectionTitle>
    <Paragraph position="0"> With these trees it is straight-forward to take a word sequence and phone sequence and estimate the likelihood that the word sequence gives rise to the phone sequence. We use the pronunciation trees to predict the first factor in Eq. 2.2 and the duration trees to predict the second factor. This simple-minded generate-and-test algorithm, of course, is not acceptable during recognition since the number of legal sentences is enormous. Instead, we have to find a more efficient way compute the exact same thing or a close approximation.</Paragraph>
    <Paragraph position="1"> The simplest approach to an efficient implementation is to use the decision trees to form pronunciation and duration networks for each word in the vocabulary ahead of time. Then, for every possible starting phone and every possible stopping phone in the recognized phone sequence we match to the pronunciation network for each word in the vocabulary. To allow for insertions and deletions, this essentially becomes a string match with costs in terms of log likelihoods in the probabilistic model \[cf. 3\]. Dynamic programming permits an efficient match here \[6\].</Paragraph>
    <Paragraph position="2"> This approach presents one disadvantage; word co~ articulation information is mostly lost, since the individual word pronuncation model would need to be created without knowing the lexical context. To get around this, we can create multiple word models per word keyed to different lexical contexts.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML