XML Viewer - h92-1067

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/92/h92-1067_concl.xml
Size: 6,273 bytes
Last Modified: 2025-10-06 13:56:50
<?xml version="1.0" standalone="yes"?>
<Paper uid="H92-1067">
  <Title>An A* algorithm for very large vocabulary continuous speech recognition I</Title>
  <Section position="6" start_page="335" end_page="336" type="concl">
    <SectionTitle>
6. Future Work
</SectionTitle>
    <Paragraph position="0"> There is obviously a substantial amount of work to be done to improve both the accuracy and speed of our recognition algorithm. These problems are not independent of each other: we expect that the search algorithm will run faster with a better language model and better acoustic models and conversely  improvements in the search algorithm will lead to fewer search errors and hence higher recognition rates.</Paragraph>
    <Paragraph position="1"> At present, the only types of pruning that have been implemented are the merging of theories having identical recent pasts and the limitations on the size of the stack used in searching a block as well as the length of the list of theories passed from one block to the next. Several other possibilities remain to be explored. We may be able to get away with a beam search in the calculation of the ~/*'s. It may be possible to prune hypotheses based on poor local acoustic matches (evaluated using the point scores or the/~* 's or a combination of the two). Since the branching factor at the root node of the lexical tree is 41, we would expect a big payoff if this type of pruning can be made to work successfully whenever a word boundary is hypothesized. Also the limitations on the stack size and the length of the hypothesis lists passed from one block to the next should probably be made threshold-dependent rather than preset.</Paragraph>
    <Paragraph position="2"> In our current implementation, we have not taken full advantage of the sparseness of the language model statistics (the the number of bigrams wl w2 for which we have trained trigram distributions P (. \]wx w2) is relatively small and these distributions are typically concentrated on very small subsets of the dictionary). Our present implementation gets some mileage out of this fact by using the notion of a language model state (~r) to determine when theories can be merged, but more work remains to be done. Adding a language model component to the heuristic would probably help as well.</Paragraph>
    <Paragraph position="3"> As for acoustic modelling, we can expect a major improvement by using allophone models. From the way we have presented the algorithm, it may appear that we have locked ourselves into the choice of the phoneme as the modelling unit so it may come as a surprise to learn that our algorithm can accommodate allophone models in a natural way (without unduly increasing the amount of computation needed). The only restriction is that the allophones of a given phoneme should be defined by looking at contexts which extend no more than two phonemes to the right (there is no restriction on left contexts). Since this is an important issue, we take the time to explain what is involved here. Certainly, we would encounter problems if we were to proceed in a straightforward manner and recompile the lexicon in terms of allophonic transcriptions rather than phonemic transcriptions. Firstly, the structure of the lexical tree would have to be radically altered to accomodate allophones defined by contexts which extend across word boundaries. Secondly, with a reasonably large allophone inventory (say a few thousand), the size of the graph G* would become so large as to make the computation of the/~* 's practically infeasible. So the approach is to retain the structure of the lexical tree and the graph G* determined by the phonemic transcriptions and perform the translation to allophonic transcriptions on-line. (The same method could be used to incorporate phonological rules whose domain spans word boundaries.) Suppose we have a theory 0 whose partial phonemic transcription is fl ... fk. We have to explain how oct(f1 ... fk-2) and fl~ (n) are computed when allophone models are used.</Paragraph>
    <Paragraph position="4"> In calculating o~t(ft ... fk-2), we simply use the appropriate allophonic models for each of the phonemes fl,..-, fk-2.</Paragraph>
    <Paragraph position="5"> (Note that sufficent information concerning the right contexts is available to determine which allophones to use).</Paragraph>
    <Paragraph position="6"> It is natural to organize the calculation of the t* 's in terms of the point scores. To see how this goes, consider first the case of phoneme models. The ~*'s can be calculated using the block Viterbi algorithm \[ 16\]. Recall that, for a given node n, the first two phoneme labels on any path in G* which starts at node n are uniquely determined. Denote the first phoneme by f and the second by g. The recursion formula is</Paragraph>
    <Paragraph position="8"> where n' ranges over all nodes such that (n, f, n') is a branch in G* and V(\[t + 1, t'\]lf) denotes the Viterbi score of the data in the interval It + 1, t'\] calculated using the f model.</Paragraph>
    <Paragraph position="9"> In the case of allophone models, we can calculate the backward probabilities using the recursion formula ~/~(n) = max max V(\[t + 1,t'\]l~) m,ax/~,(n' ) t~&gt;t qb where, as before, n' ranges over all nodes such that (n, f, n') is a branch in G* and ~b ranges over all the allophones of f determined by the condition that the phoneme immediately following f is g. It is obvious that the backward probabilities calculated in this way provide an overestimate of the acoustic score of the data which has not yet been accounted for on the optimal extension of the theory 0 so the admissibility condition is satisfied. Of course, it is not possible to endpoint the phoneme fk-2 exactly in this case since the allophone models needed to score fk-~ and fk cannot be determined until the theory has been extended. This does not present a problem since we already have a mechanism in place for handling multiple endpoint hypotheses.</Paragraph>
    <Paragraph position="10"> Finally, we have recently embarked on a project to parallelize the search algorithm with a view to obtaining a real-time response on a platform supplied by ALEX Informatique containing 48 i860's and 48 'I'800 transputers.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML