XML Viewer - h92-1067

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/92/h92-1067_intro.xml
Size: 5,287 bytes
Last Modified: 2025-10-06 14:05:17
<?xml version="1.0" standalone="yes"?>
<Paper uid="H92-1067">
  <Title>An A* algorithm for very large vocabulary continuous speech recognition I</Title>
  <Section position="2" start_page="0" end_page="333" type="intro">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> In this paper we will give a preliminary report on our efforts to extend our earlier work on very large vocabulary isolated word recognition \[16, 17, 19\] to continuous speech tasks. We are aiming to perform speaker-dependent continuous speech recognition using a trigram language model and a vocabulary of 50,000-100,000 words.</Paragraph>
    <Paragraph position="1"> Although the problem of very large vocabulary isolated word recognition has largely been solved \[I, 19\], no experiments have yet been conducted in continuous speech recognition with comparably large vocabularies because the search problem is so formidable. The best known approach to the search problem uses the word as the fundamental search unit and a stack decoding algorithm \[II, 20\] (also known as an A* search \[14\]). The effectiveness of this approach depends on having a good fast match strategy to identify candidate words whenever a word boundary is hypothesized. Many different fast match algorithms have been proposed \[3, 4, 5, 15\] but they have yet to be shown to perform satisfactorily on continuous speech tasks having a vocabulary larger than 5,000 words \[2\].</Paragraph>
    <Paragraph position="2"> An alternative approach developped by Phillips \[6\] (on a 103300 word vocabulary in German) uses the phoneme as the fundamental search unit and consists ofa Viterbi search of the hidden Markov model obtained by combining phoneme HMMs with a Markovian language model (such as a trigram model). Aggressive pruning is necessary since the search space is very large. (For instance, if a trigram language model is used then one copy of the lexical tree is needed for every possible bigram.) The phoneme inventory is surficently small that exact matches can be calculated whenever  they are needed. However, it remains to be seen whether the phoneme unit is capable of attaining respectable accuracies on very large-scale recognition tasks.</Paragraph>
    <Paragraph position="3"> A new type of bi-directional search strategy has emerged recently. The basic idea is to guide the search by means of a heuristic obtained by first carrying out an inexpensive search in the reverse time direction (subject to relatively weak linguistic and/or lexical constraints). This type of approach appears to have been discovered independently by several groups and has been shown to work effectively on a variety of applications \[8, 9, 16\]. In \[16\] we presented a very efficient algorithm for very large vocabulary isolated word recognition using this paradigm and the phoneme as the fundamental search unit. Our current efforts are devoted to extending this algorithm to continuous speech.</Paragraph>
    <Paragraph position="4"> This isolated word recognition algorithm is an A* algorithm which uses a heuristic obtained by searching a phonetic graph \[16\] which imposes triphone phonotactic constraints on phoneme strings. This search is conducted using the standard Viterbi algorithm in the reverse time direction (starting from the end of the utterance). In addition to providing a very efficient heuristic, a major advantage in using triphone phonotactic constraints is that it enables us to identify the endpoint of the third-to-last phoneme in each partial recognition hypothesis with a high degree of accuracy, thereby substantially reducing the size of the search space. Another innovative feature of this algorithm is that it computes the acoustic matches of every segment of data with each of the phoneme models ('the point scores') before carrying out the A* search. (The principal reason for doing so is that this approach enables segment-level features such as phoneme durations to be modelled in an optimal way \[16, 18, 7\].) The effectiveness of the triphone heuristic depends more on the quality of the phoneme models than on the size of the search space. Thus we have found that, even without any pruning, the isolated word recognition algorithm runs more quickly on a 60,000-word recognition task with clean speech and speaker-dependent models than on a 1,600-word task with telephone speech and speaker-independent models \[16\]. In the speaker-dependent case most of the computaion is taken up by the pre-processing (the calculation of the point scores  and the Viterbi search) and the A* search itself accounts for only about 1% of the total. In extending the algorithm to continuous speech (also with a 60,000 word vocabulary and speaker dependent models), we have found that the amount of pre-processing per unit time remains essentially the same, but the amount of computation needed for the A* search increases by three orders of magnitude. Hence, the total computational demands of the algorithm only increase by a factor of about 10.</Paragraph>
    <Paragraph position="5"> The experiments reported here have been conducted using phoneme models, but the search algorithm can be extended to accommodate allophone models (including cross-word allophones) fairly easily.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML