File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/h92-1067_metho.xml
Size: 13,299 bytes
Last Modified: 2025-10-06 14:13:07
<?xml version="1.0" standalone="yes"?> <Paper uid="H92-1067"> <Title>An A* algorithm for very large vocabulary continuous speech recognition I</Title> <Section position="3" start_page="333" end_page="333" type="metho"> <SectionTitle> 2. Block Processing </SectionTitle> <Paragraph position="0"> In developping our algorithms, we have decided to work with commercially distributed books on tape (analog recordings of well known novels). Half of each recording is used as training data and the other half for testing and we use an optical character recognizer to read the accompanying texts. Since this data is not segmented into sentences we have designed our training and recogniton algorithms to work with chunks of data of arbitrary size. This means that the data has to be processed in blocks which can fit comfortably into memory.</Paragraph> <Paragraph position="1"> Our approach is to use an A* search in each block which is similar to the isolated word recognition algorithm except insofar as word boundaries are not known in advance and a trigram language model is used in the scoring procedure. As in the isolated word case, an admissible heuristic is obtained by means of an initial Viterbi search through a graph which imposes triphone phonotactic constraints on phone strings.</Paragraph> <Paragraph position="2"> The A* search generates a list of theories (partial phonemic transcriptions together with word histories) for the speech data up to the end of the block 3 As soon as the list of theories for the current block has been obtained, the block is swapped out of memory and the search of the next block begins using this list to initialize the stack.</Paragraph> <Paragraph position="3"> This list of theories plays the same role as the beam used in a time synchronous Viterbi search. The Markov property of the trigram language model allows us to merge theories that have identical recent pasts but different remote pasts so the number of theories that have to be generated at the end of each block (the 'beam width') can held fixed without running the risk of losing the optimal theory. In order to pursue the search in subsequent blocks, the only information needed concerns the recent pasts of these theories. By logging the information concerning the remote pasts to disk we are able to ensure that the memory required to recognize a file is independent of its 3More precisely, each of the theories generated has the property that all of the hypothesized end points for the third-to-last phoneme in the partial phonemic transcription are beyond the end of the block. 'nae partial phonemic transcription need not end at a word boundary.</Paragraph> <Paragraph position="4"> length (instead of increasing exponentially with the length of the file as would be necessary without merging and block processing).</Paragraph> <Paragraph position="5"> For the last block in a file it is only necessary to generate a single recognition hypothesis and, once the last block has been processed, the transcription of the entire utterance can be obtained by back-tracking. The recognition algorithm can therefore be viewed globally as a beam search and locally as an A* search.</Paragraph> </Section> <Section position="4" start_page="333" end_page="335" type="metho"> <SectionTitle> 3. The Heuristic </SectionTitle> <Paragraph position="0"> Broadly speaking, an A* search of the data in a block proceeds as follows. At each iteration of the algorithm, there is a sorted list (or 'stack') of theories each with a heuristic score. This heuristic score is calculated by combining the exact likelihood score of the speech data accounted for by the theory (using phoneme HMMs and the language model) with an overestimate of the score of the remaining data on the optimal extension of the theory permitted by the lexicon and the language model. The theory with the highest heuristic score is expanded, meaning that for each of the one-phoneme extensions permitted by the lexicon the heuristic score of the extended theory is calculated and the extended theory is inserted into the stack at the appropriate position. This process is iterated until sufficiently many theories satisfying a suitable termination criterion have been generated.</Paragraph> <Paragraph position="1"> For the time being, we have decided to ignore the issue of overestimating language model scores altogether in constructing the heuristic (that is, we use an estimate of 1 for the language model probability of any extension of a given theory). Our strategy for overestimating acoustic scores is essentially the same as in the isolated word case, that is, we conduct an exhaustive search backwards in time through a phonetic graph which imposes triphone phonotactic constraints on phoneme strings rather than full lexical constraints and enables the third-to-last phoneme in a given partial phonemic transcription to be accurately endpointed. Naturally, the triphone phonotactic constraints must take account of triphones which occur at word boundaries. The simplest graph with these properties is specified as follows:</Paragraph> <Paragraph position="3"> Nodes: there is one node for every possible diphone fg Branches: for every legitimate triphone fgh (that is, a triphone that can be obtained by concatenating the phonemic transcriptions of words in the dictionary) there is a branch from the node corresponding to the diphone fg to the node corresponding to the diphone gh Branch Labels: if fgh is a legitimate triphone then the branch from the node fg to the node g h carries the label f.</Paragraph> <Paragraph position="4"> Denote this graph by G*. It is easy to see that this graph imposes triphone constraints on phoneme strings, that is, if 9~, g2, 93 * * * is the sequence of phoneme labels encountered on a given path through G* then every triple gtgt +t gt +2 (k = 1,2,...) is a legitimate triphone. The labelling scheme (3) is chosen so that the endpointing condition is satisfied (see the next section).</Paragraph> <Paragraph position="5"> 4. Searching a block In order to search a block extending from times T1 to T2, we first construct the hidden Markov model corresponding to the graph G* \[16\] and, for a suitably chosen positive integer A, we perform a Viterbi search backwards in time through this HMM from time T2 + A to time T1. (The condition used to determine the parameter A is given below.) The boundary condition used to initialize the search is that the backward probability at every state in the model at time T2 + A is 1. For each node n in G* and each t = T1 - 1,..., T2 + A - 1 we thus obtain the Viterbi score of the data in the interval It + 1, T2 + A\] on the best path in G* which leaves n at time t and is subject to no constraints on the state in the model occupied at time T + A; denote this quantity by/3~ (n). Suppose we are given a partial phonemic transcription fl ... ft. Let n be the node corresponding to the diphone A-~A and for each time t, let c~,(fl ... fk-2) denote the Viterbi score of all of the data up to time t (starting from the beginning of the utterance) for the truncated transcription fl ... ft-2. Since fl~ (n) is the Viterbi score of the data in the interval \[t + 1, T + A\] on the best path in G* which leaves n at time t and the construction of G* constrains this path to pass first through a branch labelled ft-1 and then through a branch labelled ft, it is reasonable to estimate the endpoint of the phoneme ft-2 as argmax o~,(fl....~-2)fl~ (n).</Paragraph> <Paragraph position="7"> In the case of clean speech and speaker-dependent models, this estimate turns out to be exact almost all of the time \[16\] but it is safer to hypothesize several end points (for instance by taking the five values of t for which +,(Yl ...</Paragraph> <Paragraph position="8"> is largest).</Paragraph> <Paragraph position="9"> A stack entry (theory) 0 is a septuple (w, f, m, n, o', {o~}, S) where 1. w = wl... w,~ is a word history.</Paragraph> <Paragraph position="10"> 2. f = f~ ... ft is a partial phonemic transcription which may extend into a word following w, (but there are no complete words after w, in the partial transcription f) . rn is a node in the lexical tree \[16\] corresponding to the part f which extends beyond Wn, if any; m is the root node of the lexical tree otherwise 4. n is the node in the graph G* which corresponds to the diphone fk- 1 fA . o- is the current state of the trigram language model; there are three possibilities depending on whether the word following wn is predicted using a trigram distribution P(.Iw,_lw,), a bigram distribution P(.lw,0 or a unigram distribution P(.) . for each endpoint hypothesis t, at is the Viterbi score of the data up to time t against the model for the truncated transcription fl ... ft-2 7. S is the heuristic score which is given by</Paragraph> <Paragraph position="12"> where P(w) is the probability of the word string w calculated using the trigram language model.</Paragraph> <Paragraph position="13"> The reason why both w and f have to be specified is that different words may have the same transcription and different transcriptions may correspond to the same word. Obviously it is redundant to specify m, n and~ in addition to w and f but it is convenient to do so.</Paragraph> <Paragraph position="14"> A stack entry is said to be complete if all of its hypothesized endpoints are to the right of T2. The parameter A is determined empirically by the condition that the exact endpoint of a complete stack entry should always be included among the hypothesized endpoints. (Since it is not actually possible because of memory limitations to carry around sufficient information with each theory to be able to generate its segmentation, we test this condition by verifying that the acoustic score of the global transcription found by the recognizer of the data in each file is the same as the score found by the training program when it is run with this transcription.) At the start of the search, the stack is initialized using the list of theories generated by searching the previous block (ending at time T1). Each of these has the property that all of its hypothesised endpoints are to the right of T~, so the speech data prior to the beginning of the current block is no longer needed. The search terminates when sufficiently many complete theories have been generated at which point the next block is swapped into memory and a new search begins.</Paragraph> <Paragraph position="15"> The Markov property of the trigram language model enables us to merge theories that have identical recent pasts but different remote pasts. Specifically, suppose we have two theories 0 = (w,f, rn, n,~r,{~t},S) and O' = (w',f',m',n',o&quot;, {a~}, S') such that m = m', n = n' and or = o-'. (In this case we will say that 0 and 0' are equivalent.) The future extensions of both theories which best account for the data starting at any given time (subject to lexical and language model constraints) will be identical. Thus if it happens that t is on the list of hypothesized endpoints for both theories</Paragraph> <Paragraph position="17"> then we can remove t from the hypothesis list for the second theory without running the risk of losing the optimal path. In practice, the condition n = n' means that the list of hypothesized endpoints for both theories will be the same (except in very rare cases). Furthermore, if this inequality holds for one such t then it is typically because the first theory gives a better fit to the remote past than the second theory; hence it will usually be the case that if the inequality holds for one t then it will hold for all t and the second theory can be pruned away completely.</Paragraph> <Paragraph position="18"> We can take advantage of this fact to speed up the A* search by maintaining a list of 'merge buckets' consisting of all the equivalence classes of theories encountered in the course of the search. Associated with each equivalence class we have an array of forward scores {At } which is updated throughout the search. For each t, At is defined to be</Paragraph> <Paragraph position="20"> where 0 extends over all theories (w, f, m, n, o', {oct}, S) in the given equivalence class that have been encountered so far (in the course of searching the current block). When a new theory 0' = (w', f', m', n', a', {o~}, S') in this equivalence class comes to be inserted into the stack we can test to see if the inequality P(w')ogt < At holds for each hypothesized endpoint t. If it does, then we can prune this endpoint hypothesis before entering the theory into the stack; if not, then At is updated and the endpoint hypothesis has to be retained.</Paragraph> <Paragraph position="21"> We have not been able to implement this scheme fully because of memory limitations. In practice, we only invoke merging when a word boundary is hypothesized so the only merge buckets generated in the course of the search are those which correspond to theories for which m is the root node of the lexical tree. (However, before starting the search we prune the list of hypotheses generated by searching the previous block by merging at arbitrary phoneme boundaries and we use this pruned list to initialize the stack.)</Paragraph> </Section> class="xml-element"></Paper>