File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/90/h90-1036_intro.xml

Size: 8,763 bytes

Last Modified: 2025-10-06 14:04:54

<?xml version="1.0" standalone="yes"?>
<Paper uid="H90-1036">
  <Title>A Rapid Match Algorithm for Continuous Speech Recognition</Title>
  <Section position="2" start_page="0" end_page="170" type="intro">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> For it to be feasible to perform large vocabulary continuous speech recognition in near real time on currently available, moderately priced hardware, computational compromises appear to be essential. One obvious compromise is to avoid detailed consideration of certain hypotheses that &amp;quot;cursory&amp;quot; inspection would reveal to be exceedingly unlikely. But what constitutes cursory inspection? To put it somewhat differently, how can we perform a very quick computation that would allow us to throw out many of the hypotheses that are &amp;quot;obviously&amp;quot; false? In particular, need dynamic programming play a role in such a &amp;quot;rapid match&amp;quot; algorithm? In this paper we shall describe a strategy for performing this kind of computation in the context of continuous speech recognition. The algorithm is an extension of the rapid match procedure that is used in DragonDictate, Dragon's commercially available 30,000 word discrete utterance recognizer. In that system, the interface between the rapid match module and the recognizer is very straightforward. Each time the user says something, the rapid match algorithm provides the recognizer with a relatively short list of words that it thinks might have been said -- typically between 100 and 200 words -- and in fact, this list is usually supplied to the recognizer before the speaker has finished saying the word.</Paragraph>
    <Paragraph position="1"> The nature of the interface between the rapid match module and the recognizer is different when continuous speech is involved, because the recognizer must contemplate hypotheses that represent sequences of words, in the course of recognizing an utterance. In the system we describe \[1\], the fundamental act that the rapid match module has been designed to carry out is to provide a short list of words that might begin at a particular time based on looking at speech data beginning at that time and extending a fixed (and short) duration into the future. Thus, whenever the recognizer is considering a partial sentence hypothesis that involves finishing a word at a certain time and needs to know what word hypotheses to consider as possible extensions, it can call the 2. Description of the Algorithm We begin by describing the kind of data that the algorithm works with and then move on to describe the models to which the data are compared. Finally, we describe the actual way in which the computation is done.</Paragraph>
    <Paragraph position="2"> We suppose that an utterance is converted by a front end to a sequence of k dimensional vectors, one per frame: Xl,X2 ..... Xn. At any time (i.e. frame) t the rapid match module is capable of hypothesizing a short list of words that might begin at that time, based on looking at the vectors xt,Xt+l,...,Xt+w-1 , where w is the window size. In our current implementation, a frame of 8 parameters is computed once every 20 milliseconds, and the window size is 12; thus the analysis is based on 240 milliseconds of speech data.</Paragraph>
    <Paragraph position="3"> The algorithm begins with the computation of a sequence of (k dimensional) smooth frames Yl,Y2,...,Ys , based on the x's in the window. Thus we have</Paragraph>
    <Paragraph position="5"> etc.</Paragraph>
    <Paragraph position="6"> where b is the smooth frame window width, the a's are the smoothing weights (and are assumed to sum to 1), c is the offset of successive smooth frame windows, and s is the number of smooth frames. A little thought reveals that w =  b + (s - 1)c. At the present time the smoothing weights are all equal, there are three smooth frames, each smooth frame is computed from a window encompassing four regular frames, and successive smooth frame windows are offset by 4 frames. Hence the smooth frame windows are nonoverlapping. In the DragonDictate isolated word rapid match we also make use of linear segmentation and smoothing. In that sytem, 5 smooth frames are computed using overlapping windows with nonuniform weights. We have not yet optimized the choice of the smoothing algorithm for continuous speech.</Paragraph>
    <Paragraph position="7"> In this way we therefore convert 240 milliseconds of speech data into 3 smooth frames, or 24 numbers. The smoothing that is done is intended on the one hand to produce a very parsimonious representation of the speech data to make further processing very quick, and on the other, to represent the data in a way that is not too sensitive to variations in the duration of phonemes, thus obviating the need A word start grouping (WSG) consists of a collection of words whose beginnings are acoustically similar. A word may appear in several different WSGs since, depending on the context in which the word finds itself, its acoustic realization may vary considerably. At the present time, we have at most 4 different word start groups for a given word, relating to whether or not the word emerges from silence or speech, and whether or not the word ends in silence or in more speech. It may prove to be desirable to expand the number of different possible representations of a word beyond this, to include prior phonetic context, but to do so might incre~e the necessary computation. The generation of the word start groups is automatic and relies on a specialized clustering algorithm. Here are some typical groups from the mammography vocabulary: A. medial, medially, mediasdnum, mediasfinal, mediolateral, needle, needed B. severe, suggest, suggested, severely, suggests, suggestive C. dense, density, denser, densities Each word start group (WSG) also consists of a sequence of acoustic models for smooth frames that might be generated from that group. More specifically a WSG is represented by a sequence of r probability distributions in k dimensional space, where r is no greater than s (the number of smooth frames computed): fl,f2,.--,fr. For most WSG's, we would have r = s, but for WSG's that are to represent words that are so short that they may not last long enough for all s smooth frames to be computed, we allow r &lt; s. For example, short function words like &amp;quot;of&amp;quot;, &amp;quot;in&amp;quot;, &amp;quot;the&amp;quot;, etc., when embedded in speech spoken continuously, often last less than 240 milliseconds.</Paragraph>
    <Paragraph position="8"> Currently, we assume that each probability density f is the product of k univariate probability densities (i.e. we assume that the k elements of each smooth frame y are independent). Furthermore, we assume that each univariate density is a double exponential distribution (Laplacian). Thus, a single element of a smooth frame vector y is assumed to be a random variable with a probability density of the form</Paragraph>
    <Paragraph position="10"> where g is the mean (or median) and a is the mean absolute deviation (MAD).</Paragraph>
    <Paragraph position="11"> If we wish to assess the evidence for whether the current sequence of smooth frames represents words in a particular WSG, we compute the score</Paragraph>
    <Paragraph position="13"> which is the average negative log likelihood of the smooth frames evaluated using the probability model for the WSG.</Paragraph>
    <Paragraph position="14"> Let us suppose that there are M word start groups; then it would be necessary to compute a score for each of these: SI,S2,...,SM. A considerable computational saving can be achieved by having a particlar probability density f appear as part of the model for multiple different WSG's. Then, the very same value of-log f can be added into multiple different WSG scores. Obtaining a representation of the probability distributions of word start groups in terms of a small number of probability densities f is again a job for a specialized clustering algorithm, one that clusters probability distributions. null Once we have computed the scores for each of the WSG's, we throw out all those groups with scores worse than a threshold T1. Then we look up all of the words contained in the surviving word start groups (throwing out any duplicates) and to each such word w, we attach the sum of its WSG score and a language model score:  where T2 is a second (combined) threshold. At this point we have a list of words for the recognizer to contemplate in more detail. If the recognizer has asked us to return no more than p words, where p happens to be less than the number of survivors, we would prune the list further by throwing out the worst scoring candidates.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML