File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/90/h90-1016_intro.xml
Size: 8,480 bytes
Last Modified: 2025-10-06 14:04:53
<?xml version="1.0" standalone="yes"?> <Paper uid="H90-1016"> <Title>Toward a Real-Time Spoken Language System Using Commercial Hardware</Title> <Section position="3" start_page="0" end_page="72" type="intro"> <SectionTitle> 2. Hardware </SectionTitle> <Paragraph position="0"> It is already quite straightforward to perform signal processing in real-time on current boards with signal processor chips. However, the speech recognition search requires a large amount of computation together with several MB of fast readily accessible memory. In the past there have not been commercially available boards or inexpensive computers that meet these needs. However this is changing. The Motorola 88000 and Intel 860 chips are being put on boards with substantial amounts of random access memory. Most chips now come with C compilers, which means that the bulk of development programs can be transfered directly. If needed, computationally intensive inner loops can be hand coded.</Paragraph> <Paragraph position="1"> After considering several choices we have chosen boards based on the Intel 860 processor. The Intel 860 processor combines a RISC core and a floating-point processor with a peak speed of 80 MFLOPS. Currently, we have looked at VME boards made by Sky Computer (the SkyBolt) and Mercury (the MC860). The SkyBolt currently is available with 4 MB of static RAM. It will very shortly be available with 16 MB of DRAM and 256 KB of static RAM cache.</Paragraph> <Paragraph position="2"> The Mercury MC860 is currently available with 16 MB of DRAM. Most C programs that we have run on both of these machines mn about five times faster than on a SUN 4/280.</Paragraph> <Paragraph position="3"> Figure 1 illustrates the hardware configuration that we have built. The host will be a SUN 4/330. The microphone is connected to an external preamp and A/D converter which connects directly to the serial port of the Sky Challenger. The Sky Challenger with dual TMS320C30s will be used for signal processing and vector quantization (VQ). The SkyBolt will be used for the speech recognition N-Best search. The boards communicate with the host and each other through the VME bus, making high speed data transfers easy. However currently the data transfer rate between the boards is very low. The SUN 4 will conlrol the overall system and will also contain the natural language understanding system and the application back end.</Paragraph> <Paragraph position="4"> Challenger Dual C30 board and the Intel 860 board plug directly into the VME bus of the SUN 4.</Paragraph> <Paragraph position="5"> We use all three processors during most of the computation. When speech has started the C30 board will compute the signal processing and VQ in real-time. The SUN 4 will accumulate the speech for possible long-term storage or playback. Meanwhile, the Intel 860 will compute the forward pass of the forward-backward search. When the end of the utterance has been detected, the SUN will give the 1-Best answer to the natural language understanding system for parsing and interpretation. Meanwhile the Intel 860 will search backwards for the remainder of the N Best sentence hypotheses. These should be completed in about the same time that the NL system requires to parse the first answer. Then, the NL system can parse down the list of alternative sentences until an acceptable sentence is found.</Paragraph> <Paragraph position="6"> Currently, the computation required for parsing each sentencce hypothesis is about 1/2 second. The delay for the N-Best search is about half the duration of the sentence. This is expected to decrease with further algorithm improvements.</Paragraph> <Paragraph position="7"> 2. Time-Synchronous Statistical Language Model Search We know that any language model that severely limits what sentences are legal cannot be used in a real SLS because people will almost always violate the constraints of the language model. Thus, a Word-Pair type language model will have a fixed high error rate. The group at IBM has long been an advocate of statistical language models that can reduce the entropy or perplexity of the language while still allowing all possible word sequences with some probability. For most SLS domains where there is not a large amount of training data available, it is most practical to use a statistical model of word classes rather than individual words. We have circulated a so called Class Grammar for the Resource Management Domain \[3\]. The language model was simply constructed, having only first-order statistics and not distinguishing the probability of different words within a class. The measured test set perplexity of this language model is about 100. While more powerful &quot;fair&quot; models could be constructed, we felt that this model would predict the difficulty of a somewhat larger task domain. The word error rate is typically twice that of the Word-Pair (WP) grammar. One problem with this type of grammar is that the computation is quite a bit larger than for the WP grammar, since all 1000 words can follow each word (rather than an average of 60 as in the WP grammar).</Paragraph> <Paragraph position="8"> During our work on statistical grammars in 1987 \[6\], we developed a technique that would greatly reduce the computational cost for a time-synchronous search with a statistical grarnmar I . Figure 2 illustrates a fully-connected first-order statistical grammar. If the number of classes is C, then the number of null-arcs connecting the nodes is C 2. However, since the language models are rarely well-estimated, most of the class pairs are never observed in the gaining data.</Paragraph> <Paragraph position="9"> Therefore, most of these null-arc transition probabilities are estimated indirectly. Two simple techniques that are commonly used are padding, or interpolating with a lower order model. In padding we assume that we have seen every pair of words or classes once before we start training. Thus we estimate p(c2lel) as</Paragraph> <Paragraph position="11"> Requires U 2 null arcs.</Paragraph> <Paragraph position="12"> In interpolation we average the first-order probability with the zeroth-order probability with a weight that depends on the n.mher of occurrences of the first class.</Paragraph> <Paragraph position="14"> In either case, when the pair of classes has never occurred, the probability can be represented much more simply. For the latter case of interpolated models, when N(el, c2) = 0 the expression simplifies to just</Paragraph> <Paragraph position="16"> The first term, 1 - A(el), depends only on the first class, while the second term, ~e2), depends only on the second class. We can represent all of these probabilities by adding a zero-order state to the language model. Figure 3 illustrates this model. From each class node we have a null transition to the zero-order state with a probability given by the first term.</Paragraph> <Paragraph position="17"> Then, from the zero-order state to each of the following class nodes we have the zero-order probability of that class.</Paragraph> <Paragraph position="18"> Now that the probabilities for all of the estimated transitions has been taken care of we only need the null transitions that have probabilities estimated from actual occurrences of the pairs of classes, as shown in Figure 4. Assuming that, on average, there are B different classes that were observed to follow each class, where B << C, the total number of transitions is only C(B + 2). For the 100-class grammar we find that B = 14.8, so we have 1680 transitions instead of 10,000. This savings reduces both the computation and storage associated with using a statistical grammar.</Paragraph> <Paragraph position="19"> All of the transitions estimated from no data are modeled by transitions to and from the zero-state.</Paragraph> <Paragraph position="20"> It should be clear that this technique can easily be extended to a higher order language model. The unobserved second-order transitions would be removed and replaced with transitions to a general first-order state for each word or class. From these we then have first-order probabilities to each of the following words or classes. As we increase the order of the language model, the percentage of transitions that are estimated only from lower order occurrences is expected to increase. Thus, the relative savings by using this algorithm will increase.</Paragraph> </Section> class="xml-element"></Paper>