File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/90/h90-1049_metho.xml

Size: 15,079 bytes

Last Modified: 2025-10-06 14:12:30

<?xml version="1.0" standalone="yes"?>
<Paper uid="H90-1049">
  <Title>Word Recognition Using Dynamic Programming Neural Networks&amp;quot;, by Sakoe, Isotani, and Yoshida (Readings in Speech Recognition, edited by Alex Waibel &amp; Kai-Fu Lee). Other work includes &amp;quot;Merging Multilayer Perceptrons and Hidden Markov Models: Some Experiments in Continuous Speech</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
IMPROVING THE HUMAN INTERFACE WITH
SPEECH RECOGNITION
</SectionTitle>
    <Paragraph position="0"> Apple's approach is distinguished by its emphasis on conversational communication with personal computers as distinct from dictation or command and control only. It is further distinguished by integration of speech recognition into the visual &amp;quot;desk top&amp;quot; metaphor of personal computers. We believe that speech recognition will impact personal computing sooner and more effectively if it is integrated with other I/O modalities such as the mouse, keyboard, visual icons, dialog boxes and perhaps speech output.</Paragraph>
    <Paragraph position="1"> We expect to bring such integrated systems to market in the 1990's.</Paragraph>
    <Paragraph position="2"> Our approach is similar in spirit to notions of Alan Sears in his SLIP (speech, language icons, and pointing) paradigm but with some distinctive differences. We will use task domain constraints provided by particular application packages on personal computers to create constrained natural language understanding. Furthermore we will implement interactive voice and text response mechanisms such as dialog boxes and speech synthesis to respond to the users input. We will provide a conversational natural language understanding within narrow task domains on personal computers in which speech is augmented with pointing, typing, and mousing around.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="241" type="metho">
    <SectionTitle>
SPEECH UNDERSTANDING ON PERSONAL
COMPUTERS
</SectionTitle>
    <Paragraph position="0"> A perennial problem confronting the speech recognition community has been lack of adequate computing power to perform real time recognition and understanding. This shortcoming is being solved, not so much to serve speech interests as it is to serve the computing needs of society at large. It is the natural progression of VLSI, economies of scale of mass produced personal computers, and computing infi'astructures.</Paragraph>
    <Paragraph position="1"> For personal computer users, speech recognition is particularly useful in areas where the user is confronted with too many options to easily manage with function keys or a small number of shift-key combinations. The current solution is to use pull down or pop up menus but these are fast becoming less convenient by shear weight of numbers of options. Sub-directories of sub-directories are becoming common. The arm motion simply to get the initial menu, and then each submenu, is a limitation on ease-of-use. Speech recognition can cut through the branches of the menu tree to speed throughput as  long as the speech recognition is fast and accurate enough.</Paragraph>
    <Paragraph position="2"> Speech recognition offers other advantages to the user interface by allowing many words and phrases to mean the same thing. If a user forgets a command or does not know if one exists, speech recognition systems can partially solve this problem by supporting synonyms and paraphrase. In addition, when user defined scripts and macros become numerous, they are difficult to manage with function keys and shift key commands. Speech recognition allows users to invoke these macros and scripts with a distinctive name or phrase and avoids function keys altogether. We expect to employ speech in interfaces to educational programs, standard computer applications (spreadsheet, word processing, etc.), multimedia systems, and telephone access systems.</Paragraph>
    <Paragraph position="3"> Automated language learning is another area of particular interest to Apple that seems to be yielding to DARPA sponsored research. Speech recognition techniques are becoming good enough to time align known utterances to templates for the words in the speech. Words that are poorly pronounced can be spotted and students can be directed to repeat offending words to mimic correctly pronounced words from the computer.</Paragraph>
  </Section>
  <Section position="5" start_page="241" end_page="242" type="metho">
    <SectionTitle>
COMMERCIAL APPLICATIONS OF DARPA
TECHNOLOGY
</SectionTitle>
    <Paragraph position="0"> Our philosophy at Apple is to leverage the efforts of other companies and researchers by providing them a platform through which they can commercially address the needs of personal computer users. For example, we stay out of certain business areas such as selling application software in order to encourage independent developers to develop products in these areas. In the research area, we stand ready to adopt systems developed by DARPA contractors and offer them along side our internally developed systems to commercial outlets.</Paragraph>
    <Paragraph position="1"> Apple encourages outside vendors to produce ASR systems to be promoted or sold by Apple. We prefer to work with those DARPA contractors that make their research freely available, but we will also consider licensing technology from the outside if it is better than internally developed technology. We actively seek partners to supply components to be used in our own internal ASR systems.</Paragraph>
    <Paragraph position="2"> For example, we currently have SPHINX working on a MAC which we call MACSPHINX. This is not currently scheduled to be shipped as a product, but a product may be based on MACSPHINX at a later time.</Paragraph>
    <Paragraph position="3"> As our contribution to the underlying technology, we intend to extend Sphinx to give it a speaker dependent mode in which it can learn new words &amp;quot;on the fly&amp;quot;. We will initially do this by augmenting Sphinx with ANNs as described below.</Paragraph>
    <Paragraph position="4"> As another example of partnering, we expect to begin building on Victor Zue's work with VOYAGER. We will receive VOYAGER from MIT in a month or two running on a MAC platform. We expect to modify it to run faster and with an order of magnitude less computing power.</Paragraph>
    <Paragraph position="5"> THE PLUS SPEECH ACCELERATOR PROJECT: In order to make it easier for DARPA contractors to use MACINTOSH computers, and to build speech recognition systems that would control applications on MACINTOSHs, we have supported and encouraged Roberto Bisiani to design a &amp;quot;speech accelerator&amp;quot; for more than a year. The goal was to allow a MAC to have intimate control over an accelerator processing unit that would offer between 50 and 200 MIPS economically and with broad base of software support. This was achieved in an external box, named PLUS by its designer Roberto Bisiani, which has a SCSI interface as well as higher speed NU BUS connection to a MAC. The SCSI interface allows the box to be programmed by other computers such as SUN computers as well as using a MAC.</Paragraph>
    <Paragraph position="6"> However, the high speed NU BUS interface to the MAC will allow tighter integration with the MAC than other computers. The box itself contains Motorola 88000s, one to ten in a single box; and the boxes may be daisy chained. We hope many of the DARPA contractors in attendance here will use the accelerator box to make their spoken language communication systems available to MAC  The identity of the ~ Second phrase, word, or ,,j..,, Step phoneme is verified by the outp?t / T</Paragraph>
    <Paragraph position="8"> FIG. 1 Nodes in a canonical HMM topology pointing to time intervals in speech time series and also to the input nodes in an ANN. The pointers to the time intervals are established using well known techniques as part of the HMM processing in step 1. Standard ANN techniques are then applied in step 2 to the speech which has now been time aligned to fixed structure of the HMM.</Paragraph>
    <Paragraph position="9"> applications. Development of this box is currently funded by DARPA and will probably be available later this year.</Paragraph>
  </Section>
  <Section position="6" start_page="242" end_page="244" type="metho">
    <SectionTitle>
ANN POST PROCESS TO HMM
</SectionTitle>
    <Paragraph position="0"> The Hidden Markov Model (HMM) approach is the dominant speech recognition paradigm in the ASR field today. Millions of dollars have been spent in dozens of institutions to explore the contributions of HMM techniques to speech recognition. Artificial Neural Net (ANN) technology is newer, but it has also become heavily funded and widely investigated.</Paragraph>
    <Paragraph position="1"> It has been only within the last year or two that the possibility and need to combine techniques from these two fields has emerged. It is very likely that numerous proposals for merging HMMs and ANNs will be presented in the next few years.</Paragraph>
    <Paragraph position="2"> George White has proposed a new and previously unpublished, technique for combining HMMs and ANNs. We refer to this technique as the ANN &amp;quot;postprocessing technique&amp;quot; by which we mean that</Paragraph>
    <Paragraph position="4"> ANNs should be applied to speech &amp;quot;after&amp;quot; HMMs have been applied. The HMM processing determines where in time words should be located in the acoustic input data. More to the point, input nodes in an ANN structure may be connected to states inside the finite state machines that form HMMs, which are then connected to time intervals in the unknown speech.</Paragraph>
    <Paragraph position="5"> The HMMs accomplish non-linear time warping, otherwise known as &amp;quot;dynamic time warping,&amp;quot; of time domain acoustic information to properly match the rigid structure of a neural net template (see Fig 1).</Paragraph>
    <Paragraph position="6"> We postulate that there is great significance to bringing inputs from HMM states that span several time units to provide input to the neural nets.</Paragraph>
    <Paragraph position="7"> The fundamental postulate of HMM or DTW (Dynamic Time Warping) is that the speech sound similarity scores in adjacent time intervals may be simply summed up provide a global match score.</Paragraph>
    <Paragraph position="8"> This is rationalized by assuming that the probability of global match over all time intervals, P(tl,t2,t3 .... tn), is equal to the product of the probabilities of the matches for each individual time interval.</Paragraph>
    <Paragraph position="9"> In other words, the fundamental assumption behind</Paragraph>
    <Paragraph position="11"> This may be acceptable when there is no practical alternative but it is not accurate and can lead to recognition errors when when subtle differences between words matter.</Paragraph>
    <Paragraph position="12"> ANNs can circumvent this problem if they are trained on the global unit spanning tl,t2,t3 .... tn. The fundamental motivation behind our approach to merging ANNs and HMMs is that ANNs compute P(tl,t2,t3,...tn) directly and thus avoid the error of Eq 1.</Paragraph>
    <Paragraph position="13"> For example, the HMM approach to scoring word sized units sums scores for phonemes which in turn sum scores over elemental time units, typically 10 ms in duration, which assumes statistical independence between the phonemes and also between the 10 ms domain units. Since these units are usually not statistically independent, some are typically overweighted. ANNs spanning word sized units overcome some of these limitations.</Paragraph>
    <Paragraph position="14"> Previous work on the general subject of merging HMMs and ANNs includes, &amp;quot;Speaker-Independent  evidently applies MLP (a form of ANN) to individual states inside HMM models. While this is a merger of ANN and HMM techniques, it falls short of the power of an ANN post process which overcomes the lack of statistical independence between adjacent time intervals.</Paragraph>
    <Paragraph position="15"> Other work includes &amp;quot;Speaker-Independent Recognition of Connected Utterances Using</Paragraph>
    <Section position="1" start_page="244" end_page="244" type="sub_section">
      <SectionTitle>
Recurrent and Non-recurrent Neural Networks&amp;quot;
</SectionTitle>
      <Paragraph position="0"> (IJCNN, June 1989). This work, like the one mentioned above, doesn't propose to achieve time alignment by HMM techniques as a precursor to applications of ANNs which is the basis of our proposal. Instead, it proposes to apply ANN technology first and then apply HMM techniques.</Paragraph>
      <Paragraph position="1"> This necessarily precludes the beneficial effects of HMM guided dynamic time warping from being realized by the inputs to the ANNs.</Paragraph>
      <Paragraph position="2"> Other related work in this area may be considered as special cases of one of the two above mentioned approaches.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="244" end_page="245" type="metho">
    <SectionTitle>
GENERAL COMMENTARY ON ANN
</SectionTitle>
    <Paragraph position="0"> While we advocate the use of ANN in conjunction with HMM or DTW, we do not at all endorse the notion that ANNs should be used alone, without DTW or HMM or other segment spotting approaches.</Paragraph>
    <Paragraph position="1"> Internal time variability in word pronunciation in multiple pronunciations of the same word must be managed and ANNs have no easy way to handle temporal variability without extraordinary requirements for silicon area.</Paragraph>
    <Paragraph position="2"> To handle the time variability problem with  accuracies competitive with HM, neural net structures must store intermediate results inside the neral net structures for each time interval. For problems as complex as speech recognition, this is not practical on silicon because of the number of interconnections is limited by the two dimensional nature of the surfaces of silicon chips. Trying to simulate the needed structures on Von Neumann machines, with DSPs for example, will result in optimal solutions similar to the Viterbi search currently used in HMM systems. In other words, as long as we are restricted to two dimensional chips and Von Neumann architectures, ANN simulations will necessarily need to employ search strategies that are already core technologies in the speech community. It would be misguided for the ANN community to ignore these refined search techniques. It is not likely that the need search strategies can be circumvented as long as the dynamic allocation of cpu operations is needed and it will be needed until we achieve three dimensional interconnections between ANNs. We should expect that hybrid combinations of HMMs (or DTW based approaches) and ANNs will be superior to pure ANN systems until an entirely new process for producing three dimensional integrated circuits is invented, and this will probably be a long time.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML