File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1077_metho.xml

Size: 15,958 bytes

Last Modified: 2025-10-06 14:13:48

<?xml version="1.0" standalone="yes"?>
<Paper uid="H94-1077">
  <Title>A LARGE-VOCABULARY CONTINUOUS SPEECH RECOGNITION ALGORITHM AND ITS APPLICATION TO A MULTI-MODAL TELEPHONE DIRECTORY ASSISTANCE SYSTEM</Title>
  <Section position="4" start_page="0" end_page="387" type="metho">
    <SectionTitle>
2. SPEECH RECOGNITION
ALGORITHM
</SectionTitle>
    <Paragraph position="0"> 2.1. Two-Stage LR Parser Figure 1 shows the structure of our continuous speech recognition system for telephone directory assistance. We have developed a two-stage LR parser that uses two classes of LR tables: &amp;quot;a main grammar table and several sub-grammar tables. These grammar tables are separately compiled from a context-free grammar. The sub-grammar tables deal with semantically classified items, such as city names, town names, block numbers, and subscriber names. The main grammar table controls the relationships between these semantic items.</Paragraph>
    <Paragraph position="1">  Dividing the grammar into two classes has two advantages: since each grammar can be compiled separately, the time needed for compiling the LR table is reduced, and the system cart easily be adapted to many types of utterances by changing the main grammar rules.</Paragraph>
    <Paragraph position="2">  a backward trellis as well as a forward trellis so as to accurately calculate the score of a phoneme sequence candidate. The backward trellis likelihood is calculated without any grammatical constraints on the phoneme sequences; it is used as a likelihood estimate of potential succeeding phoneme sequences.</Paragraph>
    <Section position="1" start_page="387" end_page="387" type="sub_section">
      <SectionTitle>
2.3. Adjusting Window
</SectionTitle>
      <Paragraph position="0"> We proposed an algorithm for determining an adjusting window that restricts calculation to a probable part of the trellis for each predicted phoneme. The adjusting window (shaded rectangle in Fig. 2) has a length of 50 frames (400 ms). The score within the adjusting window is calculated by taking the convolution of the forward and backward trdlises. In this procedure, the likelihood in the backward trellis is multiplied by (1-~), where ~ is a small positive value.</Paragraph>
    </Section>
    <Section position="2" start_page="387" end_page="387" type="sub_section">
      <SectionTitle>
2.4. Merging Candidates
</SectionTitle>
      <Paragraph position="0"> The LR tables need multiple pronunciation rules to cover allophonic phonemes, such as devoicing a~ad long vowels in Japanese pronunciation. These multiple rules cause an explosion of the search space. To make the search space smaller, we merge phoneme sequence candidates as well as grammatical states when they are phonetically and semantically the same.</Paragraph>
      <Paragraph position="1"> We further merge the candidate word sequences having the same meaning, ignoring the differences in non-keywords.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="387" end_page="388" type="metho">
    <SectionTitle>
3. RECOGNITION EXPERIMENTS
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="387" end_page="388" type="sub_section">
      <SectionTitle>
3.1. Experimental System
</SectionTitle>
      <Paragraph position="0"> We developed a telephone directory assistance system that covers two cities and contains more than 70,000 subscriber names. The vocabulary size is roughly 80,000. The grammar used in this system has various rules for interjections, verb phrases, post-positional particles, etc. It was made by analyzing 300 sentences in simulated telephone directory assistance dialogs. Figure 3 gives an example of an inquiry that can be accepted by the system. The word perplexity was about 70,000. In this task, no constraints were placed on tic combination of addresses and subscriber names by the directory database, since users may sometimes input wrong addresses.</Paragraph>
      <Paragraph position="1">  system.</Paragraph>
      <Paragraph position="2"> We prepared two speaker-independent HMM types to evaluate our algorithm: 56 context-independent phoneme HMMs, and 358 context-dependent phoneme HMMs. Each HMM has 3 states, each with 4 Gaussian distributions. We evaluated our proposed algorithm by using 51 sentences that included  184 keywords. These utterances were prepared as text with various interjections and verb phrases. They were &amp;quot;spontaneously&amp;quot; uttered by eight different speakers.</Paragraph>
    </Section>
    <Section position="2" start_page="388" end_page="388" type="sub_section">
      <SectionTitle>
3.2. Experimental Results
</SectionTitle>
      <Paragraph position="0"> The average sentence understanding and key-word recognition rates are shown in Fig. 4. These results confirm the effectiveness of merging at the meaning level and of context-dependent HMMs. These techniques achieved an average sentence understanding rate of 65% and an average keyword recognition rate of 89%.</Paragraph>
      <Paragraph position="1">  Without merging With merging Without merging With m~ging in meaning in meaning in meaning in meaning level level level level  create a NOVO HMM as the product of two source HMMs is shown in Fig. 5. Initial probabilities and transitional probabilities of the NOVO HMM can be deduced directly from the source HMMs as the product of the corresponding parameters of the source HMMs.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="388" end_page="389" type="metho">
    <SectionTitle>
4. HMM COMPOSITION
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="388" end_page="388" type="sub_section">
      <SectionTitle>
4.1. Principle
</SectionTitle>
      <Paragraph position="0"> The HMM composition assumes that the NOVO HMM (NOVO means voice mixed with noise) obtained by combining two or more &amp;quot;source HMMs&amp;quot; will adequately model a complex signal (i.e. noisy speech) resulting from the interaction of these sources. The source HMMs may model clean speech recorded in noise-free conditions or various noise sources, such as stationary or non-stationary noises, background voices, etc. In HMM decomposition \[7\], recognition is carried out by decomposing a noisy observation in a multi-dimensional state-space (at least 3 dimensions), whereas in HMM composition the noisy observation is modeled before the recognition so the computation load is much smaller than for HMM decomposition.</Paragraph>
      <Paragraph position="1"> Let R,S, and N represent the noisy-speech, clean-speech, and noise signals. Xcp, Xig, and Xi, are the variables corresponding to signal X in the cepstrum, the logarithm and the linear spectrum; # and ~ are the mean vector and the covariance matrix of the Gaussian variable, respectively; F is the cosine transform matrix; and c is the vector of LPC cepstrum co-The output probabilities of the NOVO HMM are inferred as shown in Fig. 6. Since source HMMs axe defined in the cepstrum domain, and clean speech and noise are additive in the linear spectrum domain, the normal distributions defined in the cepstrum domain are transformed into lognormal distributions in the linear spectrum domain and summed. In the figure, k(SNR) is a weighting fa~ctor that depends on the estimated SNR of the noisy speech. The distributions obtained in the linear spectrum domain are finally converted back into the cepstrum domain. The process shown in the figure has to be repeated for all states and for all mixture components of the noise and clean-speech HMMs.</Paragraph>
    </Section>
    <Section position="2" start_page="388" end_page="389" type="sub_section">
      <SectionTitle>
4.2. Experimental Results
</SectionTitle>
      <Paragraph position="0"> The effectiveness of the HMM composition technique was evaluated by the telephone directory assistance system, using the 56 context-independent phoneme HMMs. The clean-speech and the noisy-speech HMMs had 3 states, each with 4 Gaussiau distributions. The noise model had one state and one Gaussian distribution. Thus the NOVO HMMs had 3 states, each with 4 Gaussian distributions; there was no increase in the decoding time. Two kinds of noise were used for this experiment: computer-room noise (stationary) and</Paragraph>
      <Paragraph position="2"> for telephone directory assistance.</Paragraph>
      <Paragraph position="3"> crowd noise (nonstationary). The same sentences as in Chapter 3 were used for testing.</Paragraph>
      <Paragraph position="4"> The experimental results showed that the NOVO HMMs could be obtained very rapidly and gave similar recognition rates to those of HMMs trained by using a large noise-added speech database. The efficiency and flexibility of the algorithm and its adaptability to new noises and various SNRs make it suitable as a basis for a real-time speech recognizer resistant to noise.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="389" end_page="391" type="metho">
    <SectionTitle>
5. MULTI-MODAL TELEPHONE
DIRECTORY ASSISTANCE SYSTEM
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="389" end_page="389" type="sub_section">
      <SectionTitle>
5.1. System Structure
</SectionTitle>
      <Paragraph position="0"> We designed a multi-modal speech dialog system for telephone directory assistance with three input devices (microphone, keyboard and mouse) and two output devices (speaker and display), based on the above-mentioned continuous speech recognition and NOVO HMM techniques. Figure 7 shows the basic structure of the dialog system \[10\]. Since the interaction time, that is, the recognition speed is crucially important in testing dialog systems, we reduced the number of subscribers to 2,300 in this experimental system.</Paragraph>
      <Paragraph position="1"> The vocabulary size was roughly 4,000 as shown in Table 1.</Paragraph>
      <Paragraph position="2"> The corresponding beam width was also reduced to 200. This recognition system uses context-independent HMMs and does not use merging at the meaning level. Implemented on an HP-9000-735, the recognition currently takes about 20 seconds per sentence.</Paragraph>
      <Paragraph position="3"> Figure 8 shows an output window example in our system.</Paragraph>
      <Paragraph position="4"> The numbers on the left side of the window show the order of candidates. This window displays five potential subscriber candidates. For each candidate, the system displays five slots: city, town, block number, subscriber name, and telephone number. A simple example of how this dialog system is used is as follows:  1. After clicking the speech button, a user states an address and subscriber name.</Paragraph>
      <Paragraph position="5"> 2. The system recognizes the input speech and displays five candidates.</Paragraph>
      <Paragraph position="6"> 3. If one of the candidates is correct, the user obtains the  telephone number by clicking the telephone number slot of the correct candidate.</Paragraph>
    </Section>
    <Section position="2" start_page="389" end_page="389" type="sub_section">
      <SectionTitle>
5.2. Dialog Controller
</SectionTitle>
      <Paragraph position="0"> The main functions of the dialog controller are as follows.</Paragraph>
      <Paragraph position="1"> Display Candidates After speech recognition, four potential candidates are displayed in order of their likelihood scores. The telephone directory assistance database constraint is not usually used in selecting these candidates. However, the fifth candidate is the candidate that satisfies the constraint in the telephone directory assistance database, because there is a high possibility that the candidate that satisfies the constraint is correct, even if it has a low likelihood score.</Paragraph>
      <Paragraph position="2">  system for telephone directory assistance.</Paragraph>
      <Paragraph position="3"> Error Correction If there is no correct candidate among the five candidates, the user corrects the input error by choosing the candidate closest to the correct subscriber address and name, clicking the wrong keyword slot, and uttering a sentence with the specified semantic item. In the error correction mode, the system switches the main grammar to the grammar in which the clicked item must be uttered. For example, if a user clicks the subscriber name slot, the system switches the main grammar to the grammar for utterances that need to include a subscriber name. The user can include some new information in the sentence, in addition to the specified item. The beam width is also increased to raise the recognition accuracy.</Paragraph>
    </Section>
    <Section position="3" start_page="389" end_page="391" type="sub_section">
      <SectionTitle>
5.3. Evaluation
</SectionTitle>
      <Paragraph position="0"> This system was evaluated from the human-machine-interface point of view. We asked 20 researchers in our laboratory to try to use this system. Dialog experiments were performed to evaluate the following issues:  1. System performance (task completion rate, sentence understanding rate, task completion time, etc), 2. User evaluation of the system, 3. Content and manner of user utterances, and 4. Problems encountered with the system.</Paragraph>
      <Paragraph position="1">  Training The users were first requested to practice operating this system by themselves using a tutorial system, which was an interactive system implemented on a workstation. The tutorial system was designed to control and unify the guidance as well as knowledge given to each user. One sequence of the practice, including examples of correct recognition and incorrect recognition, takes roughly 10 minutes, in which users operate the system following instructions displayed on the screen. A typical way of speaking is also displayed and practiced in this stage. Pauses and speaking rates are not controlled.</Paragraph>
      <Paragraph position="2"> Testing 20 sheets of paper indicating the tasks using sketch maps were given to each user. Each task was indicated by the name and location of the person whose telephone number had to be requested on the map. Figure 9 shows an example of a sheet. The amount of information indicated on the sheet varied; for exumple, the first name or the town name of the person was sometimes not given. The users were requested to make inquiries based on the information given in each sheet. We used maps for indicating the tasks to avoid controlling the structure of the spoken sentences. When the user could obtain the desired telephone number, he/she wrote down the number on the answer sheet, and proceeded to the next task. Even if the user could not get the telephone number alter all efforts, he/she was requested to proceed to the next task.  Questionnaires After testing, each user was requested to answer several questions, and the information obtained was compared with various logs recorded during the test.</Paragraph>
      <Paragraph position="3"> Results The results of the experiments gave the task completion rate as 99%, which means that, in most of the tri-Ms, the users could get the correct telephone numbers. The average number of utterances for each task was 1.4, and the average sentence understanding rate was 57.87o. The average rate for the correct recognition result being indicated in the top five candidates was 77.5%. We found that the higher the top five recognition rate was, the lower the average number of utterances became.</Paragraph>
      <Paragraph position="4"> The average time needed to complete each task was 57.2 seconds, and it decreased as the users became mote experienced. About 75% of the users said that they prefered using the computer-based dialog system to a telephone directory. About 55% of the users said that the system was easy to use. The main reason for negative answers to this question was highly related to the feeling that the response time of the system was too slow.</Paragraph>
      <Paragraph position="5"> We have collected a speech database through these experiments for future analysis and experiments.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML