File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/93/h93-1022_evalu.xml
Size: 7,327 bytes
Last Modified: 2025-10-06 14:00:07
<?xml version="1.0" standalone="yes"?> <Paper uid="H93-1022"> <Title>IMPROVED KEYWORD-SPOTTING USING SRI'S DECIPHER TM LARGE-VOCABUARLY SPEECH-RECOGNITION SYSTEM</Title> <Section position="6" start_page="115" end_page="116" type="evalu"> <SectionTitle> 4. EXPERIMENTS 4.1. ATIS Task </SectionTitle> <Paragraph position="0"> The ATIS task \[13\] was chosen for keyword-spotting experiments because (1) the template-based system that interprets the queries of the airline database focuses on certain key-words that convey the meaning of the query, and ignores many of the other filler words (e.g. &quot;I would like...&quot;, &quot;Can you please ...'), (2) the task uses spontaneous speech, and (3) we have worked extensively on this recognition task over the last two years.</Paragraph> <Paragraph position="1"> Sixty-six keywords and their variants were selected as keywords based on the importance of each of the words to the SRI template-mateher which interprets the queries.</Paragraph> <Paragraph position="2"> SRI applied two different recognition systems to the ATIS keyword spotting task. The first system was SRI's large-vocabulary speaker-independent speech recognition system that we have used for the ATIS speech-recognition task \[3\]. The vocabulary used in this system is about 1200 words, and a back-off bigram language model was trained using the ATIS MAD-COW training data \[13\]. Many of the words in the vocabulary use word-specific or triphone acoustic models, with biphone and context-independent models used for those words that occur infrequently.</Paragraph> <Paragraph position="3"> The second system is a more traditional word-spotting system. There are 66 keywords plus 12 variants of those key-words for a total of 78 keyword models. There is a background model (see Figure 1) that tries to account for the rest of the observed acoustics, making a total of 79 words in this second system. This second system also uses a back-off bigram grammar, but all non-keywords are replaced with the background word when computing language model probabilities.</Paragraph> <Paragraph position="4"> The acoustic models for the keywords and their variants were identical in the two systems. The only difference between the two systems is that the first system uses ~1100 additional words for the background model, while the second system uses one background model with 60 context-independent phones. The resulting FOM and ROC curves are show in Figure 2 for the two systems.</Paragraph> <Paragraph position="5"> rate for the above two CSR systems on the ATIS Task.</Paragraph> <Paragraph position="6"> There are two possible explanations for the experimental results in Figure 2 and Table 2. The first explanation is that the ATIS recognizer has a much larger vocabulary, and this larger vocabulary is potentially better able at matching the non-keyword acoustics than the simple background model The second explanation is that for the larger vocabulary ATIS system, the back-off bigram grammar can provide more interword constraints to eliminate false alarms than the back-off bigram grammar that maps aU non-keywords to the filler model Additional experiments are planned to determine the extent of these effects.</Paragraph> <Section position="1" start_page="115" end_page="116" type="sub_section"> <SectionTitle> 4.2. Credit Card Task </SectionTitle> <Paragraph position="0"> The Credit Card task is to spot 20 keywords and their 58 variants on a subset of the Switchboard database. The keywords were selected to be content words relevant to the credit card topic and based on adequate frequency of occurrence of each keyword for training and testing.</Paragraph> <Paragraph position="1"> Acoustic models were trained on an 11,290 hand-transcribed utterances subset of the Switchboard database. A back-off bigram language model was trained as described in Section 2.3, using the text transcriptions from 1123 non-credit-card conversations and 35 credit card conversations. The most common 5,000 words in the non-credit-card conversations were combined with the words in the credit card conversations, the keywords, and their variants to bring the recognition vocabulary size to 6914 words (including the background word model).</Paragraph> <Paragraph position="2"> The resulting CSR system was tested on 10 credit-card conversations from the Switchboard database. Each conversation consisted of two stereo recordings (each talker was recorded separately) and was approximately 5 minutes long. Each of the two channels is processed independently. The resulting ROC curve is shown in Figure 3. The ROC curve levels out at 66% because the CSR system hypothesized 431 keywords out of a total of 498 true keyword locations. Our current CSR approach, which uses the Viterbi backtracee, does not allow us to increase the keyword false alarm rate.</Paragraph> <Paragraph position="3"> 70.00. |............................................. .............................................. ......... e000 t ........... ....</Paragraph> <Paragraph position="4"> 000r/ ................................. i .............................................. ! ......... '000rl ..................................... i .............................................. i ........ ................................... V----i .........</Paragraph> <Paragraph position="5"> ............................................. i ....................................................... ............................................. i .............................................. i ......... rate for the 6914 word CSR system on the Credit Card Task.</Paragraph> <Paragraph position="6"> The effect of using different scoring formulas is shown in Table 3. If only the duration-normalized acoustic log-likelihoods are used, an average probability of detection (FOM) of 54% is achieved. When the grammar transition log-probability into this keyword is added to the score (Eqn 2), the FOM ihereases to 59.9%. In addition, if a constant is added to the score before normalization, the FOM increases for both cases.</Paragraph> <Paragraph position="7"> This has the effect of reducing the false-alarm rate for shorterduration keyword hypotheses. We have not had a chance to experiment with the grammar transition leaving the keyword, nor with any weighting of grammar scores relative to acoustic ScoreS.</Paragraph> <Paragraph position="8"> We then varied the recognition vocabulary size and determined its effect on the keyword-spotting performance.</Paragraph> <Paragraph position="9"> These experiments show that varying the vocabulary size from medium- to large-vocabulary recognition systems (700 to 7000) does not affect the FOM performance.</Paragraph> <Paragraph position="10"> Finally, we experimented with including or excluding the background word model in the CSR lexicon. While including the background word model does increase the overall likelihood of the recognized transcription, the probability of using the background model is highly likely (due to the language model probabilities of OOV words) and tended to replace a number of keywords that had poor acoustic matches. Table 5 shows that a slight improvement can be gained by eliminating this background word model.</Paragraph> </Section> </Section> class="xml-element"></Paper>