File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/h93-1022_metho.xml

Size: 11,024 bytes

Last Modified: 2025-10-06 14:13:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="H93-1022">
  <Title>IMPROVED KEYWORD-SPOTTING USING SRI'S DECIPHER TM LARGE-VOCABUARLY SPEECH-RECOGNITION SYSTEM</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
IMPROVED KEYWORD-SPOTTING USING SRI'S DECIPHER TM LARGE-VOCABUARLY
SPEECH-RECOGNITION SYSTEM
Mitchel Weintraub
SRI International
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
ABSTRACT
</SectionTitle>
    <Paragraph position="0"> The word-spotting task is analogous to text-based informarion retrieval tasks and message-understanding tasks in that an exhaustive accounting of the input is not required: only a useful subset of the full information need be extracted in the task. Traditional approaches have focussed on the keywords involved. We have shown that accounting for more of the data, by using a large-vocabulary recognizer for the wordspotting task, can lead to dramatic improvements relative to traditional approaches.</Paragraph>
    <Paragraph position="1"> This result may well be generalizable to the analogous text-based tasks.</Paragraph>
    <Paragraph position="2"> The approach described makes several novel contributions, including: (1) a method for dramatic improvement in the FOM (figure of merit) for word-spotting results compared to more traditional approaches; (2) a demonstration of the benefit of language modeling in keyword spotting systems; and (3) a method that provides rapid porting of to new keyword vocabularies. null</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1. INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> Although both continuous speech recognition and keyword-spotting tasks use the very similar underlying technology, there are typically significant differences in the way in which the technology is developed and used for the two applications (e.g.</Paragraph>
    <Paragraph position="1"> acoustic model training, model topology and language modeling, filler models, search, and scoring). A number of HMM-based systems have previously been developed for keyword-spotting \[1-5\]. One of the most significant differences between these keyword-spotting systems and a CSR system is the type of non-keyword model that is used. It is generally thought that very simple non-keyword models (such as a single 10-state model \[2\], or the set of monophone models \[1\]) can perform as well as more complicated non-keyword models which include words or triphones.</Paragraph>
    <Paragraph position="2"> We describe how we have applied CSR techniques to the keyword-spotting task by using a speech recognition system to generate a transcription of the incoming spontaneous speech which is searched for the keywords. For this task we have used SR.I's DECIPI-IER TM system, a state-of-the-art large-vocabulary speaker-independent continuous-speech recognition system \[610\]. The method is evaluated on two domains: (1) the Air Travel Information System (ATIS) domain \[13\], and (2) the &amp;quot;credit card topic&amp;quot; subset of the Switchboard Corpus \[11\], a telephone speech corpus consisting of spontaneous conversation on a number of different topics.</Paragraph>
    <Paragraph position="3"> In the ATIS domain, for 78 keywords in a vocabulary of 1200, we show that the CSR approach significantly outperforms the traditional wordspotting approach for all false alarm rates per hour per word: the figure of merit (FOM) for the CSR recognizer is 75.9 compared to only 48.8 for the spotting recognizer. In the Credit Card task, the sporing of 20 keywords and their 58 variants on a subset of the Switchboard corpus, the system's performance levels off at a 66% detection rate, limited by the system's ability to increase the false alarm rate. Additional experiments show that varying the vocabulary size from medium- to large-vocabulary recognition systems (700 to 7000) does not affect the FOM performance.</Paragraph>
    <Paragraph position="4"> A set of experiments compares two topologies: (1) a topology for a fixed vocabulary for the keywords and the N most common words in that task (N varies from Zero to Vocabulary Size), forcing the recognition hypothesis to choose among the allowable words (traditional CSR), and (2) a second topology in which a background word model is added to the word list, thereby allowing the recognition system to transcribe parts of the incoming speech signal as background. While including the background word model does increase the overall likelihood of the recognized transcription, the probability of using the background model is highly likely (due to the language model probabilities of out of vocabulary words) and tended to replace a number of keywords that had poor acoustic matches.</Paragraph>
    <Paragraph position="5"> Finally, we introduce an algorithm for smoothing language model probabilities. This algorithm combines small task-specific language model training data with large task-independent language training data, and provided a 14% reduction in test set perplexity.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="114" type="metho">
    <SectionTitle>
2. TRAINING .
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="114" type="sub_section">
      <SectionTitle>
2.1. Acoustic Modeling
</SectionTitle>
      <Paragraph position="0"> DECIPHER TM uses a hierarchy of phonetic context-dependent models, including word-specific, triphone, generalized-triphone, biphone, generalized-biphone, and context independent models. Six spectral features are used to model the speech signal: the eepstral vector (C1-CN) and its first and second derivatives, and cepstral energy (CO) and its first and second derivatives. These features are computed from an FFT filterbank and subsequent high-pass RASTA filtering of the filterbank log  energies, and are modeled either with VQ and scalar codebooks or with tied-mixture Gaussian models. The acoustic models used for the Switchboard task use no cross word acoustic constraints.</Paragraph>
    </Section>
    <Section position="2" start_page="114" end_page="114" type="sub_section">
      <SectionTitle>
2.2. Language Modeling
</SectionTitle>
      <Paragraph position="0"> The DECIPI-IER m system uses a probabilistie finite state grammar (PFSG) to constrain allowable word sequences. In the ATIS, WSL and Credit Card tasks, we use a word-based bigram grammar, with the language model probabilities estimated using Katz's back-off bigrarn algorithm \[12\]. All words that are not in the specified vocabulary that are in the language model training data are mapped to the background word model. The background word model is treated like all the other words in the recognizer, with bigram language model probabilities on the grammar transi- T tions between words.</Paragraph>
      <Paragraph position="1"> Two topologies are used for the experiments described in this paper. One topology is to use a fixed vocabulary with the keywords and the N most common words in that task (N varies from Zero to VocabSize), forcing the reeoguition hypothesis to ehoose among the allowable words. A second topology is to add the background word model to the above word list, thereby allowing the recoguition system to transcribe parts of the incoming speech signal as background. A sample background word with 60 context-independent phones is shown below in Figure 1.</Paragraph>
      <Paragraph position="2">  The minimum duration is 2 phones and the self loop allows for an infinite duration.</Paragraph>
      <Paragraph position="3"> 2.3. Task-Specific Language Model Estimation The Switchboard Corpus \[11\] is a telephone database consisting of spontaneous conversation on a number of different topics. The Credit Card task is to spot 20 keywords and their variants where both the keywords and the test set focus on a sub-set of the Switchboard conversations pertaining to credit cards. To estimate the language model for this task, we could (1) use a small amount of task-specific training data that focuses only on the credit card topic, (2) use a large amount of task-independent training data, or (3) combine the task-specific training with the task-indepondent training data.</Paragraph>
      <Paragraph position="4"> For combining a small amount of task-specific (TS) training with a very large amount of task-independent (TI) training data, we modified the Katz back-off bigram estimation algorithm \[12\]. A weight was added to reduce the effective size of the task-independent training database as shown in Equation 1: C(w2, wl) - Crs(w2, wl) +Y*CTt(W2, wl) where C (w2, wl) is the counts of the nurnher of occurrences of word wl followed by w2, CTS (w2, wl) are the counts from the task-specific database and Crt (w2, wl) are the counts from the task-independent database. The weight 3, reduces the effective size of the task-independent database so that these counts don't overwhelm the counts of the task-specific database. Table 1 shows both the training set and test set perplexity for the credit card task as a function of T. The task-specific training consisted of 18 credit card conversations (59 K words) while the task-independent training consisted of 1123 general conversatious (17 M words).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="114" end_page="115" type="metho">
    <SectionTitle>
3. SEARCH
</SectionTitle>
    <Paragraph position="0"> The DECIPHER a~ system uses a time-synchronous beam search. A partial Viterbi baektrace \[6\] is used to locate the most-likely Viterbi path in a continuous running utterance. The Viterbi backtrace contains both language model information (grammar transition probabilities into and out of the keyword), acoustic log likelihood probabilities for the keyword, and the duration of the keyword hypothesis.</Paragraph>
    <Paragraph position="1"> A duration-normalized likelihood score for each key-word is computed using the following Equation 2: AP + GP + Constant KeyScore = Duration where AP is the acoustic log-likelihood score for the keyword, and GP is the log probability of the grammar transition into the keyword, and Constant is a constant added to the score to penalize keyword hypotheses that have a short duration. None of the earlier HMM keyword systems used a bigram language in either the decoding or the scoring. Many previous systems did use weights on the keywords to adjust the operating location on the ROC curve.</Paragraph>
    <Paragraph position="2">  A hypothesized keyword is scored as correct if its midpoint falls within the endpoints of the correct keyword. The key-word scores are used to sort the occurrences of each keyword for computing the probability of detection at different false-alarm levels. The overall figure-of-merit is computed as the average detection rate over all words and over all false alarm rates up to ten false alarms per word per hour.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML