File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2903_metho.xml

Size: 14,393 bytes

Last Modified: 2025-10-06 14:09:30

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-2903">
  <Title>Audio Hot Spotting and Retrieval Using Multiple Features</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3. Automatic Spoken Keyword
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Indexing
</SectionTitle>
      <Paragraph position="0"> As automatic speech recognition is imperfect, automatic speech transcripts contain errors. Our indexing algorithm focuses on finding words that are information rich (i.e. content words) and machine recognizable. Our approach is based on the principle that short duration and weakly stressed words are much more likely to be misrecognized, and are less likely to be important.</Paragraph>
      <Paragraph position="1"> To eliminate words that are information poor and prone to mis-recognition, our algorithm examines the speech recognizer output and creates an index list of content words. The index-generation algorithm takes the following factors into consideration: a) absolute word length by its utterance duration, b) the number of syllables, c) the recognizer's own confidence score, and d) the part of speech (i.e. verb, noun) using a POS tagger with some heuristic rules. Experiments we have conducted using broadcast news data, with Gaussian white noise added to achieve a desired Signal-to-Noise Ratio (SNR), indicate that the index list produced typically covers about 10% of the total words in the ASR output, while more than 90% of the indexed words are actually spoken and correctly recognized given a Word Error Rate (WER [Fiscus, et al.]) of 30%. The following table illustrates the performance of the automatic indexer as a function of Signal-to- null where Index Coverage is the fraction of the words in the transcript chosen as index words and IWER is the index word error rate.</Paragraph>
      <Paragraph position="2"> As expected, increases in WER result in fewer words meeting the criteria for the index list.</Paragraph>
      <Paragraph position="3"> However, the indexer algorithm manages to find reliable words even in the presence of very noisy data. At 12dB SNR, while the recognizer WER has jumped up to 54.7%, the Index Word Error Rate (IWER) has risen to 12.2%. Note that an index-word error indicates that an index word chosen from the ASR output transcript did not in fact occur in the original reference transcription. Whether this index list is valuable will depend on the application. If a user wants to get a feel for a 1-hour conversation in just a few seconds, automatically generated topic terms such as those described in [Kubala et al., 2000] or an index list such as this could be quite valuable.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4. Detecting and Using Multiple
</SectionTitle>
    <Paragraph position="0"> Features from the Audio Automatic speech recognition has been used extensively in spoken document retrieval [Garofolo et al., 2000; Rendals et al., 2000]. However, high speech WER in the speech transcript, especially in less-trained domains such as spontaneous and non-broadcast quality data, greatly reduces the effectiveness of navigation and retrieval using the speech transcripts alone. Furthermore, the retrieval of a whole document or a story still requires the user to read the whole document or listen to the entire audio file in order to locate the segments where relevant information resides. In our approach, we recognize that there is more information in the audio file than just the words and that other attributes such as speaker identification, prosodic features, and the type of background noise may also be helpful for the retrieval of information. In addition, we aim to retrieve the exact segments of interest rather than the whole audio or document so that the user can zero in on these specific segments rapidly. One of the challenges facing researchers is the need to identify &amp;quot;which&amp;quot; non-lexical features have information value. Since these features have not been available to users in the past, they don't know enough to ask for them. We have chosen to implement a variety of non-lexical cues with the intent of stimulating feedback from our user community.</Paragraph>
    <Paragraph position="1"> As an example of this, by extending a research speaker identification algorithm [Reynolds, 1995], we integrated speaker identification into the Audio Hot Spotting prototype to allow a user to retrieve three kinds of information. First, if the user cannot find what he/she is looking for using keyword search but knows who spoke, the user can retrieve content defined by the beginning and ending timestamps associated with the specified speaker; assuming enough speech exists to build a model for that speaker. Secondly, the system automatically generates speaker participation statistics indicating how many turns each speaker spoke and the total duration of each speaker's audio.</Paragraph>
    <Paragraph position="2"> Finally, the system uses speaker identification to refine the query result by allowing the user to query keywords and speaker together. For example, using the Audio Hot Spotting prototype, the user can find the audio segment in which President Bush spoke the word &amp;quot;anthrax&amp;quot;. In addition to speaker identification, we wanted to illustrate the information value of other non-lexical sounds in the audio track. As a proofof-concept, we created detectors for crowd applause and laughter. The algorithms used both spectral information as well as the estimated probability density function (pdf) of the raw audio samples to determine when one of these situations was present. Laughter has a spectral envelope which is similar to a vowel, but since many people are voicing at the same time, the audio has no coherence. Applause, on the other hand, is spectrally speaking, much like noisy speech phones such as &amp;quot;sh&amp;quot; or &amp;quot;th.&amp;quot; However, we determined that the pdf of applause differed from those individual sounds in the number of high amplitude outlier samples present. Applying this algorithm to the 2003 State of the Union address, we identified all instances of applause with only a 2.6% false alarm rate (results were compared with hand-labeled data). One can imagine a situation where a user would choose this non-lexical cue to identify statements that generated a positive response.</Paragraph>
    <Paragraph position="3"> Last year, we began to look at speech rate as a separate feature. Speech rate estimation is important, both as an indicator of emotion and stress, as well as an aid to the speech recognizer itself (see for example [Mirghafori et al., 1996; Morgan, 1998; Zheng et al., 2000]). Currently, recognizer word error rates are highly correlated to speech rate. For the user, marking that a returned passage is from an abnormal speech rate segment and therefore more likely to contain errors allows him/her to save time by ignoring these passages or reading them with discretion if desired. However, if passages of high stress are of interest, these are just the passages to be reviewed. For the recognizer, awareness of speech rate allows modification of HMM state probabilities, and even permits different sequences of phones.</Paragraph>
    <Paragraph position="4"> One approach to determine the speech rate accurately is to examine the phone-level output of the speech recognizer. Even though the phone-level error rate is quite high, the timing information is still valuable for rate estimation. By comparing the phone lengths of the recognizer output to phone lengths tabulated over many speakers, we have found that a rough estimate of speech rate is possible [Mirgafori et al. 1996].</Paragraph>
    <Paragraph position="5"> Initial experiments using MITRE Corporate event data have shown a rough correspondence between human perception of speed and the algorithm output. One outstanding issue is how to treat audio that includes both fast rate speech and significant silences between utterances. Is this truly fast speech? We are currently conducting research to detect other prosodic features by estimating vocal effort. These features may indicate when a speaker is shouting suggesting elevated emotions or near a whisper. Queries based on such features can lead to the identification of very interesting audio hot spots for the end user. Initial experiments are examining the spectral properties of detected glottal pulses obtained during voiced speech.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5. Query Expansion and Retrieval
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Tradeoffs
5.1 Effect of Passage Length
</SectionTitle>
      <Paragraph position="0"> TREC SDR found both a linear correlation between speech word error rate and retrieval rate [Garofolo et al., 2000] and that retrieval was fairly robust to WER. However, the robustness was attributed to the fact that misrecognized words are likely to also be properly recognized in the same document if the document is long enough. Since we limit our returned passages to roughly 10 seconds, we do not benefit from this full-document phenomenon. The relationship between passage retrieval rate and passage length was studied by searching 500 hours of broadcast news from the TREC SDR corpus. Using 679 keywords, each with an error rate across the corpus of at least 30%, we found that passage retrieval rate was 71.7% when the passage was limited to only the query keyword. It increased to 76.2% when the passage length was increased to 10sec and rose to 83.8% if the returned passage was allowed to be as long as 120sec.</Paragraph>
      <Paragraph position="1"> In our Audio Hot Spotting prototype, we experimented with semantic, morphological, and phonetic query expansion to achieve two purposes, 1) to improve the retrieval rate of related passages when exact word match fails, and 2) to allow cross lingual query and retrieval.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Keyword Query Expansion
</SectionTitle>
      <Paragraph position="0"> The Audio Hot Spotting prototype integrated the Oracle 9i Text engine to expand the query semantically, morphologically and phonetically. For morphological expansion, we activated the stemming function. For semantic expansion, we utilized expansion to include hyponyms, hypernyms, synonyms, and semantically related terms. For example, when the user queried for &amp;quot;oppose&amp;quot;, the exact match yielded no returns, but when semantic and morphological expansion options are selected, the query was expanded to include anti, antigovernment, against, opposed, opposition, and returned several passages containing these expanded terms.</Paragraph>
      <Paragraph position="1"> To address the noisy nature of speech transcripts, we used the phonetic expansion, i.e. &amp;quot;sound alike&amp;quot; feature from the Oracle database system. This is helpful especially for proper names. For example, if the proper name Nesbit is not in the speech recognizer vocabulary, the word will not be correctly transcribed. In fact, it was transcribed as Nesbitt (with two 't's). By phonetic expansion, Nesbit is retrieved. We are aware of the limitations of Oracle's phonetic expansion algorithms, which are simply based on spelling.</Paragraph>
      <Paragraph position="2"> This doesn't work well when text is a mistranscription of the actual speech. Hypothetically, a phoneme-based recognition engine may be a better candidate for phonetic query expansion.</Paragraph>
      <Paragraph position="3"> We are currently evaluating a phoneme-based audio retrieval system and comparing its performance with a word-based speech recognition system. The comparison will help us to determine the strengths and weaknesses of each system so that we can leverage the strength of each system to improve audio retrieval performance.</Paragraph>
      <Paragraph position="4"> Obviously more is not always better.</Paragraph>
      <Paragraph position="5"> Some of the expanded queries are not exactly what the users are looking for, and the number of passages returned increases. In our Audio Hot Spotting implementation we made query expansion an option allowing the user to choose to expand semantically and/or, morphologically, or phonetically.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Cross-lingual Query Expansion
</SectionTitle>
      <Paragraph position="0"> In some applications it is helpful for a user to be able to query in a single language and retrieve passages of interest from documents in several languages. We treated translingual search as another form of query expansion. We created a bilingual thesaurus by augmenting Oracle's default English thesaurus with Spanish dictionary terms. With this type of query expansion enabled, the system retrieves passages that contain the keyword in either English or Spanish.</Paragraph>
      <Paragraph position="1"> A straightforward extension of this approach will allow other languages to be supported.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6. Future Directions
</SectionTitle>
    <Paragraph position="0"> As our research and prototype evolve, we plan to develop algorithms to detect more meaningful prosodic and audio features to allow the users to search for and retrieve them. We are also developing algorithms that can generate speaker identify in the absence of speaker training data. For example, given an audio script, we expect the algorithms to automatically identify the number of different speakers present and the time speaker X changes to Y. For semantic query expansion, we are considering using more comprehensive thesauri and local context analysis to locate relevant segments to compensate for high ASR word error rate. We are also considering combining a word-based speech recognition system with a phoneme-based system to improve the retrieval performance especially for out of vocabulary words and multi-word queries.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML