File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/h94-1074_intro.xml

Size: 7,064 bytes

Last Modified: 2025-10-06 14:05:46

<?xml version="1.0" standalone="yes"?>
<Paper uid="H94-1074">
  <Title>Speech-Based Retrieval Using Semantic Co-Occurrence Filtering</Title>
  <Section position="2" start_page="0" end_page="373" type="intro">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> In applying speech recognition techniques to retrieve information from large unrestricted text corpora, several issues immediately arise. The recognition vocabulary is very large (being the same size as the corpus vocabulary). Each new corpus may cover a different domain, requiring new specialized vocabulary. Furthermore the constraint afforded by domain-dependent language models may be precluded due to the expense involved in constructing them.</Paragraph>
    <Paragraph position="1"> One approach to these problems obviates the need for any word vocabulary to be defined \[1, 2\]. This is done by defining a phonetic inventory based on phonetically stable sub-word units which have corresponding orthographic counterparts.</Paragraph>
    <Paragraph position="2"> This scheme has the potential advantage that both speech and text can be indexed in terms of the same units and thus speech might be used to access text and vice-versa. Sub-word units are considered to be independent and matching is performed using vector-space similarity measures.</Paragraph>
    <Paragraph position="3"> Our concern in this paper is to provide speech access to text and our approach differs from the former in that whole words are used to constrain matching; we believe this to be more effective than splitting words into smaller independent units.</Paragraph>
    <Paragraph position="4"> We use boolean retrieval with proximity constraints rather than vector-space measures. Our approach also accommodates standard phonetic alphabets (we employ a set of 39 phones in contrast to the former technique which uses about 1000 phonetic units).</Paragraph>
    <Paragraph position="5"> To demonstrate the feasibility of our approach we have implemented a prototype. The user speaks each word of a query separately and is presented with the most relevant titles, each accompanied by the relevant word hypotheses. The combination of speech processing and retrieval currently takes about  20-30 seconds. Figure 1 shows all titles produced for the query &amp;quot;Maltese Falcon&amp;quot;.</Paragraph>
    <Paragraph position="6"> The IR system acts as a novel kind of language model. The text corpus is used directly; it is not necessary to pre-compute statistical estimates, only to index the text as appropriate for the retrieval system.</Paragraph>
    <Paragraph position="7">  1. Maltese Falcon, The (maltese falcon) 2. Astor, Mary (maltese falcon) 3. film noir (maltese falcon) 4. Bogart, Humphrey (maltese falcon) 5. Huston, John (maltese falcon) 6. Hammett, Dashiell (maltese falcon) 7. Louis XV, King of France (marquise faction) 8. rum (indies factor) 9. drama (please fashion) Figure I: Presentation of Search Results 2. System Components  The overall architecture of the system is shown in Figure 2. We will first describe the IR and speech systems and then the ancillary components that integrate them.</Paragraph>
    <Section position="1" start_page="373" end_page="373" type="sub_section">
      <SectionTitle>
2.1. Retrieval System
</SectionTitle>
      <Paragraph position="0"> We use the Text Database \[3\] for indexing and for boolean search with proximity constraints. We have experimented with Grolier's encyclopedia \[4\] which is a corpus of modest size (SM words) spanning diverse topics. There are 27,000 articles in the encyclopedia and an uninflected word dictionary for it contains 100,000 entries. We use a stop list 1 containing approximately 100 words. The fact that short common words are included in the stop list is fortuitous for our speech-based retrieval because they are difficult to recognize.</Paragraph>
    </Section>
    <Section position="2" start_page="373" end_page="373" type="sub_section">
      <SectionTitle>
2.2. Phonetic Recognizer
</SectionTitle>
      <Paragraph position="0"> The phonetic recognition component of the system uses standard hidden Markov model (HMM) based speech recognition methods. The system currently operates in a speakerdependent, isolated-word mode as this was the simplest to integrate and known to be more robust operationally. Input  to the system was from a Sennheiser HMD-414 microphone, sampled at a rate of 16KHz. Feature vectors consisted of 14 Mel-scaled cepstra, their derivatives and a log energy derivative. These were computed from 20 msec frames taken at a rate of I00 per second. Training data for each speaker was taken from 1000 words spoken in isolation. Each phonetic model is a three state HMM with Gaussian output distributions having diagonal covariance matrices. The topology of the phonetic models is shown in Figure 3. Continuous training was used to avoid the need for phonetically labelling training data by hand. The models were initialized from speaker independent models trained on the TIMIT speech database \[5\]. For recognition, the models were placed in a network with probabilities reflecting the phonetic bigram statistics of the lexicon. For each spoken word, a hypothesized phone sequence was determined by the maximum likelihood state sequence through the network, computed using the Viterbi algorithm.</Paragraph>
    </Section>
    <Section position="3" start_page="373" end_page="373" type="sub_section">
      <SectionTitle>
2.3. Phonetic Dictionary
</SectionTitle>
      <Paragraph position="0"> To use the IR system with speech we construct a phonetic dictionary which is a table giving a basic phonetic spelling for each entry in the word dictionary. For example the phonetic spelling for the word &amp;quot;president&amp;quot; is the string of phonetic symbols &amp;quot;P R EH Z IH D EH N T'. In our implementation we assodate a single phonetic spelling with each word. More generally, phonological variants, alternative pronunciations  or even translations into other languages can also be placed in the phonetic dictionary. In our arrangement the user needs to speak the uninflected word form that corresponds to the uninflected spelling that is used for indexing. (Again, we can dispense with this by including the phonetic spellings of word inflections.) The question remains as to how we find phonetic spellings for all the entries in the word dictionary. We have sprit this problem into two parts. The first is to obtain a list of words and their phonetic spellings. We have adapted a list containing phonetic spellings for 175,000 words \[6\]. Of the 100,000 word types in the encyclopedia, 43,000 were covered by this list.</Paragraph>
      <Paragraph position="1"> Although this is less than half of the total vocabulary size, it nevertheless does represent the majority of actual word instances in the encyclopedia. To cover the rest of the words we propose the application of techniques for automatically producing phonetic spellings, e.g. \[7, 8\]. Such techniques are prevalent in text-to-speech synthesis.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML