File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/92/h92-1079_intro.xml
Size: 2,954 bytes
Last Modified: 2025-10-06 14:05:18
<?xml version="1.0" standalone="yes"?> <Paper uid="H92-1079"> <Title>Large Vocabulary Recognition of Wall Street Journal Sentences at Dragon Systems</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1. INTRODUCTION </SectionTitle> <Paragraph position="0"> This paper presents &quot;dry run&quot; results of work done at Dragon Systems on the Wall Street Journal (WSJ) benchmark task. After we give a brief description of our continuous speech recognition system, we describe the two different kinds of acoustic models that were used and explain how they were trained. Then we present *This work was sponsored by the Defense Advanced Research Projects Agency and was monitored by the Space and Naval Warfare Systems Command under contract N00039-86-C-0307.</Paragraph> <Paragraph position="1"> and discuss the results obtained so far and review our plans for further research.</Paragraph> <Paragraph position="2"> In our system a set of output distributions, known as the set of PELs (phonetic elements), is associated with each phoneme. The HMM for a PIC (phoneme-in-context) is represented as a linear sequence of states, each having an output distribution chosen from the set of PELs for the given phoneme, and a (double exponential) duration distribution. The model for a particular hypothesis is constructed by concatenating the necessary sequence of PICs, based on the specified pronunciation (sequence of phonemes) for each of the component words. Thus our system models both word-internal and cross-word co-articulation. When a model for a PIC that is needed does not exist, a &quot;backoff&quot; strategy is used, whereby the model for a different, but related, PIC is used instead.</Paragraph> <Paragraph position="3"> The two methods to be compared in this paper constitute different strategies for representing and training the output distributions to be used for the nodes found in the PIC models. The first method involves generating a set of (unimodal) PELs for a given speaker by clustering the hypothetical frames found in the spectral models for that speaker, a step we call &quot;rePELing&quot;, and then constructing speaker-dependent PEL sequences to represent each PIC as an HMM, which we call &quot;respelling&quot;. The spectral model for a PIC can be thought of as the expected value of the sequence of frames that would be generated by the PIC, normalized to an average length. The second method, a univariate version of tied mixtures, represents the probability distribution for each parameter in a PEL as a mixture of a fixed set of unimodal components, the mixing weights being estimated using the EM algorithm \[9\]. In both the RePELing/Respelling and the tied mixture models, we assume that the parameters are statistically independent. A more detailed explanation of these two methods can be found in sections 3 and 4.</Paragraph> </Section> class="xml-element"></Paper>