XML Viewer - h90-1049

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/90/h90-1049_evalu.xml
Size: 4,149 bytes
Last Modified: 2025-10-06 14:00:02
<?xml version="1.0" standalone="yes"?>
<Paper uid="H90-1049">
  <Title>Word Recognition Using Dynamic Programming Neural Networks&amp;quot;, by Sakoe, Isotani, and Yoshida (Readings in Speech Recognition, edited by Alex Waibel &amp; Kai-Fu Lee). Other work includes &amp;quot;Merging Multilayer Perceptrons and Hidden Markov Models: Some Experiments in Continuous Speech</Title>
  <Section position="8" start_page="245" end_page="246" type="evalu">
    <SectionTitle>
RESULTS
</SectionTitle>
    <Paragraph position="0"> These ideas have been adapted by Parfitt at Apple for application to Sphinx. Parfitt modified the Sphinx recognition system to generate input to a three layer perception, a type of ANN as shown in Figures 1 and 2. The following describes his implementation: The Sphinx system is initially trained in the traditional manner using the forward/backward algorithm.</Paragraph>
    <Paragraph position="1"> However, during training and recognition, a modified Viterbi/beam search is used. A record with backpointers is maintained for all nodes that are visited during the search. When the last speech interval is processed, the best final node identifies the optimum path through the utterance. The mapping of the speech data to the optimum path is used to establish word boundaries and derive a set of time aligned paramaters to pass to the ANN.</Paragraph>
    <Paragraph position="2"> A separate ANN is used for each word in the vocabulary. Each word in the vocabulary is represented by one or more triphone models.</Paragraph>
    <Paragraph position="3"> Although each triphone has seven states, only three are unique states. Because the HMM models contain skip arcs, the speech can skip one, two or all three of the states. The model also contains self-loop arcs for each of the states. The speech may match a given state an arbitrary number of times. When several speech samples match a given state, the middle sample is used to supply input to the ANN. When an even number of samples match, the left middle sample is used. When no speech samples match a given state, zero is used as input to the ANN.</Paragraph>
    <Paragraph position="4"> The ANN uses different input parameters than the HMM triphone in SPHINX. The SPHINX recognizer works on three VQ symbols per window. The windows are twenty milliseconds wide and are advanced by ten millisecond increments. The VQs are derived from three sets of parameters, twelve Cepstral Coefficients, twelve Delta Cepstral Coefficients, and Power/Delta-Power. However, the ANNs do not receive VQs. Instead, they receive the same three sets of parameters before they are vector quantized except that each parameter is linearly scaled to range between minus one and plus one.</Paragraph>
    <Paragraph position="5"> As shown in Figurel, the ANN models have one hidden layer and a single output node. Words are constructed from a sequence of triphones. For example, for the word &amp;quot;zero&amp;quot;, there are four input triphones. Each triphone has three unique HMM nodes and each node has twenty-six input parameters for a total of 78 inputs per triphone. Hence, the word &amp;quot;zero&amp;quot; has 312 inputs to the ANN. The 78 inputs per phone are fully interconnected to 25 nodes in the hidden layer. All the hidden layer nodes are fully interconnected to the single output node.</Paragraph>
    <Paragraph position="6"> ANN TRAINING: Each word ANN is trained from time aligned data produced by the modified SPHINX.</Paragraph>
    <Paragraph position="7"> Two sets of data are used to train the ANNs. One set of data represents the &amp;quot;in class&amp;quot; utterances. The ANN is trained to be plus one when this first set of data is presented. The second set of data represents &amp;quot;out of class&amp;quot; utterances and is composed of other words. For this case, the output of the ANN is trained to be minus one. The ratio of &amp;quot;out of class&amp;quot; utterances to &amp;quot;in class&amp;quot; utterances is 3.5. The ANNs tend to converge to 100% accuracy on the training data after about 300 passes through the data. Back propagation is used for training.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML