XML Viewer - h90-1067

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/90/h90-1067_evalu.xml
Size: 9,064 bytes
Last Modified: 2025-10-06 14:00:00
<?xml version="1.0" standalone="yes"?>
<Paper uid="H90-1067">
  <Title>Experiments with Tree-Structured MMI Encoders on the RM Task</Title>
  <Section position="5" start_page="348" end_page="349" type="evalu">
    <SectionTitle>
EXPERIMENTS
</SectionTitle>
    <Paragraph position="0"> We compared various MMI tree encoders with the standard MD encoders (quantizers), as used in Sphinx and other discrete HMM-based systems. Both MMI and MD encoders produce codes, which were used as input to the Sphinx System \[3\]. For this study, a simplified version of the Sphinx system was used. Instead of context-dependent modeling, we used only context-independent models. Instead of 51 features (as used in the latest version), we used only 26 features (12 cepstrum coefficients, 12 differenced cepstrum coefficients, power and differenced power). Therefore, the results should be evaluated relatively rather than absolutely.</Paragraph>
    <Paragraph position="1"> We evaluated both a three-codebook version (256 codes per codebook) and a one-codebook version (1024 codes). For the one-codebook version, we also used co-occurrence smoothing and deleted interpolation \[4\] to smooth rarely observed codes. We used the standard inventory of 48 phonetic models, each with 7 states and 3 output pdt~s.</Paragraph>
    <Paragraph position="2"> We also started a preliminary evaluation of the second-stage segment MMI codes for a version of the Sphinx system using context-dependent HMMs. Results are given at the end of this section.</Paragraph>
    <Paragraph position="3"> The task for our study is the DARPA Resource Management (RM) task, with the perplexity 60 word-pair grammar \[7\]. We used the standard &amp;quot;extended training set&amp;quot; of 3990 sentences from 109 speakers for speaker-independent training. We trained the phonetic HMMs on all 3990 sentences. All results were evaluated on 300 independent test sentences from 12 speakers (the June 88 test set). Following that, selected cases were evaluated on the RM2 June 90 test set as a verification.</Paragraph>
    <Paragraph position="4"> We first generated a first-stage MMI tree encoder (MMI1024). This tree was grown using 144 target phonetic classes (48 phones x 3 distributions). All 26 features were accessible at all nodes to form linear decision boundaries (via linear combination splits). We used half of the training sentences to grow the MMI tree encoder, and all of the training sentences to prune it. This tree was grown to 1430 codes, and then pruned to 1024 codes. The average code-class mutual information and corresponding error rate (substitution + deletion + insertion) on the RM task (after the Forward-Backward training with co-occurrence smoothing and deleted interpolation) are shown in in Table 2.</Paragraph>
    <Paragraph position="5"> To evaluate this result, we also generated an MD encoder (quantizer) that used the same 26 features, utilizing a weighted Euclidean distance (MD-1024) \[3\]. The results of this encoder (again, after the Forward-Backward training with co-occurrence smoothing and deleted interpolation) are shown in Table 2. In this experiment, the MMI-1024 encoder error rate was 3.5% lower than the MD-1024 encoder (a 15% reduction in error raate).</Paragraph>
    <Paragraph position="6">  frame stage encoder: a single codebook.</Paragraph>
    <Paragraph position="7"> Since the standard Sphinx system uses three separate VQ codebooks, we also compared the performance of a 3-codebook MD encoder and a 3-codebook MMI encoder. In each case, the encoder has access only to a subset of the features (VQ1 - 12 cepstrum coefficients, VQ2 - 12 differenced cepstrum coefficients, and VQ3 - power &amp; differenced power). The codebook size was the same for all the encoders (256 codes). Co-occurrence smoothing of the output code pdfs was not performed in these experiments, but deleted interpolation was done. The results (see Table 3) indicate that the MMI encoder gives slightly higher error than the MD encoder (despite higher information extracted), and both were worse than the MMI-1024 encoder. We conclude that effective tree encoders require access to the entire feature vector, so as to exploit the between-feature relationships.</Paragraph>
    <Paragraph position="8">  frame-stage encoders: three codebooks.</Paragraph>
    <Paragraph position="9"> Next, we evaluated the second-stage MMI tree encoder.</Paragraph>
    <Paragraph position="10"> We used a three-segment sliding window to compute features derived from the 26 frame acoustic parameters, and categorical features derived from the segment phonetic identities discovered by the first-stage tree encoder. Segment duration features were also computed.</Paragraph>
    <Paragraph position="11"> The target labels for segments were derived from the labels of the constituent frames. Using those targets, we grew a second-stage MMI tree encoder to 1417 codes (using all of the training sentences) and then pruned it to 1024  codes. The codes output by the encoder were further compressed by combining runs of segments with the same codes into larger segments.</Paragraph>
    <Paragraph position="12"> We evaluated the second-stage codes in two ways: as frame codes (every constituent frame of a segment was assigned the segment code, MMI-SF), and as segment codes (one code per segment, MMI-SS). Respectively, we trained two sets of the phonetic HMMs (standard 48 phonetic models of the SPHINX system) and ran recognition tests using streams of frame and segment codes. The code-class mutual information and corresponding error rates are shown in Table 4 (after the Forward-Backward training with cooccu~ence smoothing and deleted interpolation).</Paragraph>
    <Paragraph position="13"> Although MMI-SF extracts substantially more information, the performance was slightly lower. However, switching to segment codes (MMI-SS) resulted in a performance improvement of 4.0% (21% reduction in error) relative to the first stage alone. Performance was improved  SF is segment codes on frames; MMI-SS is segment codes on segments.) It was also found that MMI segment codes lead to significant frame compression (on the average, 1.6 frames/segment) and therefore to significant speed advantages (which should be roughly proportional to the reduction in segments). Table 5 illustrates this phenomenon. Thus, there was a simultaneous improvement in speed and accuracy using an MMI segment encoder rather than an MD vector quantizer.</Paragraph>
    <Paragraph position="14"> Table 5 displays the average number of temporal units (frames or segments) per target label class in the Per Target column. The number of segments decreases with each stage of successive temporal compression. In the final segmentation, the number of temporal units per target label class is reduced by a factor of 1.6. We can measure whether the temporal compression loses target class segment boundaries, by examining the percentages of the target label classes which were merged into groups of two or more within single segments (Merged Targets column); only 3.2% of such targets were merged by the final segmentation stage.  We conjecture that the observed improvement in the recognition accuracy for the segment codes versus frame codes is mainly due to the following. First, the underlying assumption of independence of the output code distributions given a transition in a phonetic class model (made for use of the hidden Markov models of phonetic classes) is satisfied to a greater extent when the runs of frames with the same code are merged in a single segment code, thus absorbing short-time dependencies. Therefore, the HMMs become more adequate models of the phonetic classes. Second, there remains a sufficient amount of training data for the segment codes after the data is compressed due to segmentation.</Paragraph>
    <Paragraph position="15"> Finally, segmentation does not lead to any significant merging of the target label classes within the resulting segments, thereby retaining temporal resolution of phonetic targets.</Paragraph>
    <Paragraph position="16"> We also made a preliminary evaluation of the 2nd-stage segment MMI codes for a version of the Sphinx system using context-dependent HMMs (1100 generalized triphone models for within- and between-word triphones). The results are shown in Table 6. Results for a comparable Sphinx configuration using 3 MD codebooks (using subsets of the 26 features) is shown for comparison. In both cases, co-occurrence smoothing was performed along with deleted interpolation. Although the word accuracy is close for both cases, the decoding speedup for the segment codes gives the advantage to the MMI encoder. We view these results as rather encouraging, in view of the following limitations: (a) the encoder tree's topology was not utilized for pdf smoothing, and (b) training of the MMI encoders was done on the pdf labels of the 48 phones, and not on the generalized triphones. Further investigation of the use of MMI encoders with context-dependent HMMs will be conducted in the future.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML