File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/90/h90-1067_metho.xml

Size: 7,360 bytes

Last Modified: 2025-10-06 14:12:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="H90-1067">
  <Title>Experiments with Tree-Structured MMI Encoders on the RM Task</Title>
  <Section position="3" start_page="346" end_page="346" type="metho">
    <SectionTitle>
SYSTEM OVERVIEW
</SectionTitle>
    <Paragraph position="0"> Here we briefly describe the system used in our experiments. Figure 1 summarizes the encoding process and the experiments performed.</Paragraph>
    <Section position="1" start_page="346" end_page="346" type="sub_section">
      <SectionTitle>
Acoustic Processing
</SectionTitle>
      <Paragraph position="0"> The speech is sampled at 16 kHz and is converted into a sequence of 10-msec frames of 26 acoustic parameters: 12 cepstrum coefficients, 12 differenced cepstrum coefficients, power and differenced power \[3\].</Paragraph>
      <Paragraph position="1"> Labelling Training of the tree-structured MMI encoders is performed using labelled speech data. The set of label classes used for labelling contains 144 classes: there is a unique label class for each of the three pdf's (roughly corresponding to beginning, middle, and end) of each of the 48 Sphinx context-independent phones. Labelled frame data for training is obtained via Viterbi alignment using the Sphinx system. First-Stage (Frame) MMI Encoder At the first (frame-coding) stage, frames are encoded in such a way as to convey maximum information about their underlying label class identities. To perform frame encoding, the frame time-sequence is scanned by a &amp;quot;sliding window&amp;quot; covenng W frames; in our experiments, we kept W = 1 (a constraint imposed for the sake of a fair comparison between the first-stage MMI encoder and the standard MD encoder; normally, we use a three-frame window). A set of the 26 acoustic parameters of a frame was used as a feature vector accessed by the window. The tree frame encoder takes as input this feature vector and outputs a code for the frame at the center of the window. The encoder is trained to maximize the average mutual information between its code alphabet and the alphabet comprised of the 144 target label classes.</Paragraph>
      <Paragraph position="2"> The resulting sequence of coded acoustic frames is further processed to form acoustic segments by merging timecontiguous blocks of frames with the same code. Also, the most likely broad phonetic class is assigned to each formed segment. The stream of the acoustic segments with the assigned segmentation classes constitutes the input to the segment-coding stage.</Paragraph>
      <Paragraph position="3"> Second-Stage (Segment) MMI Encoder The second (segment-coding) stage processing is similar to that of the frame-coding stage. Namely, segments are encoded in such a way as to convey maximum information about their underlying phonetic classes.</Paragraph>
      <Paragraph position="4"> To perform segment encoding, the stream of segments is scanned by a sliding time window covenng three segments (W = 3). A set of pre-defined feature vectors is extracted from the acoustic parameters of all the frames encountered in the segments accessed by the window. Also, the most-likely broad phonetic classes assigned after the first stage to each of the three segments in the window comprise additional categorical variables. These variables provide phonetic features complementing the acoustic features. Segment duration features are also computed. The segment encoder tree takes as input these sets of features and outputs a code for the segment in the center of the window. The encoder is trained to maximize the average mutual information between its code alphabet and the alphabet comprised of the 144 target label classes. The target labels for segments were derived from the labels of the constituent frames.</Paragraph>
      <Paragraph position="5"> To obtain a categorical feature for use in the tree based on the phonetic class of a segment, we combined 144 target phonetic classes into nine broad superclasses, and used the most likely superclass number for each code. The selected set of broad phonetic superclasses is shown in Table 1 (in the standard notation of the Sphinx phonetic system, \[3\]).</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="346" end_page="348" type="metho">
    <SectionTitle>
2 WL
3 VFTH
4 TKPHTDPD KDG B DDHDX DD
5 R ER
6 N M NG
7 AH AE AA AY AO OW OY AW
8 EH EY IH IY AX IX Y UH UW
</SectionTitle>
    <Paragraph position="0"> second-stage tree.</Paragraph>
    <Paragraph position="1"> The resulting sequence of coded segments is further processed to form larger segments as in the first stage. The stream of the enlarged segments with the assigned codes constitutes the output of the second stage.</Paragraph>
    <Paragraph position="2"> MMI Training The first- and second-stage MMI encoders are trained using labelled data (supervised training). The encoders are trained as binary decision trees using maximization of the average mutual information I(classes, codes) between the set of target label classes and the set of leaf-node numbers (codes), as the training criterion:</Paragraph>
    <Paragraph position="4"> class code where Pr(class,code) is the joint probability of the class and the code assigned to a training sample, Pr(class) and Pr(code) are the marginal class and code probabilities, respectively. Training is performed top-down, starting from the root of the binary decision tree. The decision function associated with each decision node of the tree effects the split of the feature space with a hyperplane (for the continuous-valued feature vectors) or a dichotomy of a discrete set (for a categorical feature). The training samples at each node were those which reached that node after passing through predecessor nodes.</Paragraph>
    <Paragraph position="5"> Training of the decision function at a node uses as an optimization criterion the reduction in the node's average class entropy. To find a decision hyperplane, we use conjugate gradient based search \[9\] where the gradient of the criterion function with respect to the hyperplane coefficients is computed by replacing the &amp;quot;hard limiter&amp;quot; decision function with a piecewise linear one (the threshold-logic type) and gradually annealing that non-linearity to the hard limiter.</Paragraph>
    <Paragraph position="6">  Once the optimal coefficients are estimated, we use the hard limiter decision function to send the patterns to the left or the right child node. We don't split a node if the highest reduction in the class entropy attained by the &amp;quot;node split&amp;quot; is less than a certain fraction of the node's class entropy; a final node is a terminal node.</Paragraph>
    <Paragraph position="7"> After the entire binary tree is created, its performance criterion (i.e. the average mutual information between the set of target classes and the set of the terminal nodes) is evaluated with a combination of the training and independent sets of labelled data. Some nodes are then removed, starting with the current terminal nodes, i.e., the tree is &amp;quot;pruned,&amp;quot; to produce a more robust subtree with more accurate estimates of the node-class probabilities. The resulting terminal node numbers are used as codes.</Paragraph>
    <Paragraph position="8"> The above training and pruning of the trees was performed utilizing SSI tree-growing software.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML