File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/89/h89-1024_metho.xml

Size: 7,056 bytes

Last Modified: 2025-10-06 14:12:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="H89-1024">
  <Title>THE LINCOLN CONTINUOUS SPEECH RECOGNITION SYSTEM: RECENT DEVELOPMENTS AND RESULTS 1</Title>
  <Section position="4" start_page="161" end_page="161" type="metho">
    <SectionTitle>
RESULTS OF THE JUNE 88 TEST SYSTEM
</SectionTitle>
    <Paragraph position="0"> The SD June 88 test system used word-context-free triphones (i.e., the triphone contexts included word boundaries, but excluded the phone on the other side of the boundary). Since the pronunciation of function words is often idiosyncratic, the triphones for the function words were also word-dependent \[14\]. The resulting system has 2,434 triphones, 2,413 of which were observed at least once in training and 21 of which were extrapolated by the recognizer. (The same training scripts were used for all 12 SD speakers.) The SD word error rates were 5.46% for the June 88 test set and 5.19% for the development test set.</Paragraph>
    <Paragraph position="1"> The SI dune 88 test system used the same set of triphones as the SD system. Due to the more varied training data, 2,430 triphones were observed, many from only a few speakers. The word error rate for this system is 13.55% for the June 88 test set and 13.25% for the development test set. The large training set (2,431 observed triphones) produced 10.09% word error rate for the dune 88 test set and 10.85% for the development test set.</Paragraph>
  </Section>
  <Section position="5" start_page="161" end_page="162" type="metho">
    <SectionTitle>
WORD BOUNDARY MODELS
</SectionTitle>
    <Paragraph position="0"> The SD system was improved significantly by the addition of word boundary triphone models. (The word-boundary triphones are distinct from word-internal triphones.) In this system, the training data is used twice per Baum-Welch iteration, once to train word-context-free (WCF) models and once to train word-context-dependent models. The same word-internal triphones are used both times. This provides the recognizer with a set of models for the observed word boundaries and a set of WCF models to be used for word boundaries allowed by the grammar but not observed in the training data. This reduces the number of phones extrapolated in the recognizer.</Paragraph>
    <Paragraph position="1"> The number of triphones is more than doubled by the added word boundary models. The SD trainer produces 5,993 triphones: 2,413 WCF and 3,580 word context triphones. In addition, the recognizer extrapolates 443 more triphones.</Paragraph>
    <Paragraph position="2"> Inclusion of word contexts significantly increases the recognition network complexity. Depending on the number of phones in the word, there are three word topologies which must be covered (Figure 1):  1. Three or more phones: each word end has a fan of initial (final) phones.</Paragraph>
    <Paragraph position="3"> 2. Two phones: each word end has a list of initial and final phones with a crossbar of interconnections between them.</Paragraph>
    <Paragraph position="4"> 3. One phone: a crossbar between beginnings and endings with a triphone on each link.</Paragraph>
    <Paragraph position="5"> Links between two adjacent words are formed according to the following priority list: 1. Both boundary triphones exist: link them.</Paragraph>
    <Paragraph position="6">  2. Only one of the boundary triphones exist: link to a WCF triphone on the other word. 3. Neither boundary triphone exists: link WCF boundary triphones from both words.</Paragraph>
    <Paragraph position="7">  Thus, as more word boundaries are observed in the training data, the system gradually builds from the original WCF system toward a system with full word context models.</Paragraph>
    <Paragraph position="8"> The SD development test results for this system showed a significant improvement over the WCF system: 3.39% versus 5.19% word error rate. An earlier system which extrapolated all missing boundary triphones rather the defaulting to WCF triphones did not show an improvement. Thus it is better to use observed WCF triphones rather than extrapolate boundary triphones. The SI results were worse than the WCF system, both with and without the additional training data. The word-context-dependent system appears to be too detailed a model for the available SI training data.</Paragraph>
  </Section>
  <Section position="6" start_page="162" end_page="163" type="metho">
    <SectionTitle>
VARIABLE MIXTURES
</SectionTitle>
    <Paragraph position="0"> Variable order mixtures show a small improvement for the SI task. The number of mixtures for the states in a triphone was chosen by:  rain(n, sqrt \[(number o$ instances of triphone in data)J ) This attempts to match the complexity of the distribution to the amount of available training data. It has been tested for n = 4 and n = 8 with both the normal and augmented training sets. The results are in</Paragraph>
  </Section>
  <Section position="7" start_page="163" end_page="163" type="metho">
    <SectionTitle>
TIED MIXTURES
</SectionTitle>
    <Paragraph position="0"> A version of tied mixtures \[15,16\] has been tested and shown to provide a small improvement for the SI task.</Paragraph>
    <Paragraph position="1"> In this system, each monophone group is given a set of Gaussians. All triphones of each monophone group use mixtures chosen from the same set of Gaussians. The mixture weights for each triphone are independent of all other triphones. This reduces the total number of Gaussians by a significant factor.</Paragraph>
    <Paragraph position="2"> Training is again performed using a bootstrapping procedure. After the monophones are trained, small random perturbations of their mean vectors are used to initialize the mixture Gaussians for the monophone group. The triphone weights, along with the parameters of the Gaussians, are then trained with a number of iterations of the Baum-Welch algorithm.</Paragraph>
    <Paragraph position="3"> The recognizer used here is the simpler WCF system. Three SI systems were tried using 10, 20, and 40 Ganssians per monophone group. Only the 40 system showed an improvement over the original SI system: 12.62% versus 13.25% development test word error rate. This system also reduced the number of Gaussians by a factor of five. Tied mixtures have not been tried on the SD task.</Paragraph>
  </Section>
  <Section position="8" start_page="163" end_page="164" type="metho">
    <SectionTitle>
SPEAKER GROUPING
</SectionTitle>
    <Paragraph position="0"> Another approach to improving the SI (WCF) performance was tried. The training speakers were segregated by sex and two separate sets of models were trained. The recognizer kept the sets of models separate by using two separate networks. Thus, the system co-recognizes both the speech and the sex of the speaker.</Paragraph>
    <Paragraph position="1"> Systems which lump both sexes together in training do not discriminate against cross-group spectral matches of individual sounds. Mixtures were not used to save CPU time. The results shown in Table 4 show a significant increase in the error rate.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML