File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/h94-1062_intro.xml

Size: 4,970 bytes

Last Modified: 2025-10-06 14:05:46

<?xml version="1.0" standalone="yes"?>
<Paper uid="H94-1062">
  <Title>Tree-Based State Tying for High Accuracy Modelling</Title>
  <Section position="2" start_page="0" end_page="307" type="intro">
    <SectionTitle>
1. INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> Hidden Markov Models (HMMs) have proved to be an effective basis for modelling time-varying sequences of speech spectra. However, in order to accurately capture the variations in real speech spectra (both inter-speaker and intra-speaker), it is necessary to have a large number of models and to use relatively complex output probability distributions. For example, to achieve good performance in a continuous density HMM system, it is necessary to use mixture Gaussian output probability distributions together with context dependent phone models.</Paragraph>
    <Paragraph position="1"> In practice, this creates a data insufficiency problem due to the resulting large number of model parameters. Furthermore, the data is usually unevenly spread so that some method is needed to balance model complexity against data availability.</Paragraph>
    <Paragraph position="2"> This data insufficiency problem becomes acute when a system incorporating cross-word context dependency is used. Because of the large number of possible cross-word triphones, there are many models to estimate and a large number of these triphones will have few, if any, occurrences in the training data. The total number of triphones needed for any particular application depends on the phone set, the dictionary and the grammatical constraints. For example, there are about 12,600 position-independent triphones needed for the Resource Management task when using the standard word pair grammar and 20,000 when no grammar is used. For the 20k Wall Street Journal task, around 55,000 triphones are needed. However, only 6600 triphones occur in the Resource Management training data and only 18,500 in the SI84 section of the Wall Street Journal training data.</Paragraph>
    <Paragraph position="3"> Traditional methods of dealing with these problems involve sharing models across differing contexts to form so-called generalised triphones and using a posteriori smoothing techniques\[5\]. However, model-based sharing is limited in that the left and right contexts cannot be treated independently and hence this inevitably leads to sub-optimal use of the available data. A posteriori smoothing is similarly unsatisfactory in that the models used for smoothing triphones are typically biphones and monophones, and these will be rather too broad when large training sets are used. Furthermore, the need to have cross-validation data unnecessarily complicates the training process.</Paragraph>
    <Paragraph position="4"> In previous work, a method of HMM estimation has been described which involves parameter tying at the state rather than the model level\[10,12\]. This method assumes that continuous density mixture Gaussian distributions are used and it avoids a posteriori smoothing by first training robust single Gaussian models, then tying states using an agglomerative data clustering procedure and finally, converting each tied state to a mixture Ganssian.</Paragraph>
    <Paragraph position="5"> This works well for systems which have only word internal triphone models and for which it is therefore possible to find some data for every triphone. However, as indicated by the figures given above, systems which utilise cross-word triphones require data for a very large number of triphones and, in practice, many of them will be unseen in the training data.</Paragraph>
    <Paragraph position="6"> In this paper, the state tying approach is developed further to accommodate the construction of systems which have unseen triphones. The new system is based on the use of phonetic decision trees I1,2,6\] which are used to determine contextually equivalent sets of HMM states.</Paragraph>
    <Paragraph position="7"> In order to be able to handle large training sets, the tree building is based only on the statistics encoded within  (1) (2) J/\'---.... t-iy+n t-iy+ng f-iy+l s-iy+l  to the original data.</Paragraph>
    <Paragraph position="8"> This tree-based clustering is shown to lead to similar modelling accuracy to that obtained using the data-driven approach but to have the additional advantage of providing a mapping for unseen triphones\[3\]. State-tying is also compared with traditional model-based tying and shown to be clearly superior.</Paragraph>
    <Paragraph position="9"> The arrangement of this paper is as follows. In the next section, the method of HMM system building using state tying is reviewed and then in section 3, the phonetic decision tree based method is described. Experimental results are presented in section 4 using the HTK speech recognition system\[8,9\] for both the Resource Management and Wall Street Journal tasks. Finally, section 5 presents our conclusions from this work.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML