File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/h91-1049_metho.xml

Size: 10,674 bytes

Last Modified: 2025-10-06 14:12:43

<?xml version="1.0" standalone="yes"?>
<Paper uid="H91-1049">
  <Title>A DYNAMICAL SYSTEM APPROACH TO CONTINUOUS SPEECH RECOGNITION</Title>
  <Section position="4" start_page="0" end_page="253" type="metho">
    <SectionTitle>
BBN Inc.
10 Moulton St.
Cambridge, MA 02138
</SectionTitle>
    <Paragraph position="0"> used. We attribute the performance decrease to insufficient training data and the noisy nature of the cepstral coeffidents. In this work we deal with the problem of noisy observations through a time-inhomogeneous dynamical system formalism, including observation noise in our model.</Paragraph>
    <Paragraph position="1"> Under the assumption that we model speech as a Gaussian process at the frame-rate level, a linear state-space dynamical system can be used to parameterize the density of a segment of speech. This a natural generalization of our previous Gauss-Markov approach, with the addition of modeling error in the form of observation noise.</Paragraph>
    <Paragraph position="2"> We can make two different assumptions to address the  time-variability issue: 1. Trajectory invariance (A1): There are underlying unobserved trajectories in state-space that basic units  of speech follow. In the dynamical system formalism, this assumption translates to a fixed sequence of state transition matrices for any occurrence of a speech segment. Then, the problem of variable segment length can be solved by assuming that the observed feature vectors are not only a noisy version of the fixed underlying trajectory, but also an incomplete one with missing observations. Successive observed frames of speech have stronger correlation for longer observations, since the underlying trajectory is sampled at shorter intervals (in feature space).</Paragraph>
    <Paragraph position="3"> 2. Correlation invariance (A2): The underlying trajectory in phase space is not invariant under time-warping transformations. In this case, the sequence of state transition matrices for a particular observation of a phoneme depends on the phoneme length, and we have a complete (albeit noisy) observation of the state sequence. In this case, we assume that it is the correlation between successive frames that is invariant to variations in the segment length.</Paragraph>
    <Paragraph position="4"> Under either assumption, the training problem with a known segmentation is that of maximum likelihood identification of a dynamical system. We use here an nontraditionnl  method based on the EM algorithm, that can be easily used under either correlation or trajectory invariance. The model is described in Section, and the identification algorithms are in Section . In Section we shall briefly describe phoneme classification and recognition algorithms for this model, and finally in Section we present phone classification results on the TIMIT database \[5\].</Paragraph>
  </Section>
  <Section position="5" start_page="253" end_page="253" type="metho">
    <SectionTitle>
A DYNAMICAL MODEL FOR
SPEECH SEGMENTS
</SectionTitle>
    <Paragraph position="0"> A segment of speech is represented by an L-long sequence of q-dimensional feature vector Z = \[zl z2 ... zL\].</Paragraph>
    <Paragraph position="1"> The original stochastic segment model for Z had two components \[7\]: i) a time transformation TL to model the variable-length observed segment in terms of a fixed-length unobserved sequence Z = YTL, where Y = \[yl y2 ... yM\], and ii) a probabilistic representation of the unobserved feature sequence Y. We assumed in the past \[3\] that the density of Y was that of an inhomogeneous Ganss-Markov process.</Paragraph>
    <Paragraph position="2"> We then showed how the EM algorithm can be used to estimate the parameters of the models under this assumption.</Paragraph>
    <Paragraph position="3"> In this work, we extend the modeling of the feature sequence, to the more general Markovian representation for each different phone model ot</Paragraph>
    <Paragraph position="5"> where 6m is the Kronecker delta. We further assume that the initial state xo is Gaussian with mean and covariance /~o(o~), ~0(o 0. In this work, we arbitrarily choose the dimension of the state to be equal to that of the feature vector and Hk(cr) = I, the identity matrix. The sequence Y is either fully or partially observed under the assumptions of correlation and trajectory invariance respectively. In order to reduce the number of free parameters in our model, we assume that a phone segment is locally stationary over different regions within the segment, where those regions are defined by a fixed time warping that in this work we simply choose as linear. In essence, we are tying distributions, and the way this is done under the correlation and trajectory invariance assumptions is shown in Figure 1.</Paragraph>
    <Paragraph position="6"> The likelihood of the observed sequence Z can be obtained by the Kalman predictor, as</Paragraph>
    <Paragraph position="8"> where (,e)(ot) is the prediction error variance given phone model a. In the trajectory invariance case, innovations are only computed at the points where the output of the system is observed, and the predicted state estimate for these times can be obtained by the/-step ahead prediction form of the Kalman filter, where I is the length of the last &amp;quot;black-out&amp;quot; interval - the number of missing observations y immediately before the last observed frame z.</Paragraph>
  </Section>
  <Section position="6" start_page="253" end_page="254" type="metho">
    <SectionTitle>
TRAINING
</SectionTitle>
    <Paragraph position="0"> The classical method to obtain maximum likelihood estimates involves the construction of a time-varying Kahnan predictor'and the expression of the likelihood function in terms of the prediction error as in (2) \[1\]. The minimization of the log-likelihood function is equivalent to a nonlinear programming problem, and iterativc optimization methods have to be used that all require the first and perhaps the second derivatives of the log-likelihood function with respect to the system parameters. The solution requires the integration of adjoint equations, and the method becomes too involved under the trajectory invariance assumption, where we have missing observations.</Paragraph>
    <Paragraph position="1"> We have developed a nontraditional iterative method for maximum likelihood identification of a stochastic dynamical system, based on tlle observation that tile computation of the estimates would be simple if tile state of the system were observable: using simple first and second order sufficient statistics of time state and observation vectors. The Estimate-Maximize algorithm provides an approach for estimating parameters for processes having unobserved components, in this case the state vectors, and therefore ca, be used for maximum likelihood identification of dynamical systems.</Paragraph>
    <Paragraph position="2">  If we denote the parameter vector of phone model a by 8, then at the pth iteration of the EM algorithm the new estimate of the parameter vector is obtained by minimizing +log l&amp;l] + constant I Z, dp)} (3) where we have suppressed the parameterization of the system parameters on phone model cw and the first summation is over all occurrences of a specific phone model in the training data.</Paragraph>
    <Paragraph position="3"> Since the noise process is assumed to be Gaussian, the EM algorithm simply involves iteratively computing the expected first and second order sufficient statistics given the current parameter estimates. It is known from Kalman fdtering theory [I] that the conditional distribution of the state X given the observations Z on an interval is Gaussian. The sufficient statistics are then</Paragraph>
    <Paragraph position="5"> where the quantities on the right, iklL, CkIL, Ck,k-lIL are the fixed interval smoothed state estimate, its variance and the one lag cross-covariance respectively. The computation of these sufficient statistics can be done recursively. Under A2, since Y = Z, it reduces to the fixed-interval smoothing form of the Kalman filter, together with some additional recursions for the computation of the cross-covariance. These recursions consist of a forward pass through the data, followed by a backward pass and are summarized in Table 1.</Paragraph>
    <Paragraph position="6"> Under Al, the recursions take the form of a fixed interval smoother with blackouts, and can be derived similarly to the standard Kalman filter recursions.</Paragraph>
    <Paragraph position="7"> To summarize, assuming a known segmentation and therefore a known sequence of system models, the EM algorithm involves at each iteration the computation of the sufficient statistics described previously using the recursions of Ta-</Paragraph>
    <Paragraph position="9"/>
    <Section position="1" start_page="254" end_page="254" type="sub_section">
      <SectionTitle>
Backward Recursions
</SectionTitle>
      <Paragraph position="0"> ble 1 and the old estimates of the model parameters (Estimate step). The new estimates for the system parameters can then be obtained from these statistics as simple multi-variate regression coefficients (Maximize step). In addition, the structure of the system matrices can be constrained in order to satisfy identifiability conditions. When the segmentation is unknown, since the estimates obtained from our known segmentation method are Maximum Likelihood  ones, training can be done in an iterative fashion, as described in [6].</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="254" end_page="254" type="metho">
    <SectionTitle>
RECOGNITION
</SectionTitle>
    <Paragraph position="0"> When the phonetic segmentation is known, under both assumptions A1 and A2 the model sequence can be determined from the segmentation and therefore the MAP rule can be used for phone classification, where the likelihood of the observations is obtained from the Kalman predictor (2).</Paragraph>
    <Paragraph position="1"> For connected-phone recognition, with unknown segmentation, the MAP rule for detecting the most likely phonetic sequence involves computing the total probability of a certain sequence by summing over all possible segmentations.</Paragraph>
    <Paragraph position="2"> Because of the computational complexity of this approach, one can jointly search for the most likely phone sequence and segmentation given the observed sequence. This can be done with a Dynamic-Programming recursion. In previous work we have also introduced alternative fast algorithms for both phone classification and recognition [4] which yield performance similar to Dynamic-Programming with significant computation savings.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML