File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/h92-1038_metho.xml

Size: 8,230 bytes

Last Modified: 2025-10-06 14:13:08

<?xml version="1.0" standalone="yes"?>
<Paper uid="H92-1038">
  <Title>Recognition Using Classification and Segmentation Scoring*</Title>
  <Section position="3" start_page="0" end_page="198" type="metho">
    <SectionTitle>
2. CLASSIFICATION AND
SEGMENTATION SCORING
2.1. General Model
</SectionTitle>
    <Paragraph position="0"> The goal of speech recognition systems is to find the most likely label sequence, A = al, ..., air given a sequence of acoustic observations, X. For simplicity, we can restrict the problem to finding the label sequence, A, and segmentation, $ = sl,...,SN, that have the highest joint likelihood given the observations. (There is typically no  explicit segmentation component in the formulation for HMMs; in this case, the underlying state sequence is analogous to the segmentation-label sequence.) The required optimization is then to find labels A* such that</Paragraph>
    <Paragraph position="2"> as is commonly used in HMMs and has been used in our previous segment modeling. However, we can consider an alternative decomposition: p(A, S, X) = p(A \[ S, X)p(S, X).</Paragraph>
    <Paragraph position="3"> In this case, the optimization problem has two compo~C1 nents a lassffication probability,&amp;quot; p(A I S,X), and a &amp;quot;probability of segmentation&amp;quot;, p(S, X). We refer to this approach as classification-in-recognition (CIR).</Paragraph>
    <Paragraph position="4"> The CIR approach has a number of potential advantages related to the use of a classification component. First, segmental features can be accommodated in this approach by constraining p(A \] X, S) to have the form p(A I Y(X), S), where y(X) is some function of the original observations. The possibilities for this function include the complete observation sequence itself, as well as fixed dimensional segmental feature vectors computed from it. A second advantage is that a number of different classifiers can be used to compute the posterior probability, including neural networks and classification trees, as well as other approaches.</Paragraph>
    <Paragraph position="5"> To simplify initial experiments, we have made the assumption that phoneme segments are generated independently. In this case (1) is rewritten as</Paragraph>
    <Paragraph position="7"> where ai is one label of the sequence, si is a single segment of the segmentation 1, and X(sl) is the portion of the observation sequence corresponding to si. Segmental features are incorporated by constraining p(a~ IX(s0, s~) to be of the form p(a~ If(X(sl)), s0, as mentioned above.</Paragraph>
    <Paragraph position="8"> There are a number of segment-based systems that take a classification approach to recognition \[1, 2, 3\]. With the exception of \[2\], however, these do not include an explicit computation of the segmentation probability. Our 1 If si is defined as the start and end times of the segment, clearly consecutive si are not independent. To avoid this problem, we think of si as corresponding to the length of the segment.</Paragraph>
    <Paragraph position="9"> approach differs from \[2\] in the types of models used and in the method of obtaining the segmentation score. In \[2\], the classification and segmentation probabilities are estimated with separate multi-layer perceptrons.</Paragraph>
    <Section position="1" start_page="197" end_page="197" type="sub_section">
      <SectionTitle>
2.2. Classification Component
</SectionTitle>
      <Paragraph position="0"> The formulation described above is quite general, allowing the use of a number of different classification and segmentation components. The particular classifier used in the experiments described below is based on the Stochastic Segment Model (SSM) \[4\], an approach that uses segmental measurements in a statistical framework. This model represents the probability of a phoneme based on the joint statistics of an entire segment of speech. Several variants of the SSM have been developed since its introduction \[5, 6\], and recent work has shown this model to be comparable in performance to hidden-Markov model systems for the task of word recognition \[7\]. The use of the SSM for classification in the CIR formalism is described next.</Paragraph>
      <Paragraph position="1"> Using the formalism of \[4\], p(X(8i)\[8i, ai) is characterized as p(f(X(si))\[si,ai), where f(.) is a linear time warping transformation that maps variable length X(sl) to a fixed length sequence of vectors Y = f(X(si)). The specific model for Y is multi-variate Gaussian, generally subject to some assumptions about the covariance structure to reduce the number of free parameters in the model. The posterior probability used in the classification work here is obtained from this distribution according to p(f(X(si)) I hi, si) p(ai, si) p(ai I f(X(si)), si) = Ea, p(f(X(si)) I hi, si) p(ai, si)&amp;quot; There are more efficient methods for direct computation of the posterior distribution p(ai \[ f(X(si)), si), such as with tree-based classifiers or neural networks. However, the above formulation, which uses class-conditional densities of the observations, p(f(X(si)) \[ai,si), has the advantage that we can directly compare the CIR approach to the traditional approach and therefore better understand the issues associated with using fixed-length measurements and the effect of the segmentation score.</Paragraph>
      <Paragraph position="2"> In addition, this approach allows us to take advantage of recent improvements to the SSM, such as the dynamical system model \[6\], at a potentially lower cost due to subsampling of observations.</Paragraph>
    </Section>
    <Section position="2" start_page="197" end_page="198" type="sub_section">
      <SectionTitle>
2.3. Segmentation Component
</SectionTitle>
      <Paragraph position="0"> There are several possibilities for estimating the segmentation probability, and two fundamentally different approaches are explored here. First we note that we can  estimate either p(S I x) or p(S, X) for the segmentation probability, leading to the two equivalent expressions in 0).</Paragraph>
      <Paragraph position="1"> One method is to simply compute a mixture distribution of segment probabilities to find p(sl, X(si)):</Paragraph>
      <Paragraph position="3"> where {cj } is a set of classes, such as linguistic classes or context-independent phones. In order to find the score for the complete sequence of observations, the terms in the summation in (3) are instances of the more traditional formulation of (2). This method uses the complete observation sequence, as in \[4\], to determine the segmentation probabilities, as opposed to the features used for classification, which may be substantially reduced from the original observations and may lack some cues to segment boundaries, such as transitional acoustic events.</Paragraph>
      <Paragraph position="4"> Another method for computing the segmentation probability, similar to that presented in \[2\], is to find the posterior probability p(S \[ X). In this approach, we use distributions that model presence versus absence of a segment boundary at each frame, based on local features.</Paragraph>
      <Paragraph position="5"> The segmentation probability is written as</Paragraph>
      <Paragraph position="7"> where bL is the event that there is a boundary after frame L and bj is the event that there is not a boundary after the jth frame of the segment. We estimate the frame boundary probabilities as</Paragraph>
      <Paragraph position="9"> The component conditional probabilities are computed as</Paragraph>
      <Paragraph position="11"> where fl ranges over the manner-of-articulation phoneme classes: stops, nasals, fricatives, liquids, vowels, and additionally, silence.</Paragraph>
      <Paragraph position="12"> The two segmentation models presented have different advantages. The first method makes use of the complete set of SSM phone models in determining likely boundaries for each segment and hence may have a more complete model of the speech process. On the other hand, the second approach uses models explicitly trained to differentiate between boundary and non-boundary acoustic events. The best choice of segmentation score is an empirical question that we have begun to address in this work.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML