File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3209_metho.xml
Size: 20,375 bytes
Last Modified: 2025-10-06 14:09:29
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3209"> <Title>Comparing and Combining Generative and Posterior Probability Models: Some Advances in Sentence Boundary Detection in Speech</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 The Sentence Segmentation Task </SectionTitle> <Paragraph position="0"> The sentence boundary detection problem is depicted in Figure 1 in the source-channel framework.</Paragraph> <Paragraph position="1"> The speaker intends to say something, chooses the word string, and imposes prosodic cues (duration, emphasis, intonation, etc). This signal goes through the speech production channel to generate an acoustic signal. A speech recognizer determines the most likely word string given this signal. To detect possible sentence boundaries in the recognized word string, prosodic features are extracted from the signal, and combined with textual cues obtained from the word string. At issue in this paper is the final box in the diagram: how to model and combine the available knowledge sources to find the most accurate hypotheses.</Paragraph> <Paragraph position="2"> Note that this problem differs from the sentence boundary detection problem for written text in the natural language processing literature (Schmid, 2000; Palmer and Hearst, 1994; Reynar and Ratnaparkhi, 1997). Here we are dealing with spoken language, therefore there is no punctuation information, the words are not capitalized, and the transcripts from the recognition output are errorful.</Paragraph> <Paragraph position="3"> This lack of textual cues is partly compensated by prosodic information (timing, pitch, and energy patterns) conveyed by speech. Also note that in spontaneous conversational speech &quot;sentence&quot; is not always a straightforward notion. For our purposes we use the definition of a &quot;sentence-like unit&quot;, or SU, as defined by the LDC for labeling and evaluation purposes (Strassel, 2003).</Paragraph> <Paragraph position="4"> The training data has SU boundaries marked by annotators, based on both the recorded speech and its transcription. In testing, a system has to recover both the words and the locations of sentence boundaries, denoted by (W;E) = w</Paragraph> <Paragraph position="6"> where W represents the strings of word tokens and E the inter-word boundary events (sentence boundary or no boundary).</Paragraph> <Paragraph position="7"> The system output is scored by first finding a minimum edit distance alignment between the hypothesized word string and the reference, and then comparing the aligned event labels. The SU error rate is defined as the total number of deleted or inserted SU boundary events, divided by the number of true SU boundaries.1 For diagnostic purposes a secondary evaluation condition allows use of the correct word transcripts. This condition allows us to study the segmentation task without the confounding effect of speech recognition errors, using perfect lexical information. null</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Features and Knowledge Sources </SectionTitle> <Paragraph position="0"> Words and sentence boundaries are mutually constrained via syntactic structure. Therefore, the word identities themselves (from automatic recognition or human transcripts) constitute a primary knowledge source for the sentence segmentation task. We also make use of various automatic taggers that map the word sequence to other representations. The TnT tagger (Brants, 2000) is used to obtain part-of-speech (POS) tags. A TBL chunker trained on Wall Street Journal corpus (Ngai and Florian, 2001) maps each word to an associated chunk tag, encoding chunk type and relative word position (beginning of an NP, inside a VP, etc.). The tagged versions of the word stream are provided to allow generalizations based on syntactic structure and to smooth out possibly undertrained word-based probability esti1This is the same as simple per-event classification accuracy, except that the denominator counts only the &quot;marked&quot; events, thereby yielding error rates that are much higher than if one uses all potential boundary locations.</Paragraph> <Paragraph position="1"> mates. For the same reasons we also generate word class labels that are automatically induced from bi-gram word distributions (Brown et al., 1992).</Paragraph> <Paragraph position="2"> To model the prosodic structure of sentence boundaries, we extract several hundred features around each word boundary. These are based on the acoustic alignments produced by a speech recognizer (or forced alignments of the true words when given). The features capture duration, pitch, and energy patterns associated with the word boundaries. Informative features include the pause duration at the boundary, the difference in pitch before and after the boundary, and so on. A crucial aspect of many of these features is that they are highly correlated (e.g., by being derived from the same raw measurements via different normalizations), real-valued (not discrete), and possibly undefined (e.g., unvoiced speech regions have no pitch).</Paragraph> <Paragraph position="3"> These properties make prosodic features difficult to model directly in either of the approaches we are examining in the paper. Hence, we have resorted to a modular approach: the information from prosodic features is modeled separately by a decision tree classifier that outputs posterior probability estimates</Paragraph> <Paragraph position="5"> is the prosodic feature vector associated with the word boundary. Conveniently, this approach also permits us to include some non-prosodic features that are highly relevant for the task, but not otherwise represented, such as whether a speaker (turn) change occurred at the location in question.2 A practical issue that greatly influences model design is that not all information sources are available uniformly for all training data. For example, prosodic modeling assumes acoustic data; whereas, word-based models can be trained on text-only data, which is usually available in much larger quantities.</Paragraph> <Paragraph position="6"> This poses a problem for approaches that model all relevant information jointly and is another strong motivation for modular approaches.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 The Models </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Hidden Markov Model for Segmentation </SectionTitle> <Paragraph position="0"> Our baseline model, and the one that forms the basis of much of the prior work on acoustic sentence segmentation (Shriberg et al., 2000; Gotoh and Renals, 2000; Christensen, 2001; Kim and Woodland, 2001), is a hidden Markov model. The states of the model correspond to words w i and following 2Here we are glossing over some details on prosodic modeling that are orthogonal to the discussion in this paper. For example, instead of simple decision trees we actually use ensemble bagging to reduce the variance of the classifier (Liu et al., 2004).</Paragraph> <Paragraph position="2"> problem. Only one word+event is depicted in each state, but in a model based on N-grams the previous N ? 1 tokens would condition the transition to the next state.</Paragraph> <Paragraph position="3"> event labels e i . The observations associated with the states are the words, as well as other (mainly prosodic) features f</Paragraph> <Paragraph position="5"> . Figure 2 shows a graphical model representation of the variables involved. Note that the words appear in both the states and the observations, such that the word stream constrains the possible hidden states to matching words; the ambiguity in the task stems entirely from the choice of events.</Paragraph> <Paragraph position="6"> Standard algorithms are available to extract the most probable state (and thus event) sequence given a set of observations. The error metric is based on classification of individual word boundaries. Therefore, rather than finding the highest probability sequence of events, we identify the events with highest posterior individually at each boundary i:</Paragraph> <Paragraph position="8"> where W and F are the words and features for the entire test sequence, respectively. The individual event posteriors are obtained by applying the forward-backward algorithm for HMMs (Rabiner and Juang, 1986).</Paragraph> <Paragraph position="9"> Training of the HMM is supervised since eventlabeled data is available. There are two sets of parameters to estimate. The state transition probabilities are estimated using a hidden event N-gram LM (Stolcke and Shriberg, 1996). The LM is obtained with standard N-gram estimation methods from data that contains the word+event tags in</Paragraph> <Paragraph position="11"> The N-gram estimator maximizes the joint word+event sequence likelihood P(W;E) on the training data (modulo smoothing), and does not guarantee that the correct event posteriors needed for classification according to Equation (1) are maximized.</Paragraph> <Paragraph position="12"> The second set of HMM parameters are the observation likelihoods P(f</Paragraph> <Paragraph position="14"> ). Instead of training a likelihood model we make use of the prosodic classifiers described in Section 3. We have at our disposal decision trees that estimate P(e</Paragraph> <Paragraph position="16"> we further assume that prosodic features are independent of words given the event type (a reasonable simplification if features are chosen appropriately), observation likelihoods may be obtained by</Paragraph> <Paragraph position="18"> ) is constant we can ignore it when carrying out the maximization (1).</Paragraph> <Paragraph position="19"> The HMM structure makes strong independence assumptions: (1) that features depend only on the current state (and in practice, as we saw, only on the event label) and (2) that each word+event label depends only on the last N ? 1 tokens. In return, we get a computationally efficient structure that allows information from the entire sequence W;F to inform the posterior probabilities needed for classification, via the forward-backward algorithm. More problematic in practice is the integration of multiple word-level features, such as POS tags and chunker output. Theoretically, all tags could simply be included in the hidden state representation to allow joint modeling of words, tags, and events. However, this would drastically increase the size of the state space, making robust model estimation with standard N-gram techniques difficult. A method that works well in practice is linear interpolation, whereby the conditional probability estimates of various models are simply averaged, thus reducing variance. In our case, we obtain good results by interpolating a word-N-gram model with 3To utilize word+event contexts of length greater than one we have to employ HMMs of order 2 or greater, or equivalently, make the entire word+event N-gram be the state.</Paragraph> <Paragraph position="20"> one based on automatically induced word classes (Brown et al., 1992).</Paragraph> <Paragraph position="21"> Similarly, we can interpolate LMs trained from different corpora. This is usually more effective than pooling the training data because it allows control over the contributions of the different sources. For example, we have a small corpus of training data labeled precisely to the LDC's SU specifications, but a much larger (130M word) corpus of standard broadcast new transcripts with punctuation, from which an approximate version of SUs could be inferred. The larger corpus should get a larger weight on account of its size, but a lower weight given the mismatch of the SU labels. By tuning the interpolation weight of the two LMs empirically (using held-out data) the right compromise was found.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Maxent Posterior Probability Model </SectionTitle> <Paragraph position="0"> As observed, HMM training does not maximize the posterior probabilities of the correct labels. This mismatch between training and use of the model as a classifier would not arise if the model directly estimated the posterior boundary label probabilities</Paragraph> <Paragraph position="2"> jW;F). A second problem with HMMs is that the underlying N-gram sequence model does not cope well with multiple representations (features) of the word sequence (words, POS, etc.) short of building a joint model of all variables. This type of situation is well-suited to a maximum entropy formulation (Berger et al., 1996), which allows conditioning features to apply simultaneously, and therefore gives greater freedom in choosing representations.</Paragraph> <Paragraph position="3"> Another desirable characteristic of maxent models is that they do not split the data recursively to condition their probability estimates, which makes them more robust than decision trees when training data is limited.</Paragraph> <Paragraph position="4"> We built a posterior probability model for sentence boundary classification in the maxent framework.</Paragraph> <Paragraph position="5"> Such a model takes the familiar exponential form4</Paragraph> <Paragraph position="7"> (e;W;F) are indicator functions corresponding to (complex) features defined over 4We omit the index i from e here since the &quot;current&quot; event is meant in all cases.</Paragraph> <Paragraph position="8"> events, words, and prosodic features. For example, one such feature function might be:</Paragraph> <Paragraph position="10"> such that the expected values of the various feature functions E</Paragraph> <Paragraph position="12"> empirical averages in the training data. It can be shown that the resulting model has maximal entropy among all the distributions satisfying these expectation constraints. At the same time, the parameters so chosen maximize the conditional likelihood</Paragraph> <Paragraph position="14"> jW;F) over the training data, subject to the constraints of the exponential form given by Equation (3).5 The conditional likelihood is closely related to the individual event posteriors used for classification, meaning that this type of model explicitly optimizes discrimination of correct from incorrect labels.</Paragraph> <Paragraph position="15"> Even though the mathematical formulation gives us the freedom to use features that are overlapping or otherwise dependent, we still have to choose a sub-set that is informative and parsimonious, so as to give good generalization and robust parameter estimates. Various feature selection algorithms for maxent models have been proposed, e.g., (Berger et al., 1996). However, since computational efficiency was not an issue in our experiments, we included all features that corresponded to information available to our baseline approach, as listed below. We did eliminate features that were triggered only once in the training set to improve robustness and to avoid overconstraining the model.</Paragraph> <Paragraph position="16"> Word N-grams. We use combinations of preceding and following words to encode the word context of the event, e.g.,</Paragraph> <Paragraph position="18"> refers to the word before the boundary of interest.</Paragraph> <Paragraph position="19"> POS N-grams. POS tags are the same as used for the HMM approach. The features capturing POS context are similar to those based on word tokens.</Paragraph> <Paragraph position="20"> Chunker tags. These are used similarly to POS and word features, except we use tags encoding 5In our experiments we used the L-BFGS parameter estimation method, with Gaussian-prior smoothing (Chen and Rosenfeld, 1999) to avoid overfitting.</Paragraph> <Paragraph position="21"> chunk type (NP, VP, etc.) and word position within the chunk (beginning versus inside).6 Word classes. These are similar to N-gram patterns but over automatically induced classes. Turn flags. Since speaker change often marks an SU boundary, we use this binary feature.</Paragraph> <Paragraph position="22"> Note that in the HMM approach this feature had to be grouped with the prosodic features and handled by the decision tree. In the max-ent approach we can use it separately.</Paragraph> <Paragraph position="23"> Prosody. As we described earlier, decision tree classifiers are used to generate the posterior probabilities p(e</Paragraph> <Paragraph position="25"> ). Since the maxent classifier is most conveniently used with binary features, we encode the prosodic posteriors into several binary features via thresholding. Equation (3) shows that the presence of each feature in a maxent model has a monotonic effect on the final probability (raising or lowering it by a constant factor ekgk). This suggests encoding the decision tree posteriors in a cumulative fashion through a series of binary features, for example, p > 0:1, p > 0:3, p > 0:5, p > 0:7, p > 0:9. This representation is also more robust to mismatch between the posterior probability in training and test set, since small changes in the posterior value affect at most one feature.</Paragraph> <Paragraph position="26"> Note that the maxent framework does allow the use of real-valued feature functions, but preliminary experiments have shown no gain compared to the binary features constructed as described above. Still, this is a topic for future research.</Paragraph> <Paragraph position="27"> Auxiliary LM. As mentioned earlier, additional text-only language model training data is often available. In the HMM model we incorporated auxiliary LMs by interpolation, which is not possible here since there is no LM per se, but rather N-gram features. However, we can use the same trick as we used for prosodic features. A word-only HMM is used to estimate posterior event probabilities according to the auxiliary LM, and these posteriors are then thresholded to yield binary features.</Paragraph> <Paragraph position="28"> Combined features. To date we have not fully investigated compound features that combine different knowledge sources and are able to model the interaction between them explicitly.</Paragraph> <Paragraph position="29"> 6Chunker features were only used for broadcast news data, due to the poor chunking performance on conversational speech.</Paragraph> <Paragraph position="30"> We only included a limited set of such features, for example, a combination of the decision tree hypothesis and POS contexts.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Differences Between HMM and Maxent </SectionTitle> <Paragraph position="0"> We have already discussed the differences between the two approaches regarding the training objective function (joint likelihood versus conditional likelihood) and with respect to the handling of dependent word features (model interpolation versus integrated modeling via maxent). On both counts the maxent classifier should be superior to the HMM.</Paragraph> <Paragraph position="1"> However, the maxent approach also has some theoretical disadvantages compared to the HMM by design. One obvious shortcoming is that some information gets lost in the thresholding that converts posterior probabilities from the prosodic model and the auxiliary LM into binary features.</Paragraph> <Paragraph position="2"> A more qualitative limitation of the maxent model is that it only uses local evidence (the surrounding word context and the local prosodic features). In that respect, the maxent model resembles the conditional probability model at the individual HMM states. The HMM as a whole, however, through the forward-backward procedure, propagates evidence from all parts of the observation sequence to any given decision point. Variants such as the conditional Markov model (CMM) combine sequence modeling with posterior probability (e.g., maxent) modeling, but it has been shown that CMM's are still structurally inferior to HMMs because they only propagate evidence forward in time, not backwards (Klein and Manning, 2002).</Paragraph> </Section> </Section> class="xml-element"></Paper>