File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1056_metho.xml
Size: 16,436 bytes
Last Modified: 2025-10-06 14:09:47
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1056"> <Title>Using Conditional Random Fields For Sentence Boundary Detection In Speech</Title> <Section position="4" start_page="451" end_page="452" type="metho"> <SectionTitle> 2 An N-gram LM is used to calculate </SectionTitle> <Paragraph position="0"> the transition probabilities:</Paragraph> <Paragraph position="2"> In the HMM, the forward-backward algorithm is used to determine the event with the highest posterior probability for each interword boundary:</Paragraph> <Paragraph position="4"> The HMM is a generative modeling approach since it describes a stochastic process with hidden variables (sentence boundary) that produces the observable data. This HMM approach has two main drawbacks. First, standard training methods maximize the joint probability of observed and hidden events, as opposed to the posterior probability of the correct hidden variable assignment given the observations, which would be a criterion more closely related to classification performance. Second, the N-gram LM underlying the HMM transition model makes it difficult to use features that are highly correlated (such as words and POS labels) without greatly increasing the number of model parameters, which in turn would make robust estimation difficult. More details about using textual information in the HMM system are provided in Section 3.</Paragraph> <Section position="1" start_page="451" end_page="452" type="sub_section"> <SectionTitle> 1.2 Sentence Segmentation Using Maxent </SectionTitle> <Paragraph position="0"> A maximum entropy (Maxent) posterior classification method has been evaluated in an attempt to overcome some of the shortcomings of the HMM approach (Liu et al., 2004; Huang and Zweig, 2002).</Paragraph> <Paragraph position="1"> For a boundary position i, the Maxent model takes the exponential form:</Paragraph> <Paragraph position="3"> represents textual information. The indicator functions g</Paragraph> <Paragraph position="5"> ) correspond to features defined over events, words, and prosody. The parameters in 2In the prosody model implementation, we ignore the word identity in the conditions, only using the timing or word alignment information.</Paragraph> <Paragraph position="7"> sentence boundary detection problem. Only one word+event pair is depicted in each state, but in a model based on N-grams, the previous N ? 1 tokens would condition the transition to the next state. O are observations consisting of words W and prosodic features F, and E are sentence boundary events.</Paragraph> <Paragraph position="8"> Maxent are chosen to maximize the conditional like-</Paragraph> <Paragraph position="10"> ) over the training data, better matching the classification accuracy metric. The Maxent framework provides a more principled way to combine the largely correlated textual features, as confirmed by the results of (Liu et al., 2004); however, it does not model the state sequence.</Paragraph> <Paragraph position="11"> A simple combination of the results from the Maxent and HMM was found to improve upon the performance of either model alone (Liu et al., 2004) because of the complementary strengths and weaknesses of the two models. An HMM is a generative model, yet it is able to model the sequence via the forward-backward algorithm. Maxent is a discriminative model; however, it attempts to make decisions locally, without using sequential information.</Paragraph> <Paragraph position="12"> A conditional random field (CRF) model (Lafferty et al., 2001) combines the benefits of the HMM and Maxent approaches. Hence, in this paper we will evaluate the performance of the CRF model and relate the results to those using the HMM and Max-ent approaches on the sentence boundary detection task. The rest of the paper is organized as follows.</Paragraph> <Paragraph position="13"> Section 2 describes the CRF model and discusses how it differs from the HMM and Maxent models.</Paragraph> <Paragraph position="14"> Section 3 describes the data and features used in the models to be compared. Section 4 summarizes the experimental results for the sentence boundary detection task. Conclusions and future work appear in Section 5.</Paragraph> </Section> </Section> <Section position="5" start_page="452" end_page="452" type="metho"> <SectionTitle> 2 CRF Model Description </SectionTitle> <Paragraph position="0"> A CRF is a random field that is globally conditioned on an observation sequence O. CRFs have been successfully used for a variety of text processing tasks (Lafferty et al., 2001; Sha and Pereira, 2003; McCallum and Li, 2003), but they have not been widely applied to a speech-related task with both acoustic and textual knowledge sources. The top graph in Figure 2 is a general CRF model. The states of the model correspond to event labels E. The observations O are composed of the textual features, as well as the prosodic features. The most likely event sequence ^E for the given input sequence (observations) O is</Paragraph> <Paragraph position="2"> where the functions G are potential functions over the events and the observations, and Z is the normalization term:</Paragraph> <Paragraph position="4"> Even though a CRF itself has no restriction on the potential functions G k (E;O), to simplify the model (considering computational cost and the limited training set size), we use a first-order CRF in this investigation, as at the bottom of Figure 2. In this model, an observation O</Paragraph> <Paragraph position="6"> The model is trained to maximize the conditional log-likelihood of a given training set. Similar to the Maxent model, the conditional likelihood is closely related to the individual event posteriors used for classification, enabling this type of model to explicitly optimize discrimination of correct from incorrect labels. The most likely sequence is found using the Viterbi algorithm.3 A CRF differs from an HMM with respect to its training objective function (joint versus conditional likelihood) and its handling of dependent word features. Traditional HMM training does not maximize the posterior probabilities of the correct labels; whereas, the CRF directly estimates posterior</Paragraph> <Paragraph position="8"> CRF and the first-order CRF used for the sentence boundary detection problem. E represent the state tags (i.e., sentence boundary or not). O are observations consisting of words W or derived textual features T and prosodic features F.</Paragraph> <Paragraph position="9"> boundary label probabilities P(EjO). The underlying N-gram sequence model of an HMM does not cope well with multiple representations (features) of the word sequence (e.g., words, POS), especially when the training set is small; however, the CRF model supports simultaneous correlated features, and therefore gives greater freedom for incorporating a variety of knowledge sources. A CRF differs from the Maxent method with respect to its ability to model sequence information. The primary advantage of the CRF over the Maxent approach is that the model is optimized globally over the entire sequence; whereas, the Maxent model makes a local decision, as shown in Equation (2), without utilizing any state dependency information.</Paragraph> <Paragraph position="10"> We use the Mallet package (McCallum, 2002) to implement the CRF model. To avoid overfitting, we employ a Gaussian prior with a zero mean on the parameters (Chen and Rosenfeld, 1999), similar to what is used for training Maxent models (Liu et al., 2004).</Paragraph> </Section> <Section position="6" start_page="452" end_page="454" type="metho"> <SectionTitle> 3 Experimental Setup </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="452" end_page="453" type="sub_section"> <SectionTitle> 3.1 Data and Task Description </SectionTitle> <Paragraph position="0"> The sentence-like units in speech are different from those in written text. In conversational speech, these units can be well-formed sentences, phrases, or even a single word. These units are called SUs in the DARPA EARS program. SU boundaries, as well as other structural metadata events, were annotated by LDC according to an annotation guideline (Strassel, 2003). Both the transcription and the recorded speech were used by the annotators when labeling the boundaries.</Paragraph> <Paragraph position="1"> The SU detection task is conducted on two corpora: Broadcast News (BN) and Conversational Telephone Speech (CTS). BN and CTS differ in genre and speaking style. The average length of SUs is longer in BN than in CTS, that is, 12.35 words (standard deviation 8.42) in BN compared to 7.37 words (standard deviation 8.72) in CTS. This difference is reflected in the frequency of SU boundaries: about 14% of interword boundaries are SUs in CTS compared to roughly 8% in BN. Training and test data for the SU detection task are those used in the NIST Rich Transcription 2003 Fall evaluation.</Paragraph> <Paragraph position="2"> We use both the development set and the evaluation set as the test set in this paper in order to obtain more meaningful results. For CTS, there are about 40 hours of conversational data (around 480K words) from the Switchboard corpus for training and 6 hours (72 conversations) for testing. The BN data has about 20 hours of Broadcast News shows (about 178K words) in the training set and 3 hours (6 shows) in the test set. Note that the SU-annotated training data is only a subset of the data used for the speech recognition task because more effort is required to annotate the boundaries.</Paragraph> <Paragraph position="3"> For testing, the system determines the locations of sentence boundaries given the word sequence W and the speech. The SU detection task is evaluated on both the reference human transcriptions (REF) and speech recognition outputs (STT). Evaluation across transcription types allows us to obtain the performance for the best-case scenario when the transcriptions are correct; thus factoring out the confounding effect of speech recognition errors on the SU detection task. We use the speech recognition output obtained from the SRI recognizer (Stolcke et al., 2003).</Paragraph> <Paragraph position="4"> System performance is evaluated using the official NIST evaluation tools.4 System output is scored by first finding a minimum edit distance alignment between the hypothesized word string and the refer- null more details about scoring.</Paragraph> <Paragraph position="5"> ence transcriptions, and then comparing the aligned event labels. The SU error rate is defined as the total number of deleted or inserted SU boundary events, divided by the number of true SU boundaries. In addition to this NIST SU error metric, we use the total number of interword boundaries as the denominator, and thus obtain results for the per-boundary-based metric.</Paragraph> </Section> <Section position="2" start_page="453" end_page="454" type="sub_section"> <SectionTitle> 3.2 Feature Extraction and Modeling </SectionTitle> <Paragraph position="0"> To obtain a good-quality estimation of the conditional probability of the event tag given the obser-</Paragraph> <Paragraph position="2"> ), the observations should be based on features that are discriminative of the two events (SU versus not). As in (Liu et al., 2004), we utilize both textual and prosodic information.</Paragraph> <Paragraph position="3"> We extract prosodic features that capture duration, pitch, and energy patterns associated with the word boundaries (Shriberg et al., 2000). For all the modeling methods, we adopt a modular approach to model the prosodic features, that is, a decision tree classifier is used to model them. During testing, the decision tree prosody model estimates posterior probabilities of the events given the associated prosodic features for a word boundary. The posterior probability estimates are then used in various modeling approaches in different ways as described later.</Paragraph> <Paragraph position="4"> Since words and sentence boundaries are mutually constraining, the word identities themselves (from automatic recognition or human transcriptions) constitute a primary knowledge source for sentence segmentation. We also make use of various automatic taggers that map the word sequence to other representations. Tagged versions of the word stream are provided to support various generalizations of the words and to smooth out possibly undertrained word-based probability estimates. These tags include part-of-speech tags, syntactic chunk tags, and automatically induced word classes. In addition, we use extra text corpora, which were not annotated according to the guideline used for the training and test data (Strassel, 2003). For BN, we use the training corpus for the LM for speech recognition. For CTS, we use the Penn Treebank Switchboard data. There is punctuation information in both, which we use to approximate SUs as defined in the annotation guideline (Strassel, 2003).</Paragraph> <Paragraph position="5"> As explained in Section 1, the prosody model and</Paragraph> </Section> </Section> <Section position="7" start_page="454" end_page="454" type="metho"> <SectionTitle> HMM Maxent CRF </SectionTitle> <Paragraph position="0"> generative model conditional approach Sequence information yes no yes LDC data set (words or tags) LM N-grams as indicator functions Probability from prosody model real-valued cumulatively binned Additional text corpus N-gram LM binned posteriors Speaker turn change in prosodic features a separate feature, in addition to being in the prosodic feature set Compound feature no POS tags and decisions from prosody model the N-gram LM can be integrated in an HMM. When various textual information is used, jointly modeling words and tags may be an effective way to model the richer feature set; however, a joint model requires more parameters. Since the training set for the SU detection task in the EARS program is quite limited, we use a loosely coupled approach: Linearly combine three LMs: the word-based LM from the LDC training data, the automaticclass-based LMs, and the word-based LM trained from the additional corpus.</Paragraph> <Paragraph position="1"> These interpolated LMs are then combined with the prosody model via the HMM. The posterior probabilities of events at each boundary are obtained from this step, denoted as</Paragraph> <Paragraph position="3"> jW;C;F).</Paragraph> <Paragraph position="4"> Apply the POS-based LM alone to the POS sequence (obtained by running the POS tagger on the word sequence W) and generate the posterior probabilities for each word boundary</Paragraph> <Paragraph position="6"> jPOS), which are then combined from the posteriors from the previous step,</Paragraph> <Paragraph position="8"> jP).</Paragraph> <Paragraph position="9"> The features used for the CRF are the same as those used for the Maxent model devised for the SU detection task (Liu et al., 2004), briefly listed below. N-grams of words or various tags (POS tags, automatically induced classes). Different Ns and different position information are used (N varies from one through four).</Paragraph> <Paragraph position="10"> The cumulative binned posterior probabilities from the decision tree prosody model.</Paragraph> <Paragraph position="11"> The N-gram LM trained from the extra corpus is used to estimate posterior event probabilities for the LDC-annotated training and test sets, and these posteriors are then thresholded to yield binary features.</Paragraph> <Paragraph position="12"> Other features: speaker or turn change, and compound features of POS tags and decisions from the prosody model.</Paragraph> <Paragraph position="13"> Table 1 summarizes the features and their representations used in the three modeling approaches. The same knowledge sources are used in these approaches, but with different representations. The goal of this paper is to evaluate the ability of these three modeling approaches to combine prosodic and textual knowledge sources, not in a rigidly parallel fashion, but by exploiting the inherent capabilities of each approach. We attempt to compare the models in as parallel a fashion as possible; however, it should be noted that the two discriminative methods better model the textual sources and the HMM better models prosody given its representation in this study.</Paragraph> </Section> class="xml-element"></Paper>