File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-3209_intro.xml

Size: 4,574 bytes

Last Modified: 2025-10-06 14:02:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3209">
  <Title>Comparing and Combining Generative and Posterior Probability Models: Some Advances in Sentence Boundary Detection in Speech</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Sentence boundary detection is a problem that has received limited attention in the text-based computational linguistics community (Schmid, 2000; Palmer and Hearst, 1994; Reynar and Ratnaparkhi, 1997), but which has recently acquired renewed importance through an effort by the DARPA EARS program (DARPA Information Processing Technology Office, 2003) to improve automatic speech transcription technology. Since standard speech recognizers output an unstructured stream of words, improving transcription means not only that word accuracy must be improved, but also that commonly used structural features such as sentence boundaries need to be recognized. The task is thus fundamentally based on both acoustic and textual (via automatic word recognition) information. From a computational linguistics point of view, sentence units are crucial and assumed in most of the further processing steps that one would want to apply to such output: tagging and parsing, information extraction, and summarization, among others.</Paragraph>
    <Paragraph position="1"> Sentence segmentation from speech is a difficult problem. The best systems benchmarked in a recent government-administered evaluation yield error rates between 30% and 50%, depending on the genre of speech processed (measured as the number of missed and inserted sentence boundaries as a percentage of true sentence boundaries). Because of the difficulty of the task, which leaves plenty of room for improvement, its relevance to real-world applications, and the range of potential knowledge sources to be modeled (acoustics and text-based, lower- and higher-level), this is an interesting challenge problem for statistical and computational approaches. null All of the systems participating in the recent DARPA RT-03F Metadata Extraction evaluation (National Institute of Standards and Technology, 2003) were based on a hidden Markov model framework, in which word/tag sequences are modeled by N-gram language models (LMs). Additional features (mostly reflecting speech prosody) are modeled as observation likelihoods attached to the N-gram states of the HMM (Shriberg et al., 2000). The HMM is a generative modeling approach, since it describes a stochastic process with hidden variables (the locations of sentence boundaries) that produces the observable data. The segmentation is inferred by comparing the likelihoods of different boundary hypotheses.</Paragraph>
    <Paragraph position="2"> While the HMM approach is computationally efficient and (as described later) provides a convenient way for modularizing the knowledge sources, it has two main drawbacks: First, the standard training methods for HMMs maximize the joint probability of observed and hidden events, as opposed to the posterior probability of the correct hidden variable assignment given the observations. The latter is a criterion more closely related to classification error. Second, the N-gram LM underlying the HMM transition model makes it difficult to use features that are highly correlated (such as word and POS labels) without greatly increasing the number of model parameters; this in turn would make robust estimation channel word string prosody idea syntax, semantics,  In this paper, we describe our effort to overcome these shortcomings by 1) replacing the generative model with one that estimates the posterior probabilities directly, and 2) using the maximum entropy (maxent) framework to estimate conditional distributions, giving us a more principled way to combine a large number of overlapping features. Both techniques have been used previously for traditional NLP tasks, but they are not straightforward to apply in our case because of the diverse nature of the knowledge sources used in sentence segmentation.</Paragraph>
    <Paragraph position="3"> We describe the techniques we developed to work around these difficulties, and compare classification accuracy of the old and new approach on different genres of speech. We also investigate how word recognition error affects that comparison. Finally, we show that a simple combination of the two approaches turns out to be highly effective in improving the best previous results obtained on a benchmark task.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML