XML Viewer - w04-3209

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-3209_evalu.xml
Size: 6,387 bytes
Last Modified: 2025-10-06 13:59:21
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3209">
  <Title>Comparing and Combining Generative and Posterior Probability Models: Some Advances in Sentence Boundary Detection in Speech</Title>
  <Section position="6" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
5 Results and Discussion
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Experimental Setup
</SectionTitle>
      <Paragraph position="0"> Experiments comparing the two modeling approaches were conducted on two corpora: broadcast news (BN) and conversational telephone speech (CTS). BN and CTS differ in genre and speaking style. These differences are reflected in the frequency of SU boundaries: about 14% of inter-word boundaries are SUs in CTS, compared to roughly 8% in BN.</Paragraph>
      <Paragraph position="1"> The corpora are annotated by LDC according to the guidelines of (Strassel, 2003). Training and test data are those used in the DARPA Rich Transcription Fall 2003 evaluation.7 For CTS, there is about 40 hours of conversational data from the Switchboard corpus for training and 6 hours (72 conversations) for testing. The BN data has about 20 hours 7We used both the development set and the evaluation set as the test set in this paper, in order to have a larger test set to make the results more meaningful.</Paragraph>
      <Paragraph position="2">  of broadcast news shows in the training set and 3 hours (6 shows) in the test set. The SU detection task is evaluated on both the reference transcriptions (REF) and speech recognition outputs (STT). The speech recognition output is obtained from the SRI recognizer (Stolcke et al., 2003).</Paragraph>
      <Paragraph position="3"> System performance is evaluated using the official NIST evaluation tools,8 which implement the metric described earlier. In our experiments, we compare how the two approaches perform individually and in combination. The combined classifier is obtained by simply averaging the posterior estimates from the two models, and then picking the event type with the highest probability at each position. null We also investigate other experimental factors, such as the impact of the speech recognition errors, the impact of genre, and the contribution of text versus prosodic information in each model.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Experimental Results
</SectionTitle>
      <Paragraph position="0"> Table 1 shows SU detection results for BN and CTS, using both reference transcriptions and speech recognition output, using the HMM and the max-ent approach individually and in combination. The maxent approach slightly outperforms the HMM approach when evaluating on the reference transcripts, and the combination of the two approaches achieves the best performance for all tasks (significant at p &lt; 0:05 using the sign test on the reference transcription condition, mixed results on using recognition output).</Paragraph>
      <Paragraph position="1"> 5.2.1 BN vs. CTS The detection error rate on CTS is lower than on BN. This may be due to the metric used for performance. Detection error rate is measured as the percentage of errors per reference SU. The number of SUs in CTS is much larger than for BN, making the relative error rate lower for the conversational speech task. Notice also from Table 1 that maxent yields more gain on CTS than on BN (for the reference transcription condition on both corpora). One possible reason for this is that we have more train- null Table 2 shows error rates for the HMM and the max-ent approaches in the reference condition. Due to the reduced dependence on the prosody model, the errors made in the maxent approach are different from the HMM approach. There are more deletion errors and fewer insertion errors, since the prosody model tends to overgenerate SU hypotheses. The different error patterns suggest that we can effectively combine the system output from the two approaches. As shown in the Table 1, the combination of maxent and HMM consistently yields the best performance.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2.3 Contribution of Knowledge Sources
</SectionTitle>
      <Paragraph position="0"> Table 3 shows SU detection results for the two approaches, using textual information only, as well as in combination with the prosody model (which are the same results as shown in Table 1). We only report the results on the reference transcription condition, in order to not confound the comparison by word recognition errors.</Paragraph>
      <Paragraph position="1"> The superior results for text-only classification are consistent with the maxent model's ability to combine overlapping word-level features in a principled way. However, the HMM largely catches up once prosodic information is added. This can be attributed to the loss-less integration of prosodic posteriors in the HMM, as well as the fact that in the HMM, each boundary decision is affected by prosodic information throughout the data; whereas, the maxent model only uses the prosodic features at the boundary to be classified.</Paragraph>
      <Paragraph position="2">  We observe in Table 1 that there is a large increase in error rate when evaluating on the speech recognition output. This happens in part because word information is inaccurate in the recognition output, thus impacting the LMs and lexical features. The prosody model is also affected, since the alignment of incorrect words to the speech is imperfect, thereby affecting the prosodic feature extraction. However, the prosody model is more robust to recognition errors than the LMs, due to its lesser dependence on word identity. The degradation on CTS is larger than on BN. This can easily be explained by the difference in word error rates, 22.9% on CTS and 12.1% on BN.</Paragraph>
      <Paragraph position="3"> The maxent system degrades more than then HMM system when errorful recognition output is used. In light of the previous section, this makes sense: most of the improvement of the maxent model comes from better lexical feature modeling.</Paragraph>
      <Paragraph position="4"> But these are exactly the features that are most deteriorated by faulty recognition output.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML