File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/p05-1056_evalu.xml
Size: 7,282 bytes
Last Modified: 2025-10-06 13:59:26
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1056"> <Title>Using Conditional Random Fields For Sentence Boundary Detection In Speech</Title> <Section position="8" start_page="454" end_page="456" type="evalu"> <SectionTitle> 4 Experimental Results and Discussion </SectionTitle> <Paragraph position="0"> SU detection results using the CRF, HMM, and Maxent approaches individually, on the reference transcriptions or speech recognition output, are shown in Tables 2 and 3 for CTS and BN data, respectively. We present results when different knowledge sources are used: word N-gram only, word N-gram and prosodic information, and using all the and the boundary-based error rate (% in parentheses) using the HMM, Maxent, and CRF individually and in combination. Note that the 'all features' condition uses all the knowledge sources described in Section 3.2. 'Vote' is the result of the majority vote over the three modeling approaches, each of which uses all the features. The baseline error rate when assuming there is no SU boundary at each word boundary is 100% for the NIST SU error rate and 15.7% for the boundary-based metric.</Paragraph> <Paragraph position="1"> features described in Section 3.2. The word N-grams are from the LDC training data and the extra text corpora. 'All the features' means adding textual information based on tags, and the 'other features' in the Maxent and CRF models as well. The detection error rate is reported using the NIST SU error rate, as well as the per-boundary-based classification error rate (in parentheses in the table) in order to factor out the effect of the different SU priors. Also shown in the tables are the majority vote results over the three modeling approaches when all the features are used.</Paragraph> <Section position="1" start_page="455" end_page="456" type="sub_section"> <SectionTitle> 4.1 CTS Results </SectionTitle> <Paragraph position="0"> For CTS, we find from Table 2 that the CRF is superior to both the HMM and the Maxent model across all conditions (the differences are significant at p < 0:05). When using only the word N-gram information, the gain of the CRF is the greatest, with the differences among the models diminishing as more features are added. This may be due to the impact of the sparse data problem on the CRF or simply due to the fact that differences among modeling approaches are less when features become stronger, that is, the good features compensate for the weaknesses in models.</Paragraph> <Paragraph position="1"> Notice that with fewer knowledge sources (e.g., using only word N-gram and prosodic information), the CRF is able to achieve performance similar to or even better than other methods using all the knowledges sources. This may be useful when feature extraction is computationally expensive.</Paragraph> <Paragraph position="2"> We observe from Table 2 that there is a large increase in error rate when evaluating on speech recognition output. This happens in part because word information is inaccurate in the recognition output, thus impacting the effectiveness of the LMs and lexical features. The prosody model is also affected, since the alignment of incorrect words to the speech is imperfect, thereby degrading prosodic feature extraction. However, the prosody model is more robust to recognition errors than textual knowledge, because of its lesser dependence on word identity.</Paragraph> <Paragraph position="3"> The results show that the CRF suffers most from the recognition errors. By focusing on the results when only word N-gram information is used, we can see the effect of word errors on the models. The SU detection error rate increases more in the STT condition for the CRF model than for the other models, suggesting that the discriminative CRF model suffers more from the mismatch between the training (using the reference transcription) and the test condition (features obtained from the errorful words).</Paragraph> <Paragraph position="4"> We also notice from the CTS results that when only word N-gram information is used (with or without combining with prosodic information), the HMM is superior to the Maxent; only when various additional textual features are included in the feature set does Maxent show its strength compared to based error rate (% in parentheses) using the HMM, Maxent, and CRF individually and in combination. The baseline error rate is 100% for the NIST SU error rate and 7.2% for the boundary-based metric. the HMM, highlighting the benefit of Maxent's handling of the textual features.</Paragraph> <Paragraph position="5"> The combined result (using majority vote) of the three approaches in Table 2 is superior to any model alone (the improvement is not significant though).</Paragraph> <Paragraph position="6"> Previously, it was found that the Maxent and HMM posteriors combine well because the two approaches have different error patterns (Liu et al., 2004). For example, Maxent yields fewer insertion errors than HMM because of its reliance on different knowledge sources. The toolkit we use for the implementation of the CRF does not generate a posterior probability for a sequence; therefore, we do not combine the system output via posterior probability interpolation, which is expected to yield better performance.</Paragraph> </Section> <Section position="2" start_page="456" end_page="456" type="sub_section"> <SectionTitle> 4.2 BN Results </SectionTitle> <Paragraph position="0"> Table 3 shows the SU detection results for BN. Similar to the patterns found for the CTS data, the CRF consistently outperforms the HMM and Maxent, except on the STT condition when all the features are used. The CRF yields relatively less gain over the other approaches on BN than on CTS. One possible reason for this difference is that there is more training data for the CTS task, and both the CRF and Maxent approaches require a relatively larger training set than the HMM. Overall the degradation on the STT condition for BN is smaller than on CTS.</Paragraph> <Paragraph position="1"> This can be easily explained by the difference in word error rates, 22.9% on CTS and 12.1% on BN.</Paragraph> <Paragraph position="2"> Finally, the vote among the three approaches outperforms any model on both the REF and STT conditions, and the gain from voting is larger for BN than CTS.</Paragraph> <Paragraph position="3"> Comparing Table 2 and Table 3, we find that the NIST SU error rate on BN is generally higher than on CTS. This is partly because the NIST error rate is measured as the percentage of errors per reference SU, and the number of SUs in CTS is much larger than for BN, giving a large denominator and a relatively lower error rate for the same number of boundary detection errors. Another reason is that the training set is smaller for BN than for CTS. Finally, the two genres differ significantly: CTS has the advantage of the frequent backchannels and first per-son pronouns that provide good cues for SU detection. When the boundary-based classification metric is used (results in parentheses), the SU error rate is lower on BN than on CTS; however, it should also be noted that the baseline error rate (i.e., the priors of the SUs) is lower on BN than CTS.</Paragraph> </Section> </Section> class="xml-element"></Paper>