File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/89/h89-1042_concl.xml

Size: 3,527 bytes

Last Modified: 2025-10-06 13:56:21

<?xml version="1.0" standalone="yes"?>
<Paper uid="H89-1042">
  <Title>SRI's DECIPHER System</Title>
  <Section position="9" start_page="240" end_page="241" type="concl">
    <SectionTitle>
8 Discussion
</SectionTitle>
    <Paragraph position="0"> In this section we discuss first the results relating to choice of lexicon and then the results relating to cross-word models.</Paragraph>
    <Paragraph position="1"> Though we have shown an important performance gain through improved phonological modeling, we believe that more substantial gains will be shown in the future for the following reasons: 1. The system tests were based on read speech rather than spontaneous speech. The significant increase in phonological reduction and deletion in spontaneous compared to read speech \[2\] should result in a bigger difference between systems that include techniques for modeling multiple probabilistic pronunciations and those that do not.</Paragraph>
    <Paragraph position="2"> 2. The rule sets used in the studies described here were developed using a corpus of hand phonetic transcriptions, rather than some form of system output. Different types of variation may be more important to model explicitly in different systems,  and are likely to be different from those captured by hand transcriptions.</Paragraph>
    <Paragraph position="3"> 3. Larger amounts of training data will allow the design of more detailed models of phonological variation. null Lee has suggested \[8\] that modeling multiple pronunciations is not worth-while because (1) it makes systems run too slowly, (2) it is impossible to estimate pronunciation probabilities, and (3) it unfairly penalizes words with too many pronunciations. Although, we believe that improvements can certainly still be made in the way we estimate and use our pronunciation probabilities, it is clear from our studies that modelingpronunciation has significant positive impact on recognition performance without an excessive cost in speed. We suggest that the reason for our opposite conclusions lies in the difference between the multiplepronunciation word-networks that CMU and SRI have tried. As shown in Table 1, SRI's best network models on average about 1.3 pronunciations per word. The network shown as an example in \[8\] allows thousands of pronunciations, partly because of the excessive detail in the example and partly because of the lack of constraint to correlate the many possibilities.</Paragraph>
    <Paragraph position="4"> As for the cross-word coarticulatory modeling we report on here, we believe that the performance improvement can be attributed to the following: (1) for short words (and the most frequent words are short, i.e., one to three phones long), the word boundaries form a significant portion of the context that should not be ignored, and (2) many triphones that otherwise would not be observed can be found across word boundaries, and the additional triphone training can help model the less frequent words.</Paragraph>
    <Paragraph position="5"> In sum, the use of speech and linguistic knowledge sources can be used to improve the performance of HMM-based speech recognition systems, provided that care is taken to incorporate these knowledge sources appropriately.</Paragraph>
    <Paragraph position="6"> Acknowledgement. This work was principally supported by SRI IR&amp;D and investment funding. We also gratefully acknowledge the National Science Foundation and the Defense Advanced Research Projects Agency for partial support.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML