File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/n03-2028_intro.xml
Size: 3,378 bytes
Last Modified: 2025-10-06 14:01:42
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-2028"> <Title>LM Studies on Filled Pauses in Spontaneous Medical Dictation Jochen Peters</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Corpora </SectionTitle> <Paragraph position="0"> Our experiments are based on about 1.4 Mio. words of real-life dictated medical reports from various US hospitals which are partitioned into a Train, Dev, and Eval set (Table 1). The dictation style is fully spontaneous with repairs, repetitions, partial words, and - most frequent filled pauses. Manual transciptions of these data include the annotation of FP. However, tags to distinguish between FP associated with hesitations, repairs, and restarts are missing. Here, as opposed to Switchboard, most FP are sentence-internal (ca. 70-80%).</Paragraph> <Paragraph position="1"> A large background corpus provides formatted, i.e.</Paragraph> <Paragraph position="2"> non-spontaneous reports which are mapped to the 60 k word list of our recognition system. To train LMs including FP this 'Report' corpus was stochastically enriched with FP. Considering single or sequential FP/s as hidden events in the reports we randomly inserted them with their a-posteriori probabilities in the given word contexts.</Paragraph> <Paragraph position="3"> These probabilities are estimated using a bigram from the spontaneous training data. A similar approach was mentioned without details in (Gauvain et al., 1997). They report increasing error rates if too many FP are inserted by this method into the LM training data. This might be explained by the following observation: Adding FP in a context-dependent fashion diminishes the number of observed bi- and trigrams since words typically preceding or following FP &quot;loose individual contexts&quot; if many FP are inserted. For our Report corpus, the number of dis- null 1. We treat FP as a regular word which is predicted by the LM and which conditions following words.</Paragraph> <Paragraph position="4"> 2. We use the LM for both words and FP but discard all FP from the conditioning histories.</Paragraph> <Paragraph position="5"> 3. We use a fixed, context-independent probability for FP of 0.08 (FP unigram). Here, words are predicted with a FP-free LM skipping FP in the history (as in approach 2.). Normalization is achieved by a scaling of word probabilities with (1 [?] pfix(FP)). This simplistic approach relieves us from the need of FPtagged corpora, but we clearly loose the discriminative prediction of FP.</Paragraph> <Paragraph position="6"> Approaches 1. and 2. use count statistics with FP. As discussed above, the inclusion of FP &quot;destroys&quot; some possible word transitions. To exploit the knowledge about possible FP-cleaned transitions we successfully tested merged counts. Here, the sets of observed M-Grams in the corpus with and without FP are joined and counts of common M-Grams are added. (Doubled counts use modified discounting and the reduced FP-rate is compensated using marginal adaptation (Kneser et al., 1997).) All reported results are obtained with linearly interpolated models from the spontaneous Train and the non-spontaneous Report corpus. (For trigrams, perplexities of these two component LMs are 95% and 19% above the perplexity of the interpolated LM.)</Paragraph> </Section> class="xml-element"></Paper>