File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-4007_metho.xml

Size: 14,214 bytes

Last Modified: 2025-10-06 14:08:53

<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-4007">
  <Title>Advances in Children's Speech Recognition within an Interactive Literacy Tutor</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
Coleman Institute for Cognitive Disabilities. The views expressed in
</SectionTitle>
    <Paragraph position="0"> this paper do not necessarily represent the views of the NSF.</Paragraph>
    <Paragraph position="1"> sults show that such automated reading tutors can improve student achievement (Mostow et al, 2003). Providing real time feedback by highlighting words as the are read out loud is the basis of at least one commercial product today (http://www.soliloquy.com).</Paragraph>
    <Paragraph position="2"> Cole et al. (2003) and Wise et al. (in press) describe a new scientifically-based literacy program, Foundations to Fluency, in which a virtual tutor--a lifelike 3D computer model--interacts with children in multimodal learning tasks to teach them to read. A key component of this program is the Interactive Book, which combines real-time speech recognition, facial animation, and natural language understanding capabilities to teach children to read and comprehend text. Interactive Books are designed to improve student achievement by helping students to learn to read fluently, to acquire new knowledge through deep understanding of what they read, to make connections to other knowledge, and to express their ideas concisely through spoken or written summaries. Transcribed spoken summaries can be graded automatically to provide feedback to the student about their comprehension.</Paragraph>
    <Paragraph position="3"> During reading out loud activities in Interactive Books, the goal is to design a computer interface and speech recognizer that combine to teach the student to read fluently and naturally. Here, speech recognition is used to track a child's position within the text during read-aloud sessions in addition to providing timing and confidence information which can be used for reading assessment. The speech recognizer must follow the students verbal behaviors accurately and quickly, so the cursor (or highlighted word) appears at the right place and right time when the student is reading fluently, and pauses when the student hesitates to sound out a word.</Paragraph>
    <Paragraph position="4"> The recognizer must also score mispronounced words accurately so that the student can revisit these words and receive feedback about their pronunciation after completing a paragraph or page (since highlighting hypothesized mispronounced words when reading out loud may disrupt fluent reading behavior).</Paragraph>
    <Paragraph position="5"> In this paper we focus on the problem of speech recognition to track and provide feedback during reading out loud and to transcribe spoken summaries of text.</Paragraph>
    <Paragraph position="6"> Specifically, we describe several new methods for incorporating language modeling knowledge into the read aloud task. In addition, through use of speaker adaptation, we also demonstrate the potential for significant gains in recognition accuracy. Finally, we leverage improvements in speech recognition for read aloud tracking to improve performance for spoken story summarization. Work reported here extends previous work in several important ways: by integrating the research advances into a real time system, and by including time-adaptive language modeling and time-adaptive acoustic modeling of the child's voice into the system.</Paragraph>
    <Paragraph position="7"> The paper is organized as follows. Sect. 2 describes our baseline speech recognition system and reading tracking method. Sect. 3 presents our rationale for using word-error-rate as a measure of performance. Sect. 4 describes the read aloud and story summarization corpora used in this work. Sect. 5 describes and evaluates proposed improvements in a read aloud speech recognition task. Sect. 6 describes how these improvements translate to improved recognition of story summaries produced by a child. Sect. 7 details our real-time system implementation.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Baseline System
</SectionTitle>
    <Paragraph position="0"> For this work we use the SONIC speech recognition system (Pellom, 2001; Pellom and Hacioglu, 2003).</Paragraph>
    <Paragraph position="1"> The recognizer implements an efficient timesynchronous, beam-pruned Viterbi token-passing search through a static re-entrant lexical prefix tree while utilizing continuous density mixture Gaussian HMMs.</Paragraph>
    <Paragraph position="2"> For children's speech, the recognizer has been trained on 46 hours of data from children in grades K through 9 extracted from the CU Read and Prompted speech corpus (Hagen et al., 2003) and the OGI Kids' speech corpus (Shobaki et al., 2000). Further, the baseline system utilizes PMVDR cepstral coefficients (Yapanel and Hansen, 2003) for improved noise robustness.</Paragraph>
    <Paragraph position="3"> During read-aloud operation, the speech recognizer models the story text using statistical n-gram language models. This approach gives the recognizer flexibility to insert/delete/substitute words based on acoustics and to provide accurate confidence information from the word-lattice. The recognizer receives packets of audio and automatically detects voice activity. When the child speaks, the partial hypotheses are sent to a reading tracking module. The reading tracking module determines the current reading location by aligning each partial hypothesis with the book text using a Dynamic Programming search. In order to allow for skipping of words or even skipping to a different place within the text, the search finds words that when strung together minimize a weighted cost function of adjacent wordproximity and distance from the reader's last active reading location. The Dynamic Programming search additionally incorporates constraints to account for boundary effects at the ends of each partial phrase.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Evaluation Methodology
</SectionTitle>
    <Paragraph position="0"> There are many different ways in which speech recognition can be used to serve children. In computer-based literacy tutors, speech recognition can be used to measure children's ability to read fluently and pronounce words while reading out loud, to engage in spoken dialogues with an animated agent to assess and train comprehension, or to transcribe spoken summaries of stories that can be graded automatically. Because of the variety of ways of using speech recognition systems, it is critically important to establish common metrics that are used by the research community so that progress can be measured both within and across systems.</Paragraph>
    <Paragraph position="1"> For this reason, we argue that word error rate calculations using the widely accepted NIST scoring software provides the most widely accepted, easy to use and highly valid metric. In this scoring procedure, word error rate is computed strictly by comparing the speech recognizer output against a known human transcription (or the text in a book). Of course, authors are free to define and report other measures, such as detection/false alarm curves for useful events such as reading miscues.</Paragraph>
    <Paragraph position="2"> However, such analyses should always supplement reports of word error rates using a single standardized measure. Adopting this strategy enables fair and balanced comparisons within and across systems for any speech data given a known word-level transcription.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Experimental Data
</SectionTitle>
    <Paragraph position="0"> For all experiments in this paper we use speech data and associated transcriptions from 106 children (grade 3: 17 speakers, grade 4: 28 speakers, and grade 5: 61 speakers) who were asked to read one of ten stories and to provide a spoken story summary. The 16 kHz audio data contains an average of 1054 words (min 532 words; max 1926 words) with an average of 413 unique words per story. The resulting summaries spoken by children contain an average of 168 words.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Improved Read-Aloud Recognition
</SectionTitle>
    <Paragraph position="0"> Baseline: Our baseline read-aloud system utilizes a trigram language model constructed from a normalized version of the story text. Text normalization consists primarily of punctuation removal and determination of sentence-like units. For example, It was the first day of summer vacation. Sue and Billy were eating breakfast. &amp;quot;What can we do today?&amp;quot; Billy asked. is normalized as:  The resulting text is used to estimate a back-off trigram language model. We stress that only the story text is used to construct the language model. Details on the story texts are provided in Hagen et al. (2003). Note that the sentence markers (&lt;s&gt; and &lt;/s&gt;) are used to represent positions of expected speaker pause. This baseline system is shown in Table 1(A) to produce a 17.4% word error rate.</Paragraph>
    <Paragraph position="1">  tant in the context of this research to note that children do not pause between each estimated sentence boundary. Instead, many children read fluently across phrases and sentences, where more experienced readers would pause. For this reason, we improved upon our baseline system by estimating language model parameters using a combined text material that is generated both with and without the contextual sentence markers (&lt;s&gt; and &lt;/s&gt;). Results of this modification are shown in Table 1(B) and show a reduction in error from 17.4% to 13.5%.</Paragraph>
    <Paragraph position="2"> Improved Word History Modeling: Most speech recognition systems operate on the utterance as a primary unit of recognition. Word history information typically is not maintained across segmented utterances. However, in our text example, the words &amp;quot;do today&amp;quot; should provide useful information to the recognizer that &amp;quot;Billy asked&amp;quot; may follow. We therefore modify the recognizer to incorporate knowledge of previous utterance word history. During token-passing search, the initial word-history tokens are modified to account for the fact that the incoming sentence may be either the beginning of a new sentence or a direct extension of the previous utterance's word-end history. Incorporating this constraint lowers the word error rate from 13.5% to 12.7% as shown in Table 1(C).</Paragraph>
    <Paragraph position="3"> Dynamic n-gram Language Modeling: During story reading we can anticipate words that are likely to be spoken next based upon the words in the text that are currently being read aloud. To account for this knowledge, we estimate a series of position-sensitive n-gram language models by partitioning the story into overlapping regions containing at most 150 words (i.e., each region is centered on 50 words of text with 50 words before and 50 words after). For each partition, we construct an n-gram language model by using the entire normalized story text in addition to a 10x weighting of text within the partition. Each position-sensitive language model therefore contains the entire story vocabulary. We also compute a general language model estimated solely from the entire story text (similar to Table 1(C)). At run-time, the recognizer implements a word-history buffer containing the most recent 15 recognized words. After decoding each utterance, the probability of the text within the word history buffer is computed using each of the position-sensitive language models. The language model with the highest probability is selected for the first-pass decoding of the subsequent utterance. This modification decreases the word error rate from 12.7% to 10.7% (Table 1(D)).</Paragraph>
    <Paragraph position="4"> Vocal Tract Normalization and Acoustic Adaptation: We further extend on our baseline system by incorporating the Vocal Tract Length Normalization (VTLN) method described in Welling et al. (1999). Based on results shown in Table 1(E), we see that VTLN provides only a marginal gain (0.1% absolute). Our final set of acoustic models for the read aloud task are both VTLN normalized and estimated using Speaker Adaptive Training (SAT). The SAT models are determined by estimating a single linear feature space transform for each training speaker (Gales, 1997). The means and variances of the VTLN/SAT models are then iteratively adapted using the SMAPLR algorithm (Siohan, 2002) to yield a final recognition error rate of 8.0% absolute (Table 1(G)). By combining all of these techniques, we achieved a 54% reduction in word error rate relative to the baseline system.</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Improved Story Summary Recognition
</SectionTitle>
    <Paragraph position="0"> One of the unique and powerful features of our interactive books is the notion of assessing and training comprehension by providing feedback to the student about a typed summary of text that the student has just read (Cole et al., 2003). Verbal input is especially important for younger children who often can not type well. Utilizing summaries from the children's speech corpus, Hagen et al. (2003) showed that an error rate of 42.6% could be achieved. The previous work, however, did not consider utilizing the read story material to provide improved initial acoustic models for the summarization task. In Table 2 we demonstrate several findings using a language model trained on story text and example summaries produced by children (leaving out data from the child under test). Without any adaptation the error rate is 47.1%. However, utilizing adapted models from the read stories (see Table 1(G)) provides an initial performance gain of nearly 10% absolute. Further use SMAPLR adaptation reduces the error rate to 36.1%.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML