XML Viewer - w03-0204

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0204_metho.xml
Size: 14,759 bytes
Last Modified: 2025-10-06 14:08:20
<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0204">
  <Title>PLASER: Pronunciation Learning via Automatic Speech Recognition</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 PLASER: System Design
</SectionTitle>
    <Paragraph position="0"> PLASER runs under Microsoft Windows (98, NT, 2000) with an easy-to-use web-like interface requiring only standard utilities such as the Internet Explorer and Media Player. PLASER consists of 20 lessons, and each lesson teaches two American English phonemes as shown in Table 1. The two phonemes in a lesson are usually the most confusable pair among the 40 phonemes. PLASER contains a lot of word examples and for each word there are its English spelling, its Chinese translation, a picture, and a pronunciation video-clip (PVC) which a native American English speaker helped record. A user may read and listen to the materials of each word as many times as he likes at his own pace. Besides descriptive materials, PLASER uses four types of exercises to teach pronunciation: null Read-Along Exercise: Basic pronunciation drills with no assessment.</Paragraph>
    <Paragraph position="1"> Minimal-Pair Listening Exercise: This is used to train users' ear. Words from one minimal pairs are randomly embedded in a sentence that makes perfect sense with either word in the pair. A user listens to recordings of such sentences and chooses between the two words.</Paragraph>
    <Paragraph position="2"> Minimal-Pair Speaking Exercise: Similar to the Minimal-Pair Listening Exercise except that now only minimal pairs are given and a user is asked to say them. A student may pick any one of the two words to say but not to mix up with its counterpart in the pair. It is a two-class classification problem. Word-List Speaking Exercise: A student may pick any word from a list to say, and PLASER has to decide how well each phoneme in the word is pronounced.</Paragraph>
    <Paragraph position="3">  Fig. 1 shows a snapshot of PLASER running the Word-List Speaking Exercise in the lesson teaching the two phonemes: &amp;quot;ih&amp;quot; and &amp;quot;iy&amp;quot;. The user has selected the word &amp;quot;cheese&amp;quot; to practise. The top left panel tells how to produce the phoneme &amp;quot;iy&amp;quot; with the help of an animated GIF that shows a cross-sectional view of the vocal tract during the phoneme's production. At the bottom right panel are the word's spelling, its Chinese translation, its picture, plus a recording button and a playback button. The word's PVC is shown at the top right panel. The middle panel in the screen is reserved for feedbacks. The feedback for Word-List Speaking Exercise consists of an overall score for the practising word (&amp;quot;cheese&amp;quot; here) as well as a confidence score for each individual phoneme in the word using a novel 3-color scheme. Confidence scores are derived from a log-likelihood ratio between the desired target and some reference. Garbage rejection is also implemented in a similar manner. Refer Section 4 and 5 for more details.</Paragraph>
    <Paragraph position="4"> As a self-learning as well as a teaching aid, the length of each lesson is designed to take about 25-30 minutes to complete. Students' performance is recorded for later reviews by students themselves if PLASER is used as a learning tool, or by teachers if PLASER is used as a teaching aid.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Acoustic Modelling
</SectionTitle>
    <Paragraph position="0"> For the development of PLASER's acoustic models, additional speech data were collected from local high-school students: HKTIMIT: A set of TIMIT utterances collected from a group of 61 local (Cantonese) high-school students who spoke &amp;quot;good&amp;quot; English to the local standard. There are 29 females and 32 males, and each recorded 250 TIMIT sentences. The data were divided into a training set of 9,163 utterances from 17 females and 20 males, and a test set of 6,015 utterances from 12 females and 13 males.</Paragraph>
    <Paragraph position="1"> MP-DATA: A superset of words used in PLASER's minimal-pair exercises recorded by eight high-school students, 4 males and 4 females, each speaking &gt;&gt;300 words for a total of 2,431 words.</Paragraph>
    <Paragraph position="2"> WL-DATA: A superset of words used in PLASER's word exercises by the same eight students who recorded the MP-DATA for a total of 2,265 words.</Paragraph>
    <Paragraph position="3"> All data were recorded with the same conditions as those of TIMIT. In addition, all utterances of MP-DATA and WL-DATA were phonetically transcribed.</Paragraph>
    <Paragraph position="4"> The standard American English TIMIT corpus together with the HKTIMIT corpus were used to develop Cantonese-accented English phoneme HMMs. The common 13 mel-frequency cepstral coefficients and their first and second order derivatives were used for acoustic representation. All phoneme HMMs have three real states, and there are an additional 3-state silence model and a 1-state short-pause HMM. Three kinds of modelling techniques were investigated:  independent HMMs (CIHMM) were trained for the 40 phonemes taught in PLASER. Including the silence and short-pause models, there are totally 42 HMMs.</Paragraph>
    <Paragraph position="5"> Position-Dependent HMM (PDHMM): Due to concerns of limited computing resources in local public schools, a restricted form of context-dependent modelling was chosen. Since PLASER will only perform phoneme recognition on isolated words, we postulate that it may be important to capture the word-boundary effect of a phoneme. Thus, three variants of each phoneme are modelled depending on whether it appears at the beginning, in the middle, or at the end of a word.</Paragraph>
    <Paragraph position="6"> (MCE) Discriminative Training: With the goal of minimizing classification errors in a development data-set which is WL-DATA in our case, word-based MCE/GPD algorithm (Juang and Katagiri, 1992; Chou, 2000) was applied to improve the EM-trained acoustic models.</Paragraph>
    <Paragraph position="7"> We started with a baseline system using 40 monophones with 24 mixtures per state. It gives a phoneme recognition accuracy of 39.9% on the HKTIMIT test set. The low accuracy perhaps indicates an unexpected lower  English proficiency of local students as well as a large deviation of local English from native American English. We then investigated PDHMM and MCE training, and gauged our progress by the classification accuracy of minimal pairs in the MP-DATA set. The results are tabulated in Table 2.</Paragraph>
    <Paragraph position="8"> By using PDHMMs, the inventory of models is only increased by three times, requiring little additional computational resources. Yet they result in a relative error reduction of 7.2%. MCE discriminative training gives an additional relative improvement of about 14-16%.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Confidence-based Phoneme Assessment
</SectionTitle>
    <Paragraph position="0"> The assessment of pronunciation accuracy is cast as a phoneme verification problem. The posterior probability of a phoneme is used as the Goodness of Pronunciation measure (GOP), which has been shown in many works (Witt and Young, 2000; Franco et al., 2000) that it is a good measure. PLASER computes both a GOP score and a normalized GOP score for two types of feedback as will be discussed in Section 5.</Paragraph>
    <Paragraph position="1"> When a student runs a PLASER word exercise, s/he will randomly pick a word from a list and watches its pronunciation video-clip (PVC). When s/he feels comfortable to try s/he records her/his voice speaking the word. PLASER then computes a confidence-based GOP for each phoneme in the word as follows.</Paragraph>
    <Paragraph position="2"> STEP 1: PLASER consults its dictionary for the standard phonemic transcription of the word which should be the same as that of its PVC.</Paragraph>
    <Paragraph position="3"> STEP 2: Based on the transcription, forced alignment is performed on the student's speech.</Paragraph>
    <Paragraph position="4"> STEP 3: For each acoustic segment Xu of phoneme yu (where u denotes the phoneme index), PLASER computes its GOP(yu), su, as its posterior probability by the following log-likelihood ratio normalized  where N is the number of phonemes, and jmax is the phoneme model that gives the highest likelihood of the given segment. This GOP is used with some thresholds to decide if the phoneme is pronounced correctly.</Paragraph>
    <Paragraph position="5"> In practice, the denominator in Equation 2 is replaced by the Viterbi likelihood of the segment given by a phone loop. Notice that the Viterbi path of a segment may contain more than one phoneme model.</Paragraph>
    <Paragraph position="6"> STEP 4: Besides the raw GOP score, GOP(yu) = su computed in STEP 3, a normalized GOP score is also computed by normalizing the GOP score to the range [0.0 .. 1.0] using a sigmoid function. That is, the normalized GOP for the phoneme yu is given by</Paragraph>
    <Paragraph position="8"> where the parameters fi and fl are empirically found.</Paragraph>
    <Paragraph position="9"> The current PLASER implementation has some modifications due to practical reasons: The phone loop for computing the denominator of Equation 2 uses only the middle-position PDHMM of each phoneme plus the silence and short pause models for faster computation. For greater computation savings, the phone loop may also be replaced by a single Gaussian Mixture Model (GMM) trained by all phoneme segments in the training data. In our experience, a GMM with 32 mixtures suffices with a slight degradation in performance.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Visualization of Recognition Results
</SectionTitle>
    <Paragraph position="0"> Two kinds of feedback of different resolutions are given for the word exercise: + an overall phoneme score of the whole word; and, + a phoneme-by-phoneme assessment by a 3-color scheme.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Overall Phoneme Score of a Word
</SectionTitle>
      <Paragraph position="0"> The use of posterior probability as the GOP score for assessing the accuracy of a phoneme segment allows us to readily define an overall phoneme score (PS) for a word as a weighted sum of the normalized GOPs of its composing phonemes:</Paragraph>
      <Paragraph position="2"> where wk is the weighting of the k-th phoneme among the N phonemes composing the word. In the current PLASER, all phonemes in a word are equally weighted.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 A 3-Color Feedback Scheme for Phoneme
Confidence
</SectionTitle>
      <Paragraph position="0"> The usefulness of an overall confidence for a word may be limited as it does not pinpoint the pronunciation accuracy of each phoneme in the word, and thus, the user still does not know how to correct his mistakes when the score is not good. Any attempt to report phoneme confidence score has to face the following two problems: + unless users can read phonemic transcriptions, it is not clear how to report the confidence scores at phoneme level; and, + unless the phoneme confidence scores are highly reliable, reporting its precise value may be too risky.</Paragraph>
      <Paragraph position="1"> Our solution is a visual feedback that gives a color to the letters in the word spelling to indicate the pronunciation accuracy of their associated phonemes. To do that, STEP 1: We first designed a rule-based algorithm to map each phoneme in the transcription of a word to its spelling letters. For example, for the word &amp;quot;beat&amp;quot; with the phonemic transcription &amp;quot;/b/ /iy/ /t/&amp;quot;, the three phonemes are mapped to the letters &amp;quot;b&amp;quot;, &amp;quot;ea&amp;quot; and &amp;quot;t&amp;quot; respectively. On the other hand, for the word &amp;quot;eve&amp;quot; with the phonemic transcription &amp;quot;/iy/ /v/&amp;quot;, the two phonemes are mapped to the letters &amp;quot;e&amp;quot; and &amp;quot;v&amp;quot; respectively while the last letter &amp;quot;e&amp;quot; is not mapped to any phoneme.</Paragraph>
      <Paragraph position="2"> STEP 2: A novel 3-color scheme was devised to reduce the preciseness of phoneme confidence scores. Two thresholds were found for each phoneme to label its confidence as good, fair, or bad. If the confidence score of a phoneme is good/fair/bad, its corresponding spelling letter(s) is/are painted in blue/green/red respectively. Two examples are shown in Fig. 2. The use of colors is also more appealing to users.</Paragraph>
      <Paragraph position="3"> To find the two thresholds in the 3-color scheme, we treated the problem as a bi-threshold verification problem. The detailed algorithm is beyond the scope of this paper and will only be briefly described here. For details, please refer to (Ho and Mak, 2003).</Paragraph>
      <Paragraph position="4"> Firstly, one has to decide how forgiving one wants to be and specifies the following two figures: + the false acceptance rate (FA) for an incorrectly pronounced phoneme; and, + the false rejection rate (FR) for a correctly pronounced phoneme.</Paragraph>
      <Paragraph position="5"> If one sets FA very low, it will be hard to get &amp;quot;blue&amp;quot; scores; on the other hand, if one sets FR very low, it may be too forgiving and &amp;quot;red&amp;quot; scores will rarely show up. Due to its bi-threshold nature, it turns out that in such circumstances, simple method to determine the two thresholds will results in dominating &amp;quot;green&amp;quot; scores with little &amp;quot;blue&amp;quot; or &amp;quot;red&amp;quot; scores. The more complicated algorithm in (Ho and Mak, 2003) tries to avoid that.</Paragraph>
      <Paragraph position="6"> Furthermore, due to scarcity of training data in the development data set, the phonemes were grouped into 9 phoneme classes in PLASER, and class-dependent thresholds were determined from the development data set. The 9 phoneme classes are: affricates, diphthongs, fricatives, nasals, semi-vowels, stops, back vowels, mid vowels, and front vowels.</Paragraph>
      <Paragraph position="7"> be at bad  confidence (The figure has to be read with color printouts, or electronically on a color display. The letters marked with &amp;quot;bad&amp;quot;, &amp;quot;fair&amp;quot;, &amp;quot;good&amp;quot;, and &amp;quot;unused&amp;quot; are painted in red, green, blue, and gray respectively.)</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML