XML Viewer - h92-1069

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/h92-1069_metho.xml
Size: 8,164 bytes
Last Modified: 2025-10-06 14:13:08
<?xml version="1.0" standalone="yes"?>
<Paper uid="H92-1069">
  <Title>Speaker-Independent Phone Recognition Using BREF</Title>
  <Section position="3" start_page="0" end_page="345" type="metho">
    <SectionTitle>
THE BREF CORPUS
</SectionTitle>
    <Paragraph position="0"> BREF is a large read-speech corpus, containing over 100 hours of speech material, from 120 speakers. The text materials were selected verbatim from the French newspaper Le Monde, so as to provide a large vocabulary (over 20,000 words) and a wide range of phonetic environments\[4\]. Containing 11 i5 distinct diphones and over 17,500 triphones, BREF can be used to train VI phonetic models. Hon and Lee\[5\] concluded that for VI recognition, the coverage of triphones is crucial. Separate text materials, with similar distributionalproperties were selected for training, development test, and evaluation purposes. The selected texts consist of 18 &amp;quot;all phoneme&amp;quot; sentences, and approximately 840 paragraphs, 3300 short sentences (12.4 words/sentence), and 3800 longer sentences (21 words/sentence). The distributional properties for the 3 sets of texts, and the combined total, are shown in  their coverage of word and subword units and quite similar in their phone and diphone distributions. For comparison, the last column of the table gives the distributional properties for the original text of Le Monde.</Paragraph>
    <Paragraph position="1"> Each of 80 speakers read approximately 10,000 words (about 650 sentences) of text, and an additional 40 speakers each read about half that amount. The speakers, chosen from a subject pool of over 250 persons in the Paris area, were paid for their participation. Potential subjects were given a short reading test, containing selected sentences from Le Monde representative of the type of material to be recorded\[6\] and subjects judged to be incapable of the task were not recorded. The recordings were made in stereo in a soundisolated room, and were monitored to assure the contents.</Paragraph>
    <Paragraph position="2"> Thus far, 80 training, 20 test, and 20 evaluation speakers have been recorded. The number of male and female speakers for each subcorpus is given in Table 2. The ages of the speakers range from 18 to 73 years, with 75% between the ages of 20 and 40 years. In these experiments only a subset of the training and development test data was used, reserving the evaluation data for future use.</Paragraph>
    <Section position="1" start_page="0" end_page="345" type="sub_section">
      <SectionTitle>
Labeling of BREF
</SectionTitle>
      <Paragraph position="0"> In order to be used effectively for phonetic recognition, time-aligned phonetic transcriptions of the utterances in BREF are needed. Since hand-transcription of such a large amount of data is a formidable task, and inherently subjective, an automated procedure for labeling and segmentation is being investigated.</Paragraph>
      <Paragraph position="1"> The procedure for providing a time-aiigned broad phonetic transcription for an utterance has two steps. First, a text-to-phoneme module\[10\] generates the phone sequence from the text prompt. The 35 phones (including silence) used by the text-to-phoneme system are given in Table 3. Since the automatic phone sequence generation can not always accurately predict what the speaker said, the transcriptions must be verified. The most common errors in translation occur with foreign words and names, and acronyms. Other mispredictions arise in the reading of dates: for example the year &amp;quot;1972&amp;quot; may be spoken as &amp;quot;mille neuf cent soixante douze&amp;quot; or as &amp;quot;dix neuf cent soixante douze.&amp;quot; In the second step, the phone sequence is aligned with the speech signal using Viterbi segmentation.</Paragraph>
      <Paragraph position="2"> The training and test sentences used in these experiments have been processed automatically and manually verified prior to segmentation. 'nae manual verification only corrected &amp;quot;blatant errors&amp;quot; and did not attempt to make finephonetic distinctions. Comparing the predicted and verified phone strings, 97.5% of the 38,397 phone labels 1 were assessed to be correct, with an accuracy of 96.6%. However, during verification about 67% of the automatically generated phone strings were modified. This indicates that verification  is a necessary step for accurate labeling. The exception dictionary used by the text-to-phoneme system has been updated accordingly to correct some of the prediction errors, thereby reducing the work entailed in verification.</Paragraph>
      <Paragraph position="3"> Table 4 summarizes the phone prediction accuracy of the text-to-phone translation. 86% of the errors are due to insertions and deletions by the text-to-phone system. Liaison and the pronunciation of mute-e account for about 70% of these.</Paragraph>
      <Paragraph position="4"> Liaison is almost always optional and thus bard to accurately predict. While most speakers are likely to pronounce mute-e before a pause, it is not always spoken. Whether or not mute-e is pronounced depends on the context in which it occurs and upon the dialect of the speaker. Substitutions account for  only 14% of the errors, with the most common substitutions between/z/and/s/, and between/e/and/E/.</Paragraph>
      <Paragraph position="5"> A problem that was unanticipated was that some of the French speakers actually pronounced the English words present in the text prompt using the correct English phonemes, phonemes that do not exist in French. These segments were transcribed using the &amp;quot;English phones&amp;quot; listed in Table 3, which were added to the 35 phone set. However, so few occurrences of these phones were observed that for training they were mapped to the &amp;quot;closest&amp;quot; French phone. In addition, a few cases were found where what the speaker said did not agree with the prompt text, and the orthographic text needed to be modified. These variations were typically the insertion or deletion of a single word, and usually occurred when the text was almost, but not quite, a very common expression.</Paragraph>
      <Paragraph position="6"> Validation of automatic segmentation A subset of the training data (roughly 12 minutes of speech, from 20 of the training speakers) was manually segmented to bootstrap the training and segmentation procedures. In order to evaluate the Viterbi segmentation, the phone recognition accuracy using the manual segmentation for training was compared to the recognition accuracy obtained using Viterbi resegmentation (3 iterations) on the same subset of training data. For this comparison 35 context-independent phone models with 8 mixture components and no duration model, were used. The recognizer was tested on data from 11 speakers in the development test speaker set, and the averaged results are given in Table 5. The performance is estimated by the phone accuracy given by: 1 - (subs + del + ins) I correct number of phones. The recognition accuracies are seen to be comparable, indicating that, at least for the purposes of speech recognition, the Viterbi algorithm can be used to segment the BREF corpus once the segment labels have been verified. Including a duration model increases the phone accuracy to 58.0% with the Viterbi segmentation.</Paragraph>
      <Paragraph position="7">  The segmentations determined by the Viterbi algorithm have been compared to the manual segmentations on a new independent set of test data. To do so the offset in number of frames was counted, using the manual segmentation as the reference. Silence segments were ignored. The test data consisted of 115 sentences from 10 speakers (4nff6f) and contalned 6517 segments. 71% of the segment boundaries were found to be identical. 91% of the automatically found boundaty locations were within I frame (96% within 2 frames) of the hand boundary location. The automatic boundaries were located later than the hand location for 23% of the segments, and earlier for 5% of the segments. This assyrnmetry may be due to the minimum duration imposed by the phone models.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML