File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/h92-1075_metho.xml
Size: 20,523 bytes
Last Modified: 2025-10-06 14:13:06
<?xml version="1.0" standalone="yes"?> <Paper uid="H92-1075"> <Title>Collection and Analyses of WSJ-CSR Data at MIT 1</Title> <Section position="3" start_page="367" end_page="368" type="metho"> <SectionTitle> DATA COLLECTION </SectionTitle> <Paragraph position="0"> The Environment All the MIT data are collected in an office environment, where the ambient noise level is approximately 50dB on the A scale of a sound-level meter. All utterances are collected simultaneously using two microphones. A Sennheiser HMD-410 noise cancelling microphone is always used for one of the channels. For the other channel, we rotate among the sessions three microphones: a Crown PCC-160 phase coherent cardioid desk-top microphone, a Crown PZM-6FS boundary desk-top microphone, and a Sony ECM-50PS electret condenser lavaliere microphone. The data are collected using a Sun SPARCstation-II, which has been augmented with an Ariel DSP S-32C board and ProPort-656 audio interface unit for data capture. The sampling rate is 16 kHz, and the signal is lowpass filtered at 7.2 kHz. The input gain is held constant, for all subjects, at a setting that maximizes the signal-to-noise ratio without clipping. Rather than transferring each collected sentence immediately to a remote file server for storage, and thus increasing the French researchers\[2\]. The BREF corpus contains over 200 hours of speech, collected from over 100 subjects.</Paragraph> <Paragraph position="1"> / 3The other two participants are SRI and Texas Instruments. / / amount of delay between sentences, we store the speech data temporarily on a 200 MByte local disk.</Paragraph> <Paragraph position="2"> The prompt text, i.e., the text used to elicit speech material from the subjects, has been preprocessed by Doug Paul of Lincoln Lab to remove reading ambiguities inherent in written text \[5\]. Approximately half of the prompt text contains verbalized punctuation, whereas the remainder does not. The prompt text is displayed one paragraph at a time in the hope that this will encourage the subjects to produce sentence-level prosodic phenomena. The sentence to be recorded is highlighted in yellow, and the highlighting automatically moved forward to the next sentence once the previous sentence has been accepted. Four buttons (icons that can be activated with the mouse) are available for the subject to record, playback, accept, or unaccept an utterance. A push-andhold mechanism is used for recording. We developed this user interface environment in the hope that it will enable subjects to record the data with minimum supervision.</Paragraph> <Paragraph position="3"> Our experience with pilot data collection indicates that this is indeed the case. In fact, this software and hardware environment has also been adopted by one of the two remaining sites collecting WSJ-CSR data.</Paragraph> <Section position="1" start_page="367" end_page="368" type="sub_section"> <SectionTitle> The Process </SectionTitle> <Paragraph position="0"> Subjects were recruited from the MIT community and vicinity via e-mail and posters. They were separated into three categories depending on how their data would be used for system development and evaluation: speaker-independent (SI), speaker-adaptive (SA), and speaker-dependent (SD). An attempt was made to balance the speakers by sex, dialect, and age, particularly for the latter two groups, since the total number of speakers in these groups is relatively small.</Paragraph> <Paragraph position="1"> Data were collected in sessions of approximately 100 utterances (about 40 minutes per session). Each new sub-ject was asked to read a set of instructions introducing them to the task. After that, the experimenter helped the subjects practice using the mouse for recording. The entire introduction took about 5 minutes. The subjects were then asked to read the designated set of 40 speaker adaptation sentences provided by Dragon Systems. The experimenter monitored the recording of the adaptation sentences, and asked the subject to repeat a sentence if a mistake was made. All subsequent recordings were made without supervision. Approximately half of the prompt texts for each subject contained verbalized punctuations. Subjects belonging to the SA and SD categories returned for multiple sessions. However, the introduction and the reading of the adaptation sentences took place only during the first session.</Paragraph> <Paragraph position="2"> Once the data were recorded, they were authenticated. To this end, we developed an interactive environment in which an experimenter could listen to an utterance, visually examine the waveform to detect truncation, and edit the orthographic transcription when necessary. Finally, the speech data and the corresponding orthographic transcriptions were written onto CD-ROM-compatible WORM disks for distribution.</Paragraph> </Section> <Section position="2" start_page="368" end_page="368" type="sub_section"> <SectionTitle> The Status </SectionTitle> <Paragraph position="0"> We started the collection of WSJ-CSR data in early October, 1991, and completed the pilot collection by year end. Figure 1 shows the geographical distribution of all the subjects that we have recorded thus far. Their age ranges from 17 to 52 years, with an average of 27.1 years and a standard deviation of 6.6 years. A breakdown of the amount of data collected in each of the three cate-</Paragraph> <Paragraph position="2"> gories is shown in Table 1. While we only committed ourselves at the onset to collect up to 50% of the pilot data, in the final analysis we were able to collect nearly twice as much data in all categories. All the data that we collected, totaling more than 8 GBytes, have been delivered to NIST and other research institutions for system development, training, and evaluation.</Paragraph> </Section> </Section> <Section position="4" start_page="368" end_page="370" type="metho"> <SectionTitle> DATA ANALYSES </SectionTitle> <Paragraph position="0"> Since the WSJ-CSR speech corpus differs in many dimensions from the other corpora that we have collected thus far in the DARPA community, we thought it would be useful to compute some of its vital statistics. In this section, we will describe some of the analyses that we have performed thus far.</Paragraph> <Paragraph position="1"> All the analyses are based on only the data from the training set, including the SI, SA, and SD categories ~. The results are summarized in Table 2. In addition to computing various measures for the entire data set, we have also analyzed the adaptation sentences, and those with and without verbalized punctuation.</Paragraph> <Paragraph position="2"> Table 2 indicates that the MIT training set contains nearly 15,000 sentences, and the number of sentences with and without verbalized punctuation are approximately equal. These sentences contain over 250,000 words, resulting in an average of approximately 17 words per sentence. The sentence length ranges from one word to 31 words and has a standard deviation of 6.6 words. The sentences are considerably longer than any of the data that we have collected in other domains \[1,6,8\]. The adaptation sentences are generally shorter than the WSJ sentences. Some speakers found them difficult to pronounce, and needed to be corrected repeatedly, whereas others uttered them with no apparent difficulty. On average, verbalizing the punctuations adds an extra 2.5 words to each sentence.</Paragraph> <Paragraph position="3"> To compute the duration of these sentences we first passed each sentence through an automatic begin-andend detector to remove any extraneous silences. Altogether, the MIT training set contains almost 100,000 seconds of speech material, or about 27 hours. The average duration of the sentences is 6.5 seconds. The corresponding speaking rate is 156 words per minute, which is 30% higher than that for the spontaneous speech that we have collected \[6\]. This discrepancy is presumably due to the inherent difference in the way speech is elicited.</Paragraph> <Paragraph position="4"> In collecting the WSJ-CSR data, we hoped to provide an interface that was easy for the subjects to use, so that costly on-line monitoring was not necessary. However, this potential cost reduction may be offset by the cost of authentication if the subjects produce too many errors. The sentences containing errors have the added disadvantage of not being well matched to the language model, which is constructed from the prompt text.</Paragraph> <Paragraph position="5"> To gain some insight into the magnitude of this problem, we tabulated the discrepancies between the final orthographic transcription and the corresponding prompt text. The result, summarized in the last row of Table 2, show that 697, or 0.27% of the words were read with error (including substitutions, insertions, and deletions). / Note that, while the number of words read with errors for the adaptation sentences were one-tenth of that for the WSJ sentences, the percentage of errors for the adaptation sentences is only about one-third of that for the WSJ sentences. Recall that the adaptation sentences were read with an experimenter monitoring the process and instructing the subject to repeat when an error is detected. Thus, while monitoring the data collection process can reduce the errors by a factor of three, the magnitude of the problem is relatively small. Therefore, we believe our original hypothesis was reasonable.</Paragraph> <Paragraph position="6"> Example confusions can be seen in Table 3 which lists all substitutions (computed by finding the best alignment between the prompt and spoken word strings) which occurred two or more times in the training portion of the corpus. Note that many of these are due to the speaker expanding abbreviations (&quot;R. I.&quot; becomes &quot;Rhode Island&quot; for example). Since this would not occur in the verbalized punctuation text (the prompt would be &quot;R .period I .period&quot;), it is likely that these expanded abbreviations accounted for the slightly higher error rate in the non-verbalized punctuation portions.</Paragraph> <Paragraph position="7"> In the final analysis, the entire MIT training set, containing 27 hours of usable speech, was collected in approximately 125 40-minute sessions (approximately 30 minutes of speaking with 10 minutes of setup and in- null struction). Thus three hours of subject time is required to collect one hour of speech. Adding the overhead of recruiting and scheduling subjects, authentication, and other related administrative matters, we estimate that 6-8 hours of time is needed for one hour of speech.</Paragraph> </Section> <Section position="5" start_page="370" end_page="371" type="metho"> <SectionTitle> EXPERIMENT ON TEXT PREPROCESSING </SectionTitle> <Paragraph position="0"> As mentioned earlier, the WSJ-CSR pilot effort is intended to satisfy our near term research needs, so that researchers can begin to develop very large vocabulary speech recognition algorithms and systems. The pilot effort also affords us the opportunity to experiment with prompt text preprocessing and data collection procedures, so that we can refine the procedure for the final, and considerably larger data collection initiative. In this section, we describe an experiment that we have conducted concerning the preprocessing of the text prompts.</Paragraph> <Paragraph position="1"> The prompt text used for the pilot collection has been preprocessed by Lincoln Lab \[5\]. The rationale for this preprocessing step is at least two-fold. First, by converting numbers and abbreviations to a standard format, one removes any ambiguity concerning how these items should be read. Second, forcing the subjects to read the text in some pre-determined format will result in speech data that is consistent with the language model, which is derived from a considerably larger quantity of text data.</Paragraph> <Paragraph position="2"> However, some researchers felt that this preprocessing step may unnecessarily restrict the ways these items can be pronounced. Thus the data that we collect may not accurately reflect realistic situations in which a user is asked to dictate.</Paragraph> <Paragraph position="3"> In order to gain some understanding of the effect of this preprocessing step, we recently conducted a small experiment. We first selected 100 sentences in the training set that contain one or more items that are candidates for preprocessing. Examples of some of the selected sentences are shown in Table 4. These sentences are presented to the subjects, unprocessed, for recording.</Paragraph> <Paragraph position="4"> Following the recording, each utterance is carefully transcribed orthographically, and the resulting transcription is then compared with the processed prompt text used during the pilot data collection to determine if there exist any discrepancies. For this experiment, we recruited 12 subjects, 6 male and 6 female. Three male and three Back then the distribution was $2.10 annually.</Paragraph> <Paragraph position="5"> For the 1987 first 9 months, it had a $2.4 M net loss.</Paragraph> <Paragraph position="6"> A W-4 form can be revived whenever necessary.</Paragraph> <Paragraph position="7"> produced by the 12 subjects for the 100 sentences.</Paragraph> <Paragraph position="8"> female subjects had served previously as subjects for the pilot collection effort. Thus 12 readings were obtained for each of the 100 sentences.</Paragraph> <Paragraph position="9"> crepancy with the processed prompt text prompt text.</Paragraph> <Paragraph position="10"> The results of the experiment can be analyzed in several ways. Figure 2 shows a histogram of the number of distinct renditions produced by the 12 subjects for the 100 sentences. There is considerable variation in the production of these sentences. The average number of distinct renditions is 3.9, with a standard deviation of 2.4. The figure shows that only 12 of the 100 sentences resulted in readings that agreed unanimously with the processed prompt text. Approximately half of the sentence tokens (601 out of 1,200) are identical to the corresponding prompt text. 5 Figure 2 shows that, for almost 90% of the sentences used in this experiment, the subjects produced at least one rendition that differed in some way from the processed prompt text. But is this prompt text the preferred way of producing the sentences by our subjects? To answer this question, we computed the rank of the processed prompt text for each sentence which showed that the processed prompt text corresponds to (or is at least tied with) the most frequently produced rendition in over 60% of our sentences. Over 90% of the time, it is within the top three.</Paragraph> <Paragraph position="11"> A closer examination of the 100 sentences showed that there were 171 locations where there was a discrepancy 5Although the data set size is small, we observed only small differences due to prior experience with the WSJ data collection. Experienced subjects agreed with the processed prompt text 315 times, whereas new subjects agreed only 286 times.</Paragraph> <Paragraph position="12"> between the processed prompt text and at least one of the 12 recorded orthographies. 49 of these seemed to be reading errors and consisted of a single word deletion, insertion, or substitution, and were typically produced by only one of the 12 speakers. An additional 14 discrepancies were due the addition of verbalized sentence punctuation (the subjects were not asked to verbalize punctuation).</Paragraph> <Paragraph position="13"> Figure 3 shows a breakdown of the orthographies associated with the remaining 108 discrepancies (which corresponds to 1296 substrings). 635 or 49% of these strings corresponded to the processed prompt text. Our analysis divided the majority of the remainder into three categories: numbers, abbreviations, and dates.</Paragraph> <Paragraph position="14"> Numbers were involved in 81 of the 108 discrepancies and, as shown in Figure 3, were mainly due to five factors. The most frequently occurring variation (169 instances) was where the word %nd&quot; was inserted into a string in order to break up a large number sequence (e.g. &quot;two hundred and thirty four&quot; instead of &quot;two hundred thirty four&quot;). The second most common source of variation (122 instances) involved monetary denominations. In these cases the word &quot;dollar&quot; was often deleted. The third factor involved variations in the way decimal numbers were spoken (108 instances). These changes typically involved changing a digit sequence to tens or teens (e.g. &quot;two point thirty four&quot; instead of C'two point three four&quot;), or substituting the word &quot;zero&quot; for the word &quot;oh&quot; (e.g. &quot;one point zero two&quot; instead of &quot;one point oh two&quot;). The remaining two most common factors involved 60 instances where the word &quot;zero&quot; was deleted (or replaced by the word &quot;oh&quot;) from a purely decimal number (e.g. &quot;point three percent&quot; instead of &quot;zero point three percent&quot;), and 33 instances where the word &quot;one&quot; was replaced by &quot;a&quot; in a number or fraction beginning with a one (e.g. &quot;one and a half&quot; instead of &quot;one and one half&quot;). Abbreviations accounted for 20 additional discrepancies. As shown in Figure 3, eleven of these discrepancies involved 40 instances where subjects said the contracted form of an abbreviation (e.g. &quot;Corp&quot; or &quot;In,&quot;) instead of the expanded form used in the processed prompt text.</Paragraph> <Paragraph position="15"> Conversely, there were five substrings where nearly half the subjects (a total of 29 out of 60 instances) did expand a string which was not expanded in the processed prompt text (e.g. &quot;E.S.T&quot; spoken as &quot;Eastern Standard Time&quot;). The third factor which accounted for variations in the way abbreviations were pronounced was the word &quot;slash&quot; as in &quot;P.S. slash two&quot;. Subjects had a definite preference for deleting the slash in this context, although two returning subjects did remember to say the slash in</Paragraph> </Section> <Section position="6" start_page="371" end_page="372" type="metho"> <SectionTitle> 3 instances out of 24. </SectionTitle> <Paragraph position="0"> The remaining seven discrepancies involved dates and were nearly all due to the day being spoken as a cardinal number (e.g.&quot;ten&quot;) rather than the ordinal number (e.g.&quot;tenth&quot;) provided by the prompt text. The cardinal number was used 18 times in our data. The single exception to this was one instance where a subject said &quot;the seventh&quot; instead of &quot;seventh&quot;.</Paragraph> <Paragraph position="1"> Taken together, these nine factors were involved in 104 of the 108 discrepancies, and accounted for all but 44 of the 1296 substrings uttered by the subjects (96.6%).</Paragraph> <Paragraph position="2"> These remaining differences nearly all involved numbers, and could be analyzed further of course (for instance, three of the remaining discrepancies involved report numbers, where the number was often spoken as a sequence of single digits). However, the results of our investigation indicated to us that although there is a large variation in the way the subjects have spoken these unprocessed sentences, the types of variation is fairly limited. In addition, the magnitude of the these variations would be smaller in the overall corpus since we only presented unprocessed sentences that seemed to have ambiguous realizations. Nevertheless, we are still faced with the question of whether or not to preprocess the data. Before we can answer this question definitively, it is important that we conduct further study on a larger sample of sentences using a larger number of subjects. In the end, the decision of whether to preprocess the text will have to be determined by the community who will be the consumers of the resulting data, after considering the objectives of the research program and the trade-offs between a more reliable language model and more realistic speech data.</Paragraph> </Section> class="xml-element"></Paper>