File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/h92-1076_metho.xml

Size: 19,262 bytes

Last Modified: 2025-10-06 14:13:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="H92-1076">
  <Title>SPONTANEOUS SPEECH COLLECTION FOR THE CSR CORPUS</Title>
  <Section position="4" start_page="0" end_page="374" type="metho">
    <SectionTitle>
3. METHOD
</SectionTitle>
    <Paragraph position="0"> User Interface. The speech data collection was performed with user interface software designed by Mike Phillips for collection of read speech \[2\]. MIT provided this software to SRI, where it was slightly modified for use in collecting spontaneous speech. The interface requires a talk button to be pushed and held down in order to record speech; another button must be explicitly pushed to accept the sentence.</Paragraph>
    <Paragraph position="1"> The most recent sentence is always available for playback.</Paragraph>
    <Paragraph position="2"> Material Selection. Several issues seemed important in the selection of materials to be used in story generation.</Paragraph>
    <Paragraph position="3">  1. It seemed appropriate to select material that would match the content of the WSJ, its vocabulary and topics. null 2. To ensure that speech is tngy spontaneous and not just  read from source material, it is best to provide material that gives enough information, without giving it in a format that would encourage subjects just to read. Subjects need to come up with their own wording.</Paragraph>
    <Paragraph position="4"> 3. Subjects should be set up to maximize the likelihood of success. Any reasonable accommodation that produces appropriate spontaneous material is acceptable.</Paragraph>
    <Paragraph position="5"> The materials provided to subjects changed over the course of the experiment. At first, subjects were provided with recent news articles or letters to the editor and were asked to prepare an outline of the material, put aside the original article, and then dictate from the outline or notes. In later sessions most subjects preferred to work from notes or press releases they had brought themselves and were, therefore, familiar with. Thus, subjects were encouraged to come to the session with topics and notes prepared.</Paragraph>
    <Paragraph position="6"> Subjects. Twelve subjects generated spontaneous news stories. Four of the twelve generated two sets each of 80 spontaneous sentences. The other eight provided one set each of 80 sentences. The twelve included seven journalists  and three SRI employees. Three journalists were from the Stanford Daily and four from the Peninsula Times-Tribune (a local daily). The other two subjects were one journalist currently doing public relations work under contract and a former broadcast journalist.</Paragraph>
    <Paragraph position="7"> Subject Recruiting and Selection. We chose to try to use journalists for this task. Not only did the particular task of news-style dictation lend itself to the use of journalists, but news writers seemed likely to be able to perform the task without undue effort.</Paragraph>
    <Paragraph position="8"> Preference was given to individuals who had dictation experience. Given time constraints, we were unable to limit subjects to only those with dictation experience, and doing so might very well have also imposed an age constraint: most journalists who were in the field prior to the proliferation of PCs and word processors have dictated news stories; younger journalists have not.</Paragraph>
    <Paragraph position="9"> We found subjects by first contacting a couple of local newspapers and speaking with the editor-in-chief or whoever they referred us to. After briefly describing the project and our needs, we asked for feedback about the level of interest that we could expect at what rate of pay. Journalists at major papers (where we would be more likely to find large numbers of speakers with dictation experience) typically wanted $35-$50/hr. At smaller papers we were able to find interest in the $20-$30/hr. range.</Paragraph>
    <Paragraph position="10"> We were able to find enough people for the pilot study at a rate of $20/hr. Several of these speakers expressed an interest in coming back to do more dictation.</Paragraph>
    <Paragraph position="11"> Potential subjects were first screened over the phone. After describing the project and the time commitment involved, we then asked the potential subject to &amp;quot;pick a topic of interest&amp;quot; -- a story/column they are currently working on, or a current issue in the news -- and dictate a brief story on that topic over the phone. We typically asked them to do this two or three times, to give us an idea of how easily they could come up with material.</Paragraph>
    <Paragraph position="12"> Procedures. On arrival at the first recording session, the subject was asked to read a complete set of written instructions. The instructions are reproduced in the appendix. Next, subjects filled out a short information sheet about themselves, including a description of any prior dictation experience.</Paragraph>
    <Paragraph position="13"> The data collection software was then demonstrated, and the subject was allowed to practice using the push-to-talk button. The practice session included 1-2 paragraphs of Wall Street Journal read text without any verbalized punctuation, and 1-2 paragraphs with verbalized punctuation. For 10 of the 12 subjects, this was the only exposure to the Wall Street Journal read text material prior to producing spontaneous data.</Paragraph>
    <Paragraph position="14"> The first real data recorded from each subject consisted of 40 adaptation sentences that were read immediately following the practice session. Thus, by the time subjects were ready to start the spontaneous speech collection sessions, they were already fairly comfortable with the various controis available on the MIT collection software, including functions for reviewing, accepting, and rejecting utterances. The remainder of the first recording session was devoted to spontaneous speech collection. Each set of 80 sentences comprised one session of 40 sentences with no verbalized punctuation (NVP) and one session of 40 sentences with verbalized punctuation (VP) \[3\]. All subjects generated the 40 sentences without punctuation first. The decision to order the collection this way was based on subject feedback regarding anticipated difficulty of the two tasks, and the experimenter's observations that subjects did in fact tend to have more difficulty with the verbalized punctuation condition. null Subjects were instnacted to imagine that they were using a real speech-to-text dictation system to generate news-style articles as though intending to submit the articles for publication in a major newspaper. They were told that they could assume that the articles would be reviewed and edited before publication, so that they did not need to worry about making it perfect.</Paragraph>
    <Paragraph position="15"> One goal of this project was to learn something about what people would expect to do naturally if they were using a speech-to-text dictation system. For this reason, the experimenter tried to control the process as little as possible. Subjects were allowed to use their own judgment as to whether or not a sentence was &amp;quot;good&amp;quot; and should be accepted. The first two or three subjects were handed a variety of source materials and instructed to find topics of interest, jot down some notes, and then dictate from the notes. After some experience and feedback from these first few subjects, the experimenter began instructing subjects over the phone in advance to &amp;quot;come prepared.&amp;quot; We briefly described the task as one in which the subject would be asked to make up and dictate short news-style articles. We asked that they have several topics in mind about which they could create brief stories.</Paragraph>
    <Paragraph position="16"> Subjects were encouraged to use ideas from current stories they were working on and from recent articles they had done. It was suggested that they bring notes to work from as long as those notes were in &amp;quot;cryptic&amp;quot; form; they were specifically instructed not to bring in completed articles or notes in sentence form. Most subjects found this to be a much easier task than working from the SRI-supplied materials; however, the ability to control for content/vocabulary was lost. Most subjects thus did the majority of their &amp;quot;sto- null des&amp;quot; from ideas they brought with them, and turned to the newspapers and other material provided by SRI only if they ran out of ideas before finishing the required 40 sentences. Subjects returned for at least one additional recording session during which they completed reading the text portion of the collection, and also read back their own spontaneously produced sentences. The four subjects who generated a second set of 80 spontaneous sentences did so after having completed a significant amount of read text.</Paragraph>
    <Paragraph position="17"> The schedule of a typical subject, then, was as follows:</Paragraph>
  </Section>
  <Section position="5" start_page="374" end_page="374" type="metho">
    <SectionTitle>
4. RESULTS
</SectionTitle>
    <Paragraph position="0"> The results of SRI's work with spontaneous speech are of three types: information about the cost in time or money to collect the material; information about the characteristics of the spontaneous material itself; and information about sub-ject reactions to the procedure.</Paragraph>
    <Section position="1" start_page="374" end_page="374" type="sub_section">
      <SectionTitle>
4.1. Production Cost
</SectionTitle>
      <Paragraph position="0"> A principal concern about the collection of spontaneous speech is that the cost is high and variable. Because the pilot CSR data collection had pairs of collection sessions from the same speaker, one spontaneous session and another session during which a clean, written version of this spontaneous material was read, we have a good basis of comparison for the cost of spontaneous vs. read speech.</Paragraph>
      <Paragraph position="1"> Figure 1 displays the distributions of recording session times in four collection conditions: spontaneous vs. read spontaneous, with verbalized punctuation (VP) vs. no verbalized punctuation (NVP). The recording session time is approximated by the difference in time between the completion of the first sentence and the last sentence in a 40-sentence session. This measure leaves out the variable preparation time that often occurs before the first spontaneous sentence in a session. This preparation time was typicaily about 5 minutes. There are 16 sessions in each of the</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="374" end_page="376" type="metho">
    <SectionTitle>
4 conditions.
</SectionTitle>
    <Paragraph position="0"> Actual observed speaker time was about 5.5 hours for speaker-independent test conditions, which is slightly above our initial estimates. Data collection supervisor time for setup, microphone check, transcription, tape processing, scheduling, and other miscellaneous activities (including the elapsed data collection/speaker time) was approximately  It may help to make a direct comparison of the time it takes to collect and transcribe (orthographically) a single set of 80 sentences, when one set is read and the other is spontaneous. The following table gives times required for collection itself (Subject &amp; Experimenter), and the transcription times for generating prompt texts (.ptx) and detailed orthographic forms (.dot).</Paragraph>
    <Paragraph position="1">  There were several additional costs for the spontaneous speech. The speaker cost is higher because we needed to pay more to attract journalists. The spontaneous recordings also involve costs for preparing materials. These costs were minimal for this study, but for future efforts we expect to gather or create &amp;quot;fact sheets&amp;quot; and other prompt materials.</Paragraph>
    <Section position="1" start_page="375" end_page="376" type="sub_section">
      <SectionTitle>
4.2. Characteristics
</SectionTitle>
      <Paragraph position="0"> The material generated in the spontaneous sessions differed from real WSJ text. The differences occurred in several characteristics: content, vocabulary, paragraph size, speech rate. Furthermore, there were differences in both central tendency and in variability in most measures. The following table lists several obvious differences.</Paragraph>
      <Paragraph position="1">  First, the number of different paragraphic topics that comprise a 40-sentence session was 11 in the WSJ material and about 6 or 7 in the spontaneous material. That is, spontaneous speakers like to keep going on a topic for six or seven sentences, whereas the WSJ cuts stories into paragraphs of about three or four sentences. Second, in a similar number of sentences, the spontaneous talkers used more words and more different words to construct longer sentences. Even at the session level, speakers used more different word types.</Paragraph>
      <Paragraph position="2"> Third, and most characteristically, the sessions were much more variable in the spontaneous condition. The WSJ materials are relatively uniform, and the spontaneous materials are more varied, both in the range of word types (shown in the table) and in sentence length and other measures not shown.</Paragraph>
      <Paragraph position="3"> Figure 2, below, displays speech rate measured for materials recorded in four different conditions: read WSJ text vs.</Paragraph>
      <Paragraph position="4"> spontaneous text; and each with verbalized punctuation vs.</Paragraph>
      <Paragraph position="5"> with no verbalized punctuation. The speech rate for these materials is approximated by dividing the number of words in a sentence (including verbalized punctuation words) by the length of the file in time, without any endpointing or allowance for sentence internal silence. Thus, the measure is adequate for comparisons here, but cannot be taken as absolute or be compared to other figures.</Paragraph>
      <Paragraph position="6"> As can be seen in Figure 2, the speech rates observed are slower for spontaneous speech than for read speech, and are slower for speech produced with verbalized punctuation in either form. Again, the spontaneous material is also more variable than the read material.</Paragraph>
    </Section>
    <Section position="2" start_page="376" end_page="376" type="sub_section">
      <SectionTitle>
4.3. Subject Reaction
</SectionTitle>
      <Paragraph position="0"> Materials. Subjects were most comfortable working from topics and materials that they brought with them. Most were not able, however, to be prepared on enough topics to do all of their collection this way. The next favored method was to use news releases or &amp;quot;fact sheets,&amp;quot; as most journalists are accustomed to using these as sources.</Paragraph>
      <Paragraph position="1"> An advantage of having subjects come with their own ideas and materials was that they tended to be more fluent and able to perform the task with greater ease when talking about topics with which they were familiar. In addition, they tended to produce longer and more complex sentences when covering familiar topics.</Paragraph>
      <Paragraph position="2"> Verbalized Punctuation. The dictation with verbalized punctuation was perceived by subjects to be more difficult than dictation without punctuation. Subjects also reported that including all punctuation did not seem natural. Speakers did seem to become more comfortable with the task with practice, but their use of punctuation was inconsistent. Certain types of punctuation, such as quotation marks and commas used to offset items in a series, were seldom left out.</Paragraph>
      <Paragraph position="3"> Other punctuation marks, such as hyphens and dashes, were often omitted or used incorrectly.</Paragraph>
      <Paragraph position="4"> The actual collected corpus does not really reflect the extent of these inconsistencies since either the speaker or the monitor would often catch the worst cases and the speaker would repeat the whole utterance.</Paragraph>
      <Paragraph position="5"> Collection Paradigm. Speakers complained about having to speak one sentence at a time. They wanted to speak non-stop while the thoughts were there, and not have to wait for the machine.</Paragraph>
      <Paragraph position="6"> Several subjects observed that a more natural way of doing dictation would be to speak in paragraphs, but with the capability to pause, i.e., stop recording while they think, then restart from the point where they left off. With this type of collection paradigm, some verbalized punctuation would be natural. In particular, it would seem natural to say &amp;quot;PERIOD&amp;quot; to indicate the end of one sentence before beginning the next.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="376" end_page="377" type="metho">
    <SectionTitle>
5. CONCLUSIONS
</SectionTitle>
    <Paragraph position="0"> Several results are evident:  1. The task can be done and with a fairly predictable rate of production. The total cost per sentence is about three or four times greater than similar read material. 2. The journalists do seem better at this task than others subjects of similar educational level.</Paragraph>
    <Paragraph position="1"> 3. Solicitations at local papers did not generate a large number of interested subjects.</Paragraph>
    <Paragraph position="2"> 4. Subjects with prior dictation experience do better at this task than those without such experience.</Paragraph>
    <Paragraph position="3"> 5. Subjects with more experience produced longer, more complex sentences.</Paragraph>
    <Paragraph position="4"> 6. Given the current editing tools, most subjects produce rather smooth and fluent spontaneous materials, primarily by rejecting whole utterances.</Paragraph>
    <Paragraph position="5"> 7. The spontaneous material is spoken slower and is generally much more variable than the read WSJ material.  Summary. Relatively fluent, spontaneously generated news stories can be collected at about four times the cost of read materials. Analysis of the materials is incomplete because  the collection was just finished and because the most important analysis will be done by the sites who use the data to run experiments.</Paragraph>
    <Paragraph position="6"> Ongoing Research. SRI is currently collecting speech from an additional 8 test speakers. The current work includes experimentation with different ordering of collection sessions, and different materials and instructions for eliciting spontaneous speech. Because of the negative feed-back regarding the verbalized punctuation condition, we are having some of the current subjects collect an extra set of 80 spontaneous sentences with no instructions regarding punctuation.</Paragraph>
  </Section>
  <Section position="8" start_page="377" end_page="377" type="metho">
    <SectionTitle>
6. ACKNOWLEDGEMENT
</SectionTitle>
    <Paragraph position="0"> SRI acknowledges support for this work from DARPA through NIST contract 50SBNBOC6211 D.C. 1040. The government has certain rights to this material. Opinions, findings, conclusions or recommendations expressed here are those of the authors and do not necessarily reflect the views of any government agencies.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML