File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2902_metho.xml

Size: 11,792 bytes

Last Modified: 2025-10-06 14:09:31

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-2902">
  <Title>Analysis and Processing of Lecture Audio Data: Preliminary Investigations</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Corpus Creation and Annotation
</SectionTitle>
    <Paragraph position="0"> In our efforts to date, we have created an initial corpus of approximately 270 hours containing lectures from six different courses, and from over 80 seminars given on a variety of topics. On average, each course contained over 30 lecture sessions. These data were recorded with an omni-directional microphone (as part of a video recording), and generally occurred in a classroom environment.</Paragraph>
    <Paragraph position="1"> To provide data for acoustic and language model training, we are in the process of generating transcriptions for the lecture material we have collected to date. An initial set of transcriptions have been generated by an audio transcription service. The transcription service was instructed to pay careful attention to generating a correct literal transcription of what was spoken (and not a &amp;quot;clean&amp;quot; transcript with disfluencies such as filled pauses and false starts removed). In additional to the spoken words, the transcription service also provided the following annotations: (1) occasional time markers, usually at obvious pauses or sentence boundaries, (2) locations of speaker changes (labeled with speaker identities when known), and (3) punctuation based on the transcribers subjective assessment of the structure of the spoken utterances.</Paragraph>
    <Paragraph position="2"> In addition to the audio data, we have obtained electronic versions of texts associated with three of these courses, and over 100 summaries of lecture content for one of them. We have also obtained electronic notes and presentations for another course. These resources will be used for our research involving written and spoken data.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Analysis of Lecture Characteristics
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Qualitative Analysis
</SectionTitle>
      <Paragraph position="0"> As illustrated in Figure 1, lecture data has much in common with casual, or spontaneous speech data, including false starts, extraneous filler words ( such as &amp;quot;okay&amp;quot; and &amp;quot;well&amp;quot;), and non-lexical filled pauses (such as &amp;quot;uh&amp;quot; or &amp;quot;um&amp;quot;). One can also easily observe that the colloquial nature of the data is dramatically different in style from the same presentation of this material in a text book. For example, one linear algebra text book covers this material using a section header that reads, &amp;quot;8 Rules of Matrix Multiplication,&amp;quot; followed by text that reads, &amp;quot;The method for multiplying two matrices A and B to get C = AB can be summarized as follows...&amp;quot; The section header and introductory sentence express the same information as the ten utterances spoken in Figure 1. In other words, the textual format is typically more concise and better organized.</Paragraph>
      <Paragraph position="1"> Apart from poor planning at the sentence level, lecture speech often exhibits poor planning at higher structural levels as well. For example, tangential threads digressing from the current primary theme are common in spontaneous speech. This is exemplified by the brief diversion into matrix inversion in utterances (4), (5) and (6). This off-theme digression occurs only three utterances after the primary theme of &amp;quot;the rules for matrix multiplication&amp;quot; is introduced in (1).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Quantitative Analysis
</SectionTitle>
      <Paragraph position="0"> In order to better quantify the characteristics of lecture data, we have recently examined a set of 80 lectures taken from three undergraduate courses in math, physics, and computer science. The total number of words in each approximately one hour lecture ranged between 5K and 12K words, with an average of nearly 7K words, and standard deviation of 1.5K words. The number of unique words used per lecture ranged from 500 to 1,100 words, with an average of 800 words, and standard deviation of 170 words. A preliminary assessment of spontaneous speech phenomena showed that there tended to be fewer  (1) I've been talking - I've been multiplying matrices already, but certainly time for me to discuss the rules for matrix multiplication.</Paragraph>
      <Paragraph position="1"> (2) And the interesting part is the many ways you can do it, and they all give the same answer.</Paragraph>
      <Paragraph position="2"> (3) So it's - and they're all important.</Paragraph>
      <Paragraph position="3"> (4) So matrix multiplication, and then, uh, come inverses. null (5) So we're - uh, we - mentioned the inverse of a matrix, but there's - that's a big deal.</Paragraph>
      <Paragraph position="4"> (6) Lots to do about inverses and how to find them.</Paragraph>
      <Paragraph position="5"> (7) Okay, so I'll begin with how to multiply two matrices. null (8) First way, okay, so suppose I have a matrix A multiplying a matrix B and - giving me a result - well, I could call it C.</Paragraph>
      <Paragraph position="6"> (9) A times B. Okay.</Paragraph>
      <Paragraph position="7"> (10) Uh, so, l- let me just review the rule for w- for this  filled pauses than in Switchboard (1% vs. 3%), although there were similar amounts of partial words (1%) and contractions (3-4% vs. 5%) in the data we observed. It is also clear that the behavior will very much depend on the lecturer. However, on the basis of these results, we hypothesize that in terms of spontaneous speech phenomena, the lecture data is closer to Switchboard quality than it is to a more carefully spoken corpus such as Broadcast News.</Paragraph>
      <Paragraph position="8"> As a preliminary examination of vocabulary usage, we measured the out-of-vocabulary (OOV) rate of the lecture material as a function of vocabulary size, where the words in the vocabulary were the most frequently occurring words for a given set of training data. Figure 2 displays the OOV rate vs. vocabulary size for a variety of speech and text training sources on the latter half of the computer science lectures ([?] 10hrs of speech). Each curve plots the OOV rate as a function of the most frequent words from a particular set of training material. Curves (A), and (B) show the results using the 64K-word Broadcast News, and 27K Switchboard lexicons, respectively. Curve (C) was computed from the combined lectures from a math and physics course. The remaining curves were all computed from subject-specific material. Curve (D) was computed from a companion textbook, while curve (E) was computed from the first half of the computer science lectures. Curve (F) was computed from a combination of the text and lecture transcripts from the course (i.e., (D)+(E)).</Paragraph>
      <Paragraph position="9"> If one considers the best vocabulary to be one that has a small OOV rate and a small size, the best matching data  size as a function of training material. Each curve plots the OOV rate in lectures from the latter half of a computer science (CS) course as a function of the most frequent words from a particular set of training material. The vocabularies for curves D-F utilize subject-specific material from a textbook, and/or the first half of the CS lectures. was obtained, not surprisingly, from subject-specific material. Even material from non-subject-related lectures match the test data better than data from general human-human conversations or broadcast news. However, we have also observed (not plotted) that a combination of general lecture and conversational material, combined with related text material, can produce behavior similar to subject-specific speech material.</Paragraph>
      <Paragraph position="10"> In order to examine the impact of language model training data on predicting word usage in lecture material, we created a 3.3K-word vocabulary exactly covering the latter half of the computer science lectures. We then created trigram language models from a variety of sources (ignoring OOV words) using the SRILM Toolkit (Stolcke, 2002), and measured their perplexity on this data. The results, as shown in Table 1, show again, not surprisingly, that spoken material provides the most constraints. Text material from Broadcast News or even the course textbook are poor predictors of language usage. Models of general human conversations do significantly better, although data from general lectures is better than arbitrary conversations. It was interesting to observe that a mixture of subject-specific textbook material and example lectures provided the most constraints for new lecture material, although there is still a considerable gap between this and the case of training the language model on the test set.</Paragraph>
      <Paragraph position="11"> Finally, to investigate the nature of the OOV words for a general vocabulary, we created a vocabulary of 1,568 words that were common to all three courses. Table 2  a 3.3K-word vocabulary trained with different text materials, and tested on 10hrs of CS lectures. Letter designations correspond to OOV measures plotted in Figure 2. lists the ten most frequent subject-specific words for each of the three courses (i.e., OOV words that were not in the common vocabulary), along with the rank of each of these words in the Broadcast News and Switchboard corpora. Not surprisingly, these OOV words tend to be subject-specific content words, and are likely to be important words for any kind of summarization or retrieval task.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Preliminary Transcript Generation
</SectionTitle>
    <Paragraph position="0"> The speech recognition processing that has been used to generate transcripts of spoken lectures has largely been based on large-vocabulary continuous speech recognition technology (Hurst et al., 2002; Leeuwis et al., 2003; Kawahara et al., 2003; Yokoyama et al., 2003). Language modeling research has focused on mixing topic-dependent textual source material (e.g., conference papers) with unrelated or topic-independent spoken material (e.g., Switchboard data, or transcripts of other spoken material) (Kato et al., 2000).</Paragraph>
    <Paragraph position="1"> In our initial speech recognition experiments, we have developed a recognizer that has been used to align the transcripts with the speech signal for three courses (approximately 80 lectures) (Glass, 2003). Based on manual examination, we believe that the alignments of the 16KHz wide-band speech are of good quality, and are on par with previous alignments we have performed on Broadcast News, Switchboard, as well as our own internal spontaneous speech corpora. Using these data as training material, we have performed a baseline speech recognition experiment on one course. Using a 5000 word vocabulary and trigram language model (perplexity 120) derived from a portion of lecture transcriptions and text book, we obtained a 33% word error rate on unseen lectures. This result is in line with other lecture word error rates of 30-40% that have been reported in the literature (Leeuwis et al., 2003; Kawahara et al., 2003).</Paragraph>
    <Paragraph position="2">  common 1.5K-word vocabulary. Frequency rank for 64K-word Broadcast News (BN) and 27K-word Switchboard (SB) corpora also shown (-- means never occurred).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML