File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/w98-0805_intro.xml

Size: 5,571 bytes

Last Modified: 2025-10-06 14:06:44

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-0805">
  <Title>The value of minimal prosodic information in spoken language</Title>
  <Section position="3" start_page="0" end_page="40" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> There are a range of choices to be made when deciding on the contents of spoken language corpora. Useful information will be made available by including prosodic annotations but this is difficult to automate. This paper looks at the advantages that can be expected to accrue from the inclusion of minimal prosodic information, major and minor tone unit boundaries, or pauses. Capturing this information automatically does not present the same difficulties as producing a full prosodic annotation: a method is described by (Huckvale and Fang, 1996). A purpose of this paper is to show that the development of such methods warrants further investigation. null The task which prompted this investigation will use trained speakers in a controlled environment, as described below.</Paragraph>
    <Paragraph position="1"> We investigate how much extra information is captured by representing major and minor pauses, as well as words, in a corpus of spoken English. We find that for speech such as broadcast news or talks the inclusion of this minimal prosodic annotation will lower the perplexity of a language model.</Paragraph>
    <Paragraph position="2"> This result is of general interest, and supports the development of improved language models for many applications. However, the specific issue which we address is a task in broadcasting technology: the semi-automated production of online subtitles for live television programmes.</Paragraph>
    <Paragraph position="3"> Contemporaneous subtitling of television programmes for the hearing impaired is set to increase significantly, and in the UK there is a mandatory requirement placed on T\: broadcasters to cover a certain proportion of live programmes. This skilled task is currently done by highly trained stenographers (the subtitlers), but it could be semi-automated. First, the subtitlers can transfer broadcast speech to written captions using automated speech recognition systems instead of specialist, keyboards. A similar approach is being introduced for some Court reporting, where trained speakers in a controlled environment may replace traditional stenographers. Secondly, the display of the captions can be automated, which is the problem we address here. It is necessary to place line breaks in appropriate places so the subtitler can be relieved of this secondary task.</Paragraph>
    <Paragraph position="4"> Based on previous work in this field (Lyon and Frank, 1997) a system will be developed to process the output of an ASR device. \Ve need to collect examples of this output as corpora of training data, and have investigated the type  of information that it will be useful to obtain.</Paragraph>
    <Paragraph position="5"> Commercially available speech recognizers typically output only the words spoken by the user, but as an intermediate stage in producing subtitles we may want to use prosodic information that has been made explicitly available.</Paragraph>
    <Paragraph position="6"> Information theoretic techniques can be used to determine how useful it is to capture minimal prosodic information, major and minor pauses.</Paragraph>
    <Paragraph position="7"> The experiments reported here take a prosodically marked corpus of spoken English, the Machine Readable Spoken English Corpus (MAR-SEC). Different representations are compared, in which the language model either omits or includes major and minor pauses as well as words.</Paragraph>
    <Paragraph position="8"> We need to assess how nmch the structure of language is captured by the different methods of representation, and an objective method of evaluation is provided by entropy indicators, described below.</Paragraph>
    <Paragraph position="9"> The relationship between prosody and syntax is well known (Arnfield, 1994; Fang and Huckvale, 1996; Ostendorfand \'ielleux, 1994; Taylor and Black, 1998). \Vork in this field has typically focussed on the problem in speech synthesis of mapping a text onto natural sounding speech. Our work investigates the complementary problem of mapping speech onto segmented text.</Paragraph>
    <Paragraph position="10"> It is not claimed that pauses are only produced as structural markers: hesitation phenomena perform a number of roles, particularly in spontaneous speech. However, it has been shown that the placement of pauses provide clues to syntactic structure. We introduce a statistical measure that can help indicate whether it is worth going to the trouble of capturing certain types of prosodic information for processing the output of trained speakers.</Paragraph>
    <Section position="1" start_page="40" end_page="40" type="sub_section">
      <SectionTitle>
Contents of paper
</SectionTitle>
      <Paragraph position="0"> This paper is organised in the following way.</Paragraph>
      <Paragraph position="1"> In Section 2 we describe the MARSEC corpus, which is the basis for the analysis. Section 3 describes the entropy metric, and explains the theory behind its application. Section 4 describes the experiments that were done, and gives the results. These indicate that it will be worthwhile to capture minimal prosodic information. In Section 5 we describe the subtitling task which prompted this investigation, and the paper concludes with Section 6 which puts our work into a wider context.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML