File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/93/h93-1064_intro.xml

Size: 8,855 bytes

Last Modified: 2025-10-06 14:05:27

<?xml version="1.0" standalone="yes"?>
<Paper uid="H93-1064">
  <Title>ON CUSTOMIZING PROSODY IN SPEECH SYNTHESIS: NAMES AND ADDRESSES AS A CASE IN POINT</Title>
  <Section position="3" start_page="0" end_page="317" type="intro">
    <SectionTitle>
2. INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> Text-to-speech synthesis could profitably be used to automate or create many information services, if only it were of better quality. Unfortunately it remains too unnatural and machine-like for all but the simplest and shortest texts. It has been described as sounding monotonous, boring, mechanical, harsh, disdainful, peremptory, fuzzy, muffled, choppy, and unclear. Synthesized isolated words are relatively easy to recognize, but when these are strung together into longer passages of connected speech (phrases or sentences) then it is much more difficult to follow the meaning: the task is unpleasant and the effort is fatiguing \[1\].</Paragraph>
    <Paragraph position="1"> This less-than-ideal quality seems paradoxical, because published evaluations of synthetic speech yield intelligibility scores that are very close to natural speech. For example, Greene, Logan and Pisoni \[2\] found the best synthetic speech could be transcribed with 96% accuracy; the several studies that have used human speech tokens typically report intelligibility scores of 96% to 99% for natural speech. (For a review see \[1\]).</Paragraph>
    <Paragraph position="2"> However, segmental intelligibility does not always predict comprehension. A series of experiments \[3\] compared two high-end commercially-available text-to-speech systems on application-like material such as news items, medical benefits information, and names and addresses. The result was that the one with the significantly higher segmental intelligibility had the lower comprehension scores.</Paragraph>
    <Paragraph position="3"> Although there may be several possible reasons for segmental intelligibility failing to predict comprehension, the current work focuses on the single most likely cause: synthesis of prosody. Prosody is the organization imposed onto a string of words when they are uttered as connected speech. It includes pitch, duration, pauses, tempo, rhythm, and every known aspect of articulation. When the prosody is incorrect then at best the speech will be difficult or impossible to understand \[4\], at worst listeners will be misunderstand it with being aware that they have done so.</Paragraph>
    <Paragraph position="4"> Arguments for the importance of prosody in language abound in the literature. However, the cited examples of prosodic resolution of ambiguity usually are either anecdotal citations or are illustrated by small sets of carefullyconstructed cited sentences. It is not clear how important prosody is in more normal everyday texts. This brings us to the first question addressed in the current study: how much will prosody contribute to perception of synthetic speech for non-contrived, real-world textual material? 2.1. Current Approaches to Prosody in Speech</Paragraph>
    <Section position="1" start_page="0" end_page="317" type="sub_section">
      <SectionTitle>
Synthesis
</SectionTitle>
      <Paragraph position="0"> Text-to-speech systems are typically designed to cope with &amp;quot;unrestricted text&amp;quot; \[5\]. Each sentence in the input text is analyzed independently, and the prosody that is applied is a trade-off to avoid one the one hand not sounding too monotonous, and on the other hand implementing the prosodic features so saliently that egregious errors occur when the wrong prosodic features are applied. The approach taken in these systems to generating the prosody has been to derive it from an impoverished syntactic analysis of the text to be spoken. Usually content words receive pitchrelated prominence, function words do not. Small prosodic  boundaries, marked with pitch falls and some lengthening of the syllables on the left, are inserted wherever there is a content word on the left and a function word on the right.</Paragraph>
      <Paragraph position="1"> Larger tmundaries are placed at punctuation marks, accompanied by a short pause and preceded by either a fallingthen-rising pitch shape to cue nonfinality in the case of a comma, or finality in the case of a period. Declination of pitch is .imposed over the duration of each sentence.</Paragraph>
      <Paragraph position="2"> There are several ways in which deviations from the above principles can be implemented to add variety and interest to an intonation contour. For example the declination may be partially reset at commas within a sentence. Or the extent of prominence-lending pitch excursions on content words may be varied according to their lexical class (higher pitch peaks on nouns or adjectives, lower on verbs) or their position in the phrase (alternating higher and lower peaks). These variations may be based on stochastically trained models.</Paragraph>
      <Paragraph position="3"> One problem with the above approach is that prosody is not a lexical property of English words - English is not a tone language. Neither is prosody completely predictable from English syntax - prosody is not a redundant encoding of already-inferable information.</Paragraph>
      <Paragraph position="4"> Rather, prosody annotates the information structure of the accompanying text string. It depends on the prior mutual knowledge of the speaker and listener, and on the role a particular utterance takes within its particular discourse. It marks which concepts are considered by the speaker to be new in the dialogue, which ones are topics, and which ones are comments. It encodes the speaker's expectations about how the current utterance relates to that the listener's current knowledge, it indicates focussed versus background information. This realm of information is very difficult to derive in an unrestricted text-to-speech system, and it is correspondingly difficult to generate correct discourse-relevant prosody. This is a primary reason why long passages of synthetic speech sound so unnatural.</Paragraph>
      <Paragraph position="5"> 2.2. Application-specific discourse constraints on prosody There are many different applications for synthetic speech, but what they tend to share in common is that usually within each application (i) the text is not unrestricted, but rather is a constrained topic and a limited subset of the language, and (ii) the speech is spoken within a known discourse context. Therefore within the constraints of a particular application it is possible to make assumptions about the type of text structures to expect, the reasons the text is being spoken, and the expectations of the listener.</Paragraph>
      <Paragraph position="6"> These are just the types of information that are necessary to constraint the prosody. This brings us to the second aim of the current research: is it possible to create application-specific rules to improve the prosody in a real text-to-speech synthesis application? Prior work has shown that discourse characteristics of simulated applications can be used to constrain prosody. Young and Fallside [6] built a system that enabled remote access to status information about East Anglia's water supply system.</Paragraph>
      <Paragraph position="7"> This system answered queries by generating text around numerical data and then synthesizing the resulting sentences. The desired prosody was generated along with the text, rather than being left to the default rules of an unrestricted text-to-speech system. Silverman developed paragraph-level rules to vary pitch range and place accents based on a model of recently-activated concepts. Hirschberg and Pierrehumbert [7] generated the prosody in synthetic speech according to a block structure model of discourse in an automated tutor for the vi text editor. Davis [8] built a system that generated travel directions within the Boston metropolitan area. In one version of the system, elements of the discourse structure (such as given-versus-new, repetition, and grouping of sentences into larger units) were used to manipulate accent placement, boundary placement, and pitch range.</Paragraph>
      <Paragraph position="8"> Each of these pieces of research consists of a carefully-elaborated set of rules to improve synthetic speech quality. However the evidence that the speech did indeed sound better was more intuitive than based on formal perceptual assessments. Yet systematic and controlled evaluation is crucial in order to test whether hypothesized rules are correct, and whether they have a measurable effect on how the speech is perceived.</Paragraph>
      <Paragraph position="9"> The current work builds on the progress made in the above systems by evaluating prosodic modelling in the context of an existing information-provision service.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML