File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/h93-1064_metho.xml

Size: 19,221 bytes

Last Modified: 2025-10-06 14:13:24

<?xml version="1.0" standalone="yes"?>
<Paper uid="H93-1064">
  <Title>ON CUSTOMIZING PROSODY IN SPEECH SYNTHESIS: NAMES AND ADDRESSES AS A CASE IN POINT</Title>
  <Section position="4" start_page="317" end_page="319" type="metho">
    <SectionTitle>
3. PROSODY FOR A NAME AND ADDRESS
INFORMATION RETRIEVAL SERVICE
</SectionTitle>
    <Paragraph position="0"> The text domain for the current work is synthesis of names and addresses. The associated pronunciation rules and text processing are well understood, and there are many applications that require this type of information. At the same time this represents a particularly stringent test for the contribution of prosody to synthesis quality because names and addresses have such a simple linear structure. There is little structural ambiguity, no center-embedding, no relative clauses. There are no indirect speech acts. There are no digressions. Utterances are usually very short. In general, names and addresses contain few of the features common in cited examples of the centrality of prosody in spoken language. This class of text seems to offer little opportunity for prosody to aid perception.</Paragraph>
    <Paragraph position="1"> On the other hand, if prosody can be shown to influence synthetic speech quality even on such simple material as names and addresses, then it is all the more likely to be important in spoken language systems where the structure of the material is more complex and the discourse is richer.</Paragraph>
    <Paragraph position="2"> 3.1. The application dialogue This work took place within the context of a field trial of speech synthesis to automate NYNEX's reverse-directory service \[9\]. Callers are real users of the information service. They know the nature of the information provision service, before they call. They have 10-digit telephone numbers, for which they want the associated listing information. At random, their call may arrive at the automated position. The dialogue with the automated system consists of two phases: information gathering and information provision. The information-gathering phase used standard Voice Response Unit technology: they hear recorded prompts and answer questions by pressing DTMF keys on their telephones. This phases establishes features of the discourse that are important for generating the prosody: callers are aware of the topic and purpose of the discourse and the information they will be asked to supply by the interlocutor (in this case the automated voice). It also establishes that the interlocutor can and will use the telephone numbers as a key to indicate how the to-be-spoken information (the listings) relates to what the caller already knows (thus &amp;quot;555 1234 is listed to Kim Silverman, 555 2345 is listed to Sara Basson&amp;quot;).</Paragraph>
    <Paragraph position="3"> The second phase is information provision: the listing information for each telephone number is spoken by a speech synthesizer. Specifically, the number and its associated name and town are embedded in carrier phrases, as in: &lt;number&gt; is listed to &lt;name&gt; in &lt;town&gt; The resultant sentence is spoken by the synthesizer, after which a recorded human voice offers to repeat the listing, spell the name, or continue to the next listing.</Paragraph>
    <Paragraph position="4"> These features may seem too obvious to be worthy of comment, but they very much constrain likely interpretations of what is to be spoken, and similarly define what the appropriate prosody should be in order for the to-be-synthesized information to be spoken in a compliant way.</Paragraph>
    <Paragraph position="5"> 3.2. Rules for Prosody in Names and Addresses In the field trial, text fields from NYNEX's Customer Name and Address database (approximately 20 million entries) are sent to a text processor \[10\] which identifies and labels logical fields, corrects many errors, and expands abbreviations. For the current research, a further processor was written which takes the cleaned-up text which is output from that text processor, analyzes its information structure, and inserts prosodic markers into it before passing it on to a speech synthesizer. The prosodic markers control such things as accent type, accent location, overall pitch range, boundary tones, pause durations, and speaking rate. These are recognized by the synthesizer and will override that synthesizer's own inbuilt prosody rules.</Paragraph>
    <Paragraph position="6"> The prosodic choices were based on analyses of 371 interactions between real operators and customers. The operators use a careful, clear, deliberately-helpful style when saying this information. The principles that underlie their choice of prosody, however, are general and apply to all of language. The tunes they use appear to be instances of tunes in the repertoire shared by all native speakers, their use of pitch range is consistent with observational descriptions in the Ethnomethodology literature, their pauses are neither unrepresentafively long nor rushed. What makes their prosody different from normal everyday speech is merely which tunes and categories they select from the repertoire, rather than the contents of the repertoire itself. This reflects the demand characteristics of the discourse.</Paragraph>
    <Paragraph position="7"> The synthesizer which was chosen for this prosodic preprocessor was DECtalk, within the DECvoice platform. This synthesizer has a reputation for very high segmental intelligibility \[2\]. It is widely used in applications and research laboratories, and has an international reputation.</Paragraph>
    <Paragraph position="8"> There are three categories of processing performed by the prosodic rules: (i) discourse-level shaping of the overall prosody; (ii) field-specific accent and boundary placement, and (iii) interactive adaptation of the speaking rate.</Paragraph>
    <Paragraph position="9"> (i) Discourse-level shaping of the prosody within a turn.</Paragraph>
    <Paragraph position="10"> That turn might be one short sentence, as in 914 555 2145 shows no listing, or several sentences long, as in &amp;quot;the number 914 555 2609 is an auxiliary line. The main number is 914 555 2000. That number is handled by US Communications of Westchester doing business as Southern New York Holdings Incorporated in White Plains NY 10604. The general principle here is that prosodic organization can span multiple intonational phrases, and therefore multiple sentences. These turns are all prosodically grouped together by systematic variation of the overall pitch range, lowering the final endpoint, deaccenting items in compounds (e.g. &amp;quot;auxiliary line&amp;quot;), and placing accents correctly to indicate backward references {e.g. &amp;quot;That number...&amp;quot;). The phone number which is being echoed back to the listener, which the listener only keyed in a few seconds prior, is spoken rather quickly (the 914 555 2145, in this example). The one which is new is spoken more slowly, with larger prosodic boundaries after the area code and local exchange, and an extra boundary between the eighth and ninth digits. This is the way native speakers say this type of information when it is new and important in the discourse.</Paragraph>
    <Paragraph position="11"> Another characteristic of this level of prosodic control is the type and duration of pauses within and between some of the sentences. Some pauses are inserted within intonational phrases, immediately prior to information-bearing words.</Paragraph>
    <Paragraph position="12"> These pauses are NOT preceded by boundary-related pitch tones, and only by a small amount of lengthening of the preceding material. They serve to alert the listener that something important is about to be spoken, thereby focussing the listener's attention. In the TOBI transcription system, these would be transcribed as a 2 or 2p boundary. Example locations of these pauses include: &amp;quot;The main number is...</Paragraph>
    <Paragraph position="13"> 914 555 2000?' and &amp;quot;In... White Plains, NY 10604.&amp;quot;  The duration of the sentence-final pause between names and their associated addresses is varied according to the length and complexity of the name. This allows listeners more time to finish processing the acoustic signal for the name (to perform any necessary backtracking, ambiguity resolution, or lexical access) before their auditory buffer is overwritten by the address.</Paragraph>
    <Paragraph position="14"> (ii) Signalling the internal structure of labelled fields.</Paragraph>
    <Paragraph position="15"> The most complicated and extensive set of rules is for name fields. Rules for this field first of all identify word strings which are inferable markers of information structure, rather than being information-bearing in themselves, such as &amp;quot;... doing business as...&amp;quot;. The relative pitch range is reduced, the relative speaking rate is increased, and the stress is lowered. These features jointly signal to the listener the role that these words play. In addition, the reduced range allows the synthesizer to use its normal and boosted range to mark the start of information-bearing units on either side of these markers. These units themselves are either residential or business names, which are then analyzed for a number of structural features. Prefixed titles (Mr, Dr, etc.) are cliticized (assigned less salience so that they prosodically merge with the next word), unless they are head words in their own right (e.g. &amp;quot;Misses Incorporated&amp;quot;). Accentable suffixes (incorporated, the second, etc.) are separated from their preceding head and placed in an intermediate-level phrase of their own. After these are stripped off, the right hand edge of the head itself is searched for suffixes that indicate a complex nominal. If one of these is found is has its pitch accent removed, to yield for example Building Company, Plumbing Supply, Health Services, and Savings Bank. However if the preceding word is a function word then they are NOT deaccented, to allow for constructs such as &amp;quot;John's Hardware and Supply&amp;quot;, or &amp;quot;The Limited&amp;quot;. The rest of the head is then searched for a prefix on the right, in the form of &amp;quot;&lt;word&gt; and &lt;word&gt;&amp;quot;. If found, then this is put into its own intermediate phrase, which separates it from the following material for the listener. This causes constructs like &amp;quot;A and P Tea Company&amp;quot; to NOT sound like &amp;quot;A, and P T Company&amp;quot; (prosodically analogous to &amp;quot;A, and P T Barnum&amp;quot;).</Paragraph>
    <Paragraph position="16"> Within a head, words are prosodically separated from each other very slightly, to make the word boundaries clearer.</Paragraph>
    <Paragraph position="17"> The pitch contour at these separations is chosen to signal to the listener that although slight disjuncture is present, these words cohere together as a larger unit.</Paragraph>
    <Paragraph position="18"> Similar principles are applied within the other address fields. In address fields, for example, a longer address starts with a higher pitch than a shorter one, deaccenting is performed to distinguish &amp;quot;Johnson Avenue&amp;quot; from &amp;quot;Johnson Street&amp;quot;, ambiguities like &amp;quot;120 3rd Street&amp;quot; versus &amp;quot;100 23rd Street&amp;quot; versus &amp;quot;123rd Street&amp;quot; are detected and resolved with boundaries and pauses, and so on. In city fields, items like &amp;quot;Warren Air Force Base&amp;quot; have the accents removed from the right hand two words.</Paragraph>
    <Paragraph position="19"> An important component of signalling the internal structure of fields is to mark their boundaries. Rules concerning interfield boundaries prevent listings like &amp;quot;Sylvia Rose in Baume Forest&amp;quot; from being misheard as &amp;quot;Sylvia Rosenbaum Forest&amp;quot;.</Paragraph>
    <Paragraph position="20"> (iii) Adapting the speaking rate. Speaking rate is a powerful contributor to synthesizer intelligibility: it is possible to understand even an extremely poor synthesizer if it speaks slowly enough. But the slower it speaks, the more pathological it sounds. Moreover as listeners become more familiar with a synthesizer, they understand it better and become less tolerant of unnecessarily-slow speech. Consequently it is unclear what the appropriate speaking rate should be for a particular synthesizer, since this depends on the characteristics of both the synthesizer and the application.</Paragraph>
    <Paragraph position="21"> To address this problem, a module modifies the speaking rate from listing to listing on the basis of whether customers request repeats. Briefly, repeats of listings are presented faster than the first presentation, because listeners typically ask for a repeat in order to hear only one particular part of a listing. However if listener consistently requests repeats for several consecutive listings, then the starting rate for new listings within that call is slowed down. If this happens over sufficient consecutive calls, then the default starting rate for a new call is slowed down. Similarly, if over successive listings or calls there are no repeats, then the speaking rate will be increased again. By modelling three different levels of speaking rate in this way (within-listing, within-call, and across-calls), this module attempts to distinguish between a particularly difficult listing, a particularly confused listener, and an altogether-too-fast (or too slow) synthesizer.</Paragraph>
    <Paragraph position="22"> In addition to the above prosodic controls, there is a specific module to control the way items are spelled when listeners request spelling This works in two ways. Firstly, using the same prosodic principles and features as above, it employs variation in pitch range, boundary tones, and pause durations to define the end of the spelling of one item from the start of the next (to avoid &amp;quot;Terrance C McKay Sr.&amp;quot; from being spelled &amp;quot;T-E-R-R-A-N-C-E-C, M-C-K-A Why Senior&amp;quot;), and it breaks long strings of letters into groups, so that &amp;quot;Silverman&amp;quot; is spelled &amp;quot;S-I-L, V-E-R, M-A-N&amp;quot;. Secondly, it spells by analogy letters that are ambiguous over the telephone, such as &amp;quot;F for Frank&amp;quot;, using context-sensitive rules to decide when to do this, so that it is not done when the letter is predictable by the listener. Thus N is spelled &amp;quot;N for Nancy&amp;quot; in a name like &amp;quot;Nike&amp;quot;, but not in a name like &amp;quot;Chang&amp;quot;. The choice of analogy itself also depends on the word, so that &amp;quot;David&amp;quot; is NOT spelled &amp;quot;D for David, A ..... &amp;quot;</Paragraph>
  </Section>
  <Section position="5" start_page="319" end_page="320" type="metho">
    <SectionTitle>
4. PRELIMINARY EVALUATION
</SectionTitle>
    <Paragraph position="0"> A transcnpton experiment was carried out to evaluate the impact of the prosodic rules on the synthetic speech quality  in terms of both objective transcription accuracy and of subjective ratings.</Paragraph>
    <Section position="1" start_page="320" end_page="320" type="sub_section">
      <SectionTitle>
4.1. Test material
</SectionTitle>
      <Paragraph position="0"> A set of twenty-three names and addresses had been already been developed by Sara Basson (unpublished ms, 1992) for assessing the accuracy with which listeners can transcribe such material. This set had been constructed to represent the variation in internal structure and length that occurred in NYNEX's database. Although it did contain some material that would be ambiguous if synthesized with incorrect prosody, it was not intended to focus exclusively on prosodic variability and was developed before the prosodic processor was finished. It contained phonemic diversity;, a variety of personal names, cities and states; short and long name fields, and digit strings. There were roughly equal proportions of easy, moderate, and difficult listings, as measured by how well listeners could transcribe the material when spoken by a human. Henceforth each of these names and addresses shall be referred to as items.</Paragraph>
    </Section>
    <Section position="2" start_page="320" end_page="320" type="sub_section">
      <SectionTitle>
4.2. Procedure
</SectionTitle>
      <Paragraph position="0"> The 23 items were divided into two sets. Listeners were all native speakers of English with no known hearing loss, and all employees of NYNEX Science and Technology. On the basis of our previous experience with synthetic speech perception experiments, we expect these listeners will perform better on the transcription task than general members of the public. Thus the results of this transcription test represent a &amp;quot;best ease&amp;quot; in terms of how well we can expect real users to understand the utterances.</Paragraph>
      <Paragraph position="1"> Listeners called the computer over the public telephone network from their office telephones: their task was to transcribe each of the 23 items. Each listener heard and transcribed the items in two blocks: one of the sets of items spoken by DECtalk's default prosody rules, and the other spoken with application-specific prosody. The design was counter-balanced with roughly half of the listeners hearing each version in the first block, and roughly half hearing each item set in the first block. For each item, listeners could request as many repeats as they wanted in order to transcribe the material as accurately as they felt was reasonably possible. Listeners were only allowed to request spelling in two of the items, which were constructed to sound like pronounceable names and contain every letter in the alphabet.</Paragraph>
    </Section>
    <Section position="3" start_page="320" end_page="320" type="sub_section">
      <SectionTitle>
4.3. Dependent variables
</SectionTitle>
      <Paragraph position="0"> Transcription scores per item. Each word in each item could score up to 3 points. One point would be deducted if the right-hand word boundary was misplaced, one point if one phoneme was wrong, and two points of more than one phoneme was wrong.</Paragraph>
      <Paragraph position="1"> Number of repeats requested per item. For items that were spelled, this was the number of times after the first spelling.</Paragraph>
      <Paragraph position="2"> Perceived intelligibility. Each version of the synthesis was rated by each listener on a five-point scale labelled: &amp;quot;How easy was it to understand this voice?&amp;quot; (where 1 = &amp;quot;Consistently failed to understand much of the speech&amp;quot; and 5 = &amp;quot;Consistently effortless to understand&amp;quot;).</Paragraph>
      <Paragraph position="3"> Perceived naturalness. Each version was similarly rated, on a five-point scale labelled &amp;quot;How natural (i.e. like a human voice) did this voice sound? (where 1 = extremely unnatural and 5 = extremely natural).</Paragraph>
      <Paragraph position="4"> Preferences. Since each listener heard each voice, they were asked for which voice they preferred: voice 1, voice 2, or no preference.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML