File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/a92-1004_metho.xml
Size: 21,690 bytes
Last Modified: 2025-10-06 14:12:48
<?xml version="1.0" standalone="yes"?> <Paper uid="A92-1004"> <Title>A PARSER FOR REAL-TIME SPEECH SYNTHESIS OF CONVERSATIONAL TEXTS</Title> <Section position="3" start_page="0" end_page="25" type="metho"> <SectionTitle> 2. THE APPLICATION </SectionTitle> <Paragraph position="0"> Users of Telecommunications Devices for the Deaf (TDD's) can communicate with voice telephone users via services such as AT&T's Telecommunications Relay Service (TRS). During a TRS call, special operators read incoming TDD text to the voice telephone user and then type that person's spoken responses back to the TDD user, this makes for a three-way interaction in which the special operator is performing both text-to-speech and speech-to-text conversion. Text-to-speech synthesis I. AT&T Bell Laboratories, Naperville, Illinois.</Paragraph> <Paragraph position="1"> makes it possible to automate part of this arrangement by reading the TDD text over the telephone to the voice user. The synthesizer thus replaces an operator on the TDD half of the conversation, providing increased privacy and control to the TDD user and, presumably, cost savings to the provider of the service.</Paragraph> <Paragraph position="2"> TDD texts present unusual challenges for text-tospeech. Except in laboratory experiments, large scale applications of text-to-speech have tended to focus on name pronunciation and &quot;canned text&quot; such as catalogue orders. To the best of our knowledge, the TRS text-to-speech field trial in California represents the first large scale attempt to use speech synthesis on spontaneously generated conversational texts, and also the first to use this technology on texts that are orthographically and linguistically non-standard. Unlike the written material that most text-to-speech systems are tested on, e.g. the AP newswire, TDD texts observe few of the writing conventions of English. All text is in upper case, and punctuation, even at major sentence boundaries, rarely occurs; spelling and typographical errors complicate the picture even further (Tsao 1990; Kukich, 1992). In addition, nearly all texts employ special abbreviations and lingo, e.g., CU stands for see you, GA is the message terminator go ahead. The following example illustrates a typical TDD text:</Paragraph> </Section> <Section position="4" start_page="25" end_page="25" type="metho"> <SectionTitle> OH SURE PLS CAIJ. ME ANYTI/V\[E AFTER SAT MORNING AND I WILL GIVE U THE NAMES AND PHONE NOS OK QGA </SectionTitle> <Paragraph position="0"> (Oh sure, please call me anytime after Saturday morning and I will give you the names and phone numbers. OK? Go ahead.) Finally, many texts are written in a variety of English that departs from expected lexical and syntactic patterns of the standard dialect (Charrow 1974). For example, WHEN DO I WIIJ. CAIJ. BACK U Q GA is a short TDD text that we believe most native speakers of English would recognize as When should I call you back? Go ahead. The (attested) example below is less clear, but interpretable:</Paragraph> </Section> <Section position="5" start_page="25" end_page="25" type="metho"> <SectionTitle> I WISH THAT DAY I COULD LIKE TO MEETING DIFFERENT PEOPLE WHO DOES THIS JOB AND THE WAY I WANT TO SEE HOW THEY DO IT LIKE THAT BUT THIS PLACES WAS FROM SAN FRANCISCO I GUESS </SectionTitle> <Paragraph position="0"> Syntactic variation in such texts is systematic and consistent (Bacbenko 1989, Charrow 1974). Although a complete account has yet to be formulated, Suri (1991) reports that aspects of the variation may be explained by the influence of a native language--ASL--on a second language--English.</Paragraph> <Paragraph position="1"> Figure 1 above summarizes the points about TDD texts. Spelling error estimates come from Kukich (1992) and Tsao (1990).</Paragraph> <Paragraph position="2"> Timing creates an additional obstacle since we expect TRS text-to-speech to synthesize the text while it is being typed, much as an operator would read it at the TRS center. How to chunk the incoming text now becomes a critical question. Word by word synthesis, where the listener hears a pause after each word, is the easiest approach but one that many people find nervewracking. N-word synthesis, where the listener hears a pause after some arbitrary number of words, is nearly as simple but runs the risk of creating unacceptably high levels of ambiguity and, for long texts, may be as irritating as single-word synthesis. Our solution was to build a TDD parser that uses linguistic roles to break up the speech into short, natural-sounding phrases. With partial buffering of incoming text, the parser is able to work in near real time as well as to perform lexical regularization of abbreviations and a small number of non-standard forms.</Paragraph> </Section> <Section position="6" start_page="25" end_page="27" type="metho"> <SectionTitle> 3. A TEXT-TO-SPEECH PARSER 3.1. PARSER STRUCTURE AND RULES </SectionTitle> <Paragraph position="0"> In constructing the parser, our goal was to come up with a system that (a) substitutes non-standard and abbreviated items with standard, pronounceable words, and (b) produces the most plausible phrasing with the simplest possible mechanism. Extensive data collection has been the key to success in regularizing lexical material, e.g. the conversion of fwy (pronounced &quot;f-wee&quot;) to freeway.</Paragraph> <Paragraph position="1"> Phrasing is accomplished by a collection of rules derived from the prosodic phrase grammar of Bacbenko and Fitzpatrick (1990), with some important modifications.</Paragraph> <Paragraph position="2"> The most radical of these is that the TDD phrasing rules build no hierarchical structure. Instead they rely on string adjacency, part of speech, word subclass and length to make inferences about possible syntactic constituency and to create enough prosodic cohesion to determine the location of phrase boundaries.</Paragraph> <Paragraph position="3"> The parser works deterministicaUy (Marcus 1980, Hindle 1983). It uses a small three element buffer that can contain either words or structures; once a lexical or prosodic structure is built it cannot be undone. As TDD text is typed, incoming words are collected in the buffer where they are formed into structures by rules described below. Phrasing rules then scan buffer structures. If a phrasing rule applies, all text up to the element that triggered the rule is sent to the synthesizer while, during synthesis, the buffer is reset and the roles restart anew.</Paragraph> <Paragraph position="4"> Once a structure has moved out of the buffer it cannot be recovered for exatnination by later phrasing rules.</Paragraph> <Paragraph position="5"> Our approach differs from other recent efforts to build small parsers for text-to-speech, e.g.</Paragraph> <Paragraph position="6"> O'Shaughnessy (1988) and Emorine and Martin (1988), where savings are sought in the lexicon rather than in processing. O'Shaughnessy (1988) (henceforth O.) describes a non-deterministic parser that builds sentence-level structure using a dictionary of 300 entries and a medium sized grammar, which we guess to be slightly under 100 rules. The lexicon is augmented by a morphological component of 60 word suffixes used principally to derive part of speech; for example, .ship and -hess are considered good indicators that a word of two or more syllables has the category 'noun'. O. gives a thorough account of his parser. Much of his exposition focusses on technical details of the syntactic analysis, and supporting linguistic data are plentiful. However, evaluation of O.'s proposals for speech synthesis is difficult since he gives us only a vague indication of how the parsed sentences would be prosodically phrased in a text-to-speech system. Without an explicit description of the syntax/prosody relation, we cannot be sure how to assess the suitability of O.'s analysis for speech applications.</Paragraph> <Paragraph position="7"> The system described by Emorine and Martin (1988) (henceforth E&M) incorporates a 300-entry dictionary and approximately 50 rules for identifying syntactic constituents and marking prosodic phrase boundaries. The rules in this system build sentence-level structures that are syntactically simpler than those given in O. but more geared to the requirements of phrasing in that prosodic events (e.g. pause) are explicitly mentioned in the rules.</Paragraph> <Paragraph position="8"> Unfortunately, E&M share few technical details about their system and, like O., provide no examples of the prosodic phrasing produced by their system, making evaluation an elusive task.</Paragraph> <Paragraph position="9"> Applications such as TRS, which requires near real time processing, make systems based on sentence-level analyses infeasible. In our parser, decisions about phrasing are necessarily local--they depend on lexical information and word adjacency but not upon relations among non-contiguous elements. This combined with the need for lexical regularization in TDD texts motivates a much stronger lexicon than that of O. or E&M. In addition, our parser incorporates a small number of part-of-speech disambiguation rules to make additional lexical information available to the phrasing rules. Let us briefly describe each of the three components that make up the grammar: lexicon, disambiguation rules, and phrasing rules.</Paragraph> <Paragraph position="10"> 3.1.1. The lexicon contains 1029 entries consisting of words, abbreviations, and two- to three-word phrases.</Paragraph> <Paragraph position="11"> Each entry has four fields: the input word (e.g. u), the output orthography (you), lexical category (Noun), and a list of word subclasses (destress_pronoun short_subject).</Paragraph> <Paragraph position="12"> Word subclasses reflect co-occurrence patterns and may or may not have any relationship to lexical categories.</Paragraph> <Paragraph position="13"> For example, Interjectionl includes the phrase byebye for now, the adverb however, the noun phrase my goodness, and the verb smile, as in I APPRECIATE THE I-lFff.p SMILE THANK YOU SO MUCH. Both the lexical category and subclass fields are optional--either may be marked as NIL. Abbreviations and acronyms are usually nouns and make up 20% of the lexical entries.</Paragraph> <Paragraph position="14"> Nouns and verbs together make up about 50%. We expect that additions to the lexicon will consist mostly of new abbreviations and short phrases.</Paragraph> <Paragraph position="15"> 3.1.2. Lexical disambiguation rules identify part-of-speech and expand ambiguous abbreviations. Currently, part-of-speech disambiguation is performed by ten rules.</Paragraph> <Paragraph position="16"> Most apply to words lexically marked for both noun and verb, e.g. act, call, need, assigning a single category, either noun or verb, when a rule's contextual tests are satisfied. For example, if the third term of the buffer contains a word that is lexically marked as 'noun+verb', the word will be assigned the category 'verb' when the second buffer element is the word to and the first buffer element is either a verb or adverb. When applied to the word string expect to call, tiffs rule correctly analyzes call as a verb. Other part-of-speech rules distinguish the preposition to from the use of to as an infinitive marker, and distinguish the preposition vs. verb uses of like.</Paragraph> <Paragraph position="17"> Ambiguous abbreviations are items such as no, which may signify either number or the negative particle. Since TDD texts lack punctuation, the only clue to usage in such cases is local context, e.g. the presence of the words the or phone before no are used as disambiguating context to identify no as number.</Paragraph> <Paragraph position="18"> 3.1.3. Phrasing rules consider part-of-speech, word subclass and length (as measured by word count) to</Paragraph> </Section> <Section position="7" start_page="27" end_page="27" type="metho"> <SectionTitle> T/'S Voice </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="27" end_page="27" type="sub_section"> <SectionTitle> User </SectionTitle> <Paragraph position="0"> identify phrase boundary locations. These rules are strictly ordered. In general, they instruct the synthesizer to set off interjections (e.g. wow, oh ok, etc.), and to insert a phrase boundary before non-lexical coordinate conjunctions (e.g. and in I don't recall that and am not sure, see Bachenko and Fitzpatrick (1990:163)), before sentences, and before subordinate conjunctions (after, during, etc.). Boundaries are also inserted at noun-verb junctures unless the noun is short, and at prepositional phrase boundaries unless the prepositional phrase is short. A short noun is a single word noun phrase such as a pronoun or demonstrative (this, that); a short prepositional phrase is one with a pronominal object (with me, about it, etc.). Hence the noun-verb rule will produce the phrasings below, where double bars mark phrase boundaries (this and the prepositional phrase rule are adaptations of the verb and length rules, respectively, given in Bachenko and Fitzpatrick (1990)).</Paragraph> <Paragraph position="2"> Our formulation of the phrasing rules assumes that, in the absence of syntactic structure, the subclass membership, part-of-speech and string position can provide sufficient information to infer structure in many cases.</Paragraph> <Paragraph position="3"> For example, we are assuming that the subclass 'nominative_pronoun', which includes he, she, we, etc., acts consistently as the leading edge of a sentence, so that the parser can compensate somewhat for the lack of punctuation by identifying and setting off some top-level sentential constituents. Similarly, prepositions are assumed to act consistently as the leading edge of a prepositional phrase; the parser guesses about prepositional phrase length by checking the word class of the element following the preposition to see if the object is pronominal.</Paragraph> <Paragraph position="4"> The phrase rules thus attempt to seek out major syntactic constituents. If there is evidence of constituency, the parser may look for a short constituent or it will simply insert a prosodic boundary at a presumed syntactic boundary (e.g. a verb phrase, sentence or subordinate conjunction).</Paragraph> </Section> </Section> <Section position="8" start_page="27" end_page="29" type="metho"> <SectionTitle> 3.2. PARSER IMPLEMENTATION 3.2.1. SYSTEM ARCHITECTURE </SectionTitle> <Paragraph position="0"> The quickest way to incorporate a TDD parser into a ser vice using text-to-speech (TrS) synthesis is to implemen the parser in a separate front-end module to the text-to speech system. The parser filters the input stream from TDD modem and sends the processed text to the text-to speech system where it is synthesized for the voice tele phone user, as shown in the block diagram in figure 2 This architecture minimizes the need to modify an~ existing equipment or system. Also, it allows us u maintain and change the parser module without introduc ing substantial, or unpredictable, changes elsewhere ii the system.</Paragraph> <Paragraph position="1"> 3.2.2. IMPLEMENTATION Integrating the TDD parser into a near real time systen architecture is a difficult task. To achieve it, the parse must (a) filter the TDD input stream in real-time in orde to identify tokens, i.e. words, abbreviations, and expres sions, that are suitable for processing by parser rules, an~ (b) group these tokens into natural sounding phrases tha can be sent to the text-to-speech system as soon as the,. are formed.</Paragraph> <Paragraph position="2"> In an ideal situation, it is desirable to parse the entir TDD input before sending the processed text to the text to-speech synthesizer. But the practical situatio~ demands that the voice user hear TDD text synthesizeq as soon as it is reasonably possible so that long period of silence can be prevented. Figure 3 below shows th, basic architecture chosen to implement the parse described in this paper.</Paragraph> <Paragraph position="3"> 3.2.2.1. The canonical input filter process has to de.~ with the TDD input characters as they are being typed The output of the canonical filters consists of TDD wor, tokens i.e. groups of characters separated by whit spaces. Input characters arrive at irregular speeds wit nondeterministic periods of pauses due to uneven typin by the TDD user. Also incidences of spelling erro~ typographical mistakes, and attempts to amend previousl typed text occur at very irregular rates. Even the TDI modem can contribute text to the input stream that i seen by the canonical input filter. For instance, the TDD modem might periodically insert a carriage-return character to prevent text wraparounds on tim special operator's terminal. Unfommately, these carriage-return characters could split words typed by tim TDD user into incoherent parts, e.g., advantage might become adva<CR>ntage.</Paragraph> <Paragraph position="4"> Since the voice telephone user needs to hear TDD text synthesized after some, hopefully short, interval of time, the input filter cannot wait indefinitely for TDD characters that are delayed in arriving, as might occur when the TDD user pauses to consider what to type next.</Paragraph> <Paragraph position="5"> Hence, the filter includes an input character timeout mechanism. The timeout interval is set to an appropriately short duration to ensure the timely synthesis of available TDD text, but still long enough to prevent the exclusion of forthcoming input characters.</Paragraph> <Paragraph position="6"> 3.22.2. Lexigraphical analysis examines the TDD word tokens to identify contiguous words that should be grouped together as individual units. The multi-word expressions include contractions (e.g. &quot;it .... s&quot; which becomes &quot;it's'3 and commonly used short phrases that can be viewed as sIngle lexical units (e.g.&quot;ray goodness&quot;, &quot;as long as&quot;, and &quot;mother in law&quot;). A simple stacking mechanism is used to save tokens that are identified as potential elements of multi-word expressions. The tokens are stacked until the longest potential multi-word expression has been identified, with three words being the maximum. After which the stack is popped and the corresponding structures (described below) are constructed. null 3.223. The lexical lookup process builds a tdd-term structure (record) from these tokenized words and multi-word expressions in preparation for invoking the phrasal segmentation rules. Fields in the structure include the tokenized input text (the original orthographic representation), the output orthography, lexical category (Noun, Verb, Adverb, NIL, etc.), word subclass, and other fields used internally by the phrasal segmentation process. At this point in the processing only the input text field has any non-nnll information. The output orthography, lexical category, and word subclass fields are filled via lexical lookup.</Paragraph> <Paragraph position="7"> The lexicon is organized into the four fields mentioned above. The tdd-term input text field is compared with the corresponding fieM in the lexicon until a match is found and the three remaining fields in the matched entry am then copied into the tdd-term structure. If no match is found, then the input text field is copied into the output text field and the other two lexicon fields are set to NIL.</Paragraph> <Paragraph position="8"> As an illustration, ff the single letter u is identified as our TDD token, the lexical lookup process might return with a tdd-term stmcnne that looks like: input text: &quot;u&quot; output text: &quot;you&quot; lexical category: NOUN subclasses: (DESTRESS_PRONOUN SHORT_SUBJECT) other fields: NIL.</Paragraph> <Paragraph position="9"> For tim input text oic, the structure might look like: input text: &quot;oic&quot; output text: &quot;oh, I see&quot; lexical category: INTJ subclasses: INTERJECTION 1 other fields: NIL.</Paragraph> <Paragraph position="10"> 32.2.4. The phrasal segmentation process applies a modest set of disambiguation and phrasing roles to a sliding window containing three contiguous tdd-term structures. In the start condition the sliding window wiLl not have any tdd-term structures within it. Each new tdd-term structure generated by lexical lookup enters the first term position in tim window, bumping existing terms forward one position with the last (third term) discarded after its output orthography is copied into a text buffer awaiting transmission to the text-to-speech synthesizer. The various rules described in Section 3.1 above are then applied to the available tdd-term structures. After a pronounceable phrase is identified, the output orthography of all active tdd-terms is then copied to the TTS text buffer which is subsequently sent to the synthesizer for playback to the voice telephone user. Also, the invocation of a timeout alarm due to tardy TDD input text causes flushing of the sliding window and text buffer into tim synthesizer. The sliding window and Trs text buffer axe cleared and the roles restarted anew.</Paragraph> <Paragraph position="11"> Listed below are a few examples of TDD text processed by the parser.</Paragraph> <Paragraph position="12"> TDD: I DONT THINK SO I WILL THINK</Paragraph> </Section> <Section position="9" start_page="29" end_page="29" type="metho"> <SectionTitle> ABOUT IT GA </SectionTitle> <Paragraph position="0"/> </Section> class="xml-element"></Paper>