File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/86/p86-1022_metho.xml
Size: 29,170 bytes
Last Modified: 2025-10-06 14:11:54
<?xml version="1.0" standalone="yes"?> <Paper uid="P86-1022"> <Title>THE CONTRIBUTION OF PARSING TO PROSODIC PHRASING IN AN EXPERIMENTAL TEXT-TO-SPEECH SYSTEM</Title> <Section position="3" start_page="145" end_page="146" type="metho"> <SectionTitle> SYNTACTIC STRUCTURE AND PROSODIC PHRASING </SectionTitle> <Paragraph position="0"> Certain relations between syntax and prosody.</Paragraph> <Paragraph position="1"> especially at the word level, are well-known. For example, the syntactic category of a word may affect its phonetic realization, as in the verb/adjective distinction of separate, approximate, and the verb/noun distinction of house, wind, lives. Likewise, syntactic category affects word stress, so that verbs such as progress, insert, object, and rebel receive final stress, whereas the corresponding nouns receive penultimate stress. Beyond the word level, however, there has been little investigation of systematic connections between syntactic structure and prosodic phrasing. The psycholinguistic and acoustic investigations of Cooper and Paccia-Cooper (1980), Umeda (1982) and Gee and Grosjean (1983)and the prosodic theory of Selkirk (1984) are among the more notable studies and represent the two main approaches to syntax/prosody 2. Note that without a syntactic anal,,sis that correctly identifies ~rammatical functions, it is impos'sible to determine whether tlae word mark is a noun ending the subject phrase or the verb of the predicate phrase. Simple 'surface&quot; parsers, such as that described in Umeda and Teranishl (1974l. will still fail to identify, the prosodic boundar.~ correctly.. relations. In Cooper and Paccia-Cooper (1980) and Umeda (1982), the connection from syntax to prosodic phrasing is unmediated by any filtering process, i.e..</Paragraph> <Paragraph position="2"> they propose that the details of prosodic phrasing can be determined directly from syntactic structure by associating particular syntactic nodes (or constituent boundaries) with a phonetic value, either pausing, segmental lengthening, or the blocking of the cross-word conditioning of phonological rules. By contrast, Gee and Grosjean (1983) and Selkirk (1984) believe that the syntax-prosody relation is indirect: prosodic phrasing is derived by rules that refer to left-to-right ordering, length (or branching patterns), and, in the ca~e of Selkirk. grammatical function, as well as constituent membership in order to infer a hierarchical prosodic structure. But while their respective positions are quite clear, none of these studies is conclusive. All lack a syntactic framework sufficiently detailed and formalized to allow extensive testing, and most consider 9nly a small number of sentences and sentence types?.</Paragraph> <Paragraph position="3"> To develop our analysis, we first examined prosodic phrasing in the speech of one of us reading prose from various texts, including four instruction manuals. These texts were later augmented by a ~ rofessional reading of a prose story. The boundaries etween prosodic phrases were identified and then classed according to their syntactic context and semantic function.</Paragraph> <Paragraph position="4"> Our results, which are outlined below, indicate an organization of the prosodic phrases that supports the 'indirect relationship' approach of Gee and Grosjean (1983) and Selkirk (1984). We found that, in our corpus, prosodic phrasing depends on three aspects of structure: the breakdown into syntactic constituents, the .grammatical function of a constituent, and constxtuent length, Let us review each of these factors.</Paragraph> <Paragraph position="5"> Syntactic Constituency.</Paragraph> <Paragraph position="6"> The possible constituents recognized by our parser are Noun Phrase (NP). Verb Phrase (VP). Adjective Phrase (AdjP), Adverb Phrase (AdvP), and Prepositional Phrase (PP). In general, we found that syntactic constituency is partxcularly important for predicting points at which a prosodic phrase boundary is not produced, i.e., the words within a syntactic constituent cohere. For example, the italicized phrases in (1)-(5) had no perceptible boundaries at the locations indicated by #: (1) Left-hand # power unit is connected ...</Paragraph> <Paragraph position="7"> (2) This procedure shows # you ...</Paragraph> <Paragraph position="8"> (3) An extremely # narrow opening ...</Paragraph> <Paragraph position="9"> (4) To spread powerload more # evenly (5) ... next # to any powered di-group The single exception to word cohesion within syntactic 3. Gee and Grosjean (1983) use a corpus of 14 sentences. Umeda (1982) considers a large corpus but. like Gee and Grosjean. does not distinguish among grammatical functions Althou~_h Selkirk cites r~any exam~lgs in her discussionsof phra~'al stress and word-level prosody, her description of prosodic phrasing focusses on only a single example. constituents involved boundaries between the verb and its first or second object when the object in question was lengthy. We discuss this exception below.</Paragraph> <Paragraph position="10"> Grammatical Functions.</Paragraph> <Paragraph position="11"> Our sample indicated that phrase boundaries are also determined by the grammatical relations among the syntactic constituents, i.e. the argument structure of the sentence. Four grammatical relations concern us: (a) subject-predicate, as in The 48-channel module -has two di-groups.</Paragraph> <Paragraph position="12"> (b) head-complement, where the head can be a noun, verb, or adjective and may have one complement, e.g. has -- two di-groups, or two complements, e.g. shows -- you -- how to fly your kite. (c) sentence-adjunct, as in Insert unit into correct shelf location -- per detail instructions.</Paragraph> <Paragraph position="13"> (d) head-modifier, where the head can be a noun, verb, adverb, or adjective and the modifier can be one of several things, depending on the head (e.g., for nouns, the modifier can be a relative clause; for verbs, it can be a prepositional phrase; for adjectives and adverbs, the modifier can be a comparative).</Paragraph> <Paragraph position="14"> We observed a hierarchy among these relations with respect to the strength, or perceptibility, of a prosodic boundary, with the boundary between sentence and adjunct receiving the highest potential boundary strength, followed by the subject-predicate boundary, then the head-complement and head-modifier boundaries. Thus in (6), there is a strong boundary between subject and predicate, whereas in (7), due to the strong boundary between adjunct and core sentence, the subject-predicate boundary diminishes. (Dashes indicate the location of the boundary being discussed.) (6) The name of the character -- is not pronounced.</Paragraph> <Paragraph position="15"> (7) When this switch is off -- the name of the character is not pronounced.</Paragraph> <Paragraph position="16"> Constituent Length.</Paragraph> <Paragraph position="17"> While we may view each boundary as having an intrinsic strength based on constituency and grammatical function, the determination of actual strengths appears to depend on the interaction of the intrinsic strength of a boundary with the strengths of other boundaries in the sentence, as well as the distance between these boundaries. The most salient of the interactions we observed was between the placement of a boundary at the subject-predicate junction and the placement of a boundary following the verb-complement junction. The mediating factor in this interaction was the relative length of the subject with respect to the length of the verb's complements. Thus a sentence such as (8). with both a short subject and a single short object generally is produced without a boundary in either position.</Paragraph> <Paragraph position="18"> (8) You have completed the task.</Paragraph> <Paragraph position="19"> But if, as in (9), the subject is long relative to the object, then a break occurs between the subject and predicate. Conversely, if the subject is short relative to the object, then a break will occur between the verb and the object, as in (10). Or, if there are two objects and the first is simple, the break will occur between them, as in (11). (9) The materials required -- are one kite kit.</Paragraph> <Paragraph position="20"> (10) How shall we judge -- the goodness of an algorithm? (11) This procedure shows you -- how to fly your kite.</Paragraph> </Section> <Section position="4" start_page="146" end_page="149" type="metho"> <SectionTitle> AN EXPERIMENTAL PROSODY SYSTEM </SectionTitle> <Paragraph position="0"> Our findings confirmed that syntactic structure plays a major role in determining prosodic structure, but the relationship is indirect--the exact influence of syntactic constituency varies according to the length and grammatical function of each constituent. To refine and test this idea, we implemented an experimental text-to-speech system in which rules apply to a parse tree to infer prosodic structure and then annotate the input string with phrasing information derived from the prosodic structure; this annotated input string is submitted to the Bell Labs text-to-speech programs, which convert it into a speech file. Our system comprises three components: a parser that builds syntactic structure, rules that derive prosody information from the syntactic structure, and the Bell Labs text-to-speech programs.</Paragraph> <Paragraph position="1"> The parser and speech programs are independent components. The prosody rules act as a filter between them, converting the syntactic information generated by the parser into prosodic information that can be supplied to the text-to-speech programs.</Paragraph> <Paragraph position="2"> Parsing.</Paragraph> <Paragraph position="3"> Our parser is a version of Fidditch (Hindle 1983), a moderate coverage parser based on the deterministic model described in Marcus (1980). To build syntactic structure, Fidditch uses a grammar that requires the representations produced by lexical and syntactic rules to be consistent with the (semantic) predicate-argument structure. The surface syntactic structures generated by the parser represent the argument structure of a phrase or sentence, i.e. the &quot;core&quot; constituents of a sentence (its subject (NP), modality (AUX), and predicate (VP)) and the complements of phrasal heads. The structure is determined, for the most part, by rules that refer to argument information that is specified in the lexicon for the content words !nouns, verbs, adjectives, adverbs), and by rules that insert null terminals such as the &quot;trace&quot; of whmovement. In general, the grammar is consistent with the government and binding framework of Chomsky (1981), as adapted to the needs of a parser.</Paragraph> <Paragraph position="4"> The input to the parser is a phrase or sentence (punctuation is optional). Its output is a surface structure tree in which the status of a constituent with respect to the predicate-argument structure of the sentence is indicated by the constituent's attachment to higher nodes in the tree. Thus only constituents that belong to the core are attached to the S node, and only complements of a phrasal head can become righthand sisters of the head. Adjuncts and modifiers.</Paragraph> <Paragraph position="5"> whose role depends on semantic and pragmatic information about the discourse domain, have no assigned position within a structure and so are represented as &quot;orphan&quot; nodes in the tree.</Paragraph> <Paragraph position="6"> For example, Figure 1 shows the parse tree for Left-h'and power unit on each shelf in 48-channel module can power only the echo cancelers that are in that shelf. 4 The structure in Figure 1 contains a single core sentence -- unit can power the cancelers -- with left-branching modifiers -- left-hand, power, and echo. The sentence also contains three modifiers -- the PPs on each shelf and in 48-channel module, and the adverb only -- which are unattached constituents. This is the significance of the unlabeled node dominating each of these constituents. The PPs are not attached because unit is not lexically marked to take a PP headed by on or in as a complement, and shelf is not lexically marked to take a PP complement headed by in. Nor is any constituent lexically marked to accept onh' as an argument.</Paragraph> <Paragraph position="7"> Figure 1 also contains a relative clause, that are in that shelf. In the relative clause, T is a null terminal that stands for the trace of the relativized subject NP; the * in tense stands for a null Aux element. Because nouns do not select relative clauses as arguments (any noun can be relativized), the parser does not identify the relations of the modifier constituent to the elements of the core sentence. Hence the relative clause is not attached to any other syntactic node in the tree.</Paragraph> <Paragraph position="8"> Text-to-speech Synthesis.</Paragraph> <Paragraph position="9"> The programs that make up the speech component are described in Liberman and Buchsbaum (personal communication). These programs take English text as input and produce digitized speech output. By annotating the input text to this system, many aspects of its operation can be overridden or modified: e.g. the location of major and minor phrase boundaries, the stress given to words, the transcription of words and the boundaries between them, the timing of segments, and details of the pitch contour. As we will show, with our prosody system we are able to produce strings in which four boundary levels are identified and perceptually distinguished, using the current text-to-speech system annotations.</Paragraph> <Paragraph position="10"> Prosodic Phrasing.</Paragraph> <Paragraph position="11"> The prosody rules use information about constituent structure, grammatical role, and length to map a surface structure such as that in Figure 1 onto a prosody tree such as that in Figure 2. The prosody tree identifies the location of phrase boundaries (signified by the * nodes) and the relative strength of each boundary (signified by a number in the * node).</Paragraph> <Paragraph position="12"> It is this information that is used to annotate the input text with escape sequences that provide the text-to-speech system with instructions about prosodic phrasing.</Paragraph> <Paragraph position="13"> In formulating our rules for building the prosodic structure, we began with the idea of simply implementing the model of Gee and Grosjean (1983).</Paragraph> <Paragraph position="14"> This model, initially proposed to predict a form of psychological data describing subjective sentence structure known as performance structure, determines prosodic boundaries from a syntactic tree, but assumes rather than explicitly presents a syntactic component. We were initially attracted to the Gee and Grosjean model because of its emphasis on relative boundary weighting, i.e., on the determination of the strength of a given boundary with respect to the other boundaries in the sentence. We found that in the data we had collected, this weighting played an important role. In fact, we incorporated directly into our system one method of doing this weighting, namely Gee and Grosjean's rule to determine the strengths of the prosodic phrase boundaries around a verb using relative length (as measured by terminal node count).</Paragraph> <Paragraph position="15"> As we extended Gee and Grosjean's model to create an algorithm adequate for use in a general purpose system, our algorithm diverged from its starting point, reflecting our attempts to correct weaknesses and lacunae that we encountered in the Gee and Grosjean model. That we encountered these problems is not surprising given the difference between our goals and those of Gee and Grosjean.</Paragraph> <Paragraph position="16"> The most important difference between the Gee and Grosjean model and our current algorithm involves the factors determining boundary weight.</Paragraph> <Paragraph position="17"> Gee and Grosjean assume that this weighting is dependent only on the number of syntactic nodes, their left-to-right ordering and, in the case of the verb phrase, on constituent length. In contrast, our data, in agreement with Selkirk's (1984) theoretical analysis, indicated that boundary strength is dependent on the grammatical functions that the constituents in a given sentence play. In particular, we observed a hierarchy among these functions with respect to boundary strength, as discussed below. 5 In addition to incorporating grammatical function information into our system, we fleshed out the model of Gee and Grosjean to deal with syntactic structures that they do not explicitly consider. In particular, Gee and Grosjean's strictly left-to-right building of the 5. As an example of the effect that grammatical functions have on prosodic phrasing, consider the sentence Finalh&quot; the strange young man left. We view this sentence as consisting of two lgrammatical relations: subject-predicate and adjunct-sentence. m our hierarchy of grammatical relations, the boundary between the adjuhct and the sentence is more salient than the boundary between the subject and the predicate. The system reflects this by assigning a stronger boundary following Finally than following man.</Paragraph> <Paragraph position="18"> If we exclude any effects of grammatical functions and assume a simple l.eft-to-right attachment of the three constituents Finally, the stranee voune man and left, to the prosody tree,.we ~,ould assigr/ a -strofiger boundary following manGr ...... man Imiowing Finally. It is not .clear that Gee and oslean make this lett-to-rlght assumption in such examples. They view adverbial phrases-like Fina\[Iv as dominated by the comi~lementizer node in the s)ntax tree. and it is difficult to determine .whether the)' integrate the material in the comptemennzer Wltla the material in the core sentence as they are analy.zing the material in the core bentence or after that analysis IS completed. If they integrate the complementizer with the core sentence, then they assume that Finally bundles with the sentence in a left-td-right manner and- predict, incorrectly, that the stronger boundary occurs after man. If they complete the prosodic analysis of the core sentence before bundling the sentence with the complementizer, then they incorrectly predict that there is a strong boundary after wh- phrases in'the complementizer. In particular, they would incorrectly predict that in sentences like At the outset what problems diayou expect the most perceptible boundary would be after problems.</Paragraph> <Paragraph position="19"> Furthermore, assuming that an adjunct in sentence-initial position is dominated b~ the complementizer node and in sentence-final position &quot;by S-bar creates an inconsistent description, which hampe?s the ~alue of the model as an experimental tool.</Paragraph> <Paragraph position="20"> prosodic tree left certain questions open, For example, their model does not deal with sentences embedded in the middle of a main sentence (as-in The notion \[that he would refrain from such an act\] was incorrect.) We incorporate embedded sentences into the prosodic tree in a cyclic manner to insure that the material in the embedded sentence is processed before that in the main sentence. 6 In addition. Gee and Grosjean leave open the treatment of the multiple rightward embedding of non-sentential constituents, e.g., the NP embedding in The destruction of the good name of his father. Our approach is to handle these cases recursively, from the most deeply embedded phrase up, in order to preserve the prosodic cohesion of the entire NP.</Paragraph> <Paragraph position="21"> Our adjunction rules are derived for the most part from Selkirk's account. We have also made use of the idea, which Gee and Grosjean (\[983) take largely from the work of Selkirk, that certain syntactic heads mark off phonological phrase boundaries, and provide the basic prosodic constituents for higher level analysis.</Paragraph> <Paragraph position="22"> Our prosody rules run in four independent stages.</Paragraph> <Paragraph position="23"> Each stage builds on the previous stage, so that the rules can refer to both syntactic and prosodic structure as they build successively higher levels of prosodic structure.</Paragraph> <Paragraph position="24"> (i) Adjunction Rules combine orthographically distinct words into phonological constituents with no internal word boundary, They join a word to its left or right neighbor depending on (a) the category of the word, and (b) its structural relation to other words. In general, adjoinable words are the function words-articles, complementizers, auxiliary verbs, conjunctions, prepositions and pronouns (except for the &quot;strong&quot; possessives, mine, hers, theirs, yours, ours, which are treated as regular NP's).</Paragraph> <Paragraph position="25"> Adjunction occurs six times for the sentence in Figure 2 to create six multiple word groups, all rightadjoining: on each, in 48-channel, can power, the echo, that are and in that. These groups of adjoined words appear as terminals in the prosody tree in Figure 2. In subsequent processing the boundaries between the words in these groups are marked so that the text-to-speech system does not produce the prosodic indications of a word boundary. In addition, these groups are treated as single words in further analyses.</Paragraph> <Paragraph position="26"> (ii) ~-phrasing Rules construct phonological (or 6p) phrases, which are the building blocks of the prosody tree. These rules identify groups of words that cohere strongly in speech and thus should not be separated by phrase boundaries. In the present implementation, each * phrase is constructed by a left-to-right process that collects the words formed by adjunction until it reaches a noun or verb. At this point, a * phrase is created that consists of the collected words plus the noun or verb, which acts as head of the phrase. For example, in that shelf, in Figure 2. is a single * phrase consisting of two words.</Paragraph> <Paragraph position="27"> In Figure 2, the * nodes marked with a syntactic category are the minimal phonological constituents with respect to later rules that build the prosodic s. Having taken this strona approach, we now understand the limited exceptions to this~mechanism, which we discuss below'. phrases; these @ phrases have an internal structure, but the structure plays no role in further processing.</Paragraph> <Paragraph position="28"> Note that neither adjectives nor adverbs are allowed to be the head of a * phrase, so that three additional open slots is a single * phrase consisting of four words.</Paragraph> <Paragraph position="29"> Examples such as Someone tall walked into the room, however, suggest that our treatment of these categories is not detailed enough and that, in future versions of the system, some adjectives and adverbs should act as * heads.</Paragraph> <Paragraph position="30"> (iii) Prosody-phrasing rules use information about phrases and syntactic structure to create a new organization of the sentence and to assign strength values to the boundaries between successive * phrases.</Paragraph> <Paragraph position="31"> The process of building the prosody tree starts with the sentence node (S or Sbar) that is most deeply embedded in the utterance, transforming it into a prosody subtree. This process continues through successively higher levels of sentence nodes until all top-level sentences have been transformed into prosody subtrees. All the processing of each successive sentence is done before the relation of the sentences to each other is considered7 Within a sentence, the * phrases are processed from left to right. This stage of the analysis uses a window that allows access to three adjacent nodes.</Paragraph> <Paragraph position="32"> Pattern-action rules, which are described below, apply to the nodes in the window and build prosody subtrees that replace the syntax nodes. These subtrees are headed by a * node containing a number that represents node count; the number is determined by counting the number of nodes contained in the prosodyasubtree, plus 1 for the * node that heads the subtree. In general, the prosody phrase rules do three things: (a) Balance prosodic phrases by referring to constituent length. This rule only applies for building the prosody subtree that contains the verb. If the node count for subject plus verb is less than the node count of the verb's complement, then subject and verb are grouped together in a prosodic subtree; this gives the phrasing in The characters on the right -- mark the salient features. Otherwise, the verb is grouped with its complement in a prosodic subtree; an example of this grouping is the subtree for can power only the echo cancelers in Figure 2, (b) Combine the * phrase daughters of the major constituents, excluding VP, into a prosodic subtree.</Paragraph> <Paragraph position="33"> At present, this rule only applies to NP and PP since adjectives and adverbs are currently not treated as @ heads. For example, the name of the character, which forms two d~ phrases under NP (the name and of the character), become a single prosody phrase that replaces the NP.</Paragraph> <Paragraph position="34"> 7, We have found at least one class of phrases for which this order of processing appears inappropriate. In these, the head of the top-level phrase is epistemlc -- e.g., believe, know, belief, knowledge -- andits complement is a sentence. In most cases, the current processing order for embedded sentences will produce a break between a head and a following embedded sentence. For this class of sentences, however, thd break does not seem to be appropriate. &quot;~Vhile it wot ld be straightforward to handle this as an exception, we are currently examning whether there is a more principled wa? to describe what must be done in these cases. s Onl,~ the top-level * nodes, those which contain the head of the ~ ntactic phrase, are counted in computing the node count. LnU~,~'- ~y~:Lv~ .... ~am~lev * in Fi,,ure -, &quot;~ the sub-phrasal branching' ot&quot; Left-hand and power unit c~oes not contribute to the node count. (c) Bundle together prosodic constituents (~ phrases) from left to right if no other rules apply.</Paragraph> <Paragraph position="35"> This rule integrates the constituents left unattached by the parser into the prosodic structure. It accounts for the prosodic structure of left-hand power unit on each shelf in 48-channel module in figure 2, which is formed by first bundling left-hand power unit with on each shelf, into q~-3, and then bundling the result with in 48-channel module into ~-5. The final application of bundling replaces the Sigma node with the top level prosody node, which is q5-13 in Figure 2.</Paragraph> <Paragraph position="36"> (iv) Prosody conversion rules map the boundary strength indices onto three phonological mechanisms.</Paragraph> <Paragraph position="37"> Boundary indices in the low range, e.g. the ~-3 nodes in Figure 2, are realized as a phrase accent (Pierrehumbert 1980). Mid-range indices such as ~-5 and ~-9 in Figure 2 are realized as changes in pitch range. High indices are realized with modulations in both pitch range and duration. Thus the hierarchical organization of a structure such as that in Figure 2 can be reflected directly in the synthesized speech.</Paragraph> </Section> <Section position="5" start_page="149" end_page="149" type="metho"> <SectionTitle> PHENOMENA NOT TREATED </SectionTitle> <Paragraph position="0"> Several phenomena have been omitted from this preliminary version of the system. Some of these omissions arise from the fact that we concentrated on sentence analysis rather than discourse analysis.</Paragraph> <Paragraph position="1"> Others involve phenomena that characterize spoken English, and thus did not occur in our original corpus of technical repair manuals.</Paragraph> <Paragraph position="2"> Contrastive stress is an example of prosodic phrasing based on discourse analysis. In our system's analysis, the phrase from India does not receive contrastive stress in (12).</Paragraph> <Paragraph position="3"> (12) Passengers from several countries entered the terminal.</Paragraph> <Paragraph position="4"> Finally a man from India walked in.</Paragraph> <Paragraph position="5"> In designing the current system, we have concentrated on the level of sentence analysis. Handling the contrasts involved in data like (12) necessitates an additional level of discourse analysis.</Paragraph> <Paragraph position="6"> In addition, the system never explicitly manipulates segment durations or overall speech rate. For example, we have vet to explore whether lengthening of the segment before a mid-range boundary value is appropriate, or whether increasing the duration of constituents of the core sentence might enhance the natural sound of the system.</Paragraph> </Section> class="xml-element"></Paper>