File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0310_metho.xml
Size: 17,020 bytes
Last Modified: 2025-10-06 14:07:20
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0310"> <Title>Using Dialogue Representations for Concept-to-Speech Generation</Title> <Section position="3" start_page="0" end_page="48" type="metho"> <SectionTitle> 2 Theoretical Foundations </SectionTitle> <Paragraph position="0"> In this work, we implement and extend the compositional theory of intonational meaning proposed by Pierrehumbert and Hirschberg (1986; 1990), who sought to identify correspondences between the {chn I j encc}(c)research, bell-labs, com Grosz and Sidner (1986) computational model of discourse interpretation and Pierrehumbert's prosodic grammar for American English (1980).</Paragraph> <Paragraph position="1"> In the present work, certain aspects of the original theories are modified and adapted to the architecture of the dialogue system in which the CTS component is embedded. Below, we present the important fundamental definitions and principles of intonation underlying our CTS system.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Intonational System </SectionTitle> <Paragraph position="0"> In our CTS system, the prosodic elements that are computed are based on the intonational system of Pierrehumbert (1980), who defined a formal language for describing American English intonation using the following regular grammar: Inton Phrase ---~ (Interm Phrase) + Bndry Tone Interm Phrase ~ (Pitch Acc)+ Phrase Ace Major phrases, or inlonational phrases, are made up of one or more minor phrases, or inlermediale phrases. Melodic movements in intermediate and intonational phrases are in turn expressed by three kinds of tonal elements. These include six pilch accents: a low pitch excursion (L*), a high pitch excursion (H*), or a combination of both low and high excursions (L*+H, L+H*, H*+L, It+L*); two phrase accents: a high (H-) or low (L-) tonal target that guides the interpolation of the melodic contour from final pitch accent to intermediate phrase ending; and two boundary lones: a high (H%) or low (L%) tonal target that guides interpolation from phrase accent to intonational phrase ending.</Paragraph> </Section> <Section position="2" start_page="0" end_page="48" type="sub_section"> <SectionTitle> 2.2 Intonational Meaning </SectionTitle> <Paragraph position="0"> Theoretical work on intonational meaning has attempted to relate the grammatical elements of Pierrehumbert's system -- pitch accent, phrase accent and boundary tone, to interpretive processes at different levels of discourse and dialogue structure.</Paragraph> <Paragraph position="1"> Hirschberg and Pierrehumbert (1986) conjectured that the absence or presence of accentuation conveys discourse focus status, while the tonal properties of the accent itself (i.e. pitch accent type) convey semantic focus information.</Paragraph> <Paragraph position="2"> hello this is mimic the movie information system how can I help you where in hoboken is october sky playing october sky is playing at hoboken cinema in hoboken can I help you with anything else when is it playing there october sky is playing at hoboken cinema in hoboken at 3:45pm, 5:50pm, 7:lOpm, and lOpm can i help you with anything else In later work, pitch accent type was said to express whether the accented information was intended by the speaker to be &quot;predicated&quot; or not by the hearer (Pierrehumbert and Hirschberg, 1990). Nompredicated-~forernation was said to bear lowstar accentuation (L*-, L*+H, H+L*), while predicated information would be marked by high-star accents (H*, L+H*, H*+L). The theory further stated that L*+H conveys uncertainty or lack of speaker commitment to the expressed propositional content, while L+H* marks correction or contrast. The complex accent, H*+L, was said to convey that an inference path was required to support the predication; usage of H+L* similarly was said to imply an inference path, but did not suggest a predication of a mutual belief. Finally, phrase accents and boundary tones were said to reflect aspects of discourse structure.</Paragraph> </Section> </Section> <Section position="4" start_page="48" end_page="48" type="metho"> <SectionTitle> 3 Systems Foundations </SectionTitle> <Paragraph position="0"> Our task is to improve the communicative competence of a spoken dialogue agent, by making recourse to our knowledge of intonational meaning, dialogue processing and relations between the two. Of course, a worthwhile CTS system must also outperform out-of-the-box text-to-speech (TTS) systems that may determine prosodic mark-up in linguistically sophisticated ways. As in (Nakatani, 1998), we take the prosodic output of an advanced research system that implements the Pierrehumbert theory of intonation, namely the Bell Labs TTS system, as our baseline experimental system to be enhanced by CTS algorithms. We embed the CTS system in MIMIC, a working spoken dialogue system representing state-of-the-art dialogue management practices, to develop CTS algorithms that can be eventually realistically evaluated using task-based performance metrics.</Paragraph> <Section position="1" start_page="48" end_page="48" type="sub_section"> <SectionTitle> 3.1 Dialogue System: Mixed-Initiative Movie Information Consultant </SectionTitle> <Paragraph position="0"> (MIMIC) The dialogue system whose baseline speech generation capabilities we enhance is the Mixed-Initiative Movie Information Consultant (MIMIC) (Chu-Carroll, 2000). MIMIC&quot; provides movie listing information involving knowledge about towns, theaters, movies and showtimes, as demonstrated in Figure 1. MIMIC currently utilizes template-driven text generation, and passes on text strings to a stand-alone TTS system. In the version of MIMIC enhanced with concept-to-speech capabilities, MIMIC-CTS, contextual knowledge is used to modify the prosodic features of the slot and filler material in the templates; we are currently integrating the algorithms in MIMIC-CTS with a grammar-driven generation system. Further details of MIMIC are presented in the relevant sections below, but see (Chu-Carroll, 2000) for a complete overview.</Paragraph> </Section> <Section position="2" start_page="48" end_page="48" type="sub_section"> <SectionTitle> 3.2 TTS: The Bell Labs System </SectionTitle> <Paragraph position="0"> For default prosodic processing and speech synthesis realization, we use a research version of the Bell Labs TTS System, circa 1992 (Sproat, 1997), that generates intonational contours based on Pierrehumbert's intonation theory (1980), as described in (Pierrehumbert, 1981). Of relevance is the fact that various pitch accent types, phrase accent and boundary tones in Pierrehumbert's theory are directly implemented in this system, so that by generating a Pierrehumbert-style prosodic transcription, the work of the CTS system is done. More precisely, MIMIC-CTS computes prosodic annotations that override the default prosodic processing that is performed by the Bell Labs TTS system.</Paragraph> <Paragraph position="1"> To our knowledge, the intonation component of the Bell Labs TTS system utilizes more linguistic knowledge to compute prosodic annotations than any other unrestricted TTS system, so it is reasonable to assume that improvements upon it are meaningful in practice as well as in theory.</Paragraph> </Section> </Section> <Section position="5" start_page="48" end_page="49" type="metho"> <SectionTitle> 4 MIMIC's Concept-to-Speech Component (MIMIC-CTS) </SectionTitle> <Paragraph position="0"> In MIMIC-CTS, the MIMIC dialogue system is enhanced with a CTS component to better communicate the meaning of system replies through contextually conditioned prosodic features. MIMIC-CTS makes use of three distinct levels of dialogue representations to convey meaning through intonation.</Paragraph> <Paragraph position="1"> MIMIC's semantic representations allow MIMIC-CTS to decide which information to prosodically highlight. MIMIC's task model in turn determines how to prosodically highlight selected information, based on the pragmatic properties of the system reply. MIMIC's dialogue strategy selection process informs various choices in prosodic contour and accenting that convey logico-semantic aspects of meaning, such as contradiction.</Paragraph> <Section position="1" start_page="49" end_page="49" type="sub_section"> <SectionTitle> 4.1 Highlighting Information using Semantic Representations </SectionTitle> <Paragraph position="0"> MIMIC employs a statistically-driven semantic interpretation engine to &quot;spot&quot; values for key attributes that make up a valid MIMIC query in a robust fashion) To simplify matters, for each utterance, MIMIC computes an attribute-value matrix (AVM)-~epresentation, identifying important pieces of information for accomplishing a given set of tasks. The AVM created from the following utterance, &quot;When is October Sky playing at Hoboken Cinema in Hoboken?&quot;, for example, is given in Fig- null by MIMIC's semantic interpreter.</Paragraph> <Paragraph position="1"> Attribute names and attribute values are critical to the task at hand. In MIMIC-CTS, attribute names and values that occur in templates are typed, so that MIMIC-CTS can highlight these items in the following way: 1. All lexical items realizing attribute values are accented.</Paragraph> <Paragraph position="2"> 2. Attribute values are synthesized at a slower speaking rate.</Paragraph> <Paragraph position="3"> 3. Attribute values are set off by phrase boundaries. null 4. Attribute names are always accented.</Paragraph> <Paragraph position="4"> These modifications are entirely rule-based, given a list of attribute names and typed attribute values. 1 Specifically, MIMIC uses an n-dimensional call router front-end (Chu-Carroll, 2000), which is a generalization of the vector-based call-routing paradigm of semantic interpretation (Chu-CarroU and Carpenter, 1999); that is, instead of detecting one concept per utterance, MIMIC's semantic in- null terpretation engine detects multiple (n) concepts or classes conveyed by a single utterance, by using n call touters in parallel.</Paragraph> <Paragraph position="5"> Even such minimal use of dialogue information can make a difference. For example, changing the default accent for the following utterance highlights the kind of information that the system is seeking, instead of highlighting the semantically vacuous main verb, like: 2 Default TTS: what movie would you LIKE MIMIC-CTS: what MOVIE would you like</Paragraph> </Section> <Section position="2" start_page="49" end_page="49" type="sub_section"> <SectionTitle> 4.2 Conveying Information Status using </SectionTitle> <Paragraph position="0"> the Task Model MIMIC performs a set of information-giving tasks, i.e. what, where, when, location, that are concisely defined by a task model. MIMIC processes the AVM for each utterance and then evaluates whether it should perform a database query based on the task specifications given in Figure 3. The task mode\] defines which attribute values must be filled in (Y), must not 56 filled in (N), or may optionally be filled in (-), to &quot;license&quot; a database query action. If no task is &quot;specified&quot; by the current AVM state, MIMIC employs various strategies to progress toward a complete and valid task specification.</Paragraph> <Paragraph position="1"> For example, in response to the follgwing user utterance, MIMIC initiates an information-seeking subdialogue to instantiate the theater attribute value to accomplish a when task: User: when is october sky playing in hoboken MIMIC-CTS: what THEATER would you like To better convey the structure of the task model, which is learned by the user through interaction with the system, we define four information statuses based on properties of the task model, which align on a scale of given and new in the following order:</Paragraph> </Section> </Section> <Section position="6" start_page="49" end_page="51" type="metho"> <SectionTitle> OLD INFERRABLE KEY HEARER-NEW </SectionTitle> <Paragraph position="0"> \[given\] \[new\] KEY information is that which is necessary to formulate a valid database query, and is exchanged and (implicitly or explicitly) confirmed between the system and user. INFERRABLE information is not explicitly exchanged between the system and where in montclair is analyze this playing analyze this is playing at welhnont theatre and clearviews screening zone in mont clair bold-faced reply string, generated by MIMIC-CTS. user, but is derived by MIMIC's limited inference engine that seeks to instantiate as many attribute values as possible. For instance, a theater name may be inferred given a town name, if there is only one theater in the given town. OLD information is inherited from the discourse history, based on updating rules relying on confidence scores for attribute values. HEARER-NEW information (c.f. (Prince, 1988)) is that which is requested by the user, and constitutes the only new information on the scale. But note that KEY information, while given, is still clearly in discourse focus, along with HEARER-NEW information.</Paragraph> <Paragraph position="1"> The next step is to map the information statuses, ordered from given to new, to a scale of pitch accent, or accent melodies, ordered from given to new as follows: L* L*/H L+H* H* &ivan\] \[new\] Table 1 summarizes this original mapping of information statuses to pitch accent melodies, and Figure 4 illustrates the use of this mapping in an example. It obeys the general principle of Pierrehumbert and Hirschberg's work, that low tonality signifies discourse givenness and high tonality signifies discourse newness, but extends this principle beyond its vague definition in terms of predication of mutual beliefs. Instead, the principle is operationalized here in a practically motivated manner that is consistent with and perhaps illuminating of the theory.</Paragraph> <Section position="1" start_page="49" end_page="51" type="sub_section"> <SectionTitle> 4.3 Assigning &quot;Dialogue Prosody&quot; using Dialogue Strategies </SectionTitle> <Paragraph position="0"> As in earlier CTS systems, special logico-semantic relations, such as contrast or correction, are effectively conveyed in MIMIC-CTS by prosodic cues. In MIMIC-CTS, however, these situations are not stipulated in an ad hoc manner, but can be determined to a large degree by MIMIC's dialogue strategy selection process that identifies appropriate dialogue acts to realize a dialogue goal. a For example, the dialogue act ANSWER may be selected to achieve the dialogue goal of providing an answer to a successful user query, while the dialogue act NOTIFYFAILURE may be performed to achieve the dialogue goal of providing an answer in situations where no movie listing in the database matches the user query. The template associated with the dialogue act, NOTIFYFAILURE, when compared with that for ANSWER, contains an additional negative auxiliary associated with the key attribute responsible for the query failure, in an utterance conveying a contradiction in beliefs between the user and system (namely, the presupposition on the part of the user that the query can be satisfied).</Paragraph> <Paragraph position="1"> Theoretical work on intonational interpretation leads us to prosodically mark the negative auxiliary, as well as the associated focus position (Rooth, 1985). We choose to mark the negative auxiliary not with the L+H* pitch accent to convey correction, while marking the material in the associated focus position with the L*+H pitch accent to convey (the 3Importantly, MIMIC's adaptive dialogue strategy selection algorithm takes into account the outcome of an initiative tracking module that we do not discuss here (see (Chu-Carroll, 2000)).</Paragraph> <Paragraph position="2"> User: MIMIC: where is the corruptor playing in cranford the corruptor is not playing in cranford the corruptor is playing at lincoln cinemas in arlington version of the bold-faced reply string, generated by MIMIC-CTS. Note the diacritic &quot;!&quot; denotes a downstepped accent (see (Pierrehumbert, 1980)).</Paragraph> <Paragraph position="3"> system's) lack of commitment to the (user's) presupposition at hand. Finally, the NOTIFYFAILURE dialogue act is conveyed by assigning the so-called rise-fall-risd-cbntfadiction contour, L*+tt L-H%, to the utterance-at large (c.f. (Hirschberg and Ward, 1991)). An example generated by MIMIC-CTS appears in Figure 5. Note that pitch accent types for the remaining attribute values are assigned using the task model, as described in section 4.2. Thus in Figure 5, the movie title is treated as KEY information, marked by the L+H* pitch accent.</Paragraph> <Paragraph position="4"> MIMIC-CTS contains additional prosodic rules for logical connectives, and clarification and confirmation suhdialogues.</Paragraph> </Section> </Section> class="xml-element"></Paper>