File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/w96-0418_intro.xml
Size: 8,927 bytes
Last Modified: 2025-10-06 14:06:08
<?xml version="1.0" standalone="yes"?> <Paper uid="W96-0418"> <Title>Matchmaking: dialogue modelling and speech generation meet*</Title> <Section position="3" start_page="0" end_page="172" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Task independent computational dialogue modelling (see e.g., \[5, l0 t 35\]) seldom makes contact with natural language generation (exceptions being, e.g., \[7, 9, 22\]), and much less so with speech generation/synthesis.</Paragraph> <Paragraph position="1"> Conversely, speech synthesis, being predominantly concerned with rendering text to speech, rarely considers actual full scale generation.</Paragraph> <Paragraph position="2"> In this article we introduce an approach under development in a joint collaborative project between the Technical Universities of Darmstadt and Budapest ('SPEAK!') that combines the dialogue modelling paradigm with NL generation and speech synthesis in an information retrieval system. The novelty of the approach pursued lies in the move away from text-to-speech and concept-to-speech generation towards communicative-context-to-speech generation (see Section 2) and the integration of dialogue representation, NL generation, and speech synthesis. Our principal concern is selection of appropriate intonation. More specifically, from our representation of communicative &quot;Authors appear in alphabetical order. This work was partially funded by Copernicus, Project No. 10393 ('SPEAK!').</Paragraph> <Paragraph position="3"> context, we derive constraints on interpersonal meaning, which are then expressed through intonation contour (or tone contour or simply tone).</Paragraph> <Paragraph position="4"> We have taken two existing systems, the COR dialogue model (\[31\]) and the K'OMET-PENMAN multilingual text generator \[33\] to build the backbone of an integrated dialogue-based interface to an information system. The linguistic generation resources of German have been enhanced by a systemic functionally \[14, 15, 24\] motivated grammar of speech that includes knowledge about intonational patterns \[12, 34\]. Section 3 presents our dialogue model and the intonational resources.</Paragraph> <Paragraph position="5"> In Section 4 we first apply an bottom-up approach; we will determine the kinds of knowledge the generator needs to make intonational choices, and based on this we develop a stratified model with three strata: grammar, semantics, and extra-linguistic context. Second we apply a top-down approach; we determine how this knowledge can be obtained from tile dialogue model and dialogue history, i.e., from the extra-linguistic context, and thereby verify the applicability of our over-all model. Section 5 concludes the paper with a summary and a number of questions that have been left untouched.</Paragraph> <Paragraph position="6"> 2 State of the art in speech generation In this section we give a survey of existing speech generation systems for German, arguing that their syntax-based approach does not suffice to generate &quot;natural&quot; speech in dialogue systems.</Paragraph> <Paragraph position="7"> In information-seeking dialogues that use spoken language for interaction, intonation is often the only means to distinguish between different dialogue acts, thus making the selection of the appropriate intonation crucial to the success of the information-seeking process (see e.g., \[26\] for English). To illustrate this point, imagine an information-seeking dialogue where the user wants to know a specific train connection. At some point in the interaction, the system produces a sentence like Sie fahren um drei Uhr von Darmstadt nach Heidelberg (&quot;You travel at three o'clock from Darmstadt to Heidelbei'g.&quot;). There are several interpretations of this utterance, the most obvious being that the system presents some kind of information to the reader. However, the same sentence--employing a different intonation--could be part of a clarification dialogue, where the system wants to reassure that it got the user's request right. In this case, the user would be expected to react, i.e., either confirm or rebuke this statement. Only by means of intonation can the user interpret the system's expectation correctly and react accordingly.</Paragraph> <Paragraph position="8"> Even though current speech synthesizers can support sophisticated variation of intonation, no existing text-to-speech or concept-to-speech system for German is available that provides the semantic or pragmatic guidance necessary for selecting intonations appropriately. The major shortcoming is that traditional text-to-speech systems (e.g., \[16, 18, 23\]) and concept-to-speech systems \[6\] alike use purely syntactic information in order to control prosodic features. Moreover, with text-to-speech systems, where the syntactic structure has to be reconstructed from the written text by means of a syntactic analysis, the resulting data is seldom complete nor unambiguous. Concept-to-speech systems avoid the latter problem by generating spoken output from a pre-linguistic conceptual structure. Yet, most of the current implementations of the concept-to-speech approach use the conceptual representation only to avoid syntactic ambiguities with the assignment of intonational features still based on the written text (see \[6\]).</Paragraph> <Paragraph position="9"> A common feature of all these systems is that they are often too expressive in that too many words are stressed, mainly due to the lack of discourse information, for instance on focus domain or the given/new distinction. A number of discourse-model based speech generation systems have been proposed that address exactly this problem, for example NewSpeak \[17, 26\].</Paragraph> <Paragraph position="10"> However, the problem with these systems is that they still start from a given text, and are hence restricted to those kinds of discourse information that can be reconstructed from that text. Moreover, since they assume a one-to-one mapping between syntactic structure and intonational features, they cannot account for those phenomena frequent to our domain, where the same syntactic structure can be realized with differing intonations (see example above).</Paragraph> <Paragraph position="11"> Assuming that intonation is more than the mere reflection of the surface linguistic form (see \[14, 30, 19, 24\]), and further, that intonation is selected to express particular communicative goals and intentions, an effective control of intonation requires synthesizing from meanings rather than word sequences as the discussed systems do.</Paragraph> <Paragraph position="12"> This fact is acknowledged by \[1\], whose SYNPHONICS system ~ is based on the assumption that prosodic featThe SYNPHONICS system (\[I\]) covers the incremen- null tures have a function independent of syntax. \[1\] replace the idea of syntax-dependent prosody--which is implicit to all the approaches discussed so far--with the notion of the linguistic function of prosodic features including intonation. Thus, this approach allows prosodic features to be controlled by various factors other than syntax, e.g., by the information structure such as focus-background or topic-comment structure.</Paragraph> <Paragraph position="13"> However, the function of intonation is still restricted to what is called grammatical function, more specifically the textual function of intonation, without considering aspects like communicative goals and speaker's attitude, i.e., the interpersonal function of intonation (\[14\]). 2 Yet, in the context of generating speech in information-seeking dialogues where intonational features are often the only means to signal a dialogue act, these aspects have to be taken into account.</Paragraph> <Paragraph position="14"> Furthermore, in a dialogue situation as given in our approach, it is not sufficient to look at isolated sentences; instead one has to look at the utterance in its context, as part of a larger interaction. Intonation is not only used to mark sentence-internal information structures, but additionally it can be employed in the management of the communicative demands of interaction partners.</Paragraph> <Paragraph position="15"> Therefore, we also have to consider the function of intonation with respect to the whole conversational interaction, taking into account the discourse (dialogue) history (see also \[7\]). Intonation as realization of interactional features thus draws on discourse and user model as the source of constraints.</Paragraph> <Paragraph position="16"> An approach to speech generation that starts from communication context and maps this to intonational features is the only approach that provides the intonational control needed in dialogue systems to produce speech that human hearers would find acceptable.</Paragraph> </Section> class="xml-element"></Paper>