File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1421_metho.xml
Size: 23,570 bytes
Last Modified: 2025-10-06 14:15:13
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1421"> <Title>Towards Multilingual Protocol Generation For Spontaneous Speech Dialogues*</Title> <Section position="3" start_page="198" end_page="198" type="metho"> <SectionTitle> 2 An overview of VERBMOBIL </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="198" end_page="198" type="sub_section"> <SectionTitle> 2.1 Some terminology </SectionTitle> <Paragraph position="0"> A turn is a contribution of one dialogue participant. It may be divided into segments, which sometimes resemble linguistic clauses like sentences. Basic processing entity for some comPonents is the so-called dialogue act \[Bunt 1981, Jekat et al. 1995, Alexandersson et al. 1997\]. For this work we use a set of 18 dialogue acts, some purely illocutionary, e.g., requesting a proposal for a date.</Paragraph> <Paragraph position="1"> Some comprise propositional content, e.g., proposing a date. An important property of the dialogue act is its language independence - it should be possible to be used for the annotation of dialogues in any language. Linguistic information is encoded in an abstract data type, the so called VIT *</Paragraph> </Section> </Section> <Section position="4" start_page="198" end_page="199" type="metho"> <SectionTitle> (VERBMOBIL Interface Term, \[Bos et al. 1996, Dorna 1996\]). A VIT is a semantic representation </SectionTitle> <Paragraph position="0"> formalism following the Discourse Representation Theory (DRT) of \[Kamp and Reyle 1993\]. A VIT consists of a set of semantic conditions (i.e. predicates, roles, operators and quantifiers) and allows for under-specifications with respect to scope and subordination or inherent under-specifications.</Paragraph> <Paragraph position="1"> Each discourse individual is formally represented by a discourse referent (also called instance).</Paragraph> <Paragraph position="2"> Information about the individual is encoded by one (or more) VIT-condition(s), combining a predicate with the discourse referent (see section 4).</Paragraph> <Paragraph position="3"> * Propositional information (currently time expressions) is encoded in a knowledge representation language \[Kiissner and Stede 1995\]. It is a quite surface-oriented language, but contains some interlingua-like expressions for, e.g., month-of-year (moy), weekdays (dow), time-of-day (rod) and part-of-day(pod). Figure 1 gives some examples, of so-called &quot;tempexl-expressions &quot;.</Paragraph> <Paragraph position="4"> The first of February \[moy : 2, dora: 1\] Two o'clock in thursday \[rod: 14 : 00 ,dow : thu\] Tomorrow between 8 in the morning and 14 hour \[dow:tomorrow, boundaries ( \[rod : 08 : 00\],</Paragraph> <Paragraph position="6"/> <Section position="1" start_page="198" end_page="199" type="sub_section"> <SectionTitle> 2.2 The VERBMOBIL system </SectionTitle> <Paragraph position="0"> The VERBMOBIL system is a flexible, speaker independent speech-to-speech translation system for spontaneous speech. To support the robustness of the overall system, the translation process is subdivided into several processing tracks. The most accurat e translation track is a deep linguistic* analysis in combination with semantic transfer and syntactic generation (see figure 2). When this track fails, the translation is performed by other shallow; translation components. In this paper we will just consider the dialogue act based analysis and transfer \[Block 1997\]. Effects of spontaneous speech like hesitations, corrected and revised utterance parts do not provide any translation- and thereby protocol- relevant information. They have to be recognized and filtered out during the analysis and/or translation process. In figure 2 a sketch of some of the linguistic components is given. The translation process consists of two tracks, the deep and the shallow translation. The deep linguistic translation track, whose modules all exchange linguistic information encoded in VITs, consists of three components: An HPSG-parser combined with a robust semantic component, the semantic based transfer component \[Dorna and Emele 1996\] and the generation component \[Becket et al. 1998\], an efficient-multi-lingual generator (some more details below).</Paragraph> <Paragraph position="1"> The shallow track bases its translation on dialogue acts and the propositional content (currently time expressions). Based on the input string, the dialogue act is determined, which in corn-</Paragraph> <Paragraph position="3"> bination with the propositional content is transfered using fixed templates. The DIAKON component exchanges contextual information with 15 different components. It consists of two subcomponents, the Context Disambiguation component and the Dialogue component. The former is responsible for, e.g., the disambiguation of semantic predicates* and the computation of time expressions. The latter, following a hybrid approach (more details in the next section), supports, for instance, the analysis component with top down predictions what dialogue act is next to come.</Paragraph> </Section> </Section> <Section position="5" start_page="199" end_page="203" type="metho"> <SectionTitle> 3 Requirements for the protocol generation </SectionTitle> <Paragraph position="0"> The protocol may contain original system translations as well as paraphrases of the user utterances depending on the translation system's internal information about them. Additionally, thematically irrelevant parts of the dialogue, i.e. dialogue contributions not relating to the communicative goal, should not be reflected in the protocol. Moreover, some information is condensed in the protocol structure (in comparison to the original dialogue contributions) or removed from the protocol Structure following different criteria of &quot;protocol relevance&quot;. Example of the latter are utterances annotated with the dialogue acts e.g. FEEDBACK_ $ or DELIBERATE_* can under most circumstances be removed. We are putting extra attention to this, since the removal of utterances and turns must not threaten the correctness of the protocol (e.g. the dialogue must still reflect the actual negotiation). Parts of the dialogue, e.g., clarification sub-dialogues, can under some circumstances be left out from the protocol. Stereotypical dialogue phases like the greeting phase, are reflected in the protocol by meta comments. This Procedure* has the following advantages: that the system has &quot;understood&quot; the user contributions.</Paragraph> <Paragraph position="1"> * Furthermore,. by utilizing different sources of information, e.g., deep and shallow processing, we can always generate a protocol. The planning procedure of a protocol formulation depends on whether the original turn segment was given in the output language of the protocol or not. We have to keep in mind that the dialogue partners speak different languages which means that about half of the protocol has to be re-constructed (i.e. condensed and paraphrased) from utterances in the output language of the protocol. The other half of the protocol formulations may consist of utterances which were translated into the output language of the protocol by the system. Consequentty~ the planning procedure operates as follows:</Paragraph> <Paragraph position="3"> System Translations If the system translated the user utterance into the output language of the protocol, this translation is chosen as the protocol formulation. In a system translation all irrelevant effects of spontaneous speech (e.g., hesitations, corrected or revised utterance parts) are already removed so that the selected protocol formulation is reduced to thematically relevant information.</Paragraph> <Paragraph position="4"> Deep VIT-representation In the opposite case (i.e. the user utterance was spoken in the output language of the protocol) the original VIT-representation of the user input produced by the deep analysis component (if such a VIT exists) is used. The phrase is then produces by re-generating from the VIT. Again, effects of spontaneity are removed in such a VIT.</Paragraph> <Paragraph position="5"> Tempex-VIT In all other cases (i.e. there is no deep analysis of the user utterance) we usethC/ language independent tempex mechanism whose handling with respect to the planning of protocol formulations is described below.</Paragraph> <Paragraph position="6"> Up to this point only the third type has been implemented.</Paragraph> <Paragraph position="7"> For the extraction of protocol relevant data, we utilize a part of the DIAKON module, namely the dialogue module \[Alexandersson, Reithinger, and Maier 1997\], a hybrid component consisting of a dialogue memory, a statistical component, and a plan processor. Its processing is centered around dialogue acts \[Jekat et al. 1995\] - it is assumed that every utterance can be attributed one (or more) dialogue acts. important parts of the dialogue module are the sequence memory, which contains data structures for turns and segments (see figur e 3). Each turn keeps information like speaker and source language. In each segment information like the dialogue act, thematic information and VITs for different languages are stored.</Paragraph> <Section position="1" start_page="201" end_page="201" type="sub_section"> <SectionTitle> 4.1 The Plan Processor </SectionTitle> <Paragraph position="0"> We use the plan processor \[Alexandersson and Reithinger 1997\] for the selection of protocol relevant data. It incrementally traverses the sequence memory, building a structure which we call the intentional structure. It is a tree like structure mirroring different abstractions of the dialogue, like segment (dialogue act), turn (turn class), greeting phase. In figure 3 a sketch of the intentional structure is shown: It divides into 4 distinct levels, where the top-most spans over the whole dialogue, the next distinguishes the dialogue phases greeting, negotiation and closing. The third level connects segments within a turn and distinguishes its turn-class. Finally, the fourth implements (with some minor extensions) the dialogue act hierarchy. The leaves correspond to the utterance(s) of a turn. For each translation track, one instance of the processor is used, but the figure shows just one. The plan processor uses a set of plan operators (currently about 175) (see figure 4 and 5).</Paragraph> <Paragraph position="1"> Both hand-coded as well as automatically derived operators from the VERBMOBIL corpus are used to build its structure. The operators are incrementally expanded left to right using a mixed top down and bottom up strategy. A plan operator can be attributed with constraints and actions.</Paragraph> <Paragraph position="2"> The constraints are used to check the relevance of a certain operator in a certain context, whereas the actions are mostly affecting the context. Examples of the latter are: marking an utterance as being protocol relevant or not, or setting the dialogue phase.</Paragraph> </Section> <Section position="2" start_page="201" end_page="202" type="sub_section"> <SectionTitle> 4.2 Central Contents of a Turn </SectionTitle> <Paragraph position="0"> By determining the central contents we remove unimportant segments and merge segments in such a way that the intended meaning of the turn is preserved. The plan processor performs this operation on two levels: based on *segments and based on turns. By looking at, for instance, the dialogue act, it can be determined whether the segment can be removed or not. More abstract plan operators are responsible for removing whole turns. An example of the latter is clarification sub-dialogues. This, however, turns out to be very difficult when scaling up, due to irregularities in the dialogues and more problematic: recognition errors. In the current implementation, the plan processor performs two operations on the context: (i) Marking a segment as (not) relevant for the protocol, and (ii) Merging two or more segments into one.</Paragraph> <Paragraph position="1"> Figure 4 shows a plan operator designed for processing an utterance which has been annotated with the dialogue FEEDBACK..ACKNOWLEDGMENT. The operator is designed for being applied when the utterance is the first one in the turn, and it is stated that, unless this utterance is the only one in the turn, it is marked as not relevant for the protocol. Another example of actions is shown in figure 5. This operator is used for merging two successive utterances of type ACCEPT_DATE into one. This is done when not more than one of the utterances contains propositional contents. To clarify the concept of &quot;central contents&quot; further, consider the following turn taken from our corpus (figure 6). If we use similar operators as in figure 4 for processing the second and third segment - as pointed out before, deliberations and feedbacks can under most circumstances be removed</Paragraph> <Paragraph position="3"> - the protocol relevance of the segments is affected: The first three segments are considered as not relevant for the protocol at all, whereas the last two following segments (ACCEPT) are merged into one segment. For the generation of a progress protocol, just one segment would thus be Transcription: MAW004: <P> ja , (FEEDBACK_ACKNOWLEDGMENT)(O/0 da mu&quot;s ich real eben kucken . (DELIBERATE_EXPLICIT)(I have to look) +/tier/+ das ist ein<Z> <A> Samstag (DELIBERATE_IMPLICIT)(That's a Saturday) das ist bei mir kein Problem <t> . (ACCEPT_DATE)(That's no problem for me) considered, namely one expressing an acceptance of nine o'clock. The result of the generation does not necessarily contain the spoken words, but mirrors the illocution behind the utterance.</Paragraph> </Section> <Section position="3" start_page="202" end_page="203" type="sub_section"> <SectionTitle> 4.3 Turning a segment into a VIT </SectionTitle> <Paragraph position="0"> Preparing the input for the surface generator we want to prepare it in the same format that it is already able to cope with in the translation mode: VITs. Furthermore we prefer a format that it is usable for not just one language, but for all three.</Paragraph> <Paragraph position="1"> If we construct, e.g., German VITs for protocol for- ,,~,x(,7,,,.,1) mulations we can utilize the transfer component to ~ I ~ ~7~ i~__~ transfer the V-ITs into any other VERBMOBIL Inn- , 0 guage, and then make use of the surface generator ~l// .~\[,~ .d.,:~.~:--as it is. This task is split into two steps, where the 15 Jittery \[~llll\]// ,-\ok ~ l~IA2\]~ first consists of generating partial language inde- \[,~,~p_, pendent VITs (henceforth tempex-VIT) on basis of the information in the selected segments (see Figure 7: VIT-semantics of a time expression below). The second step involves enriching the partial VITs with language dependent, e.g., verbinformation (-see section 5).</Paragraph> <Paragraph position="2"> Additionally, the plan processor draws more global inferences like, which dialogue phase a segment is part of. It also determines the turn class; information which we utilize while preparing the VITs, for instance when determining the sentence mood for a segment. For the utterance &quot;How about at 2 o'clock&quot; the corresponding information could be the dialogue act SUG-GEST._SUPPORT..DATE and the tempex expression \[rod:02:00\]. A graphical represer~tation of the partial VIT for thissegment is shown in figure 7.</Paragraph> <Paragraph position="3"> The index/3 is the entry point from where the VIT can be traversed. From the entry point we can reach the first grouping (15). It is pointing at a representation the instance i2, which is containing the representation for 2 o'clock (crime). The udef predicate stands for indefiniteness, with scopus over crime (via grouping 13). Finally, cemp_/oc is an under-specification for temporal or locational - the resolution of this under-specification is solved in the second step or by the surface generator.</Paragraph> </Section> <Section position="4" start_page="203" end_page="203" type="sub_section"> <SectionTitle> 4.4 Input to the syntactical generator </SectionTitle> <Paragraph position="0"> We can now produce the input structure for the generator. It contains on one hand general information about the dialogue (which is initially stored in the dialogue memory) and on the other hand an ordered list of protocol-relevant information of the individual Segments of the turns. The following information is included in the abstract protocol representation:* * the date, time and location where the dialogue took place, * the beginning time and the end time of the dialogue, * a detailed list describing the thematic contents of all turns of the dialogue. Each of the turn descriptions includes the information about (i) speaker, and (ii) a detailed description of the individual segments of the turn each of which contains the following information: dialogue phase, dialogue act, sentence mood, turn class, a deep VIT of the original user utterance (if it exists), the original system translation, and the tempex-VIT as described above.</Paragraph> </Section> </Section> <Section position="6" start_page="203" end_page="205" type="metho"> <SectionTitle> 5 Surface Generation of the Protocol </SectionTitle> <Paragraph position="0"> The general parts of the representation produced by the DIAKON module can be handed over to the protocol more or less directly while the *information about the individual turns and their segments is abstract and has, therefore, to be the source of a planning process of an appropriate protocol formulation. In order to keep the original ordering of turns and their segments intact, our module generates one protocol formulation for each segment. The planning of an appropriate protocol formulation for one segment has to consider different cases which require different planning procedures tocome up with a formulation.</Paragraph> <Section position="1" start_page="203" end_page="205" type="sub_section"> <SectionTitle> 5.1 From Tempex-VIT to Protocol Representations </SectionTitle> <Paragraph position="0"> The processing proceeds on the semantic level based on the VIT-formalism but now switching to language specific operations. We defined a structured set of VIT-patterns whose task is to insert a verb into the tempex-VIT (which only contains time expressions) and its obligatory verb arguments in order to come up with a complete sentence. This final VIT-representation will then be handed over to the generation module of the VERBMOBIL system for verbalization. The core planning step consists of the selection and application of an appropriate VIT-pattern of a segment which is determined by the following three main criteria: * The first criterion to find an applicable pattern is the dialogue act of the segment which means that for all 18 possible dialogue acts *there is a structured list of applicable patterns.</Paragraph> <Paragraph position="1"> *:The second criterion is the sentence mood of the user utterance.</Paragraph> <Paragraph position="2"> * The third one is the question whether the time expression contained in the tempex-VIT can be assigned directly to a verb argument position or not. This distinction is important because it requires different semantic handling of the tempex-VIT (e.g., the subject argument of the main verb).</Paragraph> <Paragraph position="4"> necessarily the case - as in figure 6, an ACCEPT ca21 contain a temporal expression or not.) and :abstr-tempex checks the mentioned distinction whether the time expression of a tempex-VIT can be assigned directly to a verb argument position or not. Sentence moods above are quest (question) and decl (declarative). The keyword VALUE indicates, e.g., (vei'b4-ppron-subj-/-tempez-pp freihaben). This is a pattern function which will be applied on the tempex-VIT. In this case the verb freihaben and a personal pronoun in subject position (ppron-subj) will be added to the tempex-VIT which will presumably be realized as a prepositional phrase (tempex-pp). The application of verb/ppron-subj-/-tempex-pp to the example VIT in figure 7 (which is a (language independent) representation for &quot;at 2 o'clock ?&quot; as an elliptical question) results in the language specific (German) VIT given in figure 9.</Paragraph> <Paragraph position="5"> The semantic structure of the tim( dition temp_loc for discourse referent il of the tempex-VIT has been exchanged by a condition for the German (temporally intended) preposition um (at). I1 is extended furthermore by the main verb /reihaben and a condition for its verb argument in sub-ject position (argl) which is realized by the personal pronoun (pron) in i3.</Paragraph> <Paragraph position="6"> The verbalization of this VIT will be 2Currently, we are working on heuristics for the choice between multiple applicable patterns.</Paragraph> </Section> <Section position="2" start_page="205" end_page="205" type="sub_section"> <SectionTitle> 5.2 Syntactic Generation and Protocol Formatting </SectionTitle> <Paragraph position="0"> The VIT-semantics of the protocol formulation is then handed over to the syntactic generator VM-GECO \[Becket et al. 1998\] for verbalization. VM-GECO is a highly efficient multi-lingual generation component which consists of a language independent kernel syntactic generator and language specific declarative knowledge sources for syntactic and lexical choices. The last step of our protocol generation module is the formatting of the protocol into an easily understandable and readable format. As the protocol consists of global information about the dialogue itself as well as paraphrased turn segments we chose a protocol format which allows for clear distinction between these parts of the protocol. Furthermore it is important to assign the speaker's name (if it is known) to the protocol formulations of each turn. There are three different formatting devices.</Paragraph> <Paragraph position="1"> The most prominent one is the production of a HTML-format of the protocol. Additionally, I~TEX and ASCII versions are available.</Paragraph> <Paragraph position="2"> Figure 10 shows an example dialogue and the respective system translations. Figure 11 shows the HTML-format of the protocol 3 of this dialogue.</Paragraph> <Paragraph position="3"> * Obviously, some of the user utterances * are not correctly understood and translated by the system, which is reflected in the protocol. However, with respect to the available data the protocol is correct. All protocol formulations have been generated based on the tempex mechanism. The protocol consists of three major parts. First there is a title (VERBMOBIL VERLAUFSPROTOKOLL NR.</Paragraph> </Section> </Section> <Section position="7" start_page="205" end_page="205" type="metho"> <SectionTitle> 3 --VERBMOBIL PROGRESS PROTOCOL NO. </SectionTitle> <Paragraph position="0"> 3), followed by general information about the dialogue: date (Datum) and time (Uhrzeit).</Paragraph> <Paragraph position="1"> The main content (GESPR.~CHSVERLAUF -</Paragraph> </Section> <Section position="8" start_page="205" end_page="205" type="metho"> <SectionTitle> PROGRESS OF THE DIALOGUE) are the indi- </SectionTitle> <Paragraph position="0"> vidual turns which consist of the paraphrased segments. The individual dialogue acts of the segments are noted for debugging purposes.</Paragraph> </Section> class="xml-element"></Paper>