File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-0216_metho.xml
Size: 28,480 bytes
Last Modified: 2025-10-06 14:07:55
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-0216"> <Title>Multi-tasking and Collaborative Activities in Dialogue Systems</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 The WITAS Dialogue System </SectionTitle> <Paragraph position="0"> In our current application, the autonomous system is the WITAS2 UAV ('unmanned aerial vehicle') a small robotic helicopter with on-board planning and deliberative systems, and vision capabilities (for details see e.g. (Doherty et al., 2000)). This robot helicopter will ultimately be controlled by the dialogue system developed at CSLI, though at the moment we interact with a simulated3 UAV. Mission goals are provided by a human operator, and an on-board planning system then responds. While the helicopter is airborne, an on-board active vision system interprets the scene or focus below to interpret ongoing events, which may be reported (via NL generation) to the operator (see Section 6). The robot can carry out various &quot;activities&quot; such as flying to a location, or following a vehicle, or landing. These activities are specified by the user during dialogue, or can be initiated by the UAV's on-board AI. In any case, a major component of the dialogue, and a way of maintaining its coherence, is tracking the state of current or planned activities of the device.</Paragraph> <Paragraph position="1"> A more interesting and problematic notion is that of &quot;joint-activities&quot; between an autonomous system and a human operator. These are activities which the autonomous system cannot complete alone, but which require some human intervention. In our current scenarios, the UAV's vision system is not good enough to determine whether a particular vehicle is the one sought-after, and only the human operator has the authority to determine this, so that human and robot must collaborate in order to find and track a vehicle. The dialogue in Figure 2 shows how a typical interaction works4 (other capabilities, such as clarification subdialogues, are covered in (Lemon et al., 2001)). Note here that the user is able to make explicit queries about the robot's activities (both current and future), that there are concurrent activities, and that conversational initiative centers around the joint activities currently being specified and executed. null</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Activity Models </SectionTitle> <Paragraph position="0"> The idea of Activity Modelling in our system is the vision that dialogue systems can, in generality, be built for 'devices' which carry out certain well-</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Multi-modal Utterances Dialogue Move </SectionTitle> <Paragraph position="0"> Operator (O): Our job is to look for a red car Command (Joint Activity) UAV (U): Ok. I am looking for one. Report (Confirm Activity) O: Fly here please [+click on map] Command (Deictic) U: Okay. I will fly to waypoint one Report (Confirm Activity) U: Now taking off and flying there. Report (Current Activity) O: Stop that. Go to the tower instead. Command, Revision U: I have cancelled flying to waypoint one. I will fly to the tower.</Paragraph> <Paragraph position="1"> Report (Activity status) O: What are you doing? Wh-question (Current Activity) U: I am searching for a red car and flying to the tower Answer (Current Activity) O: What will you do next? Wh-question (Planned Activity) U: I have nothing planned. Answer(Planned Activity) U: I see a red car on main street [display on map, show video images], Is this the right car? Report, Yn-question (Activity) O: Yes, that's the right car Yn-answer (Positive) U: Okay. I am following it . Report (Current activity) defined activities (e.g. switch lights on, record on channel a0 , send email a1 to a2 , search for vehicle a3 ), and that an important part of the dialogue context to be modelled in such a system is the device's planned activities, current activities, and their execution status5. We choose to focus on building this class of dialogue systems because we share with (Allen et al., 2001), a version of the the Practical Dialogue 5Compare this with the motivation behind the &quot;Pragmatic Adapter&quot; idea of (LuperFoy et al., 1998).</Paragraph> <Paragraph position="2"> Hypothesis: &quot;The conversational competence required for practical dialogues, although still complex, is significantly simpler to achieve than general human conversational competence.&quot; null We also share with (Rich et al., 2001) the idea that declarative descriptions of the goal decomposition of activities (COLLAGEN's &quot;recipes&quot;, our &quot;Activity Models&quot;) are a vital layer of representation, between a dialogue system and the device with which it interacts.</Paragraph> <Paragraph position="3"> In general we assume that a device is capable of performing some &quot;atomic&quot; activities or actions (possibly simultaneously), which are the lowest-level actions that it can perform. Some devices will only know how to carry out sequences of atomic activities, in which case it is the dialogue system's job to decompose linguistically specified high-level activities (e.g. &quot;record the film on channel 4 tonight&quot;) into a sequence of appropriate atomic actions for the device. In this case the dialogue system is provided with a declarative &quot;Activities Model&quot; (see e.g. Figure 3) for the device which states how high-level linguistically-specified activities can be decomposed into sequences of atomic actions. This model contains traditional planning constraints such as preconditions and postconditions of actions. In this way, a relatively &quot;stupid&quot; device (i.e. with little or no planning capabilities) can be made into a more intelligent device when it is dialogue-enabled.</Paragraph> <Paragraph position="4"> At the other end of the spectrum, more intelligent devices are able to plan their own sequences of atomic actions, based on some higher level input. In this case, it is the dialogue system's role to translate natural language into constraints (including temporal constraints) that the device's planner recognizes. The device itself then carries out planning, and informs the dialogue manager of the sequence of activities that it proposes. Dialogue can then be used to re-specify constraints, revise activities, and monitor the progress of tasks. We propose that the process of decomposing a linguistically specified command (e.g. &quot;vacuum in the main bedroom and the lounge, and before that, the hall&quot;) into an appropriate sequence of constraints for the device's on-board planner, is an aspect of &quot;conversational intelligence&quot; that can be added to devices by dialogue-enabling them.</Paragraph> <Paragraph position="5"> We are developing one representation and reasoning scheme to cover this spectrum of cases from devices with no planning capabilities to some more impressive on-board AI. Both dialogue manager and robot/device have access to a single &quot;Activity Tree&quot; which is a shared representation of current and planned activities and their execution status, involving temporal and hierarchical ordering (in fact, one can think of the Activity Tree as a Hierarchical Task Network for the device). This tree is built top-down by processing verbal input from the user, and its nodes are then expanded by the device's planner (if it has one). In cases where no planner exists, the dialogue manager itself expands the whole tree (via the Activity Model for the device) until only leaves with atomic actions are left for the device to execute in sequence. The device reports completion of activities that it is performing and any errors that occur for an activity.</Paragraph> <Paragraph position="6"> Note that because the device and dialogue system share the same representation of the device's activities, they are always properly coordinated. They also share responsibility for different aspects of constructing and managing the whole Activity Tree.</Paragraph> <Paragraph position="7"> Note also that some activities can themselves be speech acts, and that this allows us to build collaborative dialogue into the system. For example, in Figure 3 the ASK-COMPLETE activity is a speech act, generating a yes-no question to be answered by the user.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 An example Activity Model </SectionTitle> <Paragraph position="0"> An example LOCATE activity model for the UAV is shown in Figure 3. It is used when constructing parts of the activity tree involving commands such as &quot;search for&quot;, &quot;look for&quot; and so on. For instance, if the user says &quot;We're looking for a truck&quot;, that utterance is parsed into a logical form involving the structure (locate, np[det(a),truck]).</Paragraph> <Paragraph position="1"> The dialogue manager then accesses the Activity Model for LOCATE and adds a node to the Activity Tree describing it. The Activity Model specifies what sub-activities should be invoked, and under what conditions they should be invoked, what the postconditions of the activity are. Activity Models are similar to the &quot;recipes&quot; of (Rich et al., 2001). For example, in Figure 3 the Activity Model for LOCATE states that, a0 it uses the camera resource (so that any other activity using the camera must be suspended, or a dialogue about resource conflict must be initiated), a0 that the preconditions of the activity are that the UAV must be airborne, with fuel and engine indicators satisfactory, a0 that the whole activity can be skipped if the UAV is already &quot;locked-on&quot; to the sought object, null a0 that the postcondition of the activity is that the UAV is &quot;locked-on&quot; to the sought object, a0 that the activity breaks into three sequential sub-activities: WATCH-FOR, FOLLOW-OBJ, and ASK-COMPLETE.</Paragraph> <Paragraph position="2"> Nodes on the Activity Tree can be either: active, complete, failed, suspended, or canceled. Any change in the state of a node (typically because of a report from the robot) is placed onto the System Agenda (see Section 5) for possible verbal report to the user, via the message selection and generation module (see Section 6).</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 The Dialogue Context Model </SectionTitle> <Paragraph position="0"> Dialogue management falls into two parts - dialogue modelling (representation), and dialogue control (algorithm). In this section we focus on the representational aspects, and section 5.2 surveys the main algorithms. As a representation of conversational context, the dialogue manager uses the following data structures which make up the dialogue Information Figure 4 shows how the Dialogue Move Tree relates to other parts of the dialogue manager as a whole. The solid arrows represent possible update functions, and the dashed arrows represent query functions. For example, the Dialogue Move Tree can update Salience List, System Agenda, Pending List, and Activity Tree, while the Activity Tree can update only the System Agenda and send execution requests to the robot, and it can query the Activity Model (when adding nodes). Likewise, the Message Generation component queries the System Agenda and the Pending List, and updates the Dialogue Move Tree whenever a synthesized utterance is produced.</Paragraph> <Paragraph position="1"> Figure 5 shows an example Information State logged by the system, displaying the interpretation of the system's utterance &quot;now taking off&quot; as a report about an ongoing &quot;go to the tower&quot; activity (the Pending List and System Agenda are empty, and thus are not shown).</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 The Dialogue Move Tree </SectionTitle> <Paragraph position="0"> Dialogue management uses a set of abstract dialogue move classes which are domain independent (e.g. command, activity-query, wh-question, revision, a0a1a0a1a0 ). Any ongoing dialogue constructs a par- null ticular Dialogue Move Tree (DMT) representing the current state of the conversation, whose nodes are instances of the dialogue move classes, and which are linked to nodes on the Activity Tree where appropriate, via an activity tag (see below).</Paragraph> <Paragraph position="1"> Incoming logical forms (LFs) from the parsing process are always tagged with a dialogue move (see e.g. (Ginzburg et al., 2001)), which precedes more detailed information about an utterance. For instance the logical form: command([go], [param-list ([pp-loc(to, arg([np(det([def],the), [n(tower,sg)])]))])]) corresponds to the utterance &quot;go to the tower&quot;, which is flagged as a command.</Paragraph> <Paragraph position="2"> A slightly more complex example is; report(inform, agent([np([n(uav,sg)])]), complactivity([command([take-off])])) null which corresponds to &quot;I have taken off&quot; - a report from the UAV about a completed 'taking-off' activity.</Paragraph> <Paragraph position="3"> The first problem in dialogue management is to figure out how these incoming &quot;Conversational Moves&quot; relate to the current dialogue context. In other words, what dialogue moves do they constitute, and how do they relate to previous moves in the conversation? In particular, given multi-tasking, to which thread of the conversation does an incoming utterance belong? We use the Dialogue Move Tree to answer these questions: 1. A DMT is a history or &quot;message board&quot; of dialogue contributions, organized by &quot;thread&quot;, based on activities.</Paragraph> <Paragraph position="4"> 2. A DMT classifies which incoming utterances can be interpreted in the current dialogue context, and which cannot be. It thus delimits a space of possible Information State update functions.</Paragraph> <Paragraph position="5"> 3. A DMT has an Active Node List which controls the order in which this function space is searched 6.</Paragraph> <Paragraph position="6"> 4. A DMT classifies how incoming utterances are to be interpreted in the current dialogue context. null In general, then, we can think of the DMT as representing a function space of dialogue Informa- null ticular update function are determined by the node type (e.g. command, question) and incoming dialogue move type and their contents, as well as the values of Activity Tag and Agent.</Paragraph> <Paragraph position="7"> Note that this notion of &quot;Dialogue Move Tree&quot; is quite different from previous work on dialogue trees, in that the DMT does not represent a &quot;parse&quot; of the dialogue using a dialogue grammar (e.g. (Ahrenberg et al., 1990)), but instead represents all the threads in the dialogue, where a thread is the set of utterances which serve a particular dialogue goal. In the dialogue grammar approach, new dialogue moves are attached to a node on the right frontier of the tree, but in our approach, a new move can attach to any thread, no matter where it appears in the tree. This means that the system can flexibly interpret user moves which are not directly related to the current thread (e.g. a user can ignore a system question, and give a new command, or ask their own question). Finite-state representations of dialogue games have the restriction that the user is constrained by the dialogue state to follow a particular dialogue path (e.g. state the destination, clarify, state preferred time, a0a1a0a1a0 ). No such restriction exists with DMTs, where dialogue participants can begin and discontinue threads at any time.</Paragraph> <Paragraph position="8"> We discuss this further below.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Interpretation and State Update </SectionTitle> <Paragraph position="0"> The central algorithm controlling dialogue management has two main steps, Attachment, and Process Node; 1. Attachment: Process incoming input conversational move a0 with respect to the current DMT and Active Node List, and &quot;attach&quot; a new node a1 interpreting a0 to the tree if possible.</Paragraph> <Paragraph position="1"> 2. Process Node: process the new node a1 , if it exists, with respect to the current information state. Perform an Information State update using the dialogue move type and content of a1 . When an update function a2 exists, its effects depend on the details of the incoming input a0 (in particular, to the dialogue move type and the contents of the logical form) and the DMT node to which it attaches. The possible attachments can be thought of as adjacency pairs, and each dialogue move class contains information about which node types it can attach. For instance the command node type can attach confirmation, yn-question, wh-question, and report nodes.</Paragraph> <Paragraph position="2"> Examples of different attachments available in our current system can be seen in Figure 67. For example, the first entry in the table states that a command node, generated by the user, with activity tag a3 , is able to attach any system confirmation move with the same activity tag, any system yes-no question with that tag, any system wh- question with that tag, or any system report with that activity tag. Similarly, the rows for wh-question nodes state that: a0 a wh-question by the system with activity tag can attach a user's wh-answer (if it is a possible answer for that activity) a0 a user's wh-question can attach a system whanswer, and no particular activity need be specified. null These possible attachments delimit the ways in which dialogue move trees can grow, and thus classify the dialogue structures which can be captured in the current system. As new dialogue move types are added to the system, this table is being extended to cover other conversation types (e.g. tutoring (Clark et al., 2001)).</Paragraph> <Paragraph position="3"> It is worth noting that the node type created after attachment may not be the same as the dialogue move type of the incoming conversational move a0 . Depending on the particular node which attaches the new input, and the move type of that input, the created node may be of a different type. For example, if a wh-question node attaches an input which is simply a command, the wh-question node may interpret the input as an answer, and attach a wh-answer. These interpretation rules are local to the node to which the input is attached. In this way, the DMT interprets new input in context, and the pragmatics of each new input is contextually determined, rather than completely specified via parsing using conversational move types. Note that Figure 6 does not state what move type new input is attached as, when it is attached.</Paragraph> <Paragraph position="4"> 7Where Activity Tags are not specified, attachment does not depend on sharing of Activity Tags.</Paragraph> <Paragraph position="5"> In the current system, if the user produces an utterance which can attach to several nodes on the DMT, only the &quot;most active&quot; node (as defined by the Active Node List) will attach the incoming move. It would be interesting to explore such events as triggers for clarification questions, in future work.</Paragraph> </Section> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Message generation </SectionTitle> <Paragraph position="0"> Since the robot is potentially carrying out multiple activities at once, a particular problem is how to determine appropriate generation of utterances about those activities, in a way which does not overload the user with information, yet which establishes and maintains appropriate context in a natural way.</Paragraph> <Paragraph position="1"> Generation for dialogue systems in general is problematic in that dialogue contributions arise incrementally, often in response to another participant's utterances. For this reason, generation of large pieces of text is not appropriate, especially since the user is able to interrupt the system. Other differences abound, for example that aggregation rules must be sensitive to incremental aspects of message generation.</Paragraph> <Paragraph position="2"> As well as the general problems of message selection and aggregation in dialogue systems, this particular type of application domain presents specific problems in comparison with, say, travel-planning dialogue systems - e.g. (Seneff et al., 1991). An autonomous device will, in general, need to communi- null For these reasons, the message selection and generation component of such a system needs to be of wider coverage and more flexible than template-based approaches, while remaining in real, or nearreal, time (Stent, 1999). As well as this, the system must potentially be able to deal with a large bandwidth stream of communications from the robot, and so must be able to intelligently filter them for &quot;relevance&quot; so that the user is not overloaded with unimportant information, or repetitious utterances.</Paragraph> <Paragraph position="3"> In general, the system should appear as 'natural' as possible from the user's point of view - using the same language as the user if possible (&quot;echoing&quot;), using anaphoric referring expressions where possible, and aggregating utterances where appropriate.</Paragraph> <Paragraph position="4"> A 'natural' system should also exhibit &quot;variability&quot; in that it can convey the same content in a variety of ways. A further desirable feature is that the system's generated utterances should be in the coverage of the dialogue system's speech recognizer, so that system-generated utterances effectively prime the user to speak in-grammar.</Paragraph> <Paragraph position="5"> Consequently we attempted to implement the following features in message selection and generation: relevance filtering; recency filtering; echoing; variability; aggregation; symmetry; real-time generation. null Our general method is to take as inputs to the process various communicative goals of the system, expressed as logical forms, and use them to construct a single new logical form to be input to Gemini's Semantic Head-Driven Generation algorithm (Shieber et al., 1990), which produces strings for Festival speech synthesis. We now describe how to use complex dialogue context to produce natural generation in multitasking contexts.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.1 Message selection - filtering </SectionTitle> <Paragraph position="0"> Inputs to the selection and generation module are &quot;concept&quot; logical forms (LFs) describing the communicative goals of the system. These are structures consisting of context tags (e.g. activity identifier, dialogue move tree node, turn tag) and a content logical form consisting of a Dialogue Move (e.g. report, wh-question), a priority tag (e.g. warn or inform), and some additional content tags (e.g.</Paragraph> <Paragraph position="1"> for objects referred to). An example input logical form is, &quot;report(inform, agent(AgentID), cancelactivity(ActivityID))&quot;, which corresponds to the report &quot;I have cancelled flying to the tower&quot; when AgentID refers to the robot and ActivityID refers to a &quot;fly to the tower&quot; task.</Paragraph> <Paragraph position="2"> Items which the system will consider for generation are placed (either directly by the robot, or indirectly by the Activity Tree) on the &quot;System Agenda&quot; (SA), which is the part of the dialogue Information State which stores communicative goals of the system. Communicative goals may also exist on the &quot;Pending List&quot; (PL) which is the part of the information state which stores questions that the system has asked, but which the user has not answered, so that they may be re-raised by the system. Only questions previously asked by the system can exist on the Pending List.</Paragraph> <Paragraph position="3"> Due to multi-tasking, at any time there is a number of &quot;Current Activities&quot; which the user and system are performing (e.g. fly to the tower, search for a red car). These activities are topics of conversation (defining threads of the DMT) represented in the dialogue information state, and the system's reports can be generated by them (in which case the are tagged with that activity label) or can be relevant to an activity in virtue of being about an object which is in focus because it is involved in that activity. null Some system reports are more urgent that others (e.g. &quot;I am running out of fuel&quot;) and these carry the label warning. Warnings are always relevant, no matter what activities are current - they always pass the recency and relevance filters.</Paragraph> <Paragraph position="4"> Echoing (for noun-phrases) is achieved by accessing the Salience List whenever generating referential terms, and using whatever noun-phrase (if any) the user has previously employed to refer to the object in question. If the object is top of the salience list, the generator will select an anaphoric expression.</Paragraph> <Paragraph position="5"> The end result of our selection and aggregation module (see section 6.2) is a fully specified logical form which is to be sent to the Semantic-Head-Driven Generation component of Gemini (Shieber et al., 1990). The bi-directionality of Gemini (i.e.</Paragraph> <Paragraph position="6"> that we use the same grammar for both parsing and generation) automatically confers a useful &quot;symmetry&quot; property on the system - that it only utters sentences which it can also understand. This means that the user will not be misled by the system into employing out-of-vocabulary items, or out-of-grammar constructions. Another side effect of this is that the system utterances prime the user to make in-grammar utterances, thus enhancing co-ordination between user and system in the dialogues.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.2 Incremental aggregation </SectionTitle> <Paragraph position="0"> Aggregation combines and compresses utterances to make them more concise, avoid repetitious language structure, and make the system's speech more natural and understandable. In a dialogue system aggregation should function incrementally because utterances are generated on the fly. In dialogue systems, when constructing an utterance we often have no information about the utterances that will follow it, and thus the best we can do is to compress it or &quot;retro-aggregate&quot; it with utterances that preceded it. Only occasionally does the System Agenda contain enough unsaid utterances to perform reasonable &quot;pre-aggregation&quot;.</Paragraph> <Paragraph position="1"> Each dialogue move type (e.g. report, whquestion) has its own aggregation rules, stored in the class for that LF type. In each type, rules specify which other dialogue move types can aggregate with it, and exactly how aggregation works. The rules note identical portions of LFs and unify them, and then combine the non-identical portions appropriately. null For example, the LF that represents the phrase &quot;I will fly to the tower and I will land at the parking lot&quot;, will be converted to one representing &quot;I will fly to the tower and land at the parking lot&quot; according to the compression rules. Similarly, &quot;I will fly to the tower and fly to the hospital&quot; gets converted to &quot;I will fly to the tower and the hospital&quot;.</Paragraph> <Paragraph position="2"> The &quot;retro-aggregation&quot; rules result in sequences of system utterances such as, &quot;I have cancelled flying to the school. And the tower. And landing at the base.&quot;</Paragraph> </Section> </Section> class="xml-element"></Paper>