File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-1083_metho.xml
Size: 18,208 bytes
Last Modified: 2025-10-06 14:07:08
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-1083"> <Title>Taking Account of the User's View in 3D Multimodal Instruction Dialogue</Title> <Section position="4" start_page="573" end_page="573" type="metho"> <SectionTitle> 3 Related work </SectionTitle> <Paragraph position="0"> There are many multimodal systems, such as nmltimedia presentation systems and animated agents (Mwbury, 1993; Lester et al., 1997; Bares and Lester, 1997; Stone and Lester, 1996; Towns et al., 1998)~ all of which use 3D graphics and 3D animations. In some of them (Maybury, 1993; Wahlster et al., 1993; Towns et al., 1998), planning is used in generating multimodal presentations including graphics and animations. They are similar to MID-aD in that they use planning mechanisms in content planning. However, in presentation systems, unlike dialogue systems, the user just watches the presentation without changing her/his view.</Paragraph> <Paragraph position="1"> Therefore, these studies are not concerned with dlanging the content of the discourse to match the user's view.</Paragraph> <Paragraph position="2"> In some studies of dialogue management (Rich and Sidner, 1998; Stent et M., 1999), the state of the dialogue is represented using Grosz and Sidner's framework (Grosz and Sidner, 1986). We also adopt this theory in our dialogue management mechanism. However, they do not keep track of the user's viewpoint information as a part of the dialogue state because they were not concerned with dialogue management in virtual environments.</Paragraph> <Paragraph position="3"> Studies on pedagogical agents have goals closer to ours. In (Rickel and .\]ohnson, 1999), a pedagogical agent demonstrates the sequential operation of complex machiuery and answers some follow up questions fl'on~ the student. Lester et al. (1999) proposes a life-like pedagogical agent that supports problem-solving activities. Although these studies are concerned with building interactive learning environments using natural language, they do not discuss how to decide the course of on-going instruction dialogues in an incremental and coherent way.</Paragraph> </Section> <Section position="5" start_page="573" end_page="574" type="metho"> <SectionTitle> 4 Overview of the System </SectionTitle> <Paragraph position="0"> Architecture This section describes the architecture of MID3D. This system instructs users how to dismantle the steering system of a cal'. Tile system steps through the procedure and the user can interrupt the system's instructions at any time. Figme 3 shows the architecture and a snapshot of the system. The 3D virtual environment is viewed through an application window. A 3D model of a part of the car is provided and a frog- null like character is used as the pedagogical agent (Johnson et al., 2000). The user herself/himself Call also al)l)ear in the virtual enviromn(mt as an avatar. The buttons to the right of the 3D scre(m are operation 1)uttons tbr changillg the viewpoint. By using these buttons, the user can freely change her/his viewt)oint at any time.</Paragraph> <Paragraph position="1"> This system consists of five main modules: hll)Ut Analyzer, Domain Plan Reasoner, Content Planner (CP), Sentence Planner, Dialogue Manager (DM), and Virtual Environment Controller. null First of all, the user's inputs are interpreted through the Input Analyzer. It receives strings of characters from the voice recognizer and the user's inputs ti'om the Virtual Environment Controller. It interl)rets these inputs, transforms them into a semantic reprcsentation~ and sends them to the DM.</Paragraph> <Paragraph position="2"> The DM, working as a dialogue management mechanism, keeI)s track of the dialogue (:ontext including the user:s view and decides the, next goal (or a(:tion) of the system. Ut)on receiving an intmt from the user through the Input Analyzer, the DM sends it to the l)omaill Plan Reasoner (DPR) to get discourse goals for rest)onding to the inlmt. For example, if th(: user requests some instruction, the DI'I{ decides the sequence of steps that realizes the l)rocedure 1)y refi~rring to domain knowh~dge. Th(: 1354 (;hen adds (;he discourse goals to the goal agenda.</Paragraph> <Paragraph position="3"> If the user does not sulmlit a ~lew (;ot)ie , the DM (:ontilmes to expand the, instruction plan 1)y sending a goal in the goal agenda to (:lie CP. Details of the I)M are given in Section 6.</Paragraph> <Paragraph position="4"> After the goal is sent to the CP, it decides the apl)ropriate contents of instruction dialogue by eml)loying a refinement-driven hierar(:hi(:al linear 1)lamfing technique. When it; receives a goal fl'om the DM, it exl)ands the goal and returns its sul)goal to the DM. 13y ret)eating this process, the dialogue contents are, gradually specified. Theretbre, the CP provides the scenario tbr the instruction 1)ased on the control 1)rovided by the DM. Details of the CP are provided in Section 5.</Paragraph> <Paragraph position="5"> The Sentence Plalmer generates surface, linguisti(: expressions coordinated with action (Kato et al., 1996). The linguistic exl)ressions arc. output through a voice synthesizer. Actions ;/re realized through the Virtual Enviromnent Controller as 3D animation.</Paragraph> <Paragraph position="6"> For the Virtual Environment Controller, we use HyCLASS (Kawanol)e et al., 1998), which is a 3D simulation-1)ased environment tbr edu(:ational activities. Several APls are provided tbr controlling HyCLASS. By using these interfaces, the CP and the DM can discern the liser~s view and issue an action command in ()l'der to challge the virtual (;nvironnmllt. \Y=h(m HyCLASS receives an action command, it interprets the command and renders the 31) animation corresponding to the action in real time.</Paragraph> </Section> <Section position="6" start_page="574" end_page="575" type="metho"> <SectionTitle> 5 Selecting the Conten(; of </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="574" end_page="574" type="sub_section"> <SectionTitle> Instruction Dialogue </SectionTitle> <Paragraph position="0"> Ill this section, we introduce the CP and show how the instruction dialogue is (leeided in all in(:renl(:ntal way to ma, tch the user's view.</Paragraph> </Section> <Section position="2" start_page="574" end_page="575" type="sub_section"> <SectionTitle> 5.1 Content Planner </SectionTitle> <Paragraph position="0"> In MID-3D, the CP is (:ailed by the DM. Wheal a goal is put to the CP fl'(nn the DM, it; selects a plan operator fi)r achieving the goal, applies the ol)erator to lind new subgoals, and returns them to l;he \])M. The sul)goals are then added to the goal agenda maintained by the DM. Theretbre, the CP provides the seenm:io tbr the instruction dialogue to the DM and enables MID-3D to output coherent instructions. Moreover, the Content Planer emt)loys depth-first search with a retinement-drivell hierarchical linear plmming algorithm as in (Cmvsey, 1992). The advantage of this method is that the t)lan is de, veloped increnmntally, and can be changed while the conversation is in progress. Thus, by aI)plying this algorithm to 3D dialogues, it be(-omes lmssible to set instruction dialogue strategies that are contingent on the user's view.</Paragraph> </Section> <Section position="3" start_page="575" end_page="575" type="sub_section"> <SectionTitle> 5.2 Considering the User's View in Content Selection </SectionTitle> <Paragraph position="0"> In order to decide the dialogue content according to tile user's view, we extend the description of the content plan operator (Andrd and Rist, 1993) by using the user's view as a constraint in plan operator selection. We also modify the constraint checking flmctions of |;lie previous planning algorithm such that HyCLASS is queried about the state of the virtual environment. null Figure 4 shows examples of content plan operators. Each operator consists of the name of the operator (Header), the etfcct resulting from plan execution (Effect), the constraints for executing the plan (Constraints), the essential subgoals (Main-acts), and the optional subgoals (Subsidiary-acts). As shown in {Operator 1.) in Figure 4, we use the constraint (gisible-p (Visible ?object t)) to check whether the object is visible fl'om tile user's viewpoint.</Paragraph> <Paragraph position="1"> Actually, the CP asks HyCLASS to examine whether the object is in the student's field of view.</Paragraph> <Paragraph position="2"> If an object is bound to the ?object variable by rel~rring to the knowledge base, and the object is visible to the user, (Operator 1) is selected. As a result, two Main-Acts (looking at the, user and requesting to try to do the action) and two Subsidiary-Acts (showing how to do the action, then resetting the state) are set as subgoals and returned to the DM.</Paragraph> <Paragraph position="3"> In contrast, if l;he object is not visible to the user, {Operator 2} is selected. In this case, a goal for making the user i(tenti(y the object is added to the Main-Acts; (Hake-recognize S H (Object ?object) MM).</Paragraph> <Paragraph position="4"> As shown al)ove, the user's view is considered in deciding the instruction strategy. In addition to the above example, the distance between the target object and the user as well as three dimensional overlapping of objects, can also be considered as constraims related to the user's view.</Paragraph> <Paragraph position="5"> Although the user's view is also considered in selecting locative expressions of objects in the Sentence Planner in MID-3D, we do not discuss this issue here becanse surface generation is not the tbcus of this paper.</Paragraph> </Section> </Section> <Section position="7" start_page="575" end_page="576" type="metho"> <SectionTitle> 6 Managing Interruptive </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="575" end_page="575" type="sub_section"> <SectionTitle> Subdialogue </SectionTitle> <Paragraph position="0"> The DM controls the other components of MID-3D based on a discourse model that represents the state of tile dialogue. This section describes the DM and shows how the user's view is used in managing the instruction dialogue.</Paragraph> </Section> <Section position="2" start_page="575" end_page="576" type="sub_section"> <SectionTitle> 6.1 Maintaining the Discourse Model </SectionTitle> <Paragraph position="0"> The DM maintains a discourse model for tracking the state of the dialogue. The discourse model consists of the discourse goal agenda (agenda), focus stack, and dialogue history. The agenda is a list of goals that should be achieved through a dialogue between the user and the system. If all the goals in the agenda are accomplished, the instruction (tialogue finishes successflflly. The focus stack is a sta& of discourse segment frames (DSF). Each DSF is a frmne structure that stores the tbllowing inlbrmation as slot vMues: utterance content (UC): A list of utterance contents constructing a discourse segment. Physical actions are also regarded as uttcra.nce contents (D;rguson and Allen, 1998).</Paragraph> <Paragraph position="1"> discourse purpose (1)19: The purt)ose of a discourse segment.</Paragraph> <Paragraph position="2"> - 9oal state (GS): A state (or states) whi('h shouhl 1)e accomplished to achieve the discourse lmrpose of the segment.</Paragraph> <Paragraph position="3"> In addition to these, we add the user's view-point slot to the DSF description in order to track the user's viewl)oint information: user's vic.'wpoint (UV): Current user's viewpoint, which is represented as the position and orientation of the camera. The position consists of x-, y-, and z-coordinates. The orientation consists of x-, y-, and z-angles of the ('amera. The basic algorithm of the DM is to repeat (a) th(; peribnning actions step and (1)) updating the discourse model, until there is no unsatisfied goal in the agenda (~IYaum, 1994). In 1)ertbrming actions step, the DM decides what to do next ill the current dialogue state, an(1 then pertbnns the action. When continuing the system explanation, the DM posts the first goal in the agenda to the CP. If the user's response is needed in the current state, the 1)M waits tbr the nser's input.</Paragraph> <Paragraph position="4"> The other step in the DM algorith.m is to update the discourse model according to the state that results from the actions pertbrmed by the user as well as the actions peribrmed by the system. Although we do not detail this step here, the tbllowing operations could be executed depending on the case. if the current discourse purpose is accomplished, the top level DSF is popped and added to the dialogue history, q_/he system then assunms that the user understands the instruction and adds the assumption to the user model. If a new discourse 1)urpose is introduced from the CP, the I)M creates a new DSF by setting the header of the selected plan operator in the discourse lmrpose slot mM the effi~ct of the operator in the goal state slot. The DSF is then trashed to the tbcus stack. If the current discourse purpose is contimmd, the DM updates the information of the top level DSF.</Paragraph> </Section> <Section position="3" start_page="576" end_page="576" type="sub_section"> <SectionTitle> 6.2 Considering the User's View in Coping with Interruptive Subdialogues </SectionTitle> <Paragraph position="0"> The main ditlbxence of the Dialogue Manager of our system from the i)revious one is to maintain the user's viewpoint information and use this in managing the dialogue. When the DM updates the information of the current DSt i', it observes the user's viewi~oint at that petal; and renews the UV slot and it also adds the sema.nl;ic representation of utterance (or action) in the UC slot. As a result, it becomes possible to update the user's viewpoint information at each turn, and to track the user's viewl)oint in an on-going dialogue.</Paragraph> <Paragraph position="1"> By using this mechanism, the DM can cope with interruptive subdialognes. In resmning from a subdialogue, the user may become contimed if the dialogue is resumed but the observed state differs from what the user relllelllhers. In order to match the view to the resumed dialogtm, the I)M refers the UV slot of the top DSF and puts the users view ha& to that point. This ensures that the user experiences a smooth transition back to the previous topic. Figure 5 shows an example of the state of a dialogue. DSF12 represents a discourse segment that describes how to remove the left tie rod end. DSF121 represents the user-initiated interrul)tive subdialogue about where the right \[14\]System: The left knuckle arm is removed like this. (with the anilnation showing the left knuckle arm coming off) \[ 15\]User: (After moving the viewpoint to Figure I and clicking the right knuckle ann) What is this? \[16\]System: This is the right knuckle arm.</Paragraph> <Paragraph position="2"> \[I 7\]Uscr: OK.</Paragraph> <Paragraph position="3"> \[18\]Systeln: Now, let's continue the explanation. (with changing the view to the one in utterance \[ 14\]) \[19\]System: The left knuckle arm is removed like this. (with the animation showing the left knuckle arm boot is. hmnediately before starting DSF\]21, the user's viewpoint in l)SF12 is ((-38, -22, -259) (0, -0.33, 0)). After completing the subdialogue \])y answering the user's question, DSF121 is l)opped and the system resmnes DSF12. At this time, the \])M gets the view-point value of the top DSF (DSF12), alld (;Oltlmands ItyCLASS to change the viewpoint to that view, which is in this case ((-as, -22, -2,59) (0, -0.a3, 0)) ' The systeln then restarts the interrupted dialogue.</Paragraph> </Section> </Section> <Section position="8" start_page="576" end_page="577" type="metho"> <SectionTitle> 7 Exmnple </SectionTitle> <Paragraph position="0"> In order to illustrate the behavior of MID-3D, an example is shown in Figure 6. This is a part of an instruction dialogue on how to dismantle the steering system of a car. The current topic is removing the left knuckle arm. In utterance \[14\], the system describes how to remove this part in conjunction with an animation created by HyCLASS.</Paragraph> <Paragraph position="1"> In \[15\], the user interrupted the system's instruction and asked &quot;What is this?&quot; by clicking the right knuckle arm. At this point, the user's speech input was interpreted in the Input An~In the current system, it; is not 1)ossible to move the camera to an arbitrary point because of the limitations of the virtual environment controller employed. Accordingly, this func|;ion is al)proximated by selecting the nearest of several predetined viewpoints.</Paragraph> <Paragraph position="2"> alyzer and a user initiative subdialogue started by t)ushing another DSF onto the focus stack.</Paragraph> <Paragraph position="3"> In order to answer the question, the DM asked the Domain Plan Reasoner how to answer the user's question. As a result, a discourse goal was returned to the DM and added to the agenda.</Paragraph> <Paragraph position="4"> The DM then sent the goal (Describe-name S H (object knuckle_arm_r)) to the CP. This goal generated utterance \[16\].</Paragraph> <Paragraph position="5"> In system utterance \[18\], in order to resume the dialogue, a recta-comment, &quot;Now let's continue the explanation&quot;, was generated and the viewpoint returned to the previous one in \[14\] as noted in the DSF. After returning to the previous view, the interrupted goal was re-planned. As a result, utterance \[19\] was generated.</Paragraph> <Paragraph position="6"> After completing this operation in \[23\], the next step, removing the right tie rod end, is started. At this time, if the user is viewing the left side (Figure 2) and the system has the goal (Instruct-act S H remove-tierod_end_r MR), (Operator 2} in Figure 4 is applied because the target object, right tie rod end, is not visible fi'om the user's viewpoint. Thus a goal of making the user view the right tie rod end is added as a subgoal and utterances \[24\] and \[25\] are generated.</Paragraph> </Section> class="xml-element"></Paper>