File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/p00-1013_metho.xml

Size: 14,700 bytes

Last Modified: 2025-10-06 14:07:15

<?xml version="1.0" standalone="yes"?>
<Paper uid="P00-1013">
  <Title>Spoken Dialogue Management Using Probabilistic Reasoning</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Dialogue Systems and POMDPs
</SectionTitle>
    <Paragraph position="0"> A Partially Observable Markov Decision Process (POMDP) is a natural way of modelling dialogue processes, especially when the state of the system is viewed as the state of the user. The partial observability capabilities of a POMDP policy allows the dialogue planner to recover from noisy or ambiguous utterances in a natural and autonomous way. At no time does the machine interpreter have any direct knowledge of the state of the user, i.e, what the user wants. The machine interpreter can only infer this state from the user's noisy input. The POMDP framework provides a principled mechanism for modelling uncertainty about what the user is trying to accomplish.</Paragraph>
    <Paragraph position="1"> The POMDP consists of an underlying, unobservable Markov Decision Process. The MDP is specified by:  The actions represent the set of responses that the system can carry out. The transition probabilities form a structure over the set of states, connecting the states in a directed graph with arcs between states with non-zero transition probabilities. The rewards define the relative value of accomplishing certain actions when in certain states.</Paragraph>
    <Paragraph position="2"> The POMDP adds:</Paragraph>
    <Paragraph position="4"> a56 the set of rewards with rewards conditioned on observations as well: a84a86a85a42a57a88a87a89a70a107a87a108a96a86a90a92a94a93 The observations consist of a set of keywords which are extracted from the speech utterances.</Paragraph>
    <Paragraph position="5"> The POMDP plans in belief space; each belief consists of a probability distribution over the set of states, representing the respective probability that the user is in each of these states. The initial belief specified in the model is updated every time the system receives a new observation from the user.</Paragraph>
    <Paragraph position="6"> The POMDP model, as defined above, first goes through a planning phase, during which it finds an optimal strategy, or policy, which describes an optimal mapping of action a71 to belief a82 a75a76a62a53a85a109a62a8a95a32a58a6a57a59a79 , for all possible beliefs. The dialogue manager uses this policy to direct its behaviour during conversations with users. The optimal strategy for a POMDP is one that prescribes action selection that maximises the expected reward. Unfortunately, finding an optimal policy exactly for all but the most trivial POMDP problems is computationally intractable.</Paragraph>
    <Paragraph position="7"> A near-optimal policy can be computed significantly faster than an exact one, at the expense of a slight reduction in performance. This is often done by imposing restrictions on the policies that can be selected, or by simplifying the belief state and solving for a simplified uncertainty representation. null In the Augmented MDP approach, the POMDP problem is simplified by noticing that the belief state of the system tends to have a certain structure. The uncertainty that the system has is usually domain-specific and localised. For example, it may be likely that a household robot system can confuse TV channels ('ABC' for 'NBC'), but it is unlikely that the system will confuse a TV channel request for a request to get coffee. By making the localised assumption about the uncertainty, it becomes possible to summarise any given belief vector by a pair consisting of the most likely state, and the entropy of the belief state.</Paragraph>
    <Paragraph position="9"> The entropy of the belief state approximates a sufficient statistic for the entire belief state 1. Given this assumption, we can plan a policy for every possible such a60 state, entropya69 pair, that approximates the POMDP policy for the corresponding</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 The Example Domain
</SectionTitle>
    <Paragraph position="0"> The system that was used throughout these experiments is based on a mobile robot, Florence 1Although sufficient statistics are usually moments of continuous distributions, our experience has shown that the entropy serves equally well.</Paragraph>
    <Paragraph position="1"> Nightingale (Flo), developed as a prototype nursing home assistant. Flo uses the Sphinx II speech recognition system (Ravishankar, 1996), and the Festival speech synthesis system (Black et al., 1999). Figure 1 shows a picture of the robot.</Paragraph>
    <Paragraph position="2"> Since the robot is a nursing home assistant, we use task domains that are relevant to assisted living in a home environment. Table 1 shows a list of the task domains the user can inquire about (the time, the patient's medication schedule, what is on different TV stations), in addition to a list of robot motion commands. These abilities have all been implemented on Flo. The medication schedule is pre-programmed, the information about the TV schedules is downloaded on request from the web, and the motion commands correspond to pre-selected robot navigation sequences.</Paragraph>
    <Paragraph position="3">  If we translate these tasks into the framework that we have described, the decision problem has 13 states, and the state transition graph is given in Figure 2. The different tasks have varying levels of complexity, from simply saying the time, to going through a list of medications. For simplicity, only the maximum-likelihood transitions are shown in Figure 2. Note that this model is handcrafted. There is ongoing research into learning policies automatically using reinforcement learning (Singh et al., 1999); dialogue models could be learned in a similar manner. This example model is simply to illustrate the utility of the POMDP approach.</Paragraph>
    <Paragraph position="4"> There are 20 different actions; 10 actions correspond to different abilities of the robot such as going to the kitchen, or giving the time. The remaining 10 actions are clarification or confirmation actions, such as re-confirming the desired TV channel. There are 16 observations that correspond to relevant keywords as well as a nonsense observation. The reward structure gives the most reward for choosing actions that satisfy the user request.</Paragraph>
    <Paragraph position="5"> These actions then lead back to the beginning state. Most other actions are penalised with an equivalent negative amount. However, the confirmation/clarification actions are penalised lightly (values close to 0), and the motion commands are penalised heavily if taken from the wrong state, to illustrate the difference between an undesirable action that is merely irritating (i.e., giving an inappropriate response) and an action that can be much more costly (e.g., having the robot leave the room at the wrong time, or travel to the wrong destination).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 An Example Dialogue
</SectionTitle>
      <Paragraph position="0"> Table 2 shows an example dialogue obtained by having an actual user interact with the system on the robot. The left-most column is the emitted observation from the speech recognition system.</Paragraph>
      <Paragraph position="1"> The operating conditions of the system are fairly poor, since the microphone is on-board the robot and subject to background noise as well as being located some distance from the user. In the final two lines of the script, the robot chooses the correct action after some confirmation questions, despite the fact that the signal from the speech recogniser is both very noisy and also ambiguous, containing cues both for the &amp;quot;say hello&amp;quot; response and for robot motion to the kitchen.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Experimental Results
</SectionTitle>
    <Paragraph position="0"> We compared the performance of the three algorithms (conventional MDP, POMDP approximated by the Augmented MDP, and exact POMDP) over the example domain. The metric used was to look at the total reward accumulated over the course of an extended test. In order to perform this full test, the observations and states from the underlying MDP were generated stochastically from the model and then given to the policy. The action taken by the policy was returned to the model, and the policy was rewarded based on the state-action-observation triplet. The experiments were run for a total of 100 dialogues, where each dialogue is considered to be a cycle of observation-action utterances from the start state request_begun through a sequence of states and back to the start state. The time was normalised by the length of each dialogue cycle.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 The Restricted State Space Problem
</SectionTitle>
      <Paragraph position="0"> Observation True State Belief Entropy Action Reward flo hello request begun 0.406 say hello 100 flo what is like start meds 2.735 ask repeat -100 flo what time is it for will the want time 0.490 say time 100 flo was on abc want tv 1.176 ask which station -1 flo was on abc want abc 0.886 say abc 100 flo what is on nbc want nbc 1.375 confirm channel nbc -1 flo yes want nbc 0.062 say nbc 100 flo go to the that pretty good what send robot 0.864 ask robot where -1 flo that that hello be send robot bedroom 1.839 confirm robot place -1 flo the bedroom any i send robot bedroom 0.194 go to bedroom 100 flo go it eight a hello send robot 1.110 ask robot where -1 flo the kitchen hello send robot kitchen 1.184 go to kitchen 100  utterance is both noisy and ambiguous.</Paragraph>
      <Paragraph position="1"> dra et al., 1997). The solver was unable to complete a solution for the full state space, so we created a much smaller dialogue model, with only 7 states and 2 task domains: time and weather information. null Figure 3 shows the performance of the three algorithms, over the course of 100 dialogues.</Paragraph>
      <Paragraph position="2"> Notice that the exact POMDP strategy outperformed both the conventional MDP and approximate POMDP; it accumulated the most reward, and did so with the fastest rate of accumulation.</Paragraph>
      <Paragraph position="3"> The good performance of the exact POMDP is not surprising because it is an optimal solution for this problem, but time to compute this strategy is high: 729 secs, compared with 1.6 msec for the MDP and 719 msec for the Augmented MDP.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 The Full State Space Problem
</SectionTitle>
      <Paragraph position="0"> Figure 4 demonstrates the algorithms on the full dialogue model as given in Figure 2. Because of the number of states, no exact POMDP solution could be computed for this problem; the POMDP  for the exact POMDP, POMDP approximated by the Augmented MDP, and the conventional MDP for the 7 state problem. In this case, the time is measured in dialogues, or iterations of satisfying user requests.</Paragraph>
      <Paragraph position="1"> policy is restricted to the approximate solution. The POMDP solution clearly outperforms the conventional MDP strategy, as it more than triples the total accumulated reward over the lifetime of the strategies, although at the cost of taking longer to reach the goal state in each dialogue.</Paragraph>
      <Paragraph position="2">  13 state problem. Again, the time is measured in number of actions.</Paragraph>
      <Paragraph position="3"> Table 3 breaks down the numbers in more detail. The average reward for the POMDP is 18.6 per action, which is the maximum reward for most actions, suggesting that the POMDP is taking the right action about 95% of the time. Furthermore, the average reward per dialogue for the POMDP is 230 compared to 49.7 for the conventional MDP, which suggests that the conventional MDP is making a large number of mistakes in each dialogue.</Paragraph>
      <Paragraph position="4"> Finally, the standard deviation for the POMDP is much narrower, suggesting that this algorithm is getting its rewards much more consistently than the conventional MDP.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Verification of Models on Users
</SectionTitle>
      <Paragraph position="0"> We verified the utility of the POMDP approach by testing the approximating model on human users. The user testing of the robot is still preliminary, and therefore the experiment presented here cannot be considered a rigorous demonstration. However, Table 4 shows some promising results. Again, the POMDP policy is the one provided by the approximating Augmented MDP.</Paragraph>
      <Paragraph position="1"> The experiment consisted of having users interact with the mobile robot under a variety of conditions. The users tested both the POMDP and an implementation of a conventional MDP dialogue manager. Both planners used exactly the same model. The users were presented first with one manager, and then the other, although they were not told which manager was first and the order varied from user to user randomly. The user labelled each action from the system as &amp;quot;Correct&amp;quot; (+100 reward), &amp;quot;OK&amp;quot; (-1 reward) or &amp;quot;Wrong&amp;quot; (100 reward). The &amp;quot;OK&amp;quot; label was used for responses by the robot that were questions (i.e., did not satisfy the user request) but were relevant to the request, e.g., a confirmation of TV channel when a TV channel was requested.</Paragraph>
      <Paragraph position="2"> The system performed differently for the three test subjects, compensating for the speech recognition accuracy which varied significantly between them. In user #2's case, the POMDP manager took longer to satisfy the requests, but in general gained more reward per action. This is because the speech recognition system generally had lower word-accuracy for this user, either because the user had unusual speech patterns, or because the acoustic signal was corrupted by background noise.</Paragraph>
      <Paragraph position="3"> By comparison, user #3's results show that in the limit of good sensing, the POMDP policy approaches the MDP policy. This user had a much higher recognition rate from the speech recogniser, and consequently both the POMDP and conventional MDP acquire rewards at equivalent rates, and satisfied requests at similar rates.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML