File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1024_metho.xml
Size: 12,054 bytes
Last Modified: 2025-10-06 14:10:19
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1024"> <Title>Learning More Effective Dialogue Strategies Using Limited Dialogue Move Features</Title> <Section position="5" start_page="185" end_page="187" type="metho"> <SectionTitle> 2 The Experimental Framework </SectionTitle> <Paragraph position="0"> Each experiment is executed using the DIPPER Information State Update dialogue manager (Bos et al., 2003) (which here is used to track and update dialogue context rather than deciding which actions to take), a Reinforcement Learning program (which determines the next dialogue action to take), and various user simulations. In sections</Paragraph> <Section position="1" start_page="185" end_page="185" type="sub_section"> <SectionTitle> 2.3 and 2.4 we give more details about the rein- </SectionTitle> <Paragraph position="0"> forcement learner and user simulations.</Paragraph> </Section> <Section position="2" start_page="185" end_page="185" type="sub_section"> <SectionTitle> 2.1 The action set for the learner </SectionTitle> <Paragraph position="0"> Below is a list of all the different actions that the RL dialogue manager can take and must learn to choose between based on the context: 1. An open question e.g. 'How may I help you?' 2. Ask the value for any one of slots 1...n. 3. Explicitly con rm any one of slots 1...n. 4. Ask for the nth slot whilst implicitly con rming1 either slot value n[?]1 e.g. 'So you want to y from London to where?', or slot value n + 1 5. Give help.</Paragraph> <Paragraph position="1"> 6. Pass to human operator.</Paragraph> <Paragraph position="2"> 7. Database Query.</Paragraph> <Paragraph position="3"> There are a couple of restrictions regarding which actions can be taken in which states: an open question is only available at the start of the dialogue, and the dialogue manager can only try to con rm non-empty slots.</Paragraph> </Section> <Section position="3" start_page="185" end_page="186" type="sub_section"> <SectionTitle> 2.2 The Reward Function </SectionTitle> <Paragraph position="0"> We employ an all-or-nothing reward function which is as follows: 1. Database query, all slots con rmed: +100 2. Any other database query: [?]75 1Where n = 1 we implicitly con rm the nal slot and where n = 4 we implicitly con rm the rst slot. This action set does not include actions that ask the nth slot whilst implicitly con rming slot value n[?] 2. These will be added in future experiments as we continue to increase the action and state space.</Paragraph> <Paragraph position="1"> 3. User simulation hangs-up: [?]100 4. DIPPER passes to a human operator: [?]50 5. Each system turn: [?]5 To maximise the chances of a slot value being correct, it must be con rmed rather than just lled. The reward function re ects the fact that a successful dialogue manager must maximise its chances of getting the slots correct i.e. they must all be con rmed. (Walker et al., 2000) showed with the PARADISE evaluation that con rming slots increases user satisfaction.</Paragraph> <Paragraph position="2"> The maximum reward that can be obtained for a single dialogue is 85, (the dialogue manager prompts the user, the user replies by lling all four of the slots in a single utterance, and the dialogue manager then con rms all four slots and submits a database query).</Paragraph> </Section> <Section position="4" start_page="186" end_page="186" type="sub_section"> <SectionTitle> 2.3 The Reinforcement Learner's Parameters </SectionTitle> <Paragraph position="0"> When the reinforcement learner agent is initial- null ized, it is given a parameter string which includes the following: 1. Step Parameter: a = decreasing 2. Discount Factor: g = 1 3. Action Selection Type = softmax (alternative is epsilon1-greedy) 4. Action Selection Parameter: temperature = 15 5. Eligibility Trace Parameter: l = 0.9 6. Eligibility Trace = replacing (alternative is accumulating) 7. Initial Q-values = 25 The reinforcement learner updates its Q-values using the Sarsa(l) algorithm (see (Sutton and Barto, 1998)). The rst parameter is the stepparameter a which may be a value between 0 and 1, or speci ed as decreasing. If it is decreasing, as it is in our experiments, then for any given Q-value update a is 1k where k is the number of times that the state-action pair for which the update is being performed has been visited. This kind of step parameter will ensure that given a suf cient number of training dialogues, each of the Q-values will eventually converge. The second parameter (discount factor) g may take a value between 0 and 1. For the dialogue management problem we set it to 1 so that future rewards are taken into account as strongly as possible.</Paragraph> <Paragraph position="1"> Apart from updating Q-values, the reinforcement learner must also choose the next action for the dialogue manager and the third parameter speci es whether it does this by epsilon1-greedy or softmax action selection (here we have used softmax). The fth parameter, the eligibility trace parameter l, may take a value between 0 and 1, and the sixth parameter speci es whether the eligibility traces are replacing or accumulating. We used replacing traces because they produced faster learning for the slot- lling task. The seventh parameter is for supplying the initial Q-values.</Paragraph> </Section> <Section position="5" start_page="186" end_page="187" type="sub_section"> <SectionTitle> 2.4 N-Gram User Simulations </SectionTitle> <Paragraph position="0"> Here user simulations, rather than real users, interact with the dialogue system during learning. This is because thousands of dialogues may be necessary to train even a simple system (here we train on up to 50000 dialogues), and for a proper exploration of the state-action space the system should sometimes take actions that are not optimal for the current situation, making it a sadistic and time-consuming procedure for any human training the system. (Eckert et al., 1997) were the rst to use a user simulation for this purpose, but it was not goal-directed and so could produce inconsistent utterances. The later simulations of (Pietquin, 2004) and (Schef er and Young, 2001) were to some extent goal-directed and also incorporated an ASR error simulation. The user simulations interact with the system via intentions. Intentions are preferred because they are easier to generate than word sequences and because they allow error modelling of all parts of the system, for example ASR error modelling and semantic errors. The user and ASR simulations must be realistic if the learned strategy is to be directly applicable in a real system.</Paragraph> <Paragraph position="1"> The n-gram user simulations used here (see (Georgila et al., 2005a) for details and evaluation results) treat a dialogue as a sequence of pairs of speech acts and tasks. They take as input the n[?]1 most recent speech act-task pairs in the dialogue history, and based on n-gram probabilities learned from the COMMUNICATOR data (automatically annotated with speech acts and Information States (Georgila et al., 2005b)), they then output a user utterance as a further speech-act task pair. These user simulations incorporate the effects of ASR errors since they are built from the user utterances as they were recognized by the ASR components of the original COMMUNICATOR systems. Note that the user simulations do not provide instantiated slot values e.g. a response to provide a destination city is the speech-act task pair &quot;[provide info] [dest city]&quot;. We cannot assume that two such responses in the same dialogue refer to the same destination cities. Hence in the dialogue manager's Information State where we record whether a slot is empty, filled, or confirmed, we only update from filled to confirmed when the slot value is implicitly or explicitly con rmed. An additional function maps the user speech-act task pairs to a form that can be interpreted by the dialogue manager. Post-mapping user responses are made up of one or more of the following types of utterance: (1) Stay quiet, (2) Provide 1 or more slot values, (3) Yes, (4) No, (5) Ask for help, (6) Hang-up, (7) Null (out-of-domain or no ASR hypothesis).</Paragraph> <Paragraph position="2"> The quality of the 4 and 5-gram user simulations has been established through a variety of metrics and against the behaviour of the actual users of the COMMUNICATOR systems, see (Georgila et al., 2005a).</Paragraph> <Paragraph position="3"> The user and ASR simulations are a fundamentally important factor in determining the nature of the learned strategies. For this reason we should note the limitations of the n-gram simulations used here. A rst limitation is that we cannot be sure that the COMMUNICATOR training data is suf ciently complete, and a second is that the n-gram simulations only use a window of n moves in the dialogue history. This second limitation becomes a problem when the user simulation's current move ought to take into account something that occurred at an earlier stage in the dialogue. This might result in the user simulation repeating a slot value unnecessarily, or the chance of an ASR error for a particular word being independent of whether the same word was previously recognised correctly. The latter case means we cannot simulate for example, a particular slot value always being liable to misrecognition. These limitations will affect the nature of the learned strategies. Different state features may assume more or less importance than they would if the simulations were more realistic. This is a point that we will return to in the analysis of the experimental results. In future work we will use the more accurate user simulations recently developed following (Georgila et al., 2005a) and we expect that these will improve our results still further.</Paragraph> </Section> </Section> <Section position="6" start_page="187" end_page="187" type="metho"> <SectionTitle> 3 Experiments </SectionTitle> <Paragraph position="0"> First we learned strategies with the 4-gram user simulation and tested with the 5-gram simulation, and then did the reverse. We experimented with different feature sets, exploring whether better strategies could be learned by adding limited context features. We used two baselines for comparison: null * The performance of the original COMMUNICATOR systems in the data set (Walker et al., 2001).</Paragraph> <Paragraph position="1"> * An RL baseline dialogue manager learned using only slot-status features i.e. for each of slots 1[?]4, is the slot empty, filled or confirmed? null We then learned two further strategies: both the user and system's last dialogue moves to the state.</Paragraph> <Paragraph position="2"> The possible system and user dialogue moves were those given in sections 2.1 and 2.4 respectively, and the reward function was that described in section 2.2.</Paragraph> <Section position="1" start_page="187" end_page="187" type="sub_section"> <SectionTitle> 3.1 The COMMUNICATOR data baseline </SectionTitle> <Paragraph position="0"> We computed the scores for the original hand-coded COMMUNICATOR systems as was done by (Henderson et al., 2005), and we call this the HLG05 score. This scoring function is based on task completion and dialogue length rewards as determined by the PARADISE evaluation (Walker et al., 2000). This function gives 25 points for each slot which is lled, another 25 for each that is con rmed, and deducts 1 point for each system action. In this case the maximum possible score is 197 i.e. 200 minus 3 actions, (the system prompts the user, the user replies by lling all four of the slots in one turn, and the system then con rms all four slots and offers the ight). The average score for the 1242 dialogues in the COMMUNICATOR dataset where the aim was to ll and con rm only the same four slots as we have used here was 115.26. The other COMMUNICATOR dialogues involved different slots relating to return ights, hotel-bookings and car-rentals.</Paragraph> </Section> </Section> class="xml-element"></Paper>