File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-1024_evalu.xml

Size: 10,441 bytes

Last Modified: 2025-10-06 13:59:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1024">
  <Title>Learning More Effective Dialogue Strategies Using Limited Dialogue Move Features</Title>
  <Section position="7" start_page="187" end_page="190" type="evalu">
    <SectionTitle>
4 Results
</SectionTitle>
    <Paragraph position="0"> Figure 1 tracks the improvement of the 3 learned strategies for 50000 training dialogues with the 4-gram user simulation, and gure 2 for 50000 training dialogues with the 5-gram simulation. They show the average reward (according to the function of section 2.2) obtained by each strategy over intervals of 1000 training dialogues.</Paragraph>
    <Paragraph position="1"> Table 1 shows the results for testing the strategies learned after 50000 training dialogues (the baseline RL strategy, strategy 2 (UDM) and strategy 3 (USDM)). The 'a' strategies were trained with the 4-gram user simulation and tested with  trained with 5-gram and tested with 4-gram; (av) = average; * signi cance level p &lt; 0.025; ** signi cance level p &lt; 0.005; *** Note: The Hybrid RL scores (here updated from (Henderson et al., 2005)) are not directly comparable since that system has a larger action set and fewer policy constraints. the 5-gram, while the 'b' strategies were trained with the 5-gram user simulation and tested with the 4-gram. The table also shows average scores for the strategies. Column 2 contains the average reward obtained per dialogue by each strategy over 1000 test dialogues (computed using the function of section 2.2).</Paragraph>
    <Paragraph position="2"> The 1000 test dialogues for each strategy were divided into 10 sets of 100. We carried out t-tests and found that in both the 'a' and 'b' cases, strategy 2 (UDM) performs signi cantly better than the RL baseline (signi cance levels p &lt; 0.005 and p &lt; 0.025), and strategy 3 (USDM) performs signi cantly better than strategy 2 (UDM) (significance level p &lt; 0.005). With respect to average performance, strategy 2 (UDM) improves over the RL baseline by 4.9%, and strategy 3 (USDM) improves by 7.8%. Although there seem to be only negligible qualitative differences between strategies 2(b) and 3(b) and their 'a' equivalents, the former perform slightly better in testing. This suggests that the 4-gram simulation used for testing the 'b' strategies is a little more reliable in lling and con rming slot values than the 5-gram.</Paragraph>
    <Paragraph position="3"> The 3rd column HLG05 shows the average scores for the dialogues as computed by the reward function of (Henderson et al., 2005). This is done for comparison with that work but also with the COMMUNICATOR data baseline. Using the HLG05 reward function, strategy 3 (USDM) improves over the original COMMUNICATOR systems baseline by 65.9%. The components making up the reward are shown in the nal 3 columns of table 1. Here we see that all of the RL strategies are able to ll and con rm all of the 4 slots when conversing with the simulated COMMUNICATOR users. The only variation is in the average length of dialogue required to con rm all four slots. The COMMUNICATOR systems were often unable to con rm or ll all of the user slots, and the dialogues were quite long on average. As stated in section 2.4.1, the n-gram simulations do not simulate the case of a particular user goal utterance being unrecognisable for the system. This was a problem that could be encountered by the real COMMUNICATOR systems.</Paragraph>
    <Paragraph position="4"> Nevertheless, the performance of all the learned strategies compares very well to the COMMUNICATOR data baseline. For example, in an average dialogue, the RL strategies lled and con rmed all four slots with around 9 actions not including offering the ight, but the COMMUNICATOR systems took an average of around 33 actions per dialogue, and often failed to complete the task.</Paragraph>
    <Paragraph position="5"> With respect to the hybrid RL result of (Henderson et al., 2005), shown in the nal row of the table, Strategy 3 (USDM) shows a 34% improvement, though these results are not directly comparable because that system uses a larger action set and has fewer constraints (e.g. it can ask how may I help you? at any time, not just at the start of a dialogue).</Paragraph>
    <Paragraph position="6"> Finally, let us note that the performance of the RL strategies is close to optimal, but that there is some room for improvement. With respect to the HLG05 metric, the optimal system score would be 197, but this would only be available in rare cases where the simulated user supplies all 4 slots in the  rst utterance. With respect to the metric we have used here (with a [?]5 per system turn penalty), the optimal score is 85 (and we currently score an average of 55.57). Thus we expect that there are still further improvments that can be made to more fully exploit the dialogue context (see section 4.3).</Paragraph>
    <Section position="1" start_page="189" end_page="189" type="sub_section">
      <SectionTitle>
4.1 Qualitative Analysis
</SectionTitle>
      <Paragraph position="0"> Below are a list of general characteristics of the learned strategies:  mations wherever possible. This allows them to ll and con rm the slots in fewer turns than if they simply asked the slot values and then used explicit con rmation.</Paragraph>
      <Paragraph position="1"> 4. As a result of characteristic 3, which slots can be asked and implicitly con rmed at the same time in uences the order in which the learned strategies attempt to ll and con rm each slot, e.g. if the status of the third slot is ' lled' and the others are 'empty', the learner learns to ask for the second or fourth slot  rather than the rst, since it can implicitly con rm the third while it asks for the second or fourth slots, but it cannot implicitly conrm the third while it asks for the rst slot. This action is not available (see section 2.1).</Paragraph>
    </Section>
    <Section position="2" start_page="189" end_page="190" type="sub_section">
      <SectionTitle>
4.2 Emergent behaviour
</SectionTitle>
      <Paragraph position="0"> In testing the UDM strategy (2) lled and conrmed all of the slots in fewer turns on average than the RL baseline, and strategy 3 (USDM) did this in fewer turns than strategy 2 (UDM).</Paragraph>
      <Paragraph position="1"> What then were the qualitative differences between the three strategies? The behaviour of the three strategies only seems to really deviate when a user response fails to ll or con rm one or more slots. Then the baseline strategy's state has not changed and so it will repeat its last dialogue move, whereas the state for strategies 2 (UDM) and 3 (USDM) has changed and as a result, these may now try different actions. It is in such circumstances that the UDM strategy seems to be more effective than the baseline, and strategy 3 (USDM) more effective than the UDM strategy. In gure 3 we show illustrative state and learned action pairs for the different strategies. They relate to a situation where the rst user response(s) in the dialogue has/have failed to ll a single slot value.</Paragraph>
      <Paragraph position="2"> NB: here 'emp' stands for 'empty' and ' ll' for ' lled' and they appear in the rst four state variables, which stand for slot states. For strategy 2 (UDM), the fth variable represents the user's last  dialogue move, and the for strategy 3 (USDM), the fth variable represents the system's last dialogue move, and the sixth, the user's last dialogue move.</Paragraph>
      <Paragraph position="3">  gies and emergent behaviours: focus switching (for UDM) and giving help (for USDM) Here we can see that should the user responses continue to fail to provide a slot value, the baseline's state will be unchanged and so the strategy will simply ask for slot 2 again. The state for strategy 2 (UDM) does change however. This strategy switches focus between slots 3 and 1 depending on whether the user's last dialogue move was 'null' or 'quiet' NB. As stated in section 2.4, 'null' means out-of-domain or that there was no ASR hypothesis. Strategy 3 (USDM) is different again. Knowledge of the system's last dialogue move as well as the user's last move has enabled the learner to make effective use of the 'give help' action, rather than to rely on switching focus. When the user's last dialogue move is 'null' in response to the system move 'askSlot3', then the strategy uses the 'give help' action before returning to ask for slot 3 again. The example described here is not the only example of strategy 2 (UDM) employing focus switching while strategy 3 (USDM) prefers to use the 'give help' action when a user response fails to ll or con rm a slot. This kind of behaviour in strategies 2 and 3 is emergent dialogue behaviour that has been learned by the system rather than explicitly programmed.</Paragraph>
    </Section>
    <Section position="3" start_page="190" end_page="190" type="sub_section">
      <SectionTitle>
4.3 Further possibilities for improvement
</SectionTitle>
      <Paragraph position="0"> over the RL baseline Further improvements over the RL baseline might be possible with a wider set of system actions. Strategies 2 and 3 may learn to make more effective use of additional actions than the baseline e.g. additional actions that implicitly con rm one slot whilst asking another may allow more of the switching focus described in section 4.1. Other possible additional actions include actions that ask for or con rm two or more slots simultaneously.</Paragraph>
      <Paragraph position="1"> In section 2.4.1, we highlighted the fact that the n-gram user simulations are not completely realistic and that this will make certain state features more or less important in learning a strategy. Thus had we been able to use even more realistic user simulations, including certain additional context features in the state might have enabled a greater improvement over the baseline. Dialogue length is an example of a feature that could have made a difference had the simulations been able to simulate the case of a particular goal utterance being unrecognisable for the system. The reinforcement learner may then be able to use the dialogue length feature to learn when to give up asking for a particular slot value and make a partially complete database query. This would of course require a reward function that gave some reward to partially complete database queries rather than the all-or-nothing reward function used here.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML