File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/h05-1127_evalu.xml
Size: 9,548 bytes
Last Modified: 2025-10-06 13:59:20
<?xml version="1.0" standalone="yes"?> <Paper uid="H05-1127"> <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 1011-1018, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Learning Mixed Initiative Dialog Strategies By Using Reinforcement Learning On Both Conversants</Title> <Section position="6" start_page="1014" end_page="1017" type="evalu"> <SectionTitle> 5 Results </SectionTitle> <Paragraph position="0"> In this section, we present the results of the dialog policies that we learned. We first present 3 baseline policies to which we will compare the performance of our learned policies. We will then present results varying the weights in the objective function in comparison to the baseline policies. As we are learning a pair of policies--one for the system and one representing the user--we explore how well the system policy does against handcrafted ones, that will represent what a user might do, rather than test it against its learned counter-part.</Paragraph> <Section position="1" start_page="1014" end_page="1015" type="sub_section"> <SectionTitle> 5.1 Baseline Policies </SectionTitle> <Paragraph position="0"> In order to provide comparative data to evaluate the effectiveness of our approach, we will compare the performance of the policies learned for the system and user against several pairs of handcrafted poli- null cies. The first pair implement the unrestricted initiative strategy of Yang and Heeman. Here, one conversant, A, proposes an item and then the other, B, informs A of any violated preferences. B then proposes an alternative and A informs B of any violated preferences. The process repeats until an item is proposed that does not violate any of the other agent's preferences. The second pair of policies implement the restricted initiative policy of Yang and Heeman, in which A proposes an item and B informs A of any violated preferences. However, the conversants do not switch roles: it is always A who proposes items and B that informs of preferences and accepts.</Paragraph> <Paragraph position="1"> These two policies represent successful handcrafted pairs of dialog policies. The third pair represents a minimum performance: A proposes an item and B simply accepts it. This is repeated for all 5 items, with A making all of the proposals. This policy is an un-collaborative approach, which represents how well A can do on its own.</Paragraph> </Section> <Section position="2" start_page="1015" end_page="1015" type="sub_section"> <SectionTitle> 5.2 Impact of Weights on Learned Policy </SectionTitle> <Paragraph position="0"> We first explore the ability of the reinforcement learning algorithm to learn a dialog policy pair that is optimal with respect to the objective function. The only important aspect of the weights is the ratio between the two: w2/w1. We varied the ratio from 0.1 to 0.5 in increments of 0.02. For each weight setting, we learned 66 policy pairs, and tested each policy pair on 1000 different task configurations. We compared the average objective function score of the learned policy pairs with the baseline restricted policy pair (cf. Scheffler and Young, 2002). Figure 1 shows the percentage of the learned policies that perform at least as well as the unrestricted policy pair better than unrestricted baseline pair.</Paragraph> <Paragraph position="1"> at each weight setting. Interestingly, it is clear that there is a lack of convergence in the learning process, no weight ratio learns a good policy 100% of the time. Additionally, we see that as the weight ratio increases (putting more emphasis on shorter dialogs), the ability of the algorithm to learn good policies decreases. As the objective function gives this aspect more weight, it is more difficult for the objective function to learn the importance of solution quality. We think this lack of convergence is due to learning both the system and a simulated user at the same time, which is a more difficult reinforcement learning problem than just learning the policy for the system against a fixed user.</Paragraph> </Section> <Section position="3" start_page="1015" end_page="1016" type="sub_section"> <SectionTitle> 5.3 Lack of Convergence </SectionTitle> <Paragraph position="0"> To better understand the lack of convergence, we explore when a single weight is chosen for the objective function. For this analysis, we restricted ourselves to the objective function having a ratio for w2/w1 of 0.1, one of the best performing weights from section 5.2. For this setting, we learned a number of policy pairs, each learned from a different sequence of task configurations. We then tested each policy pair on 1000 task configurations, in which actions are selected strictly according to the learned policy. This gives us 1000 dialogs for each policy pair. We then computed the average objective function score for each policy pair and plotted them as a histogram in Figure 2. As can be seen, at this weight setting, 63% of the learned policies achieved an objective function score around 44.8. However, the rest achieved a performance substantially less than this. Hence, the reinforcement learning procedure does not always converge on an optimal solution.</Paragraph> <Paragraph position="1"> To better understand why reinforcement learning is not always converging, we examined the components of the objective function score: solution quality and dialog length. Figure 3 uses the same x-axis as Figure 2: average objective function score. The y-axis plots the average solution quality and average dialog length. We see that at this weight ratio, all learned dialogue pairs are very consistent in solution quality, but that the difference in objective function scores is mainly due to differences in dialog length.</Paragraph> <Paragraph position="2"> This is consistent with our earlier observation that the reinforcement learning strategy sometimes disproportionately favors shorter dialog length.</Paragraph> <Paragraph position="3"> length versus objective function score for policies learned with w2/w1 = 0.1.</Paragraph> </Section> <Section position="4" start_page="1016" end_page="1016" type="sub_section"> <SectionTitle> 5.4 Consistency of Policies </SectionTitle> <Paragraph position="0"> For the weight ratio of 0.1, the reinforcement learning algorithm usually finds a good policy pair. To further improve the likelihood of this happening, we could learn multiple policy pairs, and then pick the best performing one. In this section, we compare learned policies chosen in this way against the restricted baseline pairs. We learned 10 sets of 10 dialogue pairs. We then ran each on 1000 task configurations and chose the best performing policy pair in each set. We then ran the resulting 10 policy pairs on another set of 1000 task configurations. Table 1 gives the average objective function score for each of the 10 learned policy pairs and the 3 baseline pairs.</Paragraph> <Paragraph position="1"> From the table, we see that the learned policy pair performs almost as well as the restricted policy pair, for both solution quality and dialog length.</Paragraph> </Section> <Section position="5" start_page="1016" end_page="1017" type="sub_section"> <SectionTitle> 5.5 Robustness of Learned Policies </SectionTitle> <Paragraph position="0"> All of the results so far have used the learned policy for the system interacting with the corresponding policy that was learned for the user. However, there is no guarantee that a real user will behave like the learned policy. Thus, the true test of our approach is to run the learned system policy against actual users. The problem with testing our policies against actual users is that there are a number of aspects of dialog that we have not modeled, such as nonunderstandings, misunderstandings, and even parsing sentences into the action specification and generating sentences from the action specification. Thus, as a simplification we tested our learned system policy on the handcrafted baseline policies.</Paragraph> <Paragraph position="1"> For the weight ratio of 0.1, we learned 10 sets of 10 pairs of policies and choose the best policy pair from each set. For each of the 10 policy pairs, we ran the system policy against the 6 individual policies from the 3 baseline policy pairs. We changed the hand-crafted policies slightly from Yang and Heeman so that the policies would not fail if they encountered unexpected input. For example, for the restricted policy for A (the conversant who proposes but never informs), if the learned policy proposes an item, A always rejects it. For the restricted policy for B (the conversant who informs but never proposes), if the learned policy releases the turn when there is not an item proposed, B simply releases the turn back to the learned policy.</Paragraph> <Paragraph position="2"> Figure 4 shows the resulting average objective function scores on 1000 dialog runs. For each base-line policy, we show the performance with the policy pair, and then with each side of the baseline policy interacting with the learned policy. We see that although the performance of the learned policy is not as good as with the handcrafted pair, the performance is close, with the major shortcoming being a general increase in dialog length. Thus, the policies that we have learned our robust against different strategies a user might want to use.</Paragraph> </Section> </Section> class="xml-element"></Paper>