File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/h05-1127_metho.xml

Size: 13,545 bytes

Last Modified: 2025-10-06 14:09:36

<?xml version="1.0" standalone="yes"?>
<Paper uid="H05-1127">
  <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 1011-1018, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Learning Mixed Initiative Dialog Strategies By Using Reinforcement Learning On Both Conversants</Title>
  <Section position="4" start_page="1012" end_page="1012" type="metho">
    <SectionTitle>
3 Task Specification
</SectionTitle>
    <Paragraph position="0"> For our experiment we use the task presented in Yang and Heeman (2004), which is a modification of the DesignWorld task of Walker (1995). The task requires 2 conversants to agree on 5 pieces of furniture to place in a room. Both conversants know all of the furniture items that can be chosen, which differ by color, type and point value. Each conversant also has private preferences about which furniture items it wants in the room; such as 'if there is a red couch in the room, I also want a lamp in the room'.</Paragraph>
    <Paragraph position="1"> Each preference has a score. As this is a collaborative task, the conversants have the goal of finding the 5 furniture items that have the highest score, where the score is the sum of the point value of each of the 5 chosen furniture items less the scores for any violated preferences of either conversant.</Paragraph>
    <Paragraph position="2"> The conversational agents work to achieve their goal by performing the following actions: propose, accept, reject, inform, and release turn. If there is not a current proposal, either agent can propose an item, which makes that item into the current proposal. If there is a current proposal, the other conversant can accept it or reject it. Accepting an item results in that item being included in the task solution and removes it as the current proposal. Rejecting a proposed item removes it as the current proposal.</Paragraph>
    <Paragraph position="3"> When an item has been rejected it remains a valid choice for future proposals. In addition to accepting or rejecting a proposal, either conversant may inform the other conversant of preferences that are violated by the current proposal. A preference is violated by the current proposal if the addition of that proposed item to the solution set would cause the solution set to violate the preference. When a conversant informs of a violated preference, that preference becomes mutually known and so affects future decisions by both participants. Only preferences that are not known by the other conversant are communicated. For turn taking, we include the action release turn, which the conversant that currently has the turn can perform to signal that it is relinquishing the turn (cf. Traum and Hinkelman, 1992). Note that after a release turn, the other agent must make the next move, which could itself be a release turn. The inclusion of this action allows conversants to perform multiple actions in a row, such as a reject, an inform, and a propose. Our approach to turn taking differs slightly from Yang and Heeman, as they make it an implicit part of other actions.</Paragraph>
    <Paragraph position="4"> In order to successfully utilize these actions in a dialog, some reasoning effort is required of the conversants. Conversants must be able to determine what preferences are violated by a pending proposal and which of the remaining items makes the best proposal. In order to keep the reasoning effort manageable, we follow Yang and Heeman and use a greedy algorithm to pick the item that results in the best score for the item plus the set of items already accepted. The conversants do not consider interactions with the items that will be subsequently added to the plan. Conversants using this greedy approach can construct a plan that is very close to optimal.</Paragraph>
  </Section>
  <Section position="5" start_page="1012" end_page="1014" type="metho">
    <SectionTitle>
4 Learning Specification
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1012" end_page="1013" type="sub_section">
      <SectionTitle>
4.1 Agent Specification
</SectionTitle>
      <Paragraph position="0"> In order to apply reinforcement learning to this task we must formalize the conversants as reinforcement  learning agents, specifying their state and actions, as well as the environment they will interact in. In order to reduce the size of the state space for this task we simplified the representation of the state in a manner similar to that done by Levin (2004). We formulated the state of the dialog agents with many of the more specific details of the actual state of the task removed. For instance the agent state does not include specific information about the furniture item that is the pending proposal, rather the agent's state only indicates that there is a pending proposal.</Paragraph>
      <Paragraph position="1"> The state specification for each agent includes the following binary variables: Pending-Proposal, I-Proposed, Violated-Preference, Prior-Violated-Preferences, and Better-Alternative. Pending-Proposal indicates whether an item has been proposed but not accepted or rejected. I-Proposed indicates if the agent made the most recent proposal.</Paragraph>
      <Paragraph position="2"> Violated-Preference indicates that the pending proposal has caused one or more violations of the conversant's private preferences. Prior-Violated-Preferences indicates whether the conversant had one or more violated preferences when the pending proposal was made. This variable allows the agent to remember what its original response to a proposal was, even after it may have shared all of its preferences that were violated (thus creating a state where it no longer has any violated personal preferences).</Paragraph>
      <Paragraph position="3"> Better-Alternative indicates that the agent thinks it knows an item that would achieve a better score than the item currently proposed.</Paragraph>
      <Paragraph position="4"> The actions from Section 3 can be sequenced in a number of different orders, leading to different policies. Unlike Yang and Heeman, who compared handcrafted policies, we use reinforcement learning to learn policy pairs, one part of the pair for the system, and the other for the simulated user. We have restricted the space of policies that can be learned.</Paragraph>
      <Paragraph position="5"> First, we reduce the space by only considering legal sequences of actions. For example, if there is a pending proposal, another item cannot be proposed.</Paragraph>
      <Paragraph position="6"> Second, after 5 items have been accepted, the dialog is automatically ended. Third, to keep the space of dialog policies small, we force an inform to inform of all violated preferences at once.</Paragraph>
      <Paragraph position="7"> The Reinforcement Learning states and actions of our dialog agents capture a subset of the true state of the dialog. Our agents do not have the ability to distinguish between, or develop distinct policies in response to, the proposal of a blue chair versus a red desk. Since our formulation of the dialog agents do not encode specific information about items or preferences, the dialog environment must maintain these details. This extra information that must include the currently proposed item, what each agent's private and currently violated preferences are, what preferences are shared between each agent, what items have been accepted as part of the task solution, and what items are still available for selection. This technique of generalizing the state space is the same as the one used by Levin (2000), and allows us to keep the state space at a manageable size for our task.</Paragraph>
    </Section>
    <Section position="2" start_page="1013" end_page="1014" type="sub_section">
      <SectionTitle>
4.2 Reinforcement Learning
</SectionTitle>
      <Paragraph position="0"> For our Reinforcement Learning algorithm we chose to use an on-policy Monte Carlo method (Sutton and Barto, 1998). Our chosen task is naturally episodic since the two agents agreeing upon five items indicates task completion and thus the end of the dialog, which constitutes one learning episode. We also imposed a limit of 500 interactions per dialog in order to ensure that each learning episode was finite even if the task was not successfully completed. For some state-action pairs our task does not allow the accurate specification of the resulting state. In fact, due to the way that our state representation simplifies the true task environment an action choice for many states will necessarily lead to different states depending upon the task environment. For instance, proposing an item will sometimes lead to that items acceptance and sometimes it will be rejected. Given this uncertainty our learning approach necessarily had to learn the expected rewards of actions instead of states.</Paragraph>
      <Paragraph position="1"> At the end of each dialog the interaction is given a score based on the evaluation function and that score is used to update the dialog policy of both agents. The state-action history for each agent is iterated over separately and the score from the recent dialog is averaged in with the expected return from the existing policy. We chose not to include any discounting factor to the dialog score as we progressed back through the dialog history. The decision to equally weight each state-action pair in the dialog history was made because an action's contribution to the dialog score is not dependent upon its  proximity to the end of the task. An action that accepts a proposed item at the beginning of the dialog should be rewarded as much as an action that accepts a proposed item later in the same dialog.</Paragraph>
      <Paragraph position="2"> In order for the learning agents to obtain a large enough variety of experiences to fully explore the state space some exploration technique must be used. We chose to use e-greedy action selection in order to achieve this goal. With this approach the dialog agent makes an on policy action choice with probability 1-e and a random valid action choice the rest of the time.</Paragraph>
      <Paragraph position="3"> Training both agents simultaneously causes each agent to learn its policy as an optimal response to the opposing agent. This can create problems in the initial stages of training as each agent has an immature policy that is based on little experience. In this situation each of the agents will associate weights with state action pairs based on action choices of the opposing agent that are themselves not well developed.</Paragraph>
      <Paragraph position="4"> As training progresses the eccentricities of the initial immature policies are perpetuated and the learning process does not converge on an effective dialog policy for either agent.</Paragraph>
      <Paragraph position="5"> In order to combat the problem of converging to an effective policy we divided up the agent training process into multiple epochs. Each epoch is composed of a number of training episodes. The initial epsilon value is set to a large value and for each successive epoch the epsilon value for action selection is decreased. With an initially high epsilon value the agents are able to develop a policy that is initially weighted more heavily towards a response to random action selection than the immature policy of the other agent. As the epsilon value decreases, each agent slowly adjusts its learning to be weighted more heavily towards a response to the other agent's policy. This approach allows the agents to develop a minimally coherent dialog policy before beginning to rely too heavily upon the response of the opposing agent.</Paragraph>
      <Paragraph position="6"> Utilizing this strategy of continuously decreasing epsilon values we were able to get both agents to converge to an effective and coherent dialog policy.</Paragraph>
      <Paragraph position="7"> The initial epsilon value was set to 80</Paragraph>
    </Section>
    <Section position="3" start_page="1014" end_page="1014" type="sub_section">
      <SectionTitle>
4.3 Objective Function
</SectionTitle>
      <Paragraph position="0"> In the reinforcement learning process the objective function provides the dialog agents with feed-back on the success of each dialog. The specification of this function requires input from a human. For our learning specification we crafted a simple function that attempted to model a human perception of a dialog's quality. Our objective function is linear combination of the solution quality (S) and the dialog length (L), taking the form:</Paragraph>
      <Paragraph position="2"> where w1 and w2 are positive constants. As higher values for S and lower values for L indicate better dialogs, we subtract w2L from w1S. Instead of attempting to hand pick the constants in the objective function, we explored the effects of different values, which we report in Section 5.2.</Paragraph>
      <Paragraph position="3"> For our experiment we trained the dialog agents for 200 epochs, where each epoch consisted of 200 training episodes. After the training the agents, we then had them perform 5000 dialogs with 100% on-policy action selection (i.e. strictly following the learned policy). The results of these 5000 dialogs were then combined to obtain an average plan score and average number of interactions for the policy of the agents. These two values are then combined according to the objective function to obtain a numeric score for the learned policy.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML