File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/n06-1035_intro.xml

Size: 2,635 bytes

Last Modified: 2025-10-06 14:03:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-1035">
  <Title>Comparing the Utility of State Features in Spoken Dialogue Using Reinforcement Learning</Title>
  <Section position="4" start_page="0" end_page="272" type="intro">
    <SectionTitle>
2 Background
</SectionTitle>
    <Paragraph position="0"> We follow past lines of research (such as (Levin and Pieraccini, 1997) and (Singh et al., 1999)) for describing a dialogue a0 as a trajectory within a Markov Decision Process (MDP) (Sutton and Barto, 1998).</Paragraph>
    <Paragraph position="1">  A MDP has four main components: 1: states a0 , 2: actions a1 , 3: a policy a2 , which specifies what is the best action to take in a state, and 4: a reward function a3 which specifies the worth of the entire process. Dialogue management is easily described using a MDP because one can consider the actions as actions made by the system, the state as the dialogue context (which can be viewed as a vector of features, such as ASR confidence or dialogue act), and a reward which for many dialogue systems tends to be task completion success or dialogue length.</Paragraph>
    <Paragraph position="2"> Another advantage of using MDP's to model a dialogue space, besides the fact that the primary MDP parameters easily map to dialogue parameters, is the notion of delayed reward. In a MDP, since rewards are often not given until the final states, dynamic programming is used to propagate the rewards back to the internal states to weight the value of each state (called the V-value), as well as to develop an optimal policy a2 for each state of the MDP. This propagation of reward is done using the policy iteration algorithm (Sutton and Barto, 1998) which iteratively updates the V-value and best action for each state based on the values of its neighboring states.</Paragraph>
    <Paragraph position="3"> The V-value of each state is important for our purposes not only because it describes the relative worth of a state within the MDP, but as more data is added when building the MDP, the V-values should stabilize, and thus the policies stabilize as well. Since, in this paper, we are comparing policies in a fixed data set it is important to show that the policies are indeed reliable, and not fluctuating.</Paragraph>
    <Paragraph position="4"> For this study, we used the MDP infrastructure designed in our previous work which allows the user to easily set state, action, and reward parameters. It then performs policy iteration to generate a policy and V-values for each state. In the following sections, we discuss our corpus, methodology, and results. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML