File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1035_metho.xml

Size: 9,476 bytes

Last Modified: 2025-10-06 14:10:12

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-1035">
  <Title>Comparing the Utility of State Features in Spoken Dialogue Using Reinforcement Learning</Title>
  <Section position="5" start_page="272" end_page="273" type="metho">
    <SectionTitle>
3 Corpus
</SectionTitle>
    <Paragraph position="0"> For our study, we used an annotated corpus of 20 human-computer spoken dialogue tutoring sessions (for our work we use the ITSPOKE system (Litman and Silliman, 2004) which uses the text-based Why2-ATLAS dialogue tutoring system as its &amp;quot;back-end&amp;quot; (VanLehn et al., 2002)). The content  of the system, and all possible dialogue paths, were authored by physics experts. Each session consists of an interaction with one student over 5 different college-level physics problems, for a total of 100 dialogues. Before each session, the student is asked to read physics material for 30 minutes and then take a pretest based on that material. Each problem begins with the student writing out a short essay response to the question posed by the computer tutor. The fully-automated system assesses the essay for potential flaws in the reasoning and then starts a dialogue with the student, asking questions to help the student understand the confused concepts. The tutor's response and next question is based only on the correctness of the student's last answer. Informally, the dialogue follows a question-answer format. Once the student has successfully completed the dialogue section, he is asked to correct the initial essay. Each of the dialogues takes on average 20 minutes and 60 turns. Finally, the student is given a posttest similar to the pretest, from which we can calculate their normalized learning gain: a4a6a5a8a7a10a9a12a11a14a13a16a15a18a17a19a17a21a20a22a15a23a17a25a24a26a11a28a27a16a20a23a17a21a20a22a15a18a17a29</Paragraph>
    <Paragraph position="2"> Prior to our study, the corpus was annotated for Tutor Moves, which can be viewed as Dialogue Acts (Forbes-Riley et al., 2005) 1 and consisted of Tutor Feedback, Question and State Acts. In this corpus, a turn can consist of multiple utterances and thus can be labeled with multiple moves. For example, a tutor can give positive feedback and then ask a question in the same turn. What type of question to ask will be the action choice addressed in this paper.</Paragraph>
    <Paragraph position="3"> As for features to include in the student state, we  SAQ &amp;quot;Good. What is the direction of that force relative to your fist?&amp;quot; CAQ &amp;quot;What is the definition of Newton's Second Law?&amp;quot; Mix &amp;quot;Good. If it doesn't hit the center of the pool what do you know about the magnitude of its displacement from the center of the pool when it lands? Can it be zero? Can it be nonzero?&amp;quot; NoQ &amp;quot;So you can compare it to my response...&amp;quot;  emotion related features, certainty and frustration, were annotated manually prior to this study (Forbes-Riley and Litman, 2005) 2. Certainty describes how confident a student seemed to be in his answer, while frustration describes how frustrated the student seemed to be when he responded. We include three other automatically extracted features for the Student state: (1) Correctness: whether the student was correct or not; (2) Percent Correct: percentage of correctly answered questions so far for the current problem; (3) Concept Repetition: whether the system is forced to cover a concept again which reflects an area of difficulty for the student.</Paragraph>
  </Section>
  <Section position="6" start_page="273" end_page="274" type="metho">
    <SectionTitle>
4 Experimental Method
</SectionTitle>
    <Paragraph position="0"> The goal of this study is to quantify the utility of adding a feature to a baseline state space. We use the following four step process: (1) establish an action set and reward function to be used as constants throughout the test since the state space is the one MDP parameter that will be changed during the tests; (2) establish a baseline state and policy, and (3) add a new feature to that state and test if adding the feature results in policy changes. Every time we create a new state, we make sure that the generated V-values converge. Finally, (4), we evaluate the effects of adding a new feature by using three metrics: (1) number of policy changes (diffs), (2) % policy change, and (3) Expected Cumulative Reward. These three metrics are discussed in more detail in Section 5.2. In this section we focus on the first three steps of the methodology.</Paragraph>
    <Section position="1" start_page="273" end_page="273" type="sub_section">
      <SectionTitle>
4.1 Establishing Actions and Rewards
</SectionTitle>
      <Paragraph position="0"> We use questions as our system action a1 in our MDP. The action size is 4 (tutor can ask a simple answer question (SAQ), a complex answer question 2In a preliminary agreement study, a second annotator labeled the entire corpus for uncertain versus other, yielding 90% inter-annotator agreement (0.68 Kappa).</Paragraph>
      <Paragraph position="1"> (CAQ), or a combination of the two (Mix), or not ask a question (NoQ)). Examples from our corpus can be seen in Table 2. We selected this as the action because what type of question a tutor should ask is of great interest to the Intelligent Tutoring Systems community, and it generalizes to dialogue systems since asking users questions of varying complexity can elicit different responses.</Paragraph>
      <Paragraph position="2"> For the dialogue reward function a3 we did a median split on the 20 students based on their normalized learning gain, which is a standard evaluation metric in the Intelligent Tutoring Systems community. So 10 students and their respective 5 dialogues were assigned a positive reward of 100 (high learners), and the other 10 students and their respective 5 dialogues were assigned a negative reward of -100 (low learners). The final student turns in each dialogue were marked as either a positive final state (for a high learner) or a negative final state (for a low learner). The final states allow us to propagate the reward back to the internal states. Since no action is taken from the final states, their V-values remain the same throughout policy iteration.</Paragraph>
    </Section>
    <Section position="2" start_page="273" end_page="274" type="sub_section">
      <SectionTitle>
4.2 Establishing a Baseline State and Policy
</SectionTitle>
      <Paragraph position="0"> Currently, our tutoring system's response to a student depends only on whether or not the student answered the last question correctly, so we use correctness as the sole feature in our baseline dialogue state. A student can either be correct, partially correct, or incorrect. Since partially correct responses occur infrequently compared to the other two, we reduced the state size to two by combining Incorrect and Partially Correct into one state (IPC) and keeping Correct (C).</Paragraph>
      <Paragraph position="1"> With the actions, reward function, and baseline state all established, we use our MDP tool to generate a policy for both states (see Table 3). The second column shows the states, the third, the policy determined by our MDP toolkit (i.e. the optimal action to  take in that state with respect to the final reward) and finally how many times the state occurs in our data (state size). So if a student is correct, the best action is to give something other than a question immediately, such as feedback. If the student is incorrect, the best policy is to ask a combination of short and complex answer questions.</Paragraph>
      <Paragraph position="2">  The next step in our experiment is to test whether the policies generated are indeed reliable. Normally, the best way to verify a policy is to conduct experiments and see if the new policy leads to a higher reward for new dialogues. In our context, this would entail running more subjects with the augmented dialogue manager, which could take months. So, instead we check if the polices and values for each state are indeed converging as we add data to our MDP model. The intuition here is that if both of those parameters were varying between a corpus of 19 students versus one of 20 students, then we can't assume that our policy is stable, and hence is not reliable. null To test this out, we made 20 random orderings of our students to prevent any one ordering from giving a false convergence. Each ordering was then passed to our MDP infrastructure such that we started with a corpus of just the first student of the ordering and then determined a MDP policy for that cut, then incrementally added one student at a time until we had added all 20 students. So at the end, 20 random orderings with 20 cuts each provides 400 MDP trials.</Paragraph>
      <Paragraph position="3"> Finally, we average each cut across the 20 random orderings. The first graph in Figure 1 shows a plot of the average V-values against a cut. The state with the plusses is the positive final state, and the one at the bottom is the negative final state. However, we are most concerned with how the non-final states converge. The plot shows that the V-values are fairly stable after a few initial cuts, and we also verified that the policies remained stable over the 20 students as well (see our prior work (Tetreault and Litman, 2006) for details of this method). Thus we can be sure that our baseline policy is indeed reliable for our corpus.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML