File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-2044_metho.xml
Size: 6,582 bytes
Last Modified: 2025-10-06 14:10:13
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-2044"> <Title>Evolving optimal inspectable strategies for spoken dialogue systems</Title> <Section position="3" start_page="173" end_page="173" type="metho"> <SectionTitle> 2 Learning Classifier Systems and XCS </SectionTitle> <Paragraph position="0"> Learning Classifier Systems were introduced by John Holland in the 1970s as a framework for learning rule-based knowledge representations (Holland, 1976). In this model, a rule base consists of a population of N state-action rules known as classifiers.</Paragraph> <Paragraph position="1"> The state part of a classifier is represented by a ternary string from the set {0,1,#} while the action part is composed from {0,1}. The # symbol acts as a wildcard allowing a classifier to aggregate states; for example, the state string 1#1 matches the states 111 and 101. Classifier systems have been applied to a number of learning tasks, including data mining, optimisation and control (Bull, 2004).</Paragraph> <Paragraph position="2"> Classifier systems combine two machine learning techniques to find the optimal rule set. A genetic algorithm is used to evaluate and modify the population of rules while reinforcement learning is used to assign rewards to existing rules. The search for better rules is guided by the strength parameter associated with each classifier. This parameter serves as a fitness score for the genetic algorithm and as a predictor of future reward (payoff ) for the RL algorithm. This evolutionary learning process searches the space of possible rule sets to find an optimal policy as defined by the reward function.</Paragraph> <Paragraph position="4"> ber of modifications to Holland's original framework (Wilson, 1995). In this system, a classifier's fitness is based on the accuracy of its payoff prediction instead of the prediction itself. Furthermore, the genetic algorithm operates on actions instead of the population as a whole. These aspects of XCS result in a more complete map of the state-action space than would be the case with strength-based classifier systems. Consequently, XCS often outperforms strength-based systems in sequential decision problems (Kovacs, 2000).</Paragraph> </Section> <Section position="4" start_page="173" end_page="174" type="metho"> <SectionTitle> 3 Experimental Methodology </SectionTitle> <Paragraph position="0"> In this section we present a simple slot-filling system based on the hotel booking domain. The goal of the system is to acquire the values for three slots: the check-in date, the number of nights the user wishes to stay and the type of room required (single, twin etc.). In slot-filling dialogues, an optimal strategy is one that interacts with the user in a satisfactory way while trying to minimise the length of the dialogue.</Paragraph> <Paragraph position="1"> A fundamental component of user satisfaction is the system's prevention and repair of any miscommunication between it and the user. Consequently, our hotel booking system focuses on evolving essential slot confirmation strategies.</Paragraph> <Paragraph position="2"> We devised an experimental framework for modelling the hotel system as a sequential decision task and used XCS to evolve three behaviours. Firstly, the system should execute its dialogue acts in a logical sequence. In other words, the system should greet the user, ask for the slot information, present the query results and then finish the dialogue, in that order (Experiment 1). Secondly, the system should try to acquire the slot values as quickly as possible while taking account of the possibility of misrecognition (Experiments 2a and 2b). Thirdly, to increase the likelihood of acquiring the slot values correctly, each one should be confirmed at least once (Experiments 3 and 4).</Paragraph> <Paragraph position="3"> The reward function for Experiments 1, 2a and 2b was the same. During a dialogue, each non-terminal system action received a reward value of zero. At the end of each dialogue, the final reward comprised three parts: (i) -1000 for each system turn; (ii) 100,000 if all slots were filled; (iii) 100,000 if the first system act was a greeting. In Experiments 3 and 4, an additional reward of 100,000 was assigned if all slots were confirmed.</Paragraph> <Paragraph position="4"> The transition probabilities were modelled using two versions of a handcoded simulated user. A very large number of test dialogues are usually required for learning optimal dialogue strategies; simulated users are a practical alternative to employing human test users (Scheffler and Young, 2000; Lopez-Cozar et al., 2002). Simulated user A represented a fully cooperative user, always giving the slot information that was asked. User B was less cooperative, giving no response 20% of the time. This allowed us to perform a two-fold cross validation of the evolved strategies.</Paragraph> <Paragraph position="5"> For each experiment we allowed the system's strategy to evolve over 100,000 dialogues with each simulated user. Dialogues were limited to a maximum of 30 system turns. We then tested each strategy with a further 10,000 dialogues. We logged the total reward (payoff) for each test dialogue. Each experiment was repeated ten times.</Paragraph> <Paragraph position="6"> In each experiment, the presentation of the query results and closure of the dialogue were combined into a single dialogue act. Therefore, the dialogue acts available to the system for the first experiment were: Greeting, Query+Goodbye, Ask(Date), Ask(Duration) and Ask(RoomType). Four boolean variables were used to represent the state of the dialogue: GreetingFirst, DateFilled, DurationFilled, RoomFilled.</Paragraph> <Paragraph position="7"> Experiment 2 added a new dialogue act: Ask(All).</Paragraph> <Paragraph position="8"> The goal here was to ask for all three slot values if the probability of getting the slot values was reasonably high. If the probability was low, the system should ask for the slots one at a time as before. This information was modelled in the simulated users by 2 variables: Prob1SlotCorrect and RoomConfirmed. The goal here was for the system to learn to confirm each of the slot values after the user has first given them. Experiment 4 sought to reduce the dialogue length further by allowing the system to confirm one slot value while asking for another. Two new dialogue acts were available in this last experi- null ment: Implicit Confirm(Date)+Ask(Duration) and Implicit Confirm(Duration)+Ask(RoomType).</Paragraph> </Section> class="xml-element"></Paper>