File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-2009_metho.xml

Size: 12,366 bytes

Last Modified: 2025-10-06 14:10:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-2009">
  <Title>An ISU Dialogue System Exhibiting Reinforcement Learning of Dialogue Policies: Generic Slot-filling in the TALK In-car System</Title>
  <Section position="3" start_page="0" end_page="119" type="metho">
    <SectionTitle>
2 System Overview
</SectionTitle>
    <Paragraph position="0"> The baseline dialogue system is built around the DIPPER dialogue manager (Bos et al., 2003). This system is initially used to conduct information-seekingdialogues with a user (e.g. find a particular hotel and restaurant), using hand-coded dialogue strategies (e.g.</Paragraph>
    <Paragraph position="1"> always use implicit confirmation, except when ASR confidence is below 50%, then use explicit confirmation). We have then modified the DIPPER dialogue manager so that it can consult learnt strategies (for example strategies learnt from the 2000 and 2001 COMMUNICATOR data (Lemon et al., 2005)), based on its  currentinformationstate, andthenexecutedialogueactions from those strategies. This allows us to compare hand-coded against learnt strategies within the same system (i.e. the other components such as the speechsynthesiser, recogniser, GUI, etc. all remain fixed).</Paragraph>
    <Section position="1" start_page="119" end_page="119" type="sub_section">
      <SectionTitle>
2.1 Overview of System Features
</SectionTitle>
      <Paragraph position="0"> The following features are currently implemented:</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="119" end_page="119" type="metho">
    <SectionTitle>
3 Research Issues
</SectionTitle>
    <Paragraph position="0"> Thework presentedhereexploresa numberof research themes, in particular: using learnt dialogue policies, learning dialogue policies in online interaction with users, fragmentary clarification, and reconfigurability.</Paragraph>
    <Section position="1" start_page="119" end_page="119" type="sub_section">
      <SectionTitle>
3.1 Moving between Domains:
COMMUNICATOR and In-car Dialogues
</SectionTitle>
      <Paragraph position="0"> Thelearntpolicies in (Hendersonet al., 2005)focussed on the COMMUNICATOR system for flight-booking dialogues. There we reportedlearning a promisinginitial policy for COMMUNICATOR dialogues, but the issue arises of how we could transfer this policy to new domains - for example the in-car domain.</Paragraph>
      <Paragraph position="1"> In the in-car scenarios the genre of &amp;quot;information seeking&amp;quot; is central. For example the SACTI corpora (Stuttle et al., 2004) have driver information requests (e.g. searching for hotels) as a major component.</Paragraph>
      <Paragraph position="2"> One question we address here is to what extent dialogue policies learnt from data gathered for one system, or family of systems, can be re-used or adapted for use in other systems. We conjecture that the slot-filling policies learnt from our experiments with COMMUNICATOR will also be good policies for other slot-filling tasks - that is, that we are learning &amp;quot;generic&amp;quot; slot-filling or informationseeking dialoguepolicies. In section 5 we describe how the dialogue policies learnt for slot filling on the COMMUNICATOR data set can be abstracted and used in the in-car scenarios.</Paragraph>
    </Section>
    <Section position="2" start_page="119" end_page="119" type="sub_section">
      <SectionTitle>
3.2 Fragmentary Clarifications
</SectionTitle>
      <Paragraph position="0"> Another research issue we have been able to explore in constructing this system is the issue of generating fragmentary clarifications. The system can be run with this featureswitched on or off (off for comparison with COMMUNICATOR systems). Instead of a system simplysaying&amp;quot;Sorry,pleaserepeatthat&amp;quot;orsomesuchsim- null ilar simple clarification request when there is a speech recognition failure, we were able to use the word confidence scores output by the ATK speech recogniser to generate more intelligent fragmentary clarification requests such as &amp;quot;Did you say a cheap chinese restaurant?&amp;quot;. This works by obtaining an ASR confidence score for each recognised word. We are then able to try various techniques for clarifying the user utterance.</Paragraph>
      <Paragraph position="1"> Many possibilities arise, for example: explicitly clarify only the highest scoring content word below the rejection threshold, or, implicitly clarify all content words and explicitly clarify the lowest scoring content word.</Paragraph>
      <Paragraph position="2"> The current platform enables us to test alternative strategies, and develop more complex ones.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="119" end_page="119" type="metho">
    <SectionTitle>
4 The &amp;quot;In-car&amp;quot; Scenario
</SectionTitle>
    <Paragraph position="0"> The scenario we have designed the system to cover is that of information seeking about a town, for example its hotels, restaurants, and bars. We imagine a driver who is travelling towards this town, or is already there, who wishes to accomplish relatively complex tasks, such as finding an italian restaurant near their hotel, or finding all the wine bars in town, and so on. The driver/user should be able to specify queries using natural dialogue, and will receive system output that is a mixtureof spokenand graphicalinformation(e.g.a description of an item and a map showing its location).</Paragraph>
    <Paragraph position="1"> The example town is taken from the (Stuttle et al., 2004) corpus collection materials, and contains a number of hotels, bars, restaurants, and tourist attractions. Theusershouldbeableto getinformationonarange of locations in the town, and the dialogue system will be used to specify and refine the user queries, as well as to present information to the user. See the example dialogue in table 1.</Paragraph>
    <Paragraph position="2"> We now describe the dialogue system components.</Paragraph>
  </Section>
  <Section position="6" start_page="119" end_page="121" type="metho">
    <SectionTitle>
5 Component-level Description
</SectionTitle>
    <Paragraph position="0"> This section describes the components of the baseline in-car dialogue system. Communication between components is handled by OAA's asynchronous hub architecture (Cheyer and Martin, 2001). The major compo-</Paragraph>
    <Section position="1" start_page="119" end_page="119" type="sub_section">
      <SectionTitle>
5.1 Dialogue Policy Learner Agent
</SectionTitle>
      <Paragraph position="0"> This agent acts as an interface between the DIPPER dialogue manager and the system simulation based on RL. In particular it has the following solvable: callRLsimulation(IS file name, conversational domain, speech act, task, result).</Paragraph>
      <Paragraph position="1"> The first argument is the name of the file that contains all information about the current information state, which is required by the RL algorithm to produce an action. The action returned by the RL agent is a combination of conversational domain, speech act, and task. The last argument shows whether the learnt policy will continue to produce more actions or release the turn. When run in online learning mode the agent not only produces an action when supplied with a state, but at the end of every dialogue it uses the reward signal to update its learnt policy. The reward signal is defined in the RL agent, and is currently a linear combination of task success metrics combined with a fixed penalty for dialogue length (see (Henderson et al., 2005)).</Paragraph>
      <Paragraph position="2"> This agent can be called whenever the system has to decide on the next dialogue move. In the original hand-coded system this decision is made by way of a dialogue plan (using the &amp;quot;deliberate&amp;quot; solvable). The RL agent can be used to drive the entire dialogue policy,orcan becalled onlyin certaincircumstances. This makes it usable for whole dialogue strategies, but also, if desired, it can be targetted only on specific dialogue managementdecisions (e.g. implicit vs. explicit confirmation, as was done by (Litman et al., 2000)).</Paragraph>
      <Paragraph position="3"> One important research issue is that of tranferring learnt strategies between domains. We learnt a strategy for the COMMUNICATOR flight booking dialogues (Lemon et al., 2005; Henderson et al., 2005), but this is generated by rather different scenarios than the in-car dialogues. However, both are &amp;quot;slot-filling&amp;quot; or information-seeking applications. We defined a mapping (described below) between the states and actions of both systems, in order to construct an interface between the learnt policies for COMMUNICATOR and the in-car baseline system.</Paragraph>
    </Section>
    <Section position="2" start_page="119" end_page="121" type="sub_section">
      <SectionTitle>
5.2 Mapping between COMMUNICATOR and
the In-car Domains
</SectionTitle>
      <Paragraph position="0"> There are 2 main problems to be dealt with here: a0 mappingbetweenin-carsystem informationstates and COMMUNICATOR information states, a0 mapping between learnt COMMUNICATOR system actions and in-car system actions.</Paragraph>
      <Paragraph position="1"> The learnt COMMUNICATOR policy tells us, based on a current IS, what the optimal system action is (for example request info(dest city) or acknowledgement). Obviously, in the in-car scenario we have no use for task types such as &amp;quot;destination city&amp;quot; and &amp;quot;departure date&amp;quot;. Our method therefore is to abstract away from the particular details of the task type, but to maintain the information about dialoguemovesandtheslotnumbersthatareunderdiscus- null sion. That is, we construe the learnt COMMUNICATOR policy as a policy concerning how to fill up to 4 (ordered) informational slots, and then access a database and present results to the user. We also note that some slots are more essential than others. For example, in COMMUNICATOR it is essential to have a destination city, otherwise no results can be found for the user.</Paragraph>
      <Paragraph position="2"> Likewise, for the in-car tasks, we consider the foodtype, bar-type, and hotel-location to be more important to fill than the other slots. This suggests a partial ordering on slots via their importance for an application. In order to do this we define the mappings shown in table 2 between COMMUNICATOR dialogue actions and in-car dialogue actions, for each sub-task type of the in-car system.</Paragraph>
      <Paragraph position="3">  Note that we treat each of the 3 in-car sub-tasks (hotels, restaurants,bars) as a separateslot-filling dialogue thread, governed by COMMUNICATOR actions. This means that the very top level of the dialogue (&amp;quot;How may I help you&amp;quot;) is not governed by the learnt policy. Only when we are in a recognised task do we ask the COMMUNICATOR policy for the next action. Since the COMMUNICATOR policy is learnt for 4 slots, we &amp;quot;prefill&amp;quot; a slot3 in the IS when we send it to the Dialogue Policy Learner Agent in order to retrieve an action.</Paragraph>
      <Paragraph position="4"> As for the state mappings, these follow the same principles. That is, we abstract from the in-car states to form states that are usable by COMMUNICATOR . This means that, for example, an in-car state where foodtype and food-price are filled with high confidence is mapped to a COMMUNICATOR state where dest-city and depart-date are filled with high confidence, and all other state information is identical (modulo the task names). Note that in a future version of the in-car system where task switching is allowed we will have to maintain a separate view of the state for each task.</Paragraph>
      <Paragraph position="5"> In terms of the integration of the learnt policies with theDIPPERsystemupdaterules, wehaveasystemflag which states whether or not to use a learnt policy. If this flag is present, a different update rule fires when the system determines what action to take next. For example, instead of using the deliberate predicate to access a dialogue plan, we instead call the Dialogue PolicyLearnerAgentviaOAA, using thecurrentInformation State of the system. This will return a dialogue action to the DIPPER update rule.</Paragraph>
      <Paragraph position="6"> Incurrentworkweareevaluatinghowwellthelearnt policies work for real users of the in-car system.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML