File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-0210_metho.xml

Size: 25,994 bytes

Last Modified: 2025-10-06 14:07:55

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0210">
  <Title>Adaptive Dialogue Systems - Interaction with Interact</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Agent-based architecture
</SectionTitle>
    <Paragraph position="0"> To allow system development with reusable modules, flexible application building and easy combination of different techniques, the framework must itself be designed specifically to support adaptivity. We argue in favour of a system architecture using highly specialized agents, and use the Jaspis adaptive speech application framework (Turunen and Hakulinen, 2000; Turunen and Hakulinen, 2001a). Compared to e.g.</Paragraph>
    <Paragraph position="1"> Galaxy (Seneff et al., 1998), the system supports more flexible component communication. The system is depicted in Figure 1.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Information Storage
</SectionTitle>
      <Paragraph position="0"> The Jaspis architecture contains several features which support adaptive applications. First of all, the information about the system state is kept in a shared knowledge base called Information Storage. This blackboard-type information storage can be accessed by each system component via the Information Manager, which allows them to utilize all the information that the system contains, such as dialogue history and user profiles, directly. Since the important information is kept in a shared place, system components can be stateless, and the system can switch between them dynamically. Information Storage thus facilitates the system's adaptation to different internal situations, and it also enables the most suitable component to be chosen to handle each situation.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Flexible Component Management
</SectionTitle>
      <Paragraph position="0"> The system is organized into modules which contain three kinds of components: managers, agents and evaluators. Each module contains one manager which co-ordinates component interaction inside the module. The present architecture implements e.g. the Input/Output Manager, the Dialogue Manager and the Presentation Manager, and they have different priorities which allow them to react to the interaction flow differently. The basic principle is that whenever a manager stops processing, all managers can react to the situation, and based on their priorities, one of them is selected. There is also the Interaction Manager which coordinates applications on the most general level.</Paragraph>
      <Paragraph position="1"> The number and type of modules that can be connected to the system is not limited. The Interaction Manager handles all the connections between modules and the system can be distributed for multiple computers. In Interact we have built a demonstration application on bus-timetable information which runs on several platforms using different operating systems and programming languages. This makes the system highly modular and allows experiments with different approaches from multiple disciplines.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Interaction Agents and Evaluators
</SectionTitle>
      <Paragraph position="0"> Inside the modules, there are several agents which handle various interaction situations such as speech output presentations and dialogue decisions. These interaction agents can be very  specialized, e.g. they deal only with speech recognition errors or outputs related to greetings. They can also be used to model different interaction strategies for the same task, e.g. different dialogue agents can implement alternative dialogue strategies and control techniques. Using specialized agents it is possible to construct modular, reusable and extendable interaction components that are easy to implement and maintain. For example, different error handling methods can be included to the system by constructing new agents which handle errors using alternative approaches. Similarly, we can support multilingual outputs by constructing presentation agents that incorporate language specific features for each language, while implementing general interaction techniques, such as error correction methods, to take care of error situations in speech applications in general (Turunen and Hakulinen, 2001b).</Paragraph>
      <Paragraph position="1"> The agents have different capabilities and the appropriate agent to handle a particular situation at hand is selected dynamically based on the context. The choice is done using evaluators which determine applicability of the agents to various interaction situations. Each evaluator gives a score for every agent, using a scale between [0,1]. Zero means that an agent is not suitable for the situation, one means that an agent is perfectly suitable for the situation, values between zero and one indicate the level of suitability. Scaling functions can be used to emphasize certain evaluators over the others The scores are then multiplied, and the final score, a suitability factor, is given for every agent. Since scores are multiplied, an agent which receives zero from one evaluator is useless for that situation. It is possible to use different approaches in the evaluation of the agents, and for instance, the dialogue evaluators are based on reinforcement learning.</Paragraph>
      <Paragraph position="2"> Simple examples of evaluators are for instance presentation evaluators that select presentation agents to generate suitable implicit or explicit confirmations based on the dialogue history and the system's knowledge of the user. Another example concerns dialogue strategies: the evaluators may give better scores for system-initiative agents if the dialogue is not proceeding well with the user-initiative dialogue style, or the evaluators may prefer presentation agents which give more detailed and helpful information, if the users seem to have problems in communicating with the application.</Paragraph>
      <Paragraph position="3"> Different evaluators evaluate different aspects of interaction, and this makes the evaluation process highly adaptive itself: there is no single evaluator which makes the final decision. Instead, the choice of the appropriate interaction agent is a combination of different evaluations.</Paragraph>
      <Paragraph position="4"> Evaluators have access to all information in the Information Storage, for example dialogue history and other contextual information, and it is also possible to use different approaches in the evaluation of the agents (such as rule-based and statistical approaches). Evaluators are the key concept when considering the whole system and its adaptation to various interaction situations.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.4 Distributed Input and Output
</SectionTitle>
      <Paragraph position="0"> The input/output subsystem is also distributed which makes it possible to use several input and output devices for the same purposes. For example, we can use several speech recognition engines, each of which with different capabilities, to adapt the system to the user's way of talking. The system architecture contains virtual devices which abstract the actual devices, such as speech recognizers and speech synthesizers. From the application developers viewpoint this makes it easy to experiment with different modalities, since special agents are used to add and interpret modality specific features. It is also used for multilingual inputs and outputs, although the Interact project focuses on Finnish speech applications.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Natural Language Capabilities
</SectionTitle>
    <Paragraph position="0"> The use of Finnish as an interaction language brings special problems for the system's natural language understanding component. The extreme multiplicity of word forms prevents the use of all-including dictionaries. For instance, a Finnish noun can theoretically have around 2200, and a verb around 12000 different forms (Karlsson, 1983). In spoken language these numbers are further increased as all the different ways to pronounce any given word come into consideration (Jauhiainen, 2001). Our dialogue system is designed to understand both written and spoken input.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Written and spoken input
</SectionTitle>
      <Paragraph position="0"> The different word forms are analyzed using Fintwol, the two-level morphological analyzer for Finnish (Koskenniemi, 1983). The forms are currently input to the syntactic parser CPARSE (Carlson, 2001). However, the flexible system architecture also allows us to experiment with different morphosyntactic analyzers, such as TextMorfo (Kielikone Oy 1999) and Conexor FDG (Conexor Oy 1997-2000), and we plan to run them in parallel as separate competing agents to test and compare their applicability as well as the Jaspis architecture in the given task.</Paragraph>
      <Paragraph position="1"> We use the Lingsoft Speech Recognizer for the spoken language input. The current state of the Finnish speech recognizer forces us to limit the user's spoken input to rather restricted vocabulary and utterance structure, compared to the unlimited written input. The system uses full word lists which include all the morphological forms that are to be recognized, and a strict context-free grammar which dictates all the possible utterance structures. We are currently exploring possibilities for a HMM-based language model, with the conditional probabilities determined by a trigram backoff model.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Language analysis
</SectionTitle>
      <Paragraph position="0"> The task of the parsing component is to map the speaker utterances into task-relevant domain concepts which are to be processed by the dialogue manager. The number of domain concepts concerning the demonstration system's application domain, bus-timetables, is rather small and contains e.g. bus, departure-time and arrival-location. However, semantically equivalent utterances can of course vary in the lexical elements they contain, and in written and especially in spoken Finnish the word order in almost any given sentence can also be changed without major changes on the semantic level understood by the system (the difference lies in the information structure of the utterance). For instance, the request How does one get to Malmi? can be realised as given in Table 1.</Paragraph>
      <Paragraph position="1"> There are two ways to approach the problem: on one hand we can concentrate on finding the keywords and their relevant word forms, on the other hand we can use more specialized syntactic analyzers. At the moment we use CPARSE as the syntactic analyzer for text-based input.</Paragraph>
      <Paragraph position="2"> The grammar has been adjusted for the demon- null p&amp;quot;a&amp;quot;asee Malmille bussilla? 'How does-one-get to-Malmi by bus? stration system so that it especially looks for phrases relevant to the task at hand. For instance, if we can correctly identify the inflected word form Malmille from the input string, we can be quite certain of the user wishing to know something about getting to Malmi.</Paragraph>
      <Paragraph position="3"> The current speech input does not go through any special morpho-syntactic analysis because of the strict context-free grammar used by the speech recognizer. The dictionary used by the recognizer is tagged with the needed morphological information and the context-free rules are tagged with the needed syntactic information.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Language generation
</SectionTitle>
      <Paragraph position="0"> The language generation function is located in the system's Presentation Manager module. Unlike language analysis, for which different existing Finnish morphosyntactic analyzers can be used, there are no readily available general-purpose Finnish language generators. We are therefore developing specific generation components for this project. The flexible system architecture allows us to experiment with different generators.</Paragraph>
      <Paragraph position="1"> Unfortunately the existing Finnish syntactic analyzers have been designed from the outset as &amp;quot;parsing grammars&amp;quot;, which are difficult or impossible to use for generation. However, the two-level morphology model (Koskenniemi, 1983) is in principle bi-directional, and we are working towards its use in morphological generation. Fortunately there is also an existing Finnish speech synthesis project (Vainio, 2001), which we can use together with the language generators. null Some of our language generation components use the XML-based generation framework described by Wilcock (2001), which has the advantage of integrating well with the XML-based system architecture. The generator starts from an agenda which is created by the dialogue manager, and is available in the system's Information Storage in XML format. The agenda contains a list of semantic concepts which the dialogue manager has tagged as Topic or NewInfo.</Paragraph>
      <Paragraph position="2"> From the agenda the generator creates a response plan, which passes through the generation pipeline stages for lexicalization, aggregation, referring expressions, syntactic and morphological realization. At all stages the response specification is XML-based, including the final speech markup language which is passed to the speech synthesizer.</Paragraph>
      <Paragraph position="3"> The system architecture allows multiple generators to be used. In addition to the XML-based pipeline components we have some pregenerated outputs, such as greetings at the start and end of the dialogue or meta-acts such as wait-requests and thanking. We are also exploiting the agent-based architecture to increase the system's adaptivity in response generation, using the level of communicative confidence as described by Jokinen and Wilcock (2001).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Recognition of Discussion Topic
</SectionTitle>
    <Paragraph position="0"> One of the important aspects of the system's adaptivity is that it can recognize the correct topic that the user wants to talk about. By 'topic' we refer to the general subject matter that a dialogue is about, such as 'bus timetables' and 'bus tickets', realized by particular words in the utterances. In this sense, individual documents or short conversations may be seen to have one or a small number of topics, one at a time.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Topically ordered semantic space
</SectionTitle>
      <Paragraph position="0"> Collections of short documents, such as newspaper articles, scientific abstracts and the like, can be automatically organized onto document maps utilizing the Self-Organizing Map algorithm (Kohonen, 1995). The document map methodology has been developed in the WEB-SOM project (Kohonen et al., 2000), where the largest map organized consisted of nearly 7 million patent abstracts.</Paragraph>
      <Paragraph position="1"> We have applied the method to dialogue topic recognition by carring out experiments on 57 Finnish dialogues, recorded from the customer service phone line of Helsinki City Transport and transcribed manually into text. The dialogues are first split into topically coherent segments (utterances or longer segments), and then organized on a document map. On the ordered map, each dialogue segment is found in a specific map location, and topically similar dialogue segments are found near it. The document map thus forms a kind of topically ordered semantic space. A new dialogue segment, either an utterance or a longer history, can likewise be automatically positioned on the map. The coordinates of the best-matching map unit may then be considered as a latent topical representation for the dialogue segment.</Paragraph>
      <Paragraph position="2"> Furthermore, the map units can be labeled using named topic classes such as 'timetables' and 'tickets'. One can then estimate the probability of a named topic class for a new dialogue segment by construing a probability model defined on top of the map. A detailed description of the experiments as well as results can be found in (Lagus and Kuusisto, 2002).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Topic recognition module
</SectionTitle>
      <Paragraph position="0"> The topical semantic representation, i.e. the map coordinates, can be used as input for the dialogue manager, as one of the values of the current dialogue state. The system architecture thus integrates a special topic recognition module that outputs the utterance topic in the Information Storage. For a given text segment, say, the recognition result from the speech recognizer, the module returns the coordinates of the best-matching dialogue map unit as well as the most probable prior topic category (if prior categorization was used in labeling the map).</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Dialogue Management
</SectionTitle>
    <Paragraph position="0"> The main task of the dialogue manager component is to decide on the appropriate way to react to the user input. The reasoning includes recognition of communicative intentions behind the user's utterances as well as planning of the system's next action, whether this is information retrieval from a database or a question to clarify an insufficiently specified request. Natural interaction with the user also means that the system should not produce relevant responses only in terms of correct database facts but also in terms of rational and cooperative reactions. The system could learn suitable interaction strategies from its interaction with the user, showing adaptation to various user habits and situations.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Constructive Dialogue Model
</SectionTitle>
      <Paragraph position="0"> A uniform basis for dialogue management can be found in the communicative principles related to human rational and coordinated interaction (Allwood et al., 2000; Jokinen, 1996).</Paragraph>
      <Paragraph position="1"> The speakers are engaged in a particular activity, they have a certain role in that activity, and their actions are constrained by communicative obligations. They act by exchanging new information and constructing a shared context in which to resolve the underlying task satisfactorily. null The model consists of a set of dialogue states, defined with the help of dialogue acts, observations of the context, and reinforcement values. Each action results in a new dialogue state. The dialogue act, Dact, describes the act that the speaker performs by a particular utterance, while the topic Top and new information NewInfo denote the semantic content of the utterance and are related to the task domain. Together these three create a useful first approximation of the utterance meaning by abstracting over possible linguistic realisations. Unfilled task goals TGoals keep track of the activity related information still necessary to fulfil the underlying task (a kind of plan), and the speaker information is needed to link the state to possible speaker characteristics. The expectations, Expect are related to communicative obligations, and used to constrain possible interpretations of the next act. Consequently, the system's internal states can be reduced to a combination of these categories, all of which form an independent source of information for the system to decide on the next move.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Dialogue agents and evaluators
</SectionTitle>
      <Paragraph position="0"> A dialogue state and all agents that contribute to a dialogue state are shown in Figure 2. The Dialogue Model is used to classify the current utterance into one of the dialogue act categories (Jokinen et al., 2001), and to predict the next dialogue acts (Expect). The Topic Model recognizes the domain, or discussion topic, of the user input as described above.</Paragraph>
      <Paragraph position="1">  All domains out of the system's capabilities are handled with the help of a special OutOfDomain-agent which informs the user of the relevant tasks and possible topics directly. This allows the system to deal with error situations, such as irrelevant user utterances, efficiently and flexibly without invoking the Dialogue Manager to evaluate appropriate dialogue strategies. The information about error situations and the selected system action is still available for dialogue and task goal management through the shared Information Storage.</Paragraph>
      <Paragraph position="2"> The utterance Topic and New Information (Topic, NewInfo) of the relevant user utterances are given by the parsing unit, and supplemented with discourse knowledge by ellipsis and anaphora resolution agents (which are Input Agents). Task related goals are produced by Task Agents, located in a separate Task Manager module. They also access the backend database, the public transportation timetables of Helsinki.</Paragraph>
      <Paragraph position="3"> The Dialogue Manager (DM) consists of agents corresponding to possible system actions (Figure 3). There are also some agents for internal system interaction, illustrated in the figure with a stack of agents labeled with Agent1. One agent is selected at a time, and the architecture permits us to experiment with various competing agents for the same subtask: the evaluators are responsible for choosing the one that best fits in the particular situation.</Paragraph>
      <Paragraph position="4">  Two types of evaluators are responsible for choosing the agent in DM, and thus implementing the dialogue strategy. The QEstimate evaluator chooses the agent that has proven to be most rewarding so far, according to a Q-learning (Watkins and Dayan, 1992) algorithm with on-line epsilon1-greedy policy (Sutton and Barto, 1998). That agent is used in the normal case and the decision is based on the dialogue state presented in Figure 2. The underlying structure of the QEstimate evaluator is illustrated in Figure 4.</Paragraph>
      <Paragraph position="5"> The evaluator is based on a table of real values, indexed by dialogue states, and updated after each dialogue. The agent with the highest  value for the current dialogue state gets selected. Adaptivity of the dialogue management comes from the reinforcement learning algorithm of this evaluator.</Paragraph>
      <Paragraph position="6"> On the other hand, if one of the error evaluators (labeled with Error1..N) detects that an error has occurred, the QEstimate evaluator is overridden and a predetermined agent is selected to handle the error situation (Figure 5). In these cases, only the the correct agent is given a non-zero value, forcing the dialogue manager to select that agent. Examples of such errors include situations when the user utterance is not recognized by the speech recognizer, its topic is irrelevant to the current domain, or its interpretation is inconsistent with the dialogue context.</Paragraph>
      <Paragraph position="7">  Because all possible system actions are reusable agents, we can easily implement a different dialogue management strategy by adding evaluators, or replacing the current QEstimate evaluator. We are developing another strategy based on recurrent self-organizing maps, that learns to map dialogue states to correct actions by fuzzy clustering, minimizing the amount of human labor in designing the dialogue strategy.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Demo System and Future Work
</SectionTitle>
    <Paragraph position="0"> The project deals with both speech and text communication, interfacing with the user by telephone and web-applications. An initial integrated system with basic functionality and interaction capabilities concerning bus timetable inquiries has been built. Sample interactions are given in Figures 6 and 7. Various other application types such as intelligent question-answer systems, automatic call-centers, and other services that would benefit from flexible natural interaction can also be envisaged.</Paragraph>
    <Paragraph position="1"> The current system is the first prototype and intended to provide useful basis for further experimentation and system development. The planned activities include further investigations of the limitations and advantages of various approaches for language technology, as well as usability studies concerning real users and users with special needs.</Paragraph>
    <Paragraph position="2"> (1) UT: mmill&amp;quot;as bussilla m&amp;quot;a p&amp;quot;a&amp;quot;asen malamille 'with which bus do I get to Malami' UR: mill&amp;quot;a bussilla p&amp;quot;a&amp;quot;asen malmille 'with which bus do I get to Malmi' SU: mist&amp;quot;a olittekaan l&amp;quot;ahd&amp;quot;oss&amp;quot;a malmille 'where do you want to leave for Malmi'  SU: olen pahoillani osaan vastata vain bussiaikatauluja koskeviin kysymyksiin 'sorry I can reply only to questions concerning bus timetables' UT: aha no joo jaa kiitos hei 'I see ok thank you bye' UR: onko aijaa kiitos hei 'is I see thank you bye' SU: kiitos hei</Paragraph>
    <Paragraph position="4"/>
  </Section>
class="xml-element"></Paper>
Download Original XML