File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-1033_metho.xml
Size: 18,667 bytes
Last Modified: 2025-10-06 14:08:14
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-1033"> <Title>Flexible Guidance Generation using User Model in Spoken Dialogue Systems</Title> <Section position="3" start_page="0" end_page="1" type="metho"> <SectionTitle> 2 Kyoto City Bus Information System </SectionTitle> <Paragraph position="0"> We have developed the Kyoto City Bus Information System, which locates the bus a user wants to take, and tells him/her how long it will take before its arrival. The system can be accessed via telephone including cellular phones . From any places, users can easily get the bus information that changes every minute. Users are requested to input the bus stop to get on, the destination, or the bus route number by speech, and get the corresponding bus information. The bus stops can be specified by the name of famous places or public facilities nearby. Figure 1 shows a simple example of the dialogue.</Paragraph> <Paragraph position="1"> Figure 2 shows an overview of the system.</Paragraph> <Paragraph position="2"> The system operates by generating VoiceXML scripts dynamically. The real-time bus information database is provided on the Web, and can be accessed via Internet. Then, we explain the modules in the following.</Paragraph> <Paragraph position="3"> VWS (Voice Web Server) The Voice Web Server drives the speech recognition engine and the TTS (Text-To-Speech) module according to the specifications by the generated VoiceXML.</Paragraph> <Paragraph position="4"> based on specified grammar rules and vocabulary, which are defined by VoiceXML at each dialogue state.</Paragraph> <Paragraph position="5"> Dialogue Manager The dialogue manager generates response sentences based on speech recognition results (bus stop names or a route number) received from the VWS. If sufficient information to locate a bus is obtained, it retrieves the corresponding information from the real-time bus information database.</Paragraph> <Paragraph position="6"> VoiceXML Generator This module dynamically generates VoiceXML files that contain response sentences and specifications of speech recognition grammars, which are given by the dialogue manager.</Paragraph> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> User Model Identifier </SectionTitle> <Paragraph position="0"> This module classifies user's characters based on the user models using features specific to spoken dialogue as well as semantic attributes.</Paragraph> <Paragraph position="1"> The obtained user profiles are sent to the dialogue manager, and are utilized in the dialogue management and response generation.</Paragraph> </Section> </Section> <Section position="4" start_page="1" end_page="1" type="metho"> <SectionTitle> 3 Response Generation using User Models </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 3.1 Classification of User Models </SectionTitle> <Paragraph position="0"> We define three dimensions as user models listed below. null Skill level to the system Knowledge level on the target domain Degree of hastiness Skill Level to the System Since spoken dialogue systems are not widespread yet, there arises a difference in the skill level of users in operating the systems. It is desirable that the system changes its behavior including response generation and initiative management in accordance with the skill level of the user. In conventional systems, a system-initiated guidance has been invoked on the spur of the moment either when the user says nothing or when speech recognition is not successful. In our framework, by modeling the skill level as the user's property, we address a radical solution for the unskilled users.</Paragraph> <Paragraph position="1"> Knowledge Level on the Target Domain There also exists a difference in the knowledge level on the target domain among users. Thus, it is necessary for the system to change information to present to users. For example, it is not cooperative to tell too detailed information to strangers. On the other hand, for inhabitants, it is useful to omit too obvious information and to output additive information. Therefore, we introduce a dimension that represents the knowledge level on the target domain. Degree of Hastiness In speech communications, it is more important to present information promptly and concisely compared with the other communication modes such as browsing. Especially in the bus system, the conciseness is preferred because the bus information is urgent to most users. Therefore, we also take account of degree of hastiness of the user, and accordingly change the system's responses.</Paragraph> </Section> <Section position="2" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 3.2 Response Generation Strategy using User Models </SectionTitle> <Paragraph position="0"> Next, we describe the response generation strategies adapted to individual users based on the proposed user models: skill level, knowledge level and hastiness. Basic design of dialogue management is based on mixed-initiative dialogue, in which the system makes follow-up questions and guidance if necessary while allowing a user to utter freely. It is investigated to add various contents to the system responses as cooperative responses in (Sadek, 1999).</Paragraph> <Paragraph position="1"> Such additive information is usually cooperative, but some people may feel such a response redundant.</Paragraph> <Paragraph position="2"> Thus, we introduce the user models and control the generation of additive information. By introducing the proposed user models, the system changes generated responses by the following two aspects: dialogue procedure and contents of responses.</Paragraph> </Section> <Section position="3" start_page="1" end_page="1" type="sub_section"> <SectionTitle> Dialogue Procedure </SectionTitle> <Paragraph position="0"> The dialogue procedure is changed based on the skill level and the hastiness. If a user is identified as having the high skill level, the dialogue management is carried out in a user-initiated manner; namely, the system generates only open-ended prompts. On the other hand, when user's skill level is detected as low, the system takes an initiative and prompts necessary items in order.</Paragraph> <Paragraph position="1"> When the degree of hastiness is low, the system makes confirmation on the input contents. Conversely, when the hastiness is detected as high, such a confirmation procedure is omitted.</Paragraph> <Paragraph position="2"> Information that should be included in the system response can be classified into the following two items.</Paragraph> <Paragraph position="3"> 1. Dialogue management information 2. Domain-specific information The dialogue management information specifies how to carry out the dialogue including the instruction on user's expression like &quot;Please reply with either yes or no.&quot; and the explanation about the following dialogue procedure like &quot;Now I will ask in order.&quot; This dialogue management information is determined by the user's skill level to the system, The domain-specific information is generated according to the user's knowledge level on the target domain. Namely, for users unacquainted with the local information, the system adds the explanation about the nearest bus stop, and omits complicated contents such as a proposal of another route.</Paragraph> <Paragraph position="4"> The contents described above are also controlled by the hastiness. For users who are not in hurry, the system generates the additional contents as cooperative responses. On the other hand, for hasty users, the contents are omitted in order to prevent the dialogue from being redundant.</Paragraph> </Section> <Section position="4" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 3.3 Classification of User based on Decision Tree </SectionTitle> <Paragraph position="0"> In order to implement the proposed user models as a classifier, we adopt a decision tree. It is constructed by decision tree learning algorithm C5.0 (Quinlan, 1993) with data collected by our dialogue system.</Paragraph> <Paragraph position="1"> Figure 3 shows the derived decision tree for the skill level.</Paragraph> <Paragraph position="2"> We use the features listed in Figure 4. They include not only semantic information contained in the utterances but also information specific to spoken dialogue systems such as the silence duration prior to the utterance and the presence of barge-in. Except for the last category of Figure 4 including &quot;attribute of specified bus stops&quot;, most of the features are domain-independent.</Paragraph> <Paragraph position="3"> The classification of each dimension is done for every user utterance except for knowledge level. The model of a user can change during a dialogue. Features extracted from utterances are accumulated as history information during the session.</Paragraph> <Paragraph position="4"> Figure 5 shows an example of the system behavfeatures obtained from a single utterance - dialogue state (defined by already filled slots) - presence of barge-in - lapsed time of the current utterance - recognition result (something recognized / uncertain / no input) - score of speech recognizer - the number of filled slots by the current utterance null features obtained from the session - the number of utterances - dialogue state of the previous utterance - lapsed time from the beginning of the session - the number of repetitions of a same question - the average number of repetitions of a same question - ratio of the total time of user utterances in whole elapsed time - ratio of the occurrence of barge-in out of the whole number of utterances - recognition result of the previous utterance - ratio of something recognized - ratio of getting uncertain results - ratio of no input - the number of barge-in - the number of something recognized - the number of getting uncertain results - the number of no input - average of recognition scores - the maximum number of slots filled by a single utterance features obtained from the session (used only in classification of knowledge level) - attribute of specified bus stops - the way to specify bus stops (whether a bus stop is specified by its correct name or not) - ratio of each attribute of specified bus stops - ratio of each way of specifying bus stops ior with the proposed user models. The skill level is classified as being low by the decision tree, because the first user's utterance includes only one content word. Then, dialogue procedure is changed to the system-initiated one. Similarly, the hastiness is classified as being low by the decision tree, and the system includes the explanation on the dialogue procedure and instruction on the expression in the responses. They are omitted if the hastiness is identified as high.</Paragraph> </Section> <Section position="5" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 3.4 Decision Tree Learning for User Models </SectionTitle> <Paragraph position="0"> We train and evaluate the decision tree for the user models using dialogue data collected by our system.</Paragraph> <Paragraph position="1"> The data was collected from December 10th 2001 to case: [skill level: low][knowledge level: low] [hastiness: low] Sys: Please tell me your current bus stop, your destination or the specific bus route number.</Paragraph> <Paragraph position="2"> User: Maruyama Park.</Paragraph> <Paragraph position="3"> Sys: Let me confirm one by one.</Paragraph> <Paragraph position="4"> ( The explanation is added by [skill level: low].) Will you get on at Maruyama Park? ( The content is confirmed by [hastiness: low].) Please reply with yes or no.</Paragraph> <Paragraph position="5"> ( The instruction is added by [skill level: low].) User: Yes.</Paragraph> <Paragraph position="6"> Sys: Where will you get off the bus? Please specify by a bus stop name or a well-known landmark.</Paragraph> <Paragraph position="7"> ( The explanation is added by [skill level: low].) User: Tofukuji temple.</Paragraph> <Paragraph position="8"> Sys: Will you get off the bus at Tofukuji temple? Please reply with yes or no.</Paragraph> <Paragraph position="9"> User: Yes.</Paragraph> <Paragraph position="10"> Sys: The following buses are located.</Paragraph> <Paragraph position="11"> ( The explanation is added by [skill level: low].) The nearest bus stop to Maruyama Park is Gion. ( The domain information is added by [knowledge level: low].) Bus number 202 bound for Nishioji-Kujo has departed Higashiyama-Sanjo, which is two stops away. May 10th 2002. The number of the sessions (telephone calls) is 215, and the total number of utterances included in the sessions is 1492. We annotated the subjective labels by hand. The annotator judges the user models for every utterances based on recorded speech data and logs. The labels were given to the three dimensions described in section 3.3 among 'high', 'indeterminable' or 'low'. It is possible that annotated models of a user change during a dialogue, especially from 'indeterminable' to 'low' or 'high'. The number of labeled utterances is shown in Table 1.</Paragraph> <Paragraph position="12"> Using the labeled data, we evaluated the classification accuracy of the proposed user models. All the experiments were carried out by the method of 10-fold cross validation. The process, in which one tenth of all data is used as the test data and the remainder is used as the training data, is repeated ten times, and the average of the accuracy is computed. The result is shown in Table 2. The conditions #1, #2 and #3 in Table 2 are described as follows.</Paragraph> <Paragraph position="13"> #1: The 10-fold cross validation is carried out per utterance.</Paragraph> <Paragraph position="14"> #2: The 10-fold cross validation is carried out per session (call).</Paragraph> <Paragraph position="15"> #3: We calculate the accuracy under more realistic condition. The accuracy is calculated not in three classes (high / indeterminable / low) but in two classes that actually affect the dialogue strategies. For example, the accuracy for the skill level is calculated for the two classes: low and the others. As to the classification of knowledge level, the accuracy is calculated for dialogue sessions because the features such as the attribute of a specified bus stop are not obtained in every utterance. Moreover, in order to smooth unbalanced distribution of the training data, a cost corresponding to the reciprocal ratio of the number of samples in each class is introduced. By the cost, the chance rate of two classes becomes 50%.</Paragraph> <Paragraph position="16"> The difference between condition #1 and #2 is that the training was carried out in a speaker-closed or speaker-open manner. The former shows better performance. null The result in condition #3 shows useful accuracy in the skill level. The following features play important part in the decision tree for the skill level: the number of filled slots by the current utterance, presence of barge-in and ratio of no input. For the knowledge level, recognition result (something recognized / uncertain / no input), ratio of no input and the way to specify bus stops (whether a bus stop is specified by its exact name or not) are effective. The hastiness is classified mainly by the three features: presence of barge-in, ratio of no input and lapsed time of the current utterance.</Paragraph> </Section> </Section> <Section position="5" start_page="1" end_page="1" type="metho"> <SectionTitle> 4 Experimental Evaluation of the System </SectionTitle> <Paragraph position="0"> with User Models We evaluated the system with the proposed user models using 20 novice subjects who had not used the system. The experiment was performed in the laboratory under adequate control. For the speech input, the headset microphone was used.</Paragraph> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 4.1 Experiment Procedure </SectionTitle> <Paragraph position="0"> First, we explained the outline of the system to subjects and gave the document in which experiment conditions and the scenarios were described. We prepared two sets of eight scenarios. Subjects were requested to acquire the bus information using the system with/without the user models. In the scenarios, neither the concrete names of bus stops nor the bus number were given. For example, one of the scenarios was as follows: &quot;You are in Kyoto for sightseeing. After visiting the Ginkakuji temple, you go to Maruyama Park. Supposing such a situation, please get information on the bus.&quot; We also set the constraint in order to vary the subjects' hastiness such as &quot;Please hurry as much as possible in order to save the charge of your cellular phone.&quot; The subjects were also told to look over questionnaire items before the experiment, and filled in them after using each system. This aims to reduce the subject's cognitive load and possible confusion due to switching the systems (Over, 1999). The questionnaire consisted of eight items, for example, &quot;When the dialogue did not go well, did the system guide intelligibly?&quot; We set seven steps for evaluation about each item, and the subject selected one of them.</Paragraph> <Paragraph position="1"> Furthermore, subjects were asked to write down the obtained information: the name of the bus stop to get on, the bus number and how much time it takes before the bus arrives. With this procedure, we planned to make the experiment condition close to the realistic one.</Paragraph> <Paragraph position="2"> duration (sec.) # turn group 1 with UM 51.9 4.03 The subjects were divided into two groups; a half (group 1) used the system in the order of &quot;with user models ! without user models&quot;, the other half (group 2) used in the reverse order.</Paragraph> <Paragraph position="3"> The dialogue management in the system without user models is also based on the mixed-initiative dialogue. The system generates follow-up questions and guidance if necessary, but behaves in a fixed manner. Namely, additive cooperative contents corresponding to skill level described in section 3.2 are not generated and the dialogue procedure is changed only after recognition errors occur. The system without user models behaves equivalently to the initial state of the user models: the hastiness is low, the knowledge level is low and the skill level is high.</Paragraph> </Section> </Section> class="xml-element"></Paper>