File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-2005_metho.xml
Size: 17,353 bytes
Last Modified: 2025-10-06 14:08:20
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-2005"> <Title>An Adaptive Approach to Collecting Multimodal Input</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 The MMIF module </SectionTitle> <Paragraph position="0"> We developed a multimodal input fusion module to perform a user study. The MMIF module is based on the model proposed by Gupta (2003). The MMIF receives semantic information in the form of typed feature structures (Carpenter, 1992) from the individual modalities. It combines typed feature structures received from different modalities during a complete turn using an extended unification algorithm (Gupta et. al., 2002). The output is a joint interpretation of the multimodal input that is sent to a DM that can perform reasoning and provide with suitable system replies.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 End of turn prediction </SectionTitle> <Paragraph position="0"> Based on current approaches, the following methods were chosen to perform an analysis to deter- null mine a suitable method for predicting the end of a user turn: 1. Windowing - In this method, after receiv null ing an input, the MMIF waits for a specified time for further input. After 3 seconds, the collected input is integrated and sent to the DM. This is similar to Johnston et. al. (2002) who uses a 1 second wait period.</Paragraph> <Paragraph position="1"> 2. Two Inputs - In this method, multimodal input is assumed to consist of two inputs from two modalities. After inputs from two modalities have been received, the integration process is performed and the result sent to the DM. A window of 3 seconds is used after receiving the first input. (Oviatt et. al. 1997) 3. Information evaluation - In this method integration is performed after receiving each input, and the result is evaluated to determine if the information can be transformed to a command that the system can understand. If transformation is possible, the work of MMIF is deemed complete and the information is sent to the DM. In the case of an incomplete transformation, a windowing technique is used. This approach is similar to that of Vo and Waibel (1997).</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Use case study </SectionTitle> <Paragraph position="0"> We used a multimodal in-car navigation system (Gupta et. al., 2002), developed using the MMIF module and a dialogue manager (Thompson and Bliss, 2000) to perform this study. Users can interact with a map-based display to get information on various locations and driving instructions. The interaction is performed using speech, handwriting, touch and gesture, either simultaneously or sequentially. The system was set-up on a 650MHz computer with 256MB of RAM and a touch screen.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Subjects and Task </SectionTitle> <Paragraph position="0"> The subjects for the study were both male and female in the age group of 25-35. All the subjects were working in technical fields and had daily interaction with computer-based systems at work.</Paragraph> <Paragraph position="1"> Before using the system, each of the subjects was briefed about the tasks they needed to perform and given a demonstration of using the system.</Paragraph> <Paragraph position="2"> The tasks performed by the subjects were: * Dialogue with the system to specify a few different destinations, e.g. a gas station, a hotel, an address, etc. and * Issue commands to control the map display e.g. zoom to a certain area on the map.</Paragraph> <Paragraph position="3"> Some of the tasks could be completed both unimodally or multimodally, while others required multiple inputs from the same modality, e.g. providing multiple destinations using touch. We asked the users to perform certain tasks in both unimodal and multimodal manner. The users were free to choose their preferred mode of interaction for a particular task. We observed users' behavior during the interaction. The subjects answered a few questions after every interaction on acceptability of the system response. If it was not acceptable, we asked for their preference.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Observations </SectionTitle> <Paragraph position="0"> The following observations were made during and after analysis of the user study based on aggregate results from using all the three methods of collecting multimodal input.</Paragraph> <Paragraph position="1"> Multimodality These observations were of critical importance to understand the nature of multimodal input.</Paragraph> <Paragraph position="2"> * Multimodal commands and dialogue usually consisted of two or three segments of information from the modalities.</Paragraph> <Paragraph position="3"> * Users tried to maintain synchronization between their inputs in multiple modalities by closely following cross-modal references with the referred object. Each user preferred either to speak first and then touch or vice versa almost consistently, implying a preferred interaction style.</Paragraph> <Paragraph position="4"> * Sometimes it took a long time for some modalities to produce a semantic representation after capturing information (e.g. when there was a long spoken input or when used on lower end machines). The MMIF module did not collect all the inputs in that turn because it received some input after a long time interval from the previous input(s).</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> User preference </SectionTitle> <Paragraph position="0"> * Users became impatient when the system did not respond within a certain time period and so they tried to re-enter the input when the system state was not being displayed to them.</Paragraph> <Paragraph position="1"> * During certain stages of interaction, the user could only interact with the system unimodally. In those cases they preferred that the system does not wait.</Paragraph> <Paragraph position="2"> Performance of various schemes The performance of the various methods to predict the completion of the user turn depended on the kind of activity the user was performing. A multi-modal command is defined as multimodal input that can be translated to a system action without the need for dialogue, for example, zooming in a certain area of a map. On the other hand, multimodal dialogue involved multi-turn interaction in which the user guided the system (or was guided by the system) to provide information or to perform some action.</Paragraph> <Paragraph position="3"> * When a multimodal command was issued, the user preferred the &quot;information evaluation&quot; and &quot;two input&quot; methods. This was because most of the multimodal commands were issued using two modalities. The &quot;Windowing&quot; method suffered from a delayed response from the system. The user got the impression that the system did not capture their input.</Paragraph> <Paragraph position="4"> * During multimodal dialogue the performance of the &quot;two input&quot; method was poor as sometimes a multimodal turn has more than two inputs. Multimodal dialogue usually did not result in the evaluation of a complete command so the performance of the &quot;information evaluation&quot; technique was similar to that of &quot;Windowing&quot;.</Paragraph> <Paragraph position="5"> Efficiency * If users acted unimodally, then it took them longer than the average time required to provide the same information in multimodal manner.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Measurements </SectionTitle> <Paragraph position="0"> Several statistical measures were extracted from the data collected during the user study.</Paragraph> <Paragraph position="1"> Multimodality The total number of user turns was 112. 83% of them had multimodal input. This shows an overwhelming preference for multimodal interaction. This is compared to 86% recorded in (Oviatt et. al. 1997). 95% of the time users used only two modalities in a turn. Usually there were multiple inputs in the same modality. Of the multimodal turns, 75% had only two inputs, and the rest had more than 2 inputs. To provide multimodal input, speech and touch/gesture were used 80% of the time, handwriting and gesture were used 15% of the time and speech and handwriting were used 5% of the time.</Paragraph> <Paragraph position="2"> Temporal analysis During multimodal interaction, 45% of inputs overlapped each other in time, while the remaining 55% followed the previous after some delay. This reinforces earlier recordings of 42% simultaneous multimodal inputs (Oviatt et. al. 1997). The average time between the start of simultaneous inputs in two different modalities was 1.5 seconds. This also matches earlier observations of 1.4 seconds lag between the end of pen and start of speech (Oviatt et. al. 1997). The average duration of a multimodal turn was 2.5 seconds without including the time delay to determine the end of turn. The average delay to determine the end of user turn during multimodal interaction was 2.3 secs.</Paragraph> <Paragraph position="3"> Efficiency We observed that unimodal commands required 18% longer time to issue than multimodal commands, implying multimodal input is faster. For example, it is easier to point to a location on a map using touch than using speech to describe it. A long sentence also decreases the probability of recognition. This compares favorably with observations made in (Oviatt et. al., 1997) which recorded a 10% faster task performance for multimodal interaction. null</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Robustness </SectionTitle> <Paragraph position="0"> We labeled as errors the cases where the MMIF did not produce the expected result or when all the inputs were not collected. In 8% of the observed turns, users tried to repeat their input because of slow observed response from the system. In another 6% of observed turns, all the input from that turn was not collected properly. 4% was due to an input modality taking a long time to process user input (possibility due to resource shortfall) and the remaining 2% were due to the user taking a long time between multimodal inputs.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Analysis </SectionTitle> <Paragraph position="0"> Following an analysis of the above observations and measurements, we came to the following conclusions: * Multimodal input is segmented with the user making a conscious effort to provide synchronization between inputs in multiple modalities. The synchronization technique applied is unique to every user. Multimodal input is likely to have a limited number of segments provided in different modalities.</Paragraph> <Paragraph position="1"> * Processing time can be a key element for MMIF when deploying multimodal interactive systems on devices with limited resources. null * Knowledge of the availability of current modalities and the task at hand can improve the performance of MMIF. Based on the current task for which the user has provided input, different techniques should be applied to determine the end of user turn.</Paragraph> <Paragraph position="2"> * Users need to be made aware of the status of the MMIF and the modes available to them.</Paragraph> <Paragraph position="3"> A uniform interface design methodology should be used, allowing the availability of all the modalities during all times.</Paragraph> <Paragraph position="4"> * Timing between inputs in different modalities is critical to determine the exact relationship between the referent and the referred.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Temporal relationship </SectionTitle> <Paragraph position="0"> Based on the observations, a fine-grained classification of the temporal relationship between user inputs is proposed. Temporal relationship is defined to be the way in which the modalities are used during interaction. Figure 2 shows the various temporal relationships between feature structures that are received from the modalities. A, B, C, D, E, and F are all feature structures and their extent denotes the capture period. These relationships will allow for a better prediction of when and which modality is likely to be used next by the user.</Paragraph> <Paragraph position="1"> * Temporally subsumes - A feature structure X temporally subsumes another feature structure Y if all time points of Y are contained in X. In the figure D temporally subsumes E.</Paragraph> <Paragraph position="2"> * Temporally Intersects - A feature structure X temporally intersects another feature structure Y if there is at least one time point that is contained in both of them. However, the end point of X is not contained in Y and the start point of Y is not contained in X. In the figure B and C temporally intersect each other.</Paragraph> <Paragraph position="3"> * Temporally Disjoint - A feature structure X is temporally disjoint from another feature structure Y if there are no time points in common between X and Y. In the figure, B and F are temporally disjoint.</Paragraph> <Paragraph position="4"> * Contiguous - A feature structure X is contiguous with another feature structure Y if X starts immediately after Y ends. The two events have no time points in common, but there is no time point between them. For example, in the figure A is contiguous after B.</Paragraph> </Section> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Enhancement to MMIF </SectionTitle> <Paragraph position="0"> It was proposed to augment the MMIF component with a wait mechanism that collects information from input modalities and adaptively determines the time when no further input is expected. The following factors were used during the design of the adaptive wait mechanism: 1. If the modality is specialized (i.e. it is usually used unimodally) then the likelihood of getting information in another modality is greatly reduced.</Paragraph> <Paragraph position="1"> 2. If the modality usually occurs in combination with other modalities then the likelihood of receiving information in another modality is increased.</Paragraph> <Paragraph position="2"> 3. If the number of segments of information within a turn is more than two or three then the likelihood of receiving further information from other modalities is reduced. null 4. If the duration of information in a certain modality is greater than usual, it is likely that the user has provided most of the information in that modality in a unimodal manner.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.1 Dynamic Time Windows </SectionTitle> <Paragraph position="0"> The enhanced method is the same as the information evaluation method except, that instead of the static time window, a dynamic time window based on current input and previous learning is used.</Paragraph> <Paragraph position="1"> Time Window prediction A statistical linear predictor was incorporated into the MMIF. This linear predictor provided a dynamic time window estimate of the time to wait for further information. The linear prediction (see figure 2) was based on statistical averages of the time required by a modality i to process information</Paragraph> <Paragraph position="3"> ), the time between modalities i and j becoming active (AvgTimeDiff i j ), etc. The forward prediction coefficients (c</Paragraph> <Paragraph position="5"> ) were based on the predicted modalities to be used or active, the current modality used, and the temporal relationship between the predicted and current modality.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Bayesian Learning </SectionTitle> <Paragraph position="0"> Machine learning techniques were employed to learn the preferred interaction style of each user.</Paragraph> <Paragraph position="1"> The preferred user interaction style included the most probable modality(s) to be used next and their temporal relationship. Since there is a lot of uncertainty in the knowledge of the preferred interaction style, a Bayesian network approach to learning was used. The nodes in the Bayesian network were the following: a) Modality currently being used b) Type of current input (i.e. type of semantic structure) c) Number of inputs within the current turn d) Time spent since beginning of current turn (this was made discrete in 4 segments) e) Modality to be used next f) Temporal relationship with the next modality null g) Time in current modality greater than average (true or false) Learning was applied on the network using data collected during previous user testing. Learning was also applied online using data from previous user turns thus adapting to the current user.</Paragraph> </Section> </Section> class="xml-element"></Paper>