File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2051_metho.xml
Size: 17,610 bytes
Last Modified: 2025-10-06 14:10:29
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2051"> <Title>Spontaneous Speech Understanding for Robust Multi-Modal Human-Robot Communication</Title> <Section position="5" start_page="391" end_page="392" type="metho"> <SectionTitle> 3 Situated Dialog Corpus </SectionTitle> <Paragraph position="0"> With our robot BIRON we want to improve social and functional behavior by enabling the system to carry out a more sophisticated dialog for handling instructions. One scenario is a home-tour where a user is supposed to show the robot around the home. Another scenario is a plant-watering task, where the robot is instructed to water different plants. There is only little research on multi-modal HRI with speech-based robots. A study how users interact with mobile of ce robots is reported in (Hcurrency1uttenrauch et al., 2003). However, in this evaluation, the integration of different modalities was not analyzed explicitly. But even though the subjects were not allowed to use speech and gestures in combination, the results support that people tended to communicate in a multi-modal way, nevertheless.</Paragraph> <Paragraph position="1"> To receive more detailed information about the instructions that users are likely to give to an assistant in home or of ce we simulated this scenario and recorded 14 dialogs from German native speakers. Their task was to instruct the robot to water plants. Since our focus in this stage of the development of our system lies on the situatedness of the conversation, the robot was simply replaced by a human pretending to be a robot. The subjects were asked to act as if it would be a robot. As proposed in (Lauriar et al., 2001), a preliminary user study is necessary to reduce the number of repair dialogs between user and system, such as queries.</Paragraph> <Paragraph position="2"> The corpus provides data necessary for the design of the dialog components for multi-modal interaction. We also determined the lexicon and obtained the SSUs that describe the scene and tasks for the robot.</Paragraph> <Paragraph position="3"> The recorded dialogs feature the speci c nature of dialog situations in multi-modal communication situations. The analysis of the corpus is presented in more detail in (Hcurrency1uwel and Kummert, 2004). It con rms that spontaneously spoken utterances seldom respect the standard grammar and structure of written sentences. People tend to use short phrases or single words. Large pauses often occur during an utterance or the utterance is incomplete. More interestingly, the multi-modal data shows that 13 out of 14 persons used pointing gestures in the dialogs to refer to objects. Such utterances cannot be interpreted without additional information of the scene. For example, an utterance such as this one is used with a pointing gesture to an object in the environment. We realize, of course, that for more realistic behavior towards a robot a real experiment has to be performed. However this time- and resource-ef cient procedure allowed us to build a system capable of facilitating situated communication with a robot.</Paragraph> <Paragraph position="4"> The implemented system has been evaluated with a real robot (see section 7). In the prior version we used German as language, now the dialog system has adapted to English.</Paragraph> </Section> <Section position="6" start_page="392" end_page="393" type="metho"> <SectionTitle> 4 The Robot Assistant BIRON </SectionTitle> <Paragraph position="0"> The aim of our project is to enable intuitive interaction between a human and a mobile robot. The basis for this project is the robot system BIRON (et. al, 2004). The robot is able to visually track persons and to detect and localize sound sources.</Paragraph> <Paragraph position="1"> architecture The robot expresses its focus of attention by turning the camera into the direction of the person currently speaking. From the orientation of the person's head it is deduced whether the speaker addresses the robot or not. The main modality of the robot system is speech but the system can also detect gestures and objects. Figure 1 gives an overview of the architecture of BIRON's multi-modal interaction system. For the communication between these modules we use an XML based communication framework (Fritsch et al., 2005).</Paragraph> <Paragraph position="2"> In the following we will brie y outline the interacting modules of the entire dialog system with the speech understanding component.</Paragraph> <Paragraph position="3"> Speech recognition: If the user addresses BIRON by looking in its direction and starting to speak, the speech recognition system starts to analyze the speech data. This means that once the attention system has detected that the user is probably addressing the robot it will route the speech signal to the speech recognizer. The end of the utterance is detected by a voice activation detector. Since both components can produce errors the speech signal sent to the recognizer may contain wrong or truncated parts of speech. The speech recognition itself is performed with an incremental speaker-independent system (Wachsmuth et al., 1998), based on Hidden Markov Models. It combines statistical and declarative language models to compute the most likely word chain.</Paragraph> <Paragraph position="4"> Dialog manager: The dialog management serves as the interface between speech analysis and the robot control system. It also generates answers for the user. Thus, the speech analysis system transforms utterances with respect to gestural and scene information, such as pointing gestures or objects in the environment, into instructions for the robot. The dialog manager in our application is agent-based and enables a multi-modal, mixed ini- null tiative interaction style (Li et al., 2005). It is based on semantic entities which re ect the information the user uttered as well as discourse information based on speech-acts. The dialog system classi es this input into different categories as e.g., instruction, query or social interaction. For this purpose we use discourse segments proposed by Grosz and Sidner (Grosz and Sidner, 1986) to describe the kind of utterances during the interaction. Then the dialog manager can react appropriately if it knows whether the user asked a question or instructed the robot. As gesture and object detection in our scenario is not very reliable and time-consuming, the system needs verbal hints of scene information such as pointing gestures or object descriptions to gather information of the gesture detection and object attention system.</Paragraph> </Section> <Section position="7" start_page="393" end_page="394" type="metho"> <SectionTitle> 5 Situated Concept Representations </SectionTitle> <Paragraph position="0"> Based on the situated conversational data, we designed situated semantic units (SSUs) which are suitable for fast and automatic speech understanding. These SSUs basically establish a network of strong (mandatory) and weak (optional) relations of sematic concepts which represent world and discourse knowledge. They also provide ontological information and additional structures for the integration of other modalities. Our structures are inspired by the idea of frames which provide semantic relations between parts of sentences (Fillmore, 1976).</Paragraph> <Paragraph position="1"> Till now, about 1300 lexical entries are stored in our database that are related to 150 SSUs. Both types are represented in form of XML structures.</Paragraph> <Paragraph position="2"> The lexicon and the concept database are based on our experimental data of situated communication (see section 3) and also on data of a home-tour scenario with a real robot. This data has been annotated by hand with the aim to provide an appropriate foundation for human-robot interaction.</Paragraph> <Paragraph position="3"> It is also planned to integrate more tasks for the robot as, e.g., courier service. This can be done by only adding new lexical entries and corresponding SSUs without spending much time in reorganization. Each lexical entry in our database contains a semantic association to the related SSUs.</Paragraph> <Paragraph position="4"> Therefore, equivalent lexical entries are provided for homonyms as they are associated to different concepts.</Paragraph> <Paragraph position="5"> In gure 2 the SSU Showing has an open link to the SSUs Actor and Object. Missing links to ances like I show you my poster tomorrow .</Paragraph> <Paragraph position="6"> strongly connected SSUs are interpreted as missing information and are thus indicators for the dialog management system to initiate a clari cation question or to look for information already stored in the scene model (see g. 1). The SSUs also have connections to optional arguments, but they are less important for the entire understanding process. null The SSUs also include ontological information, so that the relations between SSUs can be described as general as possible. For example, the SSU Building subpart is a sub-category of Object.</Paragraph> <Paragraph position="7"> In our scenario this is important as for example the unit Building subpart related to the concept wall has a xed position and can be used as navigationsupport in contrast to other objects. The topcategory is stored in the entry top, a special item of the SSU. By the use of ontological information, SSUs also differentiate between task and communication related information and thereby support the strategy of the dialog manager to decouple task from communication structure. This is important in order to make the dialog system independent of the task and enable scalable interaction capabilities. For example the SSU Showing belongs to the discourse type Instruction. Other types important for our domain are Socialization, Description, Con rmation, Negation, Correction, and Query.</Paragraph> <Paragraph position="8"> Further types may be included, if necessary.</Paragraph> <Paragraph position="9"> In our domain, missing information in an utterance can often be acquired from the scene. For example the utterance look at this and a pointing gesture to a table will be merged to the meaning look at the table . To resolve this meaning, we use hints of co-verbal gestures in the utterance. Words as this one or here are linked to the SSU Potential gesture, indicating a relation between speech and gesture. The timestamp of the utterance enables temporal alignment of speech and gesture. Since gesture recognition is expensive in computing time and often not well-de ned, such linguistic hints can reduce these costs dra- null matically.</Paragraph> <Paragraph position="10"> The utterance that can also represent an anaphora, and is analyzed in both ways, as anaphora and as gesture hint. Only if there is no gesture, the dialog manager will decide that the word probably was used in an anaphoric manner.</Paragraph> <Paragraph position="11"> Since we focus on spontaneous speech, we cannot rely on the grammar, and therefore the semantic units serve as the connections between the words in an utterance. If there are open connections interpretable as missing information, it can be inferred what is missing and be integrated by the contextual knowledge. This structure makes it easy to merge the constituents of an utterance solely by semantic relations without additional knowledge of the syntactic properties. By this, we lose information that might be necessary in several cases for disambiguation of complex utterances. However, spontaneous speech is hard to parse especially since speech recognition errors often occur on syntactically relevant morphemes.</Paragraph> <Paragraph position="12"> We therefore neglect the cases which tend to occur very rarely in HRI scenarios.</Paragraph> </Section> <Section position="8" start_page="394" end_page="395" type="metho"> <SectionTitle> 6 Semantic Processing </SectionTitle> <Paragraph position="0"> In order to generate a semantic interpretation of an utterance, we use a special mechanism, which uni es words of an utterance into a single structure. The system also considers the ontological information of the SSUs to generate the most likely interpretation of the utterance. For this purpose, the mechanism rst associates lexical entries of all words in the utterance with the corresponding SSUs. Then the system tries to link all SSUs together into one connected uniform. Some SSUs provide open links to other SSUs, which can be lled by semantic related SSUs. The SSU Beside for example provides an open link to Object.</Paragraph> <Paragraph position="1"> This SSU can be linked to all Object entities and to all subtypes of Object. Thus, an utterance as next to the door can be linked together to form a single structure (see g. 3). The SSUs which possess open links are central for this mechanism, they represent roots for parts of utterances. However, these units can be connected by other roots, likewise to generate a tree representing semantic relations inside an utterance.</Paragraph> <Paragraph position="2"> The fusion mechanism computes in its best case in linear time and in worst case in square time.</Paragraph> <Paragraph position="3"> A scoring function underlies the mechanism: the more words can be combined, the better is the rat- null ing. The system nally chooses the structure with the highest score. Thus, it is possible to handle semantic variations of an utterance in parallel, such as homonyms. Additionally, the rating is helpful to decide whether the speech recognition result is reliable or not. In this case, the dialog manager can ask the user for clari cation. In the next version we will use a more elaborate evaluation technique to yield better results such as rating the amount of concept-relations and missing relations, distinguish between important and optional relations, and prefer relations to words nearby.</Paragraph> <Paragraph position="4"> A converter forwards the result of the mechanism as an XML-structure to the dialog manager. A segment of the result for the dialog manager is presented in Figure 4. With the categorydescriptions the dialog-module can react fast on the user's utterance without any further calculation. It uses them to create inquiries to the user or to send a command to the robot control system, such as look for a gesture , look for a blue object , or follow person . If the interpreted utterance does not t to any category it gets the value fragment. These utterances are currently interpreted in the same way as partial understandings and the dialog manager asks the user to provide more meaningful information.</Paragraph> <Paragraph position="5"> Figure 1 illustrates the entire architecture of the speech understanding system and its interfaces to other modules. The SSUs and the lexicon are stored in an external XML-databases. As the speech understanding module starts, it rst reads these databases and converts them into internal data-structures stored in a fast accessible hash table. As soon as the module receives results from speech recognition, it starts to merge. The mechanism also uses a history, where former parts of utterances are stored and which are also integrated in the fusing mechanism. The speech understanding system then converts the best scored result into a semantic XML-structure (see Figure 4) for the dialog manager.</Paragraph> <Paragraph position="6"> ing results for the utterances what can you do and this is a green cup .</Paragraph> <Section position="1" start_page="395" end_page="395" type="sub_section"> <SectionTitle> 6.1 Situated Speech Processing </SectionTitle> <Paragraph position="0"> Our approach has various advantages dealing with spontaneous speech. Double uttered words as in the utterance look - look here are ignored in our approach. The system still can interprete the utterance, then only one word is linked to the other words. Corrections inside an utterance as the left em right cube are handled similar. The system generates two interpretations of the utterance, the one containing left the other right. The system chooses the last one, since we assume that corrections occur later in time and therefore more to the right. The system deals with pauses inside utterances by integrating former parts of utterances stored in the history. The mechanism also processes incomplete or syntactic incorrect utterances. To prevent sending wrong interpretations to the dialog-manager the scoring function rates the quality of the interpretation as described above. In our system we also use scene information to evaluate the entire correctness so that we do not only have to rely on the speech input. In case of doubt the dialog-manager requests to the user.</Paragraph> <Paragraph position="1"> For future work it is planned to integrate additional information sources, e.g., inquiries of the dialog manager to the user. The module will also User1: Robot look - do you see? This - is a cow. Funny.</Paragraph> <Paragraph position="2"> Do you like it? ...</Paragraph> <Paragraph position="3"> User2: Look here robot - a cup.</Paragraph> <Paragraph position="4"> Look here a - a keyboard.</Paragraph> <Paragraph position="5"> Let's try that one. ...</Paragraph> <Paragraph position="6"> User3: Can you walk in this room? Sorry, can you repeat your answer? How fast can you move? ...</Paragraph> <Paragraph position="7"> store these information in the history which will be used for anaphora resolution and can also be used to verify the output of the speech recognition.</Paragraph> </Section> </Section> class="xml-element"></Paper>