XML Viewer - w00-1017

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1017_metho.xml
Size: 25,118 bytes
Last Modified: 2025-10-06 14:07:25
<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1017">
  <Title>WIT: A Toolkit for Building Robust and Real-Time Spoken Dialogue Systems</Title>
  <Section position="4" start_page="150" end_page="155" type="metho">
    <SectionTitle>
3 Architecture of WIT-Based Spoken
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="150" end_page="150" type="sub_section">
      <SectionTitle>
Dialogue Systems
</SectionTitle>
      <Paragraph position="0"> Here we explain how the modules in WIT work by exploiting domain-dependent knowledge and how they interact with each other.</Paragraph>
    </Section>
    <Section position="2" start_page="150" end_page="151" type="sub_section">
      <SectionTitle>
3.1 Speech Recognition
</SectionTitle>
      <Paragraph position="0"> The speech recognition module is a phoneme-HMM-based speaker-independent continuous speech recognizer that incrementally outputs face Toolldt.</Paragraph>
      <Paragraph position="1"> word hypotheses. As the recogn/fion engine, either VoiceRex, developed by NTI&amp;quot; (Noda et al., 1998), or HTK from Entropic Research can be used. Acoustic models for HTK is trained with the continuous speech database of the Acoustical Society of Japan (Kobayashi et al., 1992). This recognizer incrementally outputs word hypotheses as soon as they are found in the best-scored path in the forward search (Hirasawa et al., 1998) using the ISTAR (Incremental Structure Transmitter And Receiver) protocol, which conveys word graph information as well as word hypotheses. This incremental output allows the language understanding module to process recognition results before the speech interval ends, and thus real-time responses are possible. This module continuously runs and outputs recognition results when it detects a speech interval. This enables the language generation module to react immediately to user interruptions while the system is speaking.</Paragraph>
      <Paragraph position="2"> The language model for speech recognition is a network (regular) grammar, and it allows each speech interval to be an arbitrary number of phrases. A phrase is a sequence of words, which is to be defined in a domain-dependent way. Sentences can be decomposed into a couple of phrases. The reason we use a repetition of phrases instead of a sentence grammar for the language model is that the speech recognition module of a robust spoken dialogue system sometimes has to recognize spontaneously spoken utterances, which include self-repairs and repetition. In Japanese, bunsetsu is appropriate for defining phrases. A bunsetsu consists of one content word and a number (possibly zero) of function words. In the meeting room reservation system we have developed, examples of defined phrases are bunsetsu to specify the room to be reserved and the time of the reservation and bunsetsu to express affirmation and negation.</Paragraph>
      <Paragraph position="3"> When the speech recognition module finds a phrase boundary, it sends the category of the phrase to the language understanding module, and this information is used in the parsing process. null It is possible to hold multiple language models and use any one of them when recognizing a speech interval. The language models are  switched according to the requests from the language understanding module. In this way, the speech recognition success rate is increased by using the context of the dialogue.</Paragraph>
      <Paragraph position="4"> Although the current version of WIT does not exploit probabilistic language models, such models can be incorporated without changing the basic WIT architecture.</Paragraph>
    </Section>
    <Section position="3" start_page="151" end_page="151" type="sub_section">
      <SectionTitle>
3.2 Language Understanding
</SectionTitle>
      <Paragraph position="0"> The language understanding :module receives word hypotheses from the speech recognition module and incrementally understands the sequence of the word hypotheses to update the dialogue state, in which the resnlt of understanding and discourse information are represented by a frame (i.e., attribute-value pairs). The understanding module utilizes ISSS (Incremental</Paragraph>
    </Section>
    <Section position="4" start_page="151" end_page="152" type="sub_section">
      <SectionTitle>
Significant-utterance Sequence Search) (Nakano
</SectionTitle>
      <Paragraph position="0"> et al., 1999b), which is an integrated parsing and discourse processing method. ISSS enables the incremental understanding of user utterances that are not segmented into sentences prior to parsing by incrementally finding the most plausible sequence of sentences (or significant utterances in the ISSS terms) out of the possible sentence sequences for the input word sequence. ISSS also makes it possible for the language generation module to respond in real time because it can output a partial result of understanding at any point in time.</Paragraph>
      <Paragraph position="1"> The domain-dependent knowledge used in this module consists of a unification-based lexicon and phrase structure rules. Disjunctive feature descriptions are also possible; WIT incorporates an efficient method for handling disjunctions (Nakano, 1991). When a phrase boundary is detected, the feature structure for a phrase is computed using some built-in rules from the feature structure rules for the words in the phrase. The phrase structure rules specify what kind of phrase sequences can be considered as sentences, and they also enable computing the semantic representation for found sentences. Two kinds of sentenees can be considered; domain-related ones that express the user's intention about the reser- null vafion and dialogue-related ones that express the user's attitude with respect to the progress of the dialogue, such as confirmation and denial. Considering the meeting room reservation system, examples of domain-related sentences are &amp;quot;I need to book Room 2 on Wednesday&amp;quot;, &amp;quot;I need to book Room 2&amp;quot;, and &amp;quot;Room 2&amp;quot; and dialogue-related ones are &amp;quot;yes&amp;quot;, &amp;quot;no&amp;quot;, and &amp;quot;Okay&amp;quot;. The semantic representation for a sentence is a command for updatingthe dialogue state. The dialogue state is represented by a list of attribute-value pairs. For example, attributes used in the meeting room reservation system include task-related attributes, such as the date and time of the reservation, as well as attributes that represent discourse-related information, such as confirmation and grounding.</Paragraph>
    </Section>
    <Section position="5" start_page="152" end_page="152" type="sub_section">
      <SectionTitle>
3.3 Language Generation
</SectionTitle>
      <Paragraph position="0"> How the language generation module works varies depending on whether the user or system has the initiative of turn taking in the dialogue 2. Precisely speaking, the participant having the initiative is the one the system assumes has it in the dialogue.</Paragraph>
      <Paragraph position="1"> The domain-dependent knowledge used by the language generation module is generation procedures, which consist of a set of dialogue-phase definitions. For each dialogue phase, an initial function, an action function, a time-out function, and a language model are assigned. In addition, phase definitions designate whether the user or the system has the initiative. In the phases in which the system has the initiative, only the initial function and the language model are assigned. The meeting room reservation system, for example, has three phases: the phase in which the user tells the system his/her request, the phase in which the system confirms it, and the phase in which the system tells the user the result of the database access. In the first two phases, the user holds the initiative, and in the last phase, the systern holds the initiative.</Paragraph>
      <Paragraph position="2"> Functions defined here decide what string should be spoken and send that string to the speech output module based on the current dialogue state. They can also shift the dialogue 2The notion of the initiative in this paper is different from that of the dialogue initiative of Chu-Carroll (2000). phase and change the holder of the initiative as well as change the dialogue state. When the dialogue phase shifts, the language model foi&amp;quot; speech recognition is changed to get better speech recognition performance. Typically, the language generation module is responsible for database access. The language generation module works as follows. It first checks which dialogue participant has the initiative. If the initiative is held by the user, it waits until the user's speech interval ends or a duration of silence after the end of a system utterance is detected. The action function in the dialogue phase at that point in time is executed in the former case; the time-out function is executed in the latter case. Then it goes back to the initial stage. If the system holds the initiative, the module executes the initial function of the phase. In typical question-answer systems, the user has the initiative when asking questions and the system has it when answering.</Paragraph>
      <Paragraph position="3"> Since the language generation module works in parallel with the language understanding module, utterance generation is possible even while the system is listening to user utterances and that utterance understanding is possible even while it is speaking (Nakano et al., 1999a). Thus the system can respond immediately after user pauses when the user has the initiative. When the system holds the initiative, it can immediately react to an interruption by the user because user utterances are understood in an incremental way (Dohsaka and Shimazu, 1997).</Paragraph>
      <Paragraph position="4"> The time-out function is effective in moving the dialogue forward when the dialogue gets stuck for some reason. For example, the system may be able to repeat the same question with another expression and may also be able to ask the user a more specific question.</Paragraph>
    </Section>
    <Section position="6" start_page="152" end_page="153" type="sub_section">
      <SectionTitle>
3.4 Speech Output
</SectionTitle>
      <Paragraph position="0"> The speech output module produces speech according to the requests from the language generation module by using the correspondence table between strings and pre-recorded speech data. It also notifies the language generation module that speech output has finished so that the language generation module can take into account the timing of the end of system utterance. The meeting room reservation system uses speech files of short</Paragraph>
    </Section>
    <Section position="7" start_page="153" end_page="155" type="sub_section">
      <SectionTitle>
4.1 Domain-Dependent System
Specifications
</SectionTitle>
      <Paragraph position="0"> Spoken dialogue systems can be built with WIT by preparing several domain-dependent specifications. Below we explain the specifications.</Paragraph>
      <Paragraph position="1"> Feature Definitions: Feature definitions specify the set of features used in the grammar for language understanding. They also specify whether each feature is a head feature or a foot feature (Pollard and Sag, 1994). This information is used when constructing feature structures for phrases in a built-in process.</Paragraph>
      <Paragraph position="2"> The following is an example of a feature definition. Here we use examples from the specification of the meeting room reservation system.</Paragraph>
      <Paragraph position="3"> (case head) It means that the case feature is used and it is a head feature 3.</Paragraph>
      <Paragraph position="4"> Lexieal Descriptions: Lexical descriptions specify both pronunciations and grammatical features for words. Below is an example lexical item for the word 1-gatsu (January).</Paragraph>
      <Paragraph position="5"> (l-gatsu ichigatsu month nil i) The first three elements are the identifier, the pronunciation, and the grammatical category of the word. The remaining two elements are the case and semantic feature values.</Paragraph>
      <Paragraph position="6"> Phrase Definitions: Phrase definitions specify what kind of word sequence can be recognized as a phrase. Each definition is a pair comprising a phrase category name and a network of word categories. In the example below, month-phrase is the phrase category name and the remaining part is the network of word categories. opt means an option and or means a disjunction. For instance, a word sequence that consists of a word in the month category, such as 1-gatsu (January), and a word in the adraoninalparticle category, such as no (of), forms a phrase in the month-phrase category.</Paragraph>
      <Paragraph position="7">  Network Definitions: Network definitions specify what kind of phrases can be included in each language model. Each definition is a pair comprising a network name and a set of phrase category names.</Paragraph>
      <Paragraph position="8"> Semantic-Frame Specifications: The result of understanding and dialogue history can be stored in the dialogue state, which is represented by a flat frame structure, i.e., a set of attribute-value pairs. Semantic-frame specifications define the attributes used in the frame. The meeting room reservation system uses task-related attributes. Two are start and end, which represent the user's intention about the start and end times of the reservation for some meeting room. It also has attributes that represent discourse information. One is confirmed, whose value indicates whether if the system has already made an utterance to confirm the content of the task-related attributes. null  These roles are similar to DCG (Pereira and Warren, 1980) rules; they can include logical variables and these variables can be bound when these rules are applied. It is possible to add to the rules constraints that stipulate relationships that must hold among variables (Nakano, 199 I), but we do not explain these constraints in detail in this  paper. The priorities are used for disambiguating interpretation in the incremental understanding method (Nakano et al., 1999b).</Paragraph>
      <Paragraph position="9"> When the command on the right-hand side of the arrow is a frame operation command, phrases to which this rule can be applied can be considered a sentence, and the sentence's semantic representation is the command for updating the dialogue state. The command is one of the following: null * A command to set the value of an attribute of the frame, * A command to increase the priority, Conditional commands (If-then-else type command, the condition being whether the value of an attribute of the flame is or is not equal to a specified value, or a conjunction or disjunction of the above condition), or * A list of commands to be sequentially executed. null Thanks to conditional commands, it is possible to represent the semantics of sentences contextdependently. null The following rule is an example.</Paragraph>
      <Paragraph position="10">  The name of this rule is start-end-timescommand. The second and third elements are child feature structures. In these elements, time-phrase is a phrase category, : from and ( : or : to nil ) are case feature values, and *start and *end are semantic feature values. Here :or means a disjunction, and symbols starting with an asterisk are variables. The right-hand side of the arrow is a command to update the frame. The second element of the command, (set :start *start), changes the :start atttribute value of the frame to the instance of *start, which should be bound when applying this rule to the child feature structures. Phase Definitions: Each phase definition consists of a phase name, a network name, an initiative holder specification, an initial function, an action function, a maximum silence duration, and a time-out function. The network name is the identifier of the language model for the speech recognition. The maximum silence duration specifies how long the generation module should wait until the time-out function is invoked.</Paragraph>
      <Paragraph position="11"> Below is an example of a phase definition.</Paragraph>
      <Paragraph position="12"> The first element request is the name of this phase, &amp;quot;frar_request&amp;quot; is the name of the network, and move-to-reques t-phase and request-phase-action are the names of the initial and action functions. In this phase, the maximum silence duration is ten seconds and the name of the time-out function is requestphas e- t imeou t.</Paragraph>
      <Paragraph position="13"> (request &amp;quot;fmr_request&amp;quot; move- to-reques t -phase request-phase-action  request-phase- t imeout ) For the definitions of these functions, WIT provides functions for accessing the dialogue state, sending a request to speak to the speech output module, generating strings to be spoken using surface generation templates, shifting the dialogue phase, taking and releasing the initiative, and so on. Functions are defined in terms of the Common Lisp program.</Paragraph>
      <Paragraph position="14"> Surface-generation Templates: Surface-generation templates are used by the surface generation library function, which converts a list-structured semantic representation to a sequence of strings. Each string can be spoken, i.e., it is in the list of pre-recorded speech files. For example, let us consider the conversion of the semantic representation (date (dateexpression 3 15) ) to strings using the following template.</Paragraph>
      <Paragraph position="16"> The surface generation library function matches the input semantic representation with the first element of the template and checks if a sequences  of strings appear in the speech file list. It returns (' '3gagsul5nichi'') (March 15th) if the string &amp;quot;3gatsul5nichi&amp;quot; is in the list of pre-recorded speech files, and otherwise, returns ( ' ' 3gatsu .... 15nichi' ' ) when these strings are in the list.</Paragraph>
      <Paragraph position="17"> List of Pre-recorded Speech Files: The list of pre-recorded speech files should show the correspondence between strings and speech files to be played by the speech output module.</Paragraph>
    </Section>
    <Section position="8" start_page="155" end_page="155" type="sub_section">
      <SectionTitle>
4.2 Compiling System Specifications
</SectionTitle>
      <Paragraph position="0"> From the specifications explained above, domain-dependent knowledge sources are created as indicated by the dashed arrows in Figure 1. When creating the knowledge sources, WIT checks for several kinds of consistency. For example, the set of word categories appearing in the lexicon and the set of word categories appearing in phrase deftnifions are compared. This makes it easy to find errors in the domain specifications.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="155" end_page="155" type="metho">
    <SectionTitle>
5 Implementation
</SectionTitle>
    <Paragraph position="0"> WIT has been implemented in Common Lisp and C on UNIX, and we have built several experimental and demonstration dialogue systems using it, including a meeting room reservation system (Nakano et al., 1999b), a video-recording programming system, a schedule management system (Nakano et al., 1999a), and a weather information system (Dohsaka et al., 2000). The meeting room reservation system has vocabulary of about 140 words, around 40 phrase structure rules, nine attributes in the semantic frame, and around 100 speech files. A sample dialogue between this system and a naive user is shown in Figure 2. This system employs HTK as the speech recognition engine. The weather information system can answer the user's questions about weather forecasts in Japan. The vocabulary size is around 500, and the number of phrase structure rules is 31. The number of attributes in the semantic flame is 11, and the number of the files of the pre-recorded speech is about 13,000.</Paragraph>
  </Section>
  <Section position="6" start_page="155" end_page="156" type="metho">
    <SectionTitle>
6 Discussion
</SectionTitle>
    <Paragraph position="0"> As explained above, the architecture of WIT allows us to develop a system that can use utterances that are not clearly segmented into sentences by pauses and respond in real time. Below we discuss other advantages and remaining problems. null</Paragraph>
    <Section position="1" start_page="155" end_page="155" type="sub_section">
      <SectionTitle>
6.1 Descriptive Power
</SectionTitle>
      <Paragraph position="0"> Whereas previous finite-state-model-based toolkits place many severe restrictions on domain descriptions, WIT has enough descriptive power to build a variety of dialogue systems. Although the dialogue state is represented by a simple attribute-value matrix, since there is no limitation on the number of attributes, it can hold more complicated information. For example, it is possible to represent a discourse stack whose depth is limited. Recording some dialogue history is also possible. Since the language understanding module utilizes unification, a wide variety of linguistic phenomena can be covered. For example, speech repairs, particle omission, and fillers can be dealt with in the framework of unification grammar (Nakano et al., 1994; Nakano and Shimazu, 1999). The language generation module features Common Lisp functions, so there is no limitation on the description. Some of the systems we have developed feature a generation method based on hierarchical planning (Dohsaka and Shirnazu, 1997). It is also possible to build a simple finite-state-model-based dialogue system using WIT. States can be represented by dialogue phases in WIT.</Paragraph>
    </Section>
    <Section position="2" start_page="155" end_page="155" type="sub_section">
      <SectionTitle>
6.2 Consistency
</SectionTitle>
      <Paragraph position="0"> In an agglutinative language such as Japanese, there is no established definition of words, so dialogue system developers must define words. This sometimes causes a problem in that the definition of word, that is, the word boundaries, in the speech recognition module are different from that in the language understanding module. In WIT, however, since the common lexicon is used in both the speech recognition module and language understanding module, the consistency between them is maintained.</Paragraph>
    </Section>
    <Section position="3" start_page="155" end_page="156" type="sub_section">
      <SectionTitle>
6-3 Avoiding Information Loss
</SectionTitle>
      <Paragraph position="0"> In ordinary spoken language systems, the speech recognition module sends just a word hypothesis to the language processing module, which  donoy6na goy6ken desh6 ka (how may I help you?) kaigishitsu o yoyaku shitai ndesu ga (I'd like to make a reservation for a meeting room) hai (uh-huh) san-gatsujfini-nichi (on March 12th) hal (uh-huh) jayo-ji kara (from 14:00) hai (uh-huh) jashichi-ji sanjup-pun made (to 17:30) hai (uh-huh) dai-kaigishitsu (the large meeting room) san-gatsu jani-nichi, j~yo-ji kara, jashichi-ji sanjup-pun made, dai-kaigishitsu toyfi koto de yoroshf deshrka (on March 12th, from 14:00 to 17:30, the large meeting room, is that right?) &amp;quot; hai (yes) kashikomarimashitd (all right) An example dialogue of an example system must disambiguate word meaning and find phrase boundaries by parsing. In contrast, the speech recognition module in WIT sends not only words but also word categories, phrase boundaries, and phrase categories. This leads to less expensive and better language understanding.</Paragraph>
    </Section>
    <Section position="4" start_page="156" end_page="156" type="sub_section">
      <SectionTitle>
6.4 Problems and Limitations
</SectionTitle>
      <Paragraph position="0"> Several problems remain with WIT. One of the most significant is that the system developer must write language generation functions. If the generation functions employ sophisticated dialogue strategies, the system can perform complicated dialogues that are not just question answering.</Paragraph>
      <Paragraph position="1"> WIT, however, does not provide task-independent facilities that make it easier to employ such dialogue strategies.</Paragraph>
      <Paragraph position="2"> There have been several efforts aimed at developing a domain-independent method for generating responses from a frame representation of user requests (Bobrow et al., 1977; Chu-CarroU, 1999). Incorporating such techniques would deo crease the system developer workload. However, there has been no work on domain-independent response generation for robust spoken dialogue systems that can deal with utterances that might include pauses in the middle of a sentence, which WIT handles well. Therefore incorporating those techniques remains as a future work.</Paragraph>
      <Paragraph position="3"> Another limitation is that WIT cannot deal with multiple speech recognition candidates such as those in an N-best list. Extending WIT to deal with multiple recognition results would improve the performance of the whole system. The ISSS preference mechanism is expected to play a role in choosing the best recognition result.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML