XML Viewer - w06-1302

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1302_metho.xml
Size: 14,916 bytes
Last Modified: 2025-10-06 14:10:42
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1302">
  <Title>Hiroshi Tsujino +</Title>
  <Section position="4" start_page="9" end_page="10" type="metho">
    <SectionTitle>
2 Architecture used for Multi-Domain
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="9" end_page="10" type="sub_section">
      <SectionTitle>
Spoken Dialogue Systems
</SectionTitle>
      <Paragraph position="0"> In multi-domain spoken dialogue systems, the system design is more complicated than in single domain systems. When the designed systems are closely related to each other, a modification in a certain domain may affect the whole system. This type of a design makes it difficult to modify existing domains or to add new domains. Therefore, a distributed-type architecture has been previously proposed (Lin et al., 2001), which enables system developers to design each domain independently.</Paragraph>
      <Paragraph position="1"> In this architecture, the system is composed of two kinds of components: a part that can be designed independently of all other domains, and a part in which relations among domains should be considered. By minimizing the latter component, a system developer can design each domain semiindependently, which enables domains to be easily added or modified. Many existing systems are based on this architecture (Lin et al., 2001; O'Neill et al., 2004; Pakucs, 2003; Nakano et al., 2005).</Paragraph>
      <Paragraph position="2"> Thus, we adopted the distributed-type architecture (Nakano et al., 2005). Our system is roughly composed of two parts, as shown in Figure 1: several experts that control dialogues in each domain, and a central module that controls each expert.</Paragraph>
      <Paragraph position="3"> When a user speaks to the system, the central module drives a speech recognizer, and then passes the result to each domain expert. Each expert, which controls its own domains, executes a language understanding module, updates its dialogue states based on the speech recognition result, and returns the information required for domain selection null  . Based on the information obtained from the experts, the central module selects an appropriate domain for giving the response. An expert then takes charge of the selected domain and determines the next dialogue act based on its dialogue state. The central module generates a response based on the dialogue act obtained from the expert, and outputs the synthesized speech to the user.</Paragraph>
      <Paragraph position="4"> Communications between the central module and each expert are realized using method-calls in the central module. Each expert is required to have several methods, such as utterance understanding or response selection, to be considered an expert  Dialogue states in a domain that are not selected during domain selection are returned to their previous states.  in this architecture.</Paragraph>
      <Paragraph position="5"> As was previously described, the central module is not concerned with processing the speech recognition results; instead, the central module leaves this task to each expert. Therefore, it is important that the central module selects an expert that is committed to the process of the speech recognition result. Furthermore, information used during domain selection should also be domain independent, because this allows easier domain modification and addition, which is, after all, the main advantage of distributed-type architecture.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="10" end_page="11" type="metho">
    <SectionTitle>
3 Extensible and Robust Domain
</SectionTitle>
    <Paragraph position="0"> Selection Domain selection in the central module should also be performed within an extensible framework, and also should be robust against speech recognition errors.</Paragraph>
    <Paragraph position="1"> In many conventional methods, domain selection is based on estimating the most likely domains based on the speech recognition results. Since these methods are heavily dependent on the performance of the speech recognizers, they are not robust because the systems will fail when a speech recognizer fails. To behave robustly against speech recognition errors, the success of speech recognition and of domain selection should be treated separately. Furthermore, in some conventional methods, accurate language models are required to construct the domain selection parts before new domains are added to a multi-domain system. This means that they are not extensible.</Paragraph>
    <Paragraph position="2"> When selecting a domain, other studies have used the information on the domain in which a previous response was made. Lin et al. (2001) gave preference to the domain selected in the previous turn by adding a certain score as an award when comparing the N-best candidates of the speech recognition for each domain. Lane and Kawahara (2005) also assigned a similar preference in the classification with Support Vector Machine (SVM). A system described in (O'Neill et al., 2004) does not change its domain until its sub-task is completed, which is a constraint similar to keeping dialogue in one domain. Since these methods assume that the previous domain is most likely the correct domain, it is expected that these methods keep a system in the domain despite errors due to speech recognition problems. Thus, should domain selection be erroneous, the damage due to the  error is compounded, as the system assumes that the previous domain is always correct. Therefore, we solve this problem by considering features that represent the confidence of the previously selected domain.</Paragraph>
    <Paragraph position="3"> We define domain selection as being based on the following 3-class categorization: (I) the previous domain, (II) the domain in which the speech recognition results can be accepted with the highest recognition score, which is different from the previous domain, and (III) other domains. Figure 2 depicts the three choices. This framework includes the conventional methods as choices (I) and (II). Furthermore, it considers the possibility that the current interpretations may be wrong, which is represented as choice (III). This framework also has extensibility for adding new domains, since it treats domain selection not by detecting each domain directly, but by defining only a relative relationship between the previous and current domains. null Since our framework separates speech recognition results and domain selection, it can keep dialogues in the correct domain even when speech recognition results are wrong. This situation is represented as choice (I). An example is shown in Figure 3. Here, the user's first utterance (U1) is about the restaurant domain. Although the second utterance (U2) is also about the restaurant domain, an incorrect interpretation for the restaurant domain is obtained because the utterance contains an out-of-vocabulary word and is incorrectly recognized. Although a response for utterance U2 should ideally be in the restaurant domain, the system control shifts to the temple sightseeing information domain, in which an interpretation is obtained based on the speech recognition result. This  a19a16 U1: Tell me bars in Kawaramachi area.</Paragraph>
    <Paragraph position="4"> (domain: restaurant) S1: Searching for bars in Kawaramachi area.</Paragraph>
    <Paragraph position="5"> 30 items found.</Paragraph>
    <Paragraph position="6"> U2: I want Tamanohikari (name of liquor).</Paragraph>
    <Paragraph position="7"> (domain: restaurant) Tamanohikari is out-of-vocabulary word, and misrecognized as Tamba-bashi (name of place).</Paragraph>
    <Paragraph position="8"> (domain: temple) S2 (bad): Searching spots near Tamba-bashi. 10 items found. (domain: temple) S2 (good): I do not understand what you said. Do you have any other preferences? (domain: restaurant)  ate in spite of speech recognition error is shown as utterance S2 (bad). In such cases, our framework is capable of behaving appropriately.</Paragraph>
    <Paragraph position="9"> This is shown as S2 (good), which is made by selecting choice (I). Accepting erroneous recognition results is more harmful than rejecting correct ones for the following reasons: 1) a user needs to solve the misunderstanding as a result of the false acceptance, and 2) an erroneous utterance affects the interpretation of the utterances following it. Furthermore, we define choice (III), which detects the cases where normal dialogue management is not suitable, in which case the central module selects an expert based on either the previous domain or the domain based on the speech recognition results. The situation corresponds to a succession of recognition errors. However, this problem is more difficult to solve than merely detecting a simple succession of the errors because the system needs to distinguish between speech recognition errors and domain selection errors in order to generate appropriate next utterances. Figure 4 shows an example of such a situation. Here, the user's utterances U1 and U2 are about the temple domain, but a speech recognition error occurred in U2, and system control shifts to the hotel domain. The user again says (U3), but this results in the same recognition error. In this case, a domain that should ideally be selected is neither the domain in the previous turn nor the domain determined based on the speech recognition results. If this situation can be detected, the system should be able to generate an appropriate response, like S3 (good), and prevent inappropriate responses based  on an incorrect domain determination. It is possible for the system to restart from two utterances before (U1), after asking a confirmatory question (S4) about whether to return to it or not. After that, repetition of similar errors can also be avoided if the system prohibits transition to the hotel domain.</Paragraph>
  </Section>
  <Section position="6" start_page="11" end_page="13" type="metho">
    <SectionTitle>
4 Domain Selection using Dialogue
</SectionTitle>
    <Paragraph position="0"> History We constructed a classifier that selects the appropriate domains using various features, including dialogue histories. The selected domain candidates are based on: (I) the previous domain, (II) the domain in which the speech recognition results can be accepted with the highest recognition score, or (III) other domains. Here, we describe the features present in our domain selection method.</Paragraph>
    <Paragraph position="1"> In order to not spoil the system's extensibility, an advantage of the distributed-type architecture, the features used in the domain selection should not depend on the specific domains. We categorize the features used into three categories listed below:  R1: best posteriori probability of the N-best candidates interpreted in the previous domain R2: best posteriori probability for the speech recognition result interpreted in the domain, that is the domain with the highest score R3: average of word's confidence scores for the best candidate of speech recognition results in the domain, that is, the domain with the highest score R4: difference of acoustic scores between candidates selected as (I) and (II) R5: ratio of averages of words' confidence scores between candidates selected as (I) and (II) * Features representing the situation after domain selection (Table 3) We can take into account the possibility that a current estimated domain might be erroneous, by using features representing the confidence in the previous domain. Each feature from P1 to P9 is defined to represent the determination of whether an estimated domain is reliable or not. Specifically, if there are many affirmative responses from a user or many changes of slot values during interactions in the domain, we regard the current domain as reliable. Conversely, the domain is not reliable if there are many negative answers from a user after entering the domain.</Paragraph>
    <Paragraph position="2"> We also adopted the feature P10 to represent the state of the task, because the likelihood that a domain is changed depends on the state of the task. We classified the tasks that we treat into two categories using the following classifications first made by Araki et al. (1999). For a task categorized as a &amp;quot;slot-filling type&amp;quot;, we defined the dialogue states as one of the following two types: &amp;quot;not completed&amp;quot;, if not all of the requisite slots have been filled; and &amp;quot;completed&amp;quot;, if all of the  C1: dialogue state after the domain selection after selecting previous domain C2: whether the interpretation of the user's utterance is negative in previous domain C3: number of changed slots after selecting previous domain C4: dialogue state after selecting the domain with the highest speech recognition score C5: whether the interpretation of the user's utterance is negative in the domain with the highest speech recognition score C6: number of changed slots after selecting the domain with the highest speech recognition score C7: number of common slots (name of place, here) changed after selecting the domain with the highest speech recognition score C8: whether the domain with the highest speech recog- null nition score has appeared before requisite slots have been filled. For a task categorized as a &amp;quot;database search type&amp;quot;, we defined the dialogue states as one of the following two types: &amp;quot;specifying query conditions&amp;quot; and &amp;quot;requesting detailed information&amp;quot;, which were defined in (Komatani et al., 2005a).</Paragraph>
    <Paragraph position="3"> The features which represent the user's speech recognition result are listed in Table 2 and correspond to those used in conventional studies. R1 considers the N-best candidates of speech recognition results that can be interpreted in the previous domain. R2 and R3 represent information about a domain with the highest speech recognition score.</Paragraph>
    <Paragraph position="4"> R4 and R5 represent the comparisons between the above-mentioned two groups.</Paragraph>
    <Paragraph position="5"> The features that characterize the situations after domain selection correspond to the information each expert returns to the central module after understanding the speech recognition results. These are listed in Table 3. Features listed from C1 to C3 represent a situation in which the previous domain (choice (I)) is selected. Those listed from C4 to C8 represent a situation in which a domain with the highest recognition score (choice (II)) is selected.</Paragraph>
    <Paragraph position="6"> Note that these features listed here have survived after feature selection. A feature survives if the performance in the domain classification is degraded when it is removed from a feature set one by one. We had prepared 32 features for the initial set.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML