File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0707_intro.xml

Size: 7,396 bytes

Last Modified: 2025-10-06 14:01:53

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0707">
  <Title>Flexible and Personalizable Mixed-Initiative Dialogue Systems</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Spoken dialogue systems are emerging as an effective means for humans to access information spaces through natural spoken interaction with computers. These systems are usually implemented with a static knowledge space, or one that is only augmented through manual intervention from the system developers. A significant enhancement to the usability of such systems would be the ability to automatically acquire new knowledge through interaction with its end users and its available knowledge resources. We believe, in fact, that the main barrier to wide acceptance of spoken dialogue systems is their current lack of flexibility and personalization.</Paragraph>
    <Paragraph position="1"> Over the past decade, researchers in the Spoken Language Systems Group at MIT have been developing human language technologies for mixed initiative conversational systems, which are distinguished from the emerging deployed commercial systems in that the interaction is natural and flexible, modelled after the style of human-human dialogue (Zue and Glass, 2000). The development of the Galaxy Communicator architecture (Seneff et al., 1998) has greatly accelerated the pace at which we as experts can configure complex dialogue systems in a wide range of different domains. As the underlying technology components have matured, our research focus has evolved to include issues related to portability, modularity, and dynamic configurability of system components.</Paragraph>
    <Paragraph position="2"> We believe that the ability for naive system developers, and even end users, to reconfigure existing systems to manage their personal needs, will be crucial for the successful use of these technologies.</Paragraph>
    <Paragraph position="3"> We see several different ways in which such flexible reconfiguration will become feasible in the near future.</Paragraph>
    <Paragraph position="4"> Perhaps most critical is the initial preparation of a new domain, where available on-line databases will be the catalyst for defining the vocabulary and language models of the domain, as well as the nature of the dialogue interaction needed to guide the user through the information space (Polifroni et al., 2003). However, the ability to dynamically reconfigure based on new information will also be extremely valuable. For instance, a hotel domain for the entire U.S. might initially restrict name recognition to the major chains such as Sheraton and Hyatt, since it would not be feasible to support all hotel names in the U.S. Once the user defines a specific geographic region, the system can enhance the level of detail of its working vocabulary, but particular to the user's narrower request, now supporting explicit mention of perhaps all hotels by name within that designated region. Users would be able to select hotels based on a number of attributes as guided by the hotel database, and to specify constraints in general world knowledge areas such as dates and prices.</Paragraph>
    <Paragraph position="5"> Users should be able to further personalize the system User: I'm interested in a restaurant in Boston's North End Sys: I know of 53 restaurants in the North End.</Paragraph>
    <Paragraph position="6"> Of these, 46 are Italian restaurants. 28 of them are located on either Hanover or Salem Street.</Paragraph>
    <Paragraph position="7"> User: Tell me about the ones on Hanover Street.</Paragraph>
    <Paragraph position="8"> Sys: There is one seafood restaurant on Hanover Street. The others are Italian.</Paragraph>
    <Paragraph position="9"> User: I'm interested in the Seafood restaurant.</Paragraph>
    <Paragraph position="10"> Sys: The Daily Catch is located on 323 Hanover Street. The phone number is 617-523-8567.</Paragraph>
    <Paragraph position="11"> The price range is between $12 and $18.</Paragraph>
    <Paragraph position="12">  user and a system in a restaurant domain.</Paragraph>
    <Paragraph position="13"> by adding new words instantaneously to the working vocabulary via spoken dialogue. This might also include specifying the word's semantic class: &amp;quot;I want to add the name John Doe to my rolodex.&amp;quot; When feasible, a user-specified named entity, such as a restaurant, would be verified against Web sources to improve the system's ability to understand their request.</Paragraph>
    <Paragraph position="14"> In order for this vision to become a reality, a number of specific technology goals must be met. First and foremost, it is essential to develop tools that will enable rapid configuration of dialogue systems in new domains of knowledge, guided mainly from domain-dependent information sources. Our efforts in generic dialogue development represent a strong initiative toward that goal (Polifroni and Chung, 2002). Secondly, we need to be able to support incremental update of vocabularies and language models for speech recognition and understanding, in essentially instantaneous time (Schalkwyk et al., 2003; Seneff et al., 1998; Chung et al., 2003). This would allow great flexibility within a single dialogue where the user might ask about a named entity that is not yet known to the system. Third, while we can make use of a large lexical resource for pronunciation modeling, we must have available as well a high-performance letter-to-sound capability, integrating multiple knowledge sources such as a Web page, a spoken name, a spoken spelling of the name, and/or a key-padded name (Chung and Seneff, 2002).</Paragraph>
    <Paragraph position="15"> Fourth, we need to have intelligent knowledge acquisition systems, capable of populating a database from Web sources, and extracting and organizing key elements from the database (Polifroni et al., 2003).</Paragraph>
    <Paragraph position="16"> These ideas can best be illustrated through a couple of example scenarios. In Figure 1, the user begins with a request for a restaurant in a neighborhood of Boston.</Paragraph>
    <Paragraph position="17"> The system then rapidly configures itself to support the appropriate sub-language, and is able to summarize lists of restaurants meeting the constraints of the user's subsequent queries, eventually leading to a unique selection. For the scenario in Figure 2, the user has asked about User: Can you tell me the phone number of the Thaiku restaurant in Seattle? Sys: I may not know the name of the restaurant.</Paragraph>
    <Paragraph position="18"> Can you spell it for me? User: t h a i k u Sys: The phone number of Thaiku is 206-706-7807.</Paragraph>
    <Paragraph position="19">  the phone number for a restaurant they already know about. The system parses the name within a complete parse, but with a generic &amp;quot;unknown word&amp;quot; as a stand-in for the restaurant name. It can at this point go to the Web and download a set of candidate restaurant names for Seattle, to form additional constraints on a solicited spelling. The integration of the spelling, the spoken pronunciation, and the Web listing, we argue, potentially provide enough constraint to solve the specific problem with high accuracy. The system can now retrieve the requested information from the Web.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML