File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1038_metho.xml

Size: 19,003 bytes

Last Modified: 2025-10-06 14:13:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="H94-1038">
  <Title>RECENT DEVELOPMENTS IN THE EXPERIMENTAL &amp;quot;WAXHOLM&amp;quot; DIALOG SYSTEM</Title>
  <Section position="4" start_page="0" end_page="207" type="metho">
    <SectionTitle>
2. THE DEMONSTRATOR APPLICATION
</SectionTitle>
    <Paragraph position="0"> The demonstrator application, which we call WAXHOLM, gives information on boat traffic in the Stockholm archipelago (see Figure 1). I references time tables for a fleet of some twenty * The Waxholm group consists of staff and students at the</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Department of Speech Communication and Music Acoustics,
</SectionTitle>
      <Paragraph position="0"> KTH. Most of the efforts are done part time. The members of the group in alphabetic order are: Mats Blomberg, Rolf Carlson, Kjell Elenius, Bj6rn Granstr6m, Joakim Gustafson, Sheri Hunnicutt, Jesper H6gberg, Roger Lindell, Lennart Neovius, Lennart Nord, Antonio de Serpa-Leitao and Nikko Str6m boats from the Waxholm company which connects about two hundred ports. Different days of the week have different timetables. null Besides the speech recognition and synthesis components, the system contains modules that handle graphic information such as pictures, maps, charts, and time-tables. This information can be presented to the user at his/her request The application has great similarities to the ATIS domain within the ARPA community and other similar tasks in Europe, for example SUNDIAL. The possibility of expanding the task in many directions is an advantage for our future research on interactive dialog systems. An initial version of the system based on text input has been running since September 1992.</Paragraph>
      <Paragraph position="1"> 2.1. The database In addition to boat time-tables the database also contains, and also information about port locations, hotels, camping places, and restaurants in the Stockholm archipelago. This information is accessed by the standardized query language (SQL, Oracle).</Paragraph>
      <Paragraph position="2"> The time-table, which is the primary part of the database, brings some inherent difficulties to our application. One is that a boat can go in &amp;quot;loops,&amp;quot; i.e. it uses the same port more than once for departure or arrival. This has been solved by giving unique tour identification numbers to different &amp;quot;loops.&amp;quot; Another problem is that the port Waxhoim may be used as a &amp;quot;transit port&amp;quot; for many destinations, and to avoid redundancy transit tours are not included in the database. Transits are instead handled by searching for tours from the departure port to Waxholm, and (backwards) from the destination port to Waxholm that require less than 20 minutes at the transit point \[2\].</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="207" type="sub_section">
      <SectionTitle>
2.2. Implementation
</SectionTitle>
      <Paragraph position="0"> The dialog system is implemented as a number of independent and specialized modules that run as servers on our HP computer system. A notation has been defined to control the information flow between them. The structure makes it possible to run the system in parallel on different machines and facilitates the implementation and testing of alternate models within the same framework. The communication software is based on UNIX de facto standards, which will facilitate the reuse and portability of the components.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="207" end_page="208" type="metho">
    <SectionTitle>
3. SPEECH RECOGNITION
</SectionTitle>
    <Paragraph position="0"> The speech recognition component, which so far has not been integrated in the system during data collection, will handle continuous speech with a vocabulary of about 1000 words. The work on recognition has been carried out along two main lines: artificial neural networks and a speech production oriented approach. Since neural nets are general classification tools, it is quite feasible to combine the two approaches.</Paragraph>
    <Paragraph position="1"> 3.2. Artificial neural networks We have tested different types of artificial neural networks for performing acoustic-pbonetic mapping for speech signais, see \[4\], \[5\], and \[6\]. The tested strategies include serf-organizing nets and nets using the error-back propagation (BP) technique.</Paragraph>
    <Paragraph position="2"> The use of simple recurrent BP-networks has been shown to substantially improve performance. The serf-organizing nets learn faster than the BP-networks, but they are not as easily transformed to recurrent structures.</Paragraph>
    <Paragraph position="3"> 3.1. Speech production approach Our system uses a speech synthesis technique to generate spectral prototypes of words in a given vocabulary, sec \[3\]. A speaker-independent recognition system has been built according to the speech production approach, using a formant-based speech production module including a voice source model. Whole word models are used to describe intra-word phonemes, while triphones (three-phoneme clusters) are used to model the phonemes at word boundaries. An important part of the system is a method of dynamic voice-source adaptation. The recognition errors have been significantly reduced by this method.</Paragraph>
    <Paragraph position="4"> 3.3. Lexical search The frame based outputs from the neural network form the input to the lexical search. There is one output for each of the 40 Swedish phonemes used in our lexicon. Each word in the lexicon is described on the phonetic level. The lexicon may include alternate pronunciations of each word. The outputs are seen as the aposteriori probabilities of the respective phonemes in each frame. We have implemented an A* N-best search using a simple bigram language model. In a second stage the speech production approach mentioned above will be used to reorder the N-best list according to speaker specific criteria. A tight coupling between the parser and the recognizer is a long-term goal in the project. This will naturally influence the search algorithms.</Paragraph>
  </Section>
  <Section position="6" start_page="208" end_page="208" type="metho">
    <SectionTitle>
4. SPEECH SYNTHESIS
</SectionTitle>
    <Paragraph position="0"> For the speech-output component we have chosen the multi-lingual text-to-speech system developed in an earlier project \[7\]. The system is modified for this application. The application vocabulary must be checked for correctness, especially considering the general problem of name pronunciation.</Paragraph>
    <Paragraph position="1"> Speaker-specific aspects are important for the acceptability of the synthetic speech. The WAXHOLM dialog system will focus our efforts on modelling the speaking style and speaker characteristics of one reference speaker. Since the recognition and synthesis modules have the same need of semantic, syntactic and pragmatic information, the lexieal information will, to a great extent, be shared. The linguistic module, STINA, will also be used for improved phrase parsing, compared to the simple function-word based methods that have been used so far in the synthesis project. However, in dialog applications such as the proposed WAXHOLM demonstrator, information on phrasing and prosodic structure can be supplied by the application control software itself, rather than by a general module meant for textto-speech. In a man-machine dialog situation we have a much better base for prosodic modelling compared to ordinary text-tospeech, since we, in such an environment, will have access to much more information than if we used an unknown text as input to the speech synthesizer.</Paragraph>
  </Section>
  <Section position="7" start_page="208" end_page="208" type="metho">
    <SectionTitle>
5. NATURAL LANGUAGE COMPONENT
</SectionTitle>
    <Paragraph position="0"> Our initial work on a natural language component is focused on a sublanguage grammar, a grammar limited to a particular subject domain: that of requesting information from a transportation database.</Paragraph>
    <Paragraph position="1"> The fundamental concepts are inspired by TINA, a parser developed at MIT \[8\]. Our parser, STINA, i.e., Swedish TINA, is knowledge-based and is designed as a probabilistic language model \[9\]. It contains a context-free grammar which is compiled into an augmented transition network (ATN). Probabilities are assigned to each arc after training. Features of STINA are a stack-decoding search strategy and a feature-passing mechanism to implement unification.</Paragraph>
    <Paragraph position="2"> In the implementation of the parser and the dialog management, we have stressed an interactive development environment. This makes it easier to have control over the system's progress as more components are added. It is possible to study the parsing and the dialog flow step by step when a tree is built. It is even possible to use the collected log files as scripts to repeat a collected dialog including all graphic displays and acoustic outputs.</Paragraph>
    <Section position="1" start_page="208" end_page="208" type="sub_section">
      <SectionTitle>
5.1. Lexicon
</SectionTitle>
      <Paragraph position="0"> The lexicon entries are generated by processing each word in the Two-Level Morphology (TWOL) lexical analyzer (\[10\] and \[11\]). Each entry is then corrected by removing all unknown homographs. New grammatical and semantic features, which are used by our algorithm and special application, are then added.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="208" end_page="208" type="metho">
    <SectionTitle>
5.2. Features
</SectionTitle>
    <Paragraph position="0"> The basic grammatical features can be positive, negative or unspecified. Unspecified features match both positive and negative features.</Paragraph>
    <Paragraph position="1">  Semantic features can be divided into two different classes. The basic features like BOAT and PORT give a simple description of 'the semantic property of a word. These features are hierarchically structured. Figure 2 gives an example of a semantic feature tree. During the unification process in STINA, all features which belong to the same branch are considered. Thus, a unification of the feature PLACE engage all semantic &amp;quot;non-shaded&amp;quot; features in Figure 2.</Paragraph>
    <Paragraph position="2"> Another type of semantic feature controls which nodes can be used in the syntactic analysis. For example, the node DEPARTURE TIME cannot be used in connection with verbs that imply an arrival time. This is also a powerful method to control the analysis of responses to questions from the dialog module. The question &amp;quot;Where do you want to go?&amp;quot; conditions the parser to accept a simple port name as a possible response from the user.</Paragraph>
  </Section>
  <Section position="9" start_page="208" end_page="209" type="metho">
    <SectionTitle>
6. DIALOG MANAGEMENT
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="208" end_page="209" type="sub_section">
      <SectionTitle>
6.1. Dialog rules
</SectionTitle>
      <Paragraph position="0"> Dialog management based on grammar rules and lexical semantic features has recently been implemented in STINA. The notation to describe the syntactic rules has been expanded to cover some of our special needs to model the dialog. The STINA parser is running with two different time scales during data collection corresponding both to the words in each utterance and to the turns in the dialog. Syntactic nodes and dialog states are processed according to transition networks with probabilities on each arc.</Paragraph>
      <Paragraph position="1">  Each dialog topic is explored according to the rules. These rules define which constraints have to be fulfilled and what action that should be taken depending on the dialog history. Each dialog node is specified according to Figure 3.</Paragraph>
      <Paragraph position="2"> The constraint evaluation i s described in terms of features and the content in the semantic frame. If the frame needs to be expanded with additional information, a system question, is synthesized. During recognition of a response to such a question the grammar is controlled with semantic features in order to allow incomplete sentences. If the response from the subject does not clarify the question, the robust parsing is temporarily disconnected so that specific information can be given to the user about syntactic or unknown word problems. At the same time a complete sentence is requested giving the dialog manager the possibility of evaluating whether the chosen topic is a bad choice.</Paragraph>
      <Paragraph position="3"> A positive response from the constraint evaluation clears the way for the selected action to take place. The node function list in the figure gives examples of such actions.</Paragraph>
    </Section>
    <Section position="2" start_page="209" end_page="209" type="sub_section">
      <SectionTitle>
6.2. Topic selection
</SectionTitle>
      <Paragraph position="0"> In Figure 4 some of the major topics are listed. The decision about which path to follow in the dialog is based on several factors such as the dialog history and the content of the specific utterance. The utterance is coded in the form of a &amp;quot;semantic frame&amp;quot; with slots corresponding to both the grammatical analysis and the specific application. The structure of the semantic frame is automatically created based on the rule system.</Paragraph>
      <Paragraph position="1"> TIME_TABLE Goal: to get a time-table presented with departure and arrival times specified between two specific locations.</Paragraph>
      <Paragraph position="2"> Example: N~ g~r b~ten? (When does the boat leave?)</Paragraph>
    </Section>
  </Section>
  <Section position="10" start_page="209" end_page="209" type="metho">
    <SectionTitle>
GETPOSITION
</SectionTitle>
    <Paragraph position="0"> Goal: to get a chart or a map displayed with the place of interest shown.</Paragraph>
    <Paragraph position="1"> Example: Var ligger Vaxholm? (Where is Vaxholm?)</Paragraph>
  </Section>
  <Section position="11" start_page="209" end_page="210" type="metho">
    <SectionTitle>
EXIST
</SectionTitle>
    <Paragraph position="0"> Goal: to display the availability of lodging and dining possibilities.</Paragraph>
    <Paragraph position="1"> Example: Var finns det vandrarhem? (Where are there hostels?) OUT_OFDOMAIN Goal: Inform the user that the subject is out of the domain for the system.</Paragraph>
    <Paragraph position="2"> Example: Kan jag boka rum. (Can I book a room?)  Each semantic feature found in the syntactic and semantic analysis is considered in the form of a conditional probability to decide on the topic. The probability for each topic is expressed as: p(topiclF), where F is a feature vector including all semantic features used in the utterance. Thus, the BOAT feature can be a strong indication for the TIME-TABLE topic but this can be contradicted by a HOTEL feature.</Paragraph>
    <Paragraph position="3"> 6.3. Introduction of a new topic The rule-based and to some extent probabilistic approach we are exploring makes the addition of new topics relatively easy. However, we do not know at this stage where the limits are for this approach. In this section we will give a simple example of how a new topic can be introduced.</Paragraph>
    <Paragraph position="4"> Suppose we want to create a topic called &amp;quot;out of domain.&amp;quot; Figure 5 illustrates the steps that need to be taken. First a topic node is introduced in the rule system. Some words will need to be included in the lexicon and labelled with a semantic feature showing that the system does not know how to deal with the subjects these words relate to. Then a synthesis node might be added with a text informing the user about the situation. Example sentences must be created that illustrate the problem. 'nae dialog parser must be trained with these sentences labelled with the &amp;quot;out of domain&amp;quot; topic.</Paragraph>
    <Paragraph position="5"> Since the topic selection is done by a probabilistic approach that needs application-specific training, data collection is of great importance for the progress of the project.</Paragraph>
    <Paragraph position="6">  How to introduce a new topic Introduce a new dialog grammar parent node Expand the semantic feature set if needed Specify dialog children nodes and their function and add to  The dialog will be naturally restricted by application-specific capabilities and the limited grammar. So far we also assume that the human subjects will be co-operative in pursuing the task. Recovery in case of human-machine &amp;quot;misunderstandings&amp;quot; will be aided by informative error messages generated upon the occurrence of lexical, parsing or retrieval errors. This technique has been shown to be useful in helping subjectsto recover from an error through rephrasing of their last input \[12\].</Paragraph>
  </Section>
  <Section position="12" start_page="210" end_page="211" type="metho">
    <SectionTitle>
7. DATA COLLECTION
</SectionTitle>
    <Paragraph position="0"> We are currently collecting speech and text data using the WAXHOLM system. Initially, a &amp;quot;Wizard of Oz&amp;quot; (a human simulating part of a system) is replacing the speech recognition module, (See Figure 6). The user is placed in a sound-treated room in front of a terminal screen. The wizard sitting outside the room can observe the subject's screen on a separate display.</Paragraph>
    <Paragraph position="1"> The user is initially requested to pronounce a number of sentences and digit sequences to practice talking to a computer.</Paragraph>
    <Paragraph position="2"> This material will be used for speaker adaptation experiments.</Paragraph>
    <Paragraph position="3"> After this the subject is presented with a task to be carried out.</Paragraph>
    <Paragraph position="4"> The scenario is presented both as text and as synthetic speech.</Paragraph>
    <Paragraph position="5"> An advantage of this procedure is that the subject becomes familiar with the synthetic speech. During the data collection, utterance-size speech files are stored together with the transcribed text entered by the wizard.</Paragraph>
    <Paragraph position="6"> The stored speech files and their associated label files are processed by our text-to-speech system to generate a possible phonetic transciption. This transcription is then aligned and manually corrected. (For a description of this process see \[13\].) The collected corpus is being used for grammar development, for training of probabilities in the language model in STINA, and also for generation of an application-dependent bigram model to be used by the recognizer. It is also being used to train word collocation probabilities. Our plan is to replace explicit formulations of semantic coupling by a collocation probability mall~X.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML