File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1044_metho.xml
Size: 21,522 bytes
Last Modified: 2025-10-06 14:08:58
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1044"> <Title>Combining Acoustic and Pragmatic Features to Predict Recognition Performance in Spoken Dialogue Systems</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 The WITAS Dialogue System </SectionTitle> <Paragraph position="0"> The WITAS dialogue system (Lemon et al., 2002) is a multimodal command and control dialogue system that allows a human operator to interact with a simulated &quot;unmanned aerial vehicle&quot; (UAV): a small robotic helicopter. The human operator is provided with a GUI - an interactive (i.e. mouse clickable) map - and specifies mission goals using natural language commands spoken into a headset, or by using combinations of GUI actions and spoken commands. The simulated UAV can carry out different activities such as flying to locations, following vehicles, and delivering objects. The dialogue system uses the Nuance 8.0 speech recognizer with language models compiled from a grammar (written using the Gemini system (Dowding et al., 1993)), which is also used for parsing and generation.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 WITAS Information States </SectionTitle> <Paragraph position="0"> The WITAS dialogue system is part of a larger family of systems that implement the Information State Update (ISU) approach to dialogue management (Traum et al., 1999). The ISU approach has been used to formalize different theories of dialogue and forms the basis of several dialogue system implementations in domains such as route planning, home automation, and tutorial dialogue. The ISU approach is a particularly useful testbed for our technique because it collects information relevant to dialogue context in a central data structure from which it can be easily extracted. (Lemon et al., 2002) describe in detail the components of Information States (IS) and the update procedures for processing user input and generating system responses.</Paragraph> <Paragraph position="1"> Here, we briefly introduce parts of the IS which are needed to understand the system's basic workings, and from which we will extract dialogue-level and task-level information for our learning experiments: * Dialogue Move Tree (DMT): a tree-structure, in which each subtree of the root node represents a &quot;thread&quot; in the conversation, and where each node in a subtree represents an utterance made either by the system or the user. 1 based processing, see (Lemon and Gruenstein, 2004).</Paragraph> <Paragraph position="2"> cate conversational contributions that are still in some sense open, and to which new utterances can attach.</Paragraph> <Paragraph position="3"> * Activity Tree (AT): a tree-structure representing the current, past, and planned activities that the back-end system (in this case a UAV) performs. null * Salience List (SL): a list of NPs introduced in the current dialogue ordered by recency.</Paragraph> <Paragraph position="4"> * Modality Buffer (MB): a temporary store that registers click events on the GUI.</Paragraph> <Paragraph position="5"> The DMT and AT are the core components of Information States. The SL and MB are subsidiary data-structures needed for interpreting and generating anaphoric expressions and definite NPs. Finally, the ANL plays a crucial role in integrating new user utterances into the DMT.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Data Collection </SectionTitle> <Paragraph position="0"> For our experiments, we use data collected in a small user study with the grammar-switching version of the WITAS dialogue system (Lemon, 2004).</Paragraph> <Paragraph position="1"> In this study, six subjects from Edinburgh University (4 male, 2 female) had to solve five simple tasks with the system, resulting in 30 complete dialogues.</Paragraph> <Paragraph position="2"> The subjects' utterances were recorded as 8kHz 16bit waveform files and all aspects of the Information State transitions during the interactions were logged as html files. Altogether, 303 utterances were recorded in the user study ([?] 10 user utterances/dialogue). null</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Labeling </SectionTitle> <Paragraph position="0"> We transcribed all user utterances and parsed the transcriptions offline using WITAS' natural language understanding component in order to get a gold-standard labeling of the data. Each utterance was labeled as either in-grammar or out-of-grammar (oog), depending on whether its transcription could be parsed or not, or as crosstalk: a special marker that indicated that the input was not directed to the system (e.g. noise, laughter, self-talk, the system accidentally recording itself). For all in-grammar utterances we stored their interpretations (quasi-logical forms) as computed by WITAS' parser. Since the parser uses a domain-specific semantic grammar designed for this particular application, each in-grammar utterance had an interpretation that is &quot;correct&quot; with respect to the WITAS application.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Simplifying Assumptions </SectionTitle> <Paragraph position="0"> The evaluations in the following sections make two simplifying assumptions. First, we consider a user utterance correctly recognized only if the logical form of the transcription is the same as the logical form of the recognition hypothesis. This assumption can be too strong because the system might react appropriately even if the logical forms are not literally the same. Second, if a transcribed utterance is out-of-grammar, we assume that the system cannot react appropriately. Again, this assumption might be too strong because the recognizer can accidentally map an utterance to a logical form that is equivalent to the one intended by the user.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 The Baseline System </SectionTitle> <Paragraph position="0"> The baseline for our experiments is the behavior of the WITAS dialogue system that was used to collect the experimental data (using dialogue context as a predictor of language models for speech recognition, see below). We chose this baseline because it has been shown to perform significantly better than an earlier version of the system that always used the same (i.e. full) grammar for recognition (Lemon, 2004).</Paragraph> <Paragraph position="1"> We evaluate the performance of the baseline by analyzing the dialogue logs from the user study.</Paragraph> <Paragraph position="2"> With this information, it is possible to decide how the system reacted to each user utterance. We distinguish between the following three cases: 1. accept: the system accepted the recognition hypothesis of a user utterance as correct.</Paragraph> <Paragraph position="3"> 2. reject: the system rejected the recognition hypothesis of a user utterance given a fixed confidence rejection threshold.</Paragraph> <Paragraph position="4"> 3. ignore: the system did not react to a user utterance at all.</Paragraph> <Paragraph position="5"> These three classes map naturally to the gold-standard labels of the transcribed user utterances: the system should accept in-grammar utterances, reject out-of-grammar input, and ignore crosstalk.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Context-sensitive Speech Recognition </SectionTitle> <Paragraph position="0"> In the the WITAS dialogue system, the &quot;grammar-switching&quot; approach to context-sensitive speech recognition (Lemon, 2004) is implemented using the ANL. At any point in the dialogue, there is a &quot;most active node&quot; at the top of the ANL. The dialogue move type of this node defines the name of a language model that is used for recognizing the next user utterance. For instance, if the most active node is a system yes-no-question then the appropriate language model is defined by a small context-free grammar covering phrases such as &quot;yes&quot;, &quot;that's right&quot;, &quot;okay&quot;, &quot;negative&quot;, &quot;maybe&quot;, and so on. The WITAS dialogue system with context-sensitive speech recognition showed significantly better recognition rates than a previous version of the system that used the full grammar for recognition at all times ((Lemon, 2004) reports a 11.5% reduction in overall utterance recognition error rate). Note however that an inherent danger with grammar-switching is that the system may have wrong expectations and thus might activate a language model which is not appropriate for the user's next utterance, leading to misrecognitions or incorrect rejections.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Results </SectionTitle> <Paragraph position="0"> Table 1 summarizes the evaluation of the baseline system.</Paragraph> <Paragraph position="1"> Table 1 should be read as follows: looking at the first row, in 154 cases the system understood and accepted the correct logical form of an in-grammar utterance by the user. In 22 cases, the system accepted a logical form that differed from the one for the transcribed utterance.2 In 8 cases, the system rejected an in-grammar utterance and in 4 cases it did not react to an in-grammar utterance at all. The second row of Table 1 shows that the system accepted 45, rejected 43, and ignored 4 user utterances whose transcriptions were out-of-grammar and could not be parsed. Finally, the third row of the table shows that the baseline system accepted 12 utterances that were not addressed to it, rejected 9, and ignored 2. Table 1 shows that a major problem with the base-line system is that it accepts too many user utterances. In particular, the baseline system accepts the wrong interpretation for 22 in-grammar utterances, 45 utterances which it should have rejected as outof-grammar, and 12 utterances which it should have 2For the computation of accuracy and weighted f-scores, these were counted as wrongly accepted out-of-grammar utterances. null ignored. All of these cases will generally lead to unintended actions by the system.</Paragraph> </Section> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Classifying and Selecting N-best </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Recognition Hypotheses </SectionTitle> <Paragraph position="0"> We aim at improving over the baseline results by considering the n-best recognition hypotheses for each user utterance. Our methodology consists of two steps: i) we automatically classify the n-best recognition hypotheses for an utterance as either correctly or incorrectly recognized and ii) we use a simple selection procedure to choose the &quot;best&quot; hypothesis based on this classification. In order to get multiple recognition hypotheses for all utterances in the experimental data, we re-ran the speech recognizer with the full recognition grammar and 10-best output and processed the results offline with WITAS' parser, obtaining a logical form for each recognition hypothesis (every hypothesis has a logical form since language models are compiled from the parsing grammar).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.1 Hypothesis Labeling </SectionTitle> <Paragraph position="0"> We labeled all hypotheses with one of the following four classes, based on the manual transcriptions of the experimental data: in-grammar, oog (WER [?] 50), oog (WER > 50), or crosstalk. The in-grammar and crosstalk classes correspond to those described for the baseline. However, we decided to divide up the out-of-grammar class into the two classes oog (WER[?]50) and oog (WER > 50) to get a more fine-grained classification. In order to assign hypotheses to the two oog classes, we compute the word error rate (WER) between recognition hypotheses and the transcription of corresponding user utterances.</Paragraph> <Paragraph position="1"> If the WER is [?] 50%, we label the hypothesis as oog (WER [?] 50), otherwise as oog (WER > 50).</Paragraph> <Paragraph position="2"> We also annotate all misrecognized hypotheses of in-grammar utterances with their respective WER scores.</Paragraph> <Paragraph position="3"> The motivation behind splitting the out-of-grammar class into two subclasses and for annotating misrecognized in-grammar hypotheses with their WER scores is that we want to distinguish between different &quot;degrees&quot; of misrecognition that can be used by the dialogue system to decide whether it should initiate clarification instead of rejection.3 We use a threshold (50%) on a hypothesis' WER as an indicator for whether hypotheses should be this type of clarification dialogue; the WER annotations are therefore only of theoretical interest. However, an extended system could easily use this information to decide when clarification should be initiated.</Paragraph> <Paragraph position="4"> clarified or rejected. This is adopted from (Gabsdil, 2003), based on the fact that WER correlates with concept accuracy (CA, (Boros et al., 1996)).</Paragraph> <Paragraph position="5"> The WER threshold can be set differently according to the needs of an application. However, one would ideally set a threshold directly on CA scores for this labeling, but these are currently not available for our data.</Paragraph> <Paragraph position="6"> We also introduce the distinction between out-of-grammar (WER [?] 50) and out-of-grammar (WER > 50) in the gold standard for the classification of (whole) user utterances. We split the out-of-grammar class into two sub-classes depending on whether the 10-best recognition results include at least one hypothesis with a WER [?] 50 compared to the corresponding transcription. Thus, if there is a recognition hypothesis which is close to the transcription, an utterance is labeled as oog (WER [?] 50). In order to relate these classes to different system behaviors, we define that utterances labeled as oog (WER [?] 50) should be clarified and utterances labeled as oog (WER > 50) should be rejected by the system. The same is done for all in-grammar utterances for which only misrecognized hypotheses are available.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.2 Classification: Feature Groups </SectionTitle> <Paragraph position="0"> We represent recognition hypotheses as 20dimensional feature vectors for automatic classification. The feature vectors combine recognizer confidence scores, low-level acoustic information, in- null formation from WITAS system Information States, and domain knowledge about the different tasks in the scenario. The following list gives an overview of all features (described in more detail below).</Paragraph> <Paragraph position="1"> 1. Recognition (6): nbestRank, hypothesisLength, confidence, confidenceZScore, confidence-StandardDeviation, minWordConfidence null 2. Utterance (3): minAmp, meanAmp, RMS-amp 3. Dialogue (9): currentDM, currentCommand, mostActiveNode, DMBigramFrequency, qa-Match, aqMatch, #unresolvedNPs, #unresolvedPronouns, #uniqueIndefinites 4. Task (2): taskConflict, #taskConstraintConflict null All features are extracted automatically from the output of the speech recognizer, utterance waveforms, IS logs, and a small library of plan operators describing the actions the UAV can perform. The recognition (REC) feature group includes the position of a hypothesis in the n-best list (nbestRank), its length in words (hypothesisLength), and five features representing the recognizer's confidence assessment. Similar features have been used in the literature (e.g. (Litman et al., 2000)). The minWord-Confidence and standard deviation/zScore features are computed from individual word confidences in the recognition output. We expect them to help the machine learners decide between the different WER classes (e.g. a high overall confidence score can sometimes be misleading). The utterance (UTT) feature group reflects information about the amplitude in the speech signal (all features are extracted with the UNIX sox utility). The motivation for including the amplitude features is that they might be useful for detecting crosstalk utterances which are not directly spoken into the headset microphone (e.g. the system accidentally recognizing itself). The dialogue features (DIAL) represent information derived from Information States and can be coarsely divided into two sub-groups. The first group includes features representing general coherence constraints on the dialogue: the dialogue move types of the current utterance (currentDM) and of the most active node in the ANL (mostActiveNode), the command type of the current utterance (currentCommand, if it is a command, null otherwise), statistics on which move types typically follow each other (DMBigramFrequency), and two features (qaMatch and aqMatch) that explicitly encode whether the current and the previous utterance form a valid question answer pair (e.g.</Paragraph> <Paragraph position="2"> yn-question followed by yn-answer). The second group includes features that indicate how many definite NPs and pronouns cannot be resolved in the current Information State (#unresolvedNP, #unresolvedPronouns, e.g. &quot;the car&quot; if no car was mentioned before) and a feature indicating the number of indefinite NPs that can be uniquely resolved in the Information State (#uniqueIndefinites, e.g. &quot;a tower&quot; where there is only one tower in the domain). We include these features because (short) determiners are often confused by speech recognizers. In the WITAS scenario, a misrecognized determiner/demonstrative pronoun can lead to confusing system behavior (e.g. a wrongly recognized &quot;there&quot; will cause the system to ask &quot;Where is that?&quot;). Finally, the task features (TASK) reflect conflicting instructions in the domain. The feature taskConflict indicates a conflict if the current dialogue move type is a command and that command already appears as an active task in the AT. #taskConstraint-Conflict counts the number of conflicts that arise between the currently active tasks in the AT and the hypothesis. For example, if the UAV is already flying somewhere the preconditions of the action operator for take off (altitude = 0) conflict with those for fly (altitude negationslash= 0), so that &quot;take off&quot; would be an unlikely command in this context.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.3 Learners and Selection Procedure </SectionTitle> <Paragraph position="0"> We use the memory based learner TiMBL (Daelemans et al., 2002) and the rule induction learner RIPPER (Cohen, 1995) to predict the class of each of the 10-best recognition hypotheses for a given utterance. We chose these two learners because they implement different learning strategies, are well established, fast, freely available, and easy to use. In a second step, we decide which (if any) of the classified hypotheses we actually want to pick as the best result and how the user utterance should be classified as a whole. This task is decided by the following selection procedure (see Figure 1) which implements a preference ordering accept > clarify > re- null ject > ignore.4 1. Scan the list of classified n-best recognition hypotheses top-down. Return the first result that is classified as accept and classify the utterance as accept.</Paragraph> <Paragraph position="1"> 2. If 1. fails, scan the list of classified n-best recognition hypotheses top-down. Return the first result that is classified as clarify and classify the utterance as clarify.</Paragraph> <Paragraph position="2"> 3. If 2. fails, count the number of rejects and ignores in the classified recognition hypotheses. If the number of rejects is larger or equal than the number of ignores classify the utterance as reject.</Paragraph> <Paragraph position="3"> 4. Else classify the utterance as ignore.</Paragraph> <Paragraph position="4"> This procedure is applied to choose from the classified n-best hypotheses for an utterance, independent of the particular machine learner, in all of the following experiments.</Paragraph> <Paragraph position="5"> Since we have a limited amount experimental data in this study (10 hypotheses for each of the 303 user utterances), we use a &quot;leave-one-out&quot; crossvalidation setup for classification. This means that we classify the 10-best hypotheses for a particular utterance based on the 10-best hypotheses of all 302 other utterances and repeat this 303 times.</Paragraph> <Paragraph position="6"> 4Note that in a dialogue application one would not always need to classify all n-best hypotheses in order to select a result but could stop as soon as a hypothesis is classified as correct, which can save processing time.</Paragraph> </Section> </Section> class="xml-element"></Paper>