File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-1028_metho.xml
Size: 26,557 bytes
Last Modified: 2025-10-06 14:08:52
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-1028"> <Title>Non-Native Users in the Let's Go!! Spoken Dialogue System: Dealing with Linguistic Mismatch</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Overview of the System 2.1 The CMU Let's Go!! Bus Information System </SectionTitle> <Paragraph position="0"> In order to study the use of spoken dialogue systems by non-native speakers in a realistic setting, we built Let's Go!!, a spoken dialogue system that provides bus schedule information for the Pittsburgh area(Raux et al., 2003).</Paragraph> <Paragraph position="1"> As shown in Figure 1, the system is composed of five basic modules: the speech recognizer, the parser, the dialog manager, the language generator, and the speech synthesizer. Speech recognition is performed by the Sphinx II speech recognizer (Huang et al., 1992). The Phoenix parser (Ward and Issar, 1994) is in charge of natural language understanding. The dialogue manager is based on the RavenClaw framework (Bohus and Rudnicky, 2003). Natural language generation is done by a simple template-based generation module, and speech synthesis by the Festival speech synthesis system (Black et al., 1998). The original system uses a high quality limited-domain voice recorded especially for the project but for some experiments, lower quality, more flexible voices formation system.</Paragraph> <Paragraph position="2"> have been used. All modules communicate through the Galaxy-II (Seneff et al., 1998) framework.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Definition of the Domain </SectionTitle> <Paragraph position="0"> The Port Authority of Allegheny County, which manages the buses in Pittsburgh provided the full database of bus routes and schedules. Overall, this database contains more than 10,000 bus stops but we restricted our system to 5 routes and 559 bus stops in areas where international students are likely to travel since they are our main target population at present.</Paragraph> <Paragraph position="1"> In order to improve speech recognition accuracy, we concatenated the words in the name of each bus stop (e.g. &quot;Fifth and Grant&quot;) and made them into a single entry in the recognizer's lexicon. Because there are usually several variant names for each bus stop and since we included other places such as landmarks and neighborhoods, the total size of the lexicon is 9914 words.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Data Collection Experiments </SectionTitle> <Paragraph position="0"> To gather enough data to train and test acoustic and language models, we had the system running, advertising it to international students at our university, as well as conducting several studies. In those studies, we gave scenarios to the participants in the form of a web page with maps indicating the places of departure and destination, as well as additional time and/or route preferences. There was as little written English as possible in the description of the scenarios to prevent influencing the language habits of the participants. Participants then called the system over the phone to get the required information. One experiment conducted in June 2003 netted 119 calls from 11 different non-native speakers (5 of them were from India and 6 from Japan), as well as 25 calls from 4 native speakers of American English. Another experiment in August 2003 allowed the collection of 47 calls from 6 non-native speakers of various linguistic backgrounds.</Paragraph> <Paragraph position="1"> The rest of the non-native data comes from unsollicited native language model on native and non-native data.</Paragraph> <Paragraph position="2"> individual callers labelled as non-native by a human annotator who transcribed their speech. The total size of the spontaneous non-native corpus is 1757 utterances.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Recognition and Understanding of </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Non-Native Speech 3.1 Recognition Accuracy </SectionTitle> <Paragraph position="0"> We used acoustic models trained on data consisting of phone calls to the CMU Communicator system(Rudnicky et al., 2000). The data was split into gender specific sets and corresponding models were built. At recognition time, the system runs the two sets of models in parallel and for each utterance selects the result that has the highest recognition score, as computed by Sphinx. The language model is a class-based trigram model built on 3074 utterances from past calls to the Let's Go!! system, in which place names, time expressions and bus route names are each replaced by a generic class name to compensate for the lack of training data.</Paragraph> <Paragraph position="1"> In order to evaluate the performance of these models on native and non-native speakers, we used 449 utterances from non-native users (from the August experiment and the unsollicited calls) and 452 from native users of the system. The results of recognition on the two data sets are given in Table 1. Even for native speakers, performance was not very high with a word error rate of 20:4%. Yet, this is acceptable given the small amount of training data for the language model and the conversational nature of the speech. However, performance degrades significantly for non-native speakers, with a word error rate of 52:0%. The two main potential reasons for this loss are acoustic mismatch and linguistic mismatch. Acoustic mismatch arises from the variations between the native speech on which the acoustic models were trained and non-native speech, which often include different accents and pronunciations. On the other hand, linguistic mis-match stems from variations or errors in syntax and word choice, between the native corpus on which the language model was trained and non-native speech.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Impact of Linguistic Mismatch on the </SectionTitle> <Paragraph position="0"> Performance of the Language Model To analyze the effect of linguistic mismatch, we compared the number of out-of-vocabulary words (OOV) and the perplexity of the model on the transcription of the test utterances. Table 2 shows the results. The percentage of statistical significance of the difference between the native and non-native sets is computed using the chi-square test for equality of distributions.</Paragraph> <Paragraph position="1"> OOVs is 3:09% for non-native speakers, more than 2:5 times higher than it is for native speakers, which shows the difference in word choices made by each population.</Paragraph> <Paragraph position="2"> Such differences include words that are correctly used but are not frequent in native speech. For example, when referring to bus stops by street intersections, all native speakers in our training set simply used &quot;A and B&quot;, hence the word &quot;intersection&quot; was not in the language model. On the other hand, many non-native speakers used the full expression &quot;the intersection of A and B&quot;. Note that the differences inside the place name itself (e.g. &quot;A and B&quot; vs &quot;A at B&quot;) are abstracted away by the class-based model, since all variants are replaced by the same class name (words like &quot;intersection&quot; and &quot;corner&quot; were kept out of the class to reduce the number of elements in the &quot;place&quot; class). In other cases non-native speakers used inappropriate words, such as &quot;bus timing&quot; for &quot;bus schedule&quot;, which were not in the language model. Ultimately, OOVs affect 14:0% of the utterances as opposed to 5:9% for native utterances, which is significant, since an utterance containing an OOV is more likely to contain recognition errors even on its in-vocabulary words, since the OOV prevents the language model from accurately matching the utterance. Differences between the native and non-native set in both OOV rate and the ratio of utterances containing OOVs were statistically significant.</Paragraph> <Paragraph position="3"> We computed the perplexity of the model on the utterances that did not contain any OOV. The perplexity of the model on this subset of the non-native test set is 36:55, 59:7% higher than that on the native set. This reflects differences in syntax and selected constructions. For example, although native speakers almost always used the same expression to request a bus departure time (&quot;When does the bus leave ...?&quot;), non-natives used a wider variety of sentences (e.g. &quot;Which time I have to leave?&quot;, &quot;What the next bus I have to take?&quot;). Both the difference between native and non-native and the larger variability of non-native language account for the larger perplexity of the model over the non-native set. This results seems to disagree with what (Wang and Schultz, 2003) found in their study, where the perplexity was larger on the native set. Unfortunately, they do not describe the data used to train the language model so it is hard to draw any conclusions. But one main difference is that their experiment focused only on German speakers of English, whereas we collected data from a much more diverse population.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Impact of the Linguistic Mismatch on Language Understanding </SectionTitle> <Paragraph position="0"> The Phoenix parser used in the natural language understanding module of the system is a robust, context-free grammar-based parser. Grammar rules, including optional words, are compiled into a grammar network that is used to parse user input. When no complete parse is found, which is often the case with spoken language, Phoenix looks for partial parses and returns the parse forest that it is most confident in. Confidence is based on internal measures such as the number of words covered by the parses and the number of parse trees in the parse forest (for an equal number of covered words, a smaller number of parse trees is preferred).</Paragraph> <Paragraph position="1"> The grammar rules were hand written by the developers of the system. Initially, since no data was available, choices were made based on their intuition and on a small scale Wizard-of-Oz experiment. Then, after the first version of the system was made available, the grammar was extended according to actual calls to the system. The grammar has thus undergone continuous change, as is often the case in spoken dialogue systems.</Paragraph> <Paragraph position="2"> The grammar used in this experiment (the &quot;native&quot; grammar) was designed based for native speech without adaptation to non-native data. It provides full parses of sentences like &quot;When is the next bus going to the airport?&quot;, but also, due to the robustness of the parser, partial parses to ungrammatical sentences like &quot;What time bus leave airport?&quot;. Once compiled, the grammar network consisted of 1537 states and 3076 arcs. The two bottom rows of Table 2 show the performance of the parser on human-transcribed native and non-native utterances.</Paragraph> <Paragraph position="3"> Both the number of words that could be parsed and the number of sentences for which a full parse was obtained are larger for native speakers (resp. 63:3% and 56:4%) than non-native (56% and 49:7%), although the relative differences are not as large as those observed for the lan- null using a language model and grammar that includes some non-native data over the original purely native model, on transcribed native and non-native speech.</Paragraph> <Paragraph position="4"> guage model. This can be attributed to the original difficulty of the task since even native speech contains a lot of disfluencies that make it difficult to parse. As a consequence, robust parsers such as Phoenix, which are designed to be flexible enough to handle native disfluencies, can deal with some of the specificities of non-native speech. Yet, the chi-square test shows that the difference between the native and non-native set is very significant for the ratio of words parsed and mildly so for the ratio of fully parsed sentences. The weak significance of the latter can be partly explained by the small number of utterances in the corpora.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.4 Effect of Additional Non-Native Data on Language Modeling and Parsing </SectionTitle> <Paragraph position="0"> In order to study the improvement of performance provided by mixing native and non-native data in the language model, we built a second language model (the &quot;mixed&quot; model), using the 3074 sentences of the native model to which were added 1308 sentences collected from non-native calls to the system not included in the test set. Using this model, we were able to reduce the OOV rate by 56:6% and perplexity by 23:6% for our non-native test set. While the additional data also improved the performance of the model on native utterances, the improvement was relatively smaller than for non-native speakers (12:1%). As can be seen by comparing Tables 2 and 3, this observation is also true of OOV rate (56:6% improvement for non-native vs 50:0% for native) and the proportion of sentences with OOVs (43:1% vs 55:7%).</Paragraph> <Paragraph position="1"> Figure 2 shows the relative improvement due to the mixed LM over the native LM on the native and non-native set.</Paragraph> <Paragraph position="2"> We also evaluated the impact of additional non-native data on natural language understanding. In this case, since we wrote the grammar manually and incrementally over time, it is not possible to directly &quot;add the non-native data&quot; to the grammar. Instead, we compared the June 2003 version of the grammar, which is mostly based on native speech, to its September 2003 version, which contains modifications based on the non-native data collected during the summer. This part is therefore an evaluation of the impact of the human grammar design done by the authors based on additional non-native data. At that point, the compiled grammar had grown to contain 1719 states and 3424 arcs which represents an increase of respectively 11:8% and 11:3% over the &quot;native&quot; grammar. Modifications include the addition of new words (e.g. &quot;reach&quot; as a synonym of &quot;arrive&quot;), new constructs (e.g. &quot;What is the next bus?&quot;) and the relaxation of some syntactic constraints to accept ungrammatical sentences (e.g. &quot;I want to arrive the airport at five&quot; instead of &quot;I want to arrive at the airport at five&quot;). Using this new grammar, the proportion of words parsed and sentences fully parsed improved by respectively 10:4% and 11:3% for the native set and by 17:3% and 11:7% for the non-native set. We believe that, as for the language model, the reduction in the number of OOVs is the main explanation behind the better improvement in word coverage observed for the non-native set compared to the native set.</Paragraph> <Paragraph position="3"> The reduction of the difference between the native and non-native sets is also reflected in the weaker significance levels for all ratios except that of fully parsed utterances, in 3, larger p-values meaning that there is a larger probability that the differences between the ratios were due to spurious differences between the corpora rather than to their (non-)nativeness.</Paragraph> <Paragraph position="4"> This confirms that even for populations with a wide variety of linguistic backgrounds, adding non-native data does reduce the linguistic mismatch between the model and new, unseen, non-native speech. Another explanation is that, on a narrow domain such as bus schedule information, the linguistic variance of non-native speech is much larger than that of native speech. Therefore, less data is required to accurately model native speech than non-native speech. It also appears from these results that, in the context of task-based spoken dialogue systems, higher-level modules, such as the natural language understanding module, are less sensitive to explicit modeling of non-nativeness. This can be explained by the fact that such modules were designed to be flexible in order to compensate for speech recognition errors. This flexibility benefits non-native speakers as well, regardless of additional recognition errors.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.5 Effect of Additional Non-Native Data on Speech Recognition </SectionTitle> <Paragraph position="0"> Unfortunately, the reduction of linguistic mismatch was not observed on recognition results. While using the new language model improved word error rate on both native Data using a Native and a Mixed Language Model and non-native utterances (resp. to 17:8% and 47:8%, see Figure 3 ), the impact was relatively larger for native speech. This is an indication that acoustics play a prominent role in the loss of accuracy of speech recognition on non-native speech. Acoustic differences between native and non-native speakers are likely to be larger than the linguistic ones, since, particularly on such a limited and common domain, it is easier for non-native speakers to master syntax and word choice than to improve their accent and pronunciation habits. Differences among non-native speakers of different origins are also very large in the acoustic domain, making it hard to create a single acoustic model matching all non-native speakers. Finally, the fact that additional non-native data improves performance on native speech is a sign that, generally speaking, the lack of training data for the language model is a limiting factor for recognition accuracy. Indeed, if there was enough data to model native speech, additional non-native data should increase the variance and therefore the perplexity on native speech.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Adaptive Lexical Entrainment as a </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Solution to Linguistic Mismatch 4.1 Gearing the User To the System's Language </SectionTitle> <Paragraph position="0"> The previous section described the issue of recognizing and understanding non-native speech and solutions to adapt traditional systems to non-native speakers. Another approach is to help non-native users adapt to the system by learning appropriate words and expressions. Lexical entrainment is the phenomenon by which, in a conversation, speakers negotiate a common ground of expressions to refer to objects or topics. Developers of spoken dialogue systems frequently take advantage of lexical entrainment to help users speak utterances that are within the language model of the system. This is done by carefully designing the system prompts to contain only words that are recognized by the recognition and understanding modules (Gustafson et al., 1997). However, in the case of non-native users, there is no guarantee that users actually know the words the system wants them to use. Also, even if they do, some non-native speakers might prefer to use other words, which they pronounce better or that they better know how to use. For those reasons, we believe that to be optimal, the system must try to match the user's choice of words in its own prompts. This idea is motivated by the observations of (Bortfeld and Brennan, 1997), who showed that this type of adaptation occurs in human-human conversations between native and non-native speakers.</Paragraph> <Paragraph position="1"> The role of the system's &quot;native&quot; prompts is to take the users through the shortest path from their current linguistic state to the system's expectations. In fact, this is not only true for non-native speakers and lexical entrainment is often described as a negotiation process between the speakers (Clark and Wilkes-Gibbs, 1986). However, while it is possible for limited-domain system designers to establish a set of words and constructions that are widely used among native speakers, the variable nature of the expressions mastered by non-native speakers make adaptation a desirable feature of the system.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Automatic Generation of Corrective Prompts </SectionTitle> <Paragraph position="0"> In this study, not all prompts were modified to match the user's choice of words. Instead, the focus was placed on confirmation prompts that both ensure proper understanding between the user and the system and lexically entrain the user towards the system's expected input. Two questions arise: how to generate the prompts and when to trigger them. Our approach has been to design a list of target prompts that fit the system's language model and grammar and find the closest target prompt to each user input. The distance between a user utterance as recognized by Sphinx and each of the target utterances is computed by the same dynamic programming algorithm that is traditionally used to compute word error rate in speech recognition evaluation. It determines the number of word insertions, deletions and substitutions that lead from the target prompt to the user's utterance. The target prompt that is closest, i.e. that requires the fewest operations to match the input, is selected. In addition, words that represent important concepts such as places, times or bus route numbers, are given additional weight. This follows the assumption that a target sentence is not appropriate if it has a missing or an extra concept compared to the utterance. We also used this heuristic to answer the second question: when to trigger the confirmation prompts.</Paragraph> <Paragraph position="1"> The system asks for a confirmation whenever a target sentence is found that contains the same concepts as the user input and differs from it by at least one word. In this case a prompt like &quot;Did you mean ...&quot; followed by the target sentence is generated. Finally, the dynamic programming algorithm used to align the utterances also locates the words that actually differ between the input and the target. This information is sent to the speech synthesizer, which puts particular emphasis on the words that differ.</Paragraph> <Paragraph position="2"> To provide natural emphasis, the intonation of all sentences is generated by the method described in (Raux and Black, 2003) that concatenates portions of natural intonational contours from recorded utterances into a contour appropriate for each prompt. Since the domain-limited voice recorded for the project does not allow us to either generate non-recorded prompts or to modify the contour of the utterances, we used a different, generic voice for this version of the system.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Application and Example </SectionTitle> <Paragraph position="0"> The method described in the previous paragraph was implemented in the system and tested in a small pilot study.</Paragraph> <Paragraph position="1"> We manually wrote 35 different target prompts describing departure and destination places, times and route numbers, based on our knowledge of the system's language model and grammar. An example of a confirmation dialogue obtained from one of these prompts is given in Figure 4. In the first user utterance, the preposition &quot;to&quot; is missing, either because it was not pronounced by the user or because it was not recognized by the speech recognition module. As a consequence, the utterance cannot be fully parsed by the language understanding module.</Paragraph> <Paragraph position="2"> In parallel, the confirmation module computes the distance between the user's input and each of the 35 target prompts, and identifies the closest one as &quot;I want to go to the airport&quot;. At the same time it finds that the user's utterance is obtained from the target by deleting the word &quot;to&quot; and therefore stresses it in the confirmation prompt. Once S: What can I do for you? U: I want to go the airport.</Paragraph> <Paragraph position="3"> S: Sorry, I didn't get that.</Paragraph> <Paragraph position="4"> Did you mean: I want to go TO the airport? U: Yes S: To the airport.</Paragraph> <Paragraph position="5"> Where are you leaving from? U: ...</Paragraph> <Paragraph position="6"> The capital &quot;TO&quot; indicate that the word was emphasized by the system.</Paragraph> <Paragraph position="7"> the user answers &quot;yes&quot; to the confirmation prompt, the target prompt is sent to the parser as if it had been uttered by the user and the state of the dialogue is updated accordingly. If the user answers &quot;no&quot;, the prompt is simply discarded. We found that this method works well when speech recognition is only slightly degraded and/or when the recognition errors mostly concern grammar and function words. In such cases, this approach is often able to repair utterances that would not be parsed correctly otherwise. However, when too many recognition errors occur, or when they affect the values of the concepts (i.e. the system recognizes one place name instead of another), the users receive too many confirmation prompts to which they must respond negatively. Combined with the difficulty that non-native speakers have in understanding unexpected synthesized utterances, this results in cognitive overload on the user. Yet, this method provides an easy way (since the designer only has to provide the list of target prompts) to generate adaptive confirmation prompts that are likely to help lexical entrainment.</Paragraph> </Section> </Section> class="xml-element"></Paper>