File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3703_metho.xml
Size: 23,501 bytes
Last Modified: 2025-10-06 14:10:58
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3703"> <Title>Speech to Speech Translation for Medical Triage in Korean</Title> <Section position="5" start_page="1" end_page="1" type="metho"> <SectionTitle> 1 US Census Bureau, 2000 </SectionTitle> <Paragraph position="0"> patients in this study were identified as needing interpreters during their inpatient stay and medical interpreters were available.</Paragraph> <Paragraph position="1"> Although the evidence favors using trained medical interpreters, there is a gap between best practice and reality. Many patients needing an interpreter do not get one, and many must use ad hoc interpreters. In a study of 4,161 uninsured patients who received care in 23 hospitals in 16 cities, more than 50% who needed an interpreter did not get one (Andrulis et al., 2002).</Paragraph> <Paragraph position="2"> Another study surveyed 59 residents in a pediatric residency program in an urban children's hospital (O'Leary and Hampers, 2003). Forty of the 59 residents surveyed spoke little or no Spanish. Again, it is important to note that this hospital had in-house medical interpreters. Of this group of nonproficient residents: * 100% agreed that the hospital interpreters were effective; however, 75% &quot;never&quot; or only &quot;sometimes&quot; used the hospital interpreters. null * 53% used their inadequate language skills in the care of patients &quot;often&quot; or &quot;every day.&quot; * 53% believed the families &quot;never&quot; or only &quot;sometimes&quot; understood their child's diagnosis. null * 43% believed the families &quot;never&quot; or only &quot;sometimes&quot; understood discharge instructions. null * 40% believed the families &quot;never&quot; or only &quot;sometimes&quot; understood the follow-up plan.</Paragraph> <Paragraph position="3"> * 28% believed the families &quot;never&quot; or only &quot;sometimes&quot; understood the medications. * 53% reported calling on their Spanishproficient colleagues &quot;often&quot; or &quot;every day&quot; for help.</Paragraph> <Paragraph position="4"> * 80% admitted to avoiding communication with non-English-speaking families.</Paragraph> <Paragraph position="5"> The conclusion of the study was as follows: &quot;Despite a perception that they are providing suboptimal communication, nonproficient residents rarely use professional interpreters. Instead, they tend to rely on their own inadequate language skills, impose on their proficient colleagues, or avoid communication with Spanish-speaking families with LEP.&quot; Virtually every study on language barriers suggests that these residents are not unique. Physicians and staff at several hospitals have told Sehda that they are less likely to use a medical interpreter or telephone-based interpreter because it takes too long and is too inconvenient. Sehda believes that to bridge this gap requires 2-way speech translation solutions that are immediately available, easy to use, accurate, and consistent in interpretation.</Paragraph> <Paragraph position="6"> The need for speech translation exists in healthcare, and a lot of work has been done in speech translation over the past two decades. Carnegie-Mellon University has been experimenting with spoken language translation in its JANUS project since the late 1980s (Waibel et al., 1996). The University of Karlsruhe, Germany, has also been involved in an expansion of JANUS. In 1992, these groups joined ATR in the C-STAR consortium (Consortium for Speech Translation Advanced Research) and in January 1993 gave a successful public demonstration of telephone translation between English, German and Japanese, within the limited domain of conference registrations (Woszczyna, 1993). A number of other large companies and laboratories including NEC (Isotani, et al., 2003) in Japan, the Verbmobil Consortium (Wahlster, 2000), NESPOLE! Consortium (Florian et al., 2002), AT&T (Bangalore and Riccardi, 2001), and ATR have been making their own research effort (Yasuda et al., 2003). LC-Star and TC-Star are two recent European efforts to gather the data and the industrial requirements to enable pervasive speech-to-speech translation (Zhang, 2003). Most recently, the DARPA TransTac program (previously known as Babylon) has been focusing on developing deployable systems for English to Iraqi Arabic.</Paragraph> </Section> <Section position="6" start_page="1" end_page="5" type="metho"> <SectionTitle> 3 System Description </SectionTitle> <Paragraph position="0"> Unlike other systems that try to solve the speech translation problem with the assumption that there is a moderate amount of data available, S-MINDS focuses on rapid building and deployment of speech translation systems in languages where little or no data is available. S-MINDS allows the user to communicate easily in a question-andanswer, interview-style conversation across languages in limited domains such as border control, hospital admissions or medical triage, or other narrow interview fields.</Paragraph> <Paragraph position="1"> S-MINDS uses a number of voice-independent speech recognition engines with the usage dependent on the languages and the particular domain. These engines include Nuance 8.5 There is a dialog/translation creation tool that allows us to compile and run our created dialogs with any of these engines. This allows our developers to be free from the nuances of any particular engine that is deployed. S-MINDS uses a combination of grammars and language models with these engines, depending on the task and the availability of training data. In the case of the system described in this document, we were using Nuance 8.5 for both English and Korean speech recognition.</Paragraph> <Paragraph position="2"> We use our own semantic parser, which identifies keywords and phrases that are tagged by the user; these in turn are fed into an interpretation engine. Because of the limited context, we can achieve high translation accuracy with the interpretation engine. However, as the name suggests, this engine does not directly translate users' utterances but interprets what they say and paraphrases their statements. Finally, we use a voice generation system (which splices human recordings) along with the Festival TTS engine to output the translations. This has been recently replaced by the Cepstral TTS engine.</Paragraph> <Paragraph position="3"> Additionally, S-MINDS includes a set of tools to modify and augment the existing system with additional words and phrases in the field in a matter of a few minutes.</Paragraph> <Paragraph position="4"> The initial task given to us was a medical disaster recovery scenario that might occur near an American military base in Korea. We were given about 270 questions and an additional 90 statements that might occur on the interviewer side. Since our system is an interview-driven system (sometimes referred to as &quot;1.5-way&quot;), the second-language person is not given the option of initiating conversations. The questions and statements given to us covered several domains related to the task above, including medical triage, force protection at the http://htk.eng.cam.ac.uk/ installation gate, and some disaster recovery questions. In addition to the 270 assigned questions, we created 120 of our own in order to make the domains more complete.</Paragraph> <Section position="1" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 3.1 Data Collection </SectionTitle> <Paragraph position="0"> Since we assumed that we could internally generate the English language data used to ask the question but not the language data on the Korean side, our entire focus for the data collection task was on Korean. As such, we collected about 56,000 utterances from 144 people to answer the 390 questions described above. This data collection was conducted over the course of 2 months via a telephone-based computer system that the native Korean speakers could call. The system first introduced the purpose of the data collection and then presented the participants with 12 different scenarios. The participants were then asked a subset of the questions after each of the scenarios. One advantage of the phone-based system - in addition to the savings in administrative costs - was that the participants were free to do the data collection any time during the day or night, from any location.</Paragraph> <Paragraph position="1"> The system also allowed participants to hang up and call back at a later time. The participants were paid only if they completed all the scenarios.</Paragraph> <Paragraph position="2"> Of this data, roughly 7% was unusable and was thrown away. Another 31% consisted of one-word answers (like &quot;yes&quot;). The rest of the data consisted of utterances 2 to 25 words long. Approximately 85% of the usable data was used for training; the remainder was used for testing.</Paragraph> <Paragraph position="3"> The transcription of the data started one week after the start of the data collection, and we started building the grammars three weeks later.</Paragraph> </Section> <Section position="2" start_page="4" end_page="5" type="sub_section"> <SectionTitle> 3.2 System Development </SectionTitle> <Paragraph position="0"> We have an extensive set of tools that allow nonspecialists, with a few days of training, to build complete mission-oriented domains. In this project, we used three bilingual college graduates who had no knowledge of linguistics. We spent the first 10 days training them and the next two weeks closely supervising their work. Their work involved taking the sentences that were produced from the data collection and building grammars for them until the &quot;coverage&quot; of our grammars - that is, the number of utterances from the training set that our system would handle - was larger than a set threshold (generally set between 80% and 90%). Because of the scarcity of Korean-language data, we built this system based entirely on grammar language models rather than statistical language models. Grammars are generally more rigid than statistical language models, and as such grammars tend to have higher in-domain accuracy and much lower out-of-domain accuracy than statistical language models. This means that the system performance will depend greatly upon on how well our grammars cover the domains.</Paragraph> <Paragraph position="1"> The semantic tagging and the paraphrase translations were built simultaneously with the grammars. This involved finding and tagging the semantic classes as well as the key concepts in each utterance. Frame-based translations were performed by doing concept and semantic transfer. Because our tools allowed the developers to see the resulting frame translations right away, they were able to make fixes to the system as they were building it; hence, the system-building time was greatly reduced. null We used about 15% of the collected telephone data for batch testing. Before deployment, our average word accuracy on the batch results was 92.9%. The translation results were harder to measure directly, mostly because of time constraints.</Paragraph> </Section> <Section position="3" start_page="5" end_page="5" type="sub_section"> <SectionTitle> 3.3 System Testing </SectionTitle> <Paragraph position="0"> We tested our system with 11 native Korean speakers, gathering 968 utterances from them. The results of the test are shown in Table 1. Most of the valid rejected utterances occurred because participants spoke too softly, too loudly, before the prompt, or in English. Note that there was one utterance with bad translation; that and a number of other problems were fixed before the actual field testing.</Paragraph> <Paragraph position="1"> 5 Note that there are many factors effecting both grammar-based and statistical language model based speech recognition, including noise, word perplexity, acoustic confusability, etc. The statement above has been true with some of the experiments that we have done, but we can not claim that it is universally true.</Paragraph> <Paragraph position="2"> sults for the 11 native Korean speakers.</Paragraph> </Section> </Section> <Section position="7" start_page="5" end_page="5" type="metho"> <SectionTitle> 4 Experimental Setup </SectionTitle> <Paragraph position="0"> A military medical group used S-MINDS during a medical training exercise in January 2005 in Carlsbad, California. The testing of speech translation systems was integrated into the exercise to assess the viability of such systems in realistic situations.</Paragraph> <Paragraph position="1"> The scenario involved a medical aid station near the front lines treating badly injured civilians. The medical facilities were designed to quickly triage severely wounded patients, provide life-saving surgery if necessary, and transfer the patients to a safer area as soon as possible.</Paragraph> <Section position="1" start_page="5" end_page="5" type="sub_section"> <SectionTitle> 4.1 User Training </SectionTitle> <Paragraph position="0"> Often the success or failure of these interactive systems is determined by how well the users are trained on the systems' features.</Paragraph> <Paragraph position="1"> Training and testing on S-MINDS took place from November 2004 through January 2005. The training had three parts: a system demonstration in November, two to three hours of training per person in December, and another three-hour training session in January. About 30 soldiers were exposed to S-MINDS during this period. Because of the tsunami in Southeast Asia, many of the people who attended the November demo and December training were not available for the January training and the exercise. Nine service members used S-MINDS during the exercise. Most of them had attended only the training session in January.</Paragraph> </Section> <Section position="2" start_page="5" end_page="5" type="sub_section"> <SectionTitle> 4.2 Test Scenarios </SectionTitle> <Paragraph position="0"> Korean-speaking 'patients' arrived by military ambulance. They were received into one of three tents where they were (notionally) triaged, treated, and prepared for surgery. The tents were about 20 feet wide by 25 feet deep, and each had six to eight cots for patients. The tents had lights and electricity.</Paragraph> <Paragraph position="1"> The environment was noisy, sandy, and 'bloody.' The patients' makeup coated our handsets by the end of the day. There were many soldiers available to help and watch. Nine service members used S-MINDS during a four-hour period.</Paragraph> <Paragraph position="2"> All of the 'patients' spoke both English and Korean. A few 'patients' were native Korean speakers, and two were American service members who spoke Korean fairly fluently but with an accent.</Paragraph> <Paragraph position="3"> The 'patients' were all presented as severely injured from burns, explosions, and cuts and in need of immediate trauma care.</Paragraph> <Paragraph position="4"> The 'patients' were instructed to act as if they were in great pain. Some did, and they sounded quite realistic. In fact, their recorded answers to questions were sometimes hard for a native Korean speaker to understand. The background noise in the tents was quite loud (because of the number of people involved, screaming patients and close quarters). Although we did not directly measure the noise; we estimate it ranged from 65 to 75 decibels. null</Paragraph> </Section> <Section position="3" start_page="5" end_page="5" type="sub_section"> <SectionTitle> 4.3 Physical and Hardware Setup </SectionTitle> <Paragraph position="0"> S-MINDS is a flexible system that can be configured in different ways depending on the needs of the end user. Because of the limited time available for training, the users were trained on a single hardware setup, tailored to our understanding of how the exercises would be conducted. Diagrams available before the exercises showed that each tent would have a &quot;translation station&quot; where Korean-speaking patients would be brought. The experimenters (two of the authors) had expected that the tents would be positioned at least 40 feet apart.</Paragraph> <Paragraph position="1"> In reality, the tents were positioned about 5 feet apart, and there was no translation station.</Paragraph> <Paragraph position="2"> Our original intent was to use S-MINDS on a Sony U-50 tablet computer mounted on a computer stand with a keyboard and mouse at the translation station, and for a prototype wireless device - based on a Bluetooth-like technology to eliminate the need for wires between the patient and the system - that we had built previously. However, because of changes in the conduct of the exercise, the experimenters had to step in and quickly set up two of the S-MINDS systems without the wireless system (because of the close proximity of the tents) and without the computer stands. The keyboards and mice were also removed so that the S-MINDS systems could be made portable. The medics worked in teams of two; one medic would hold the computer and headset for the injured patient while the other medic conducted the interview.</Paragraph> </Section> </Section> <Section position="8" start_page="5" end_page="5" type="metho"> <SectionTitle> 5 Results </SectionTitle> <Paragraph position="0"> The nine participants used our system to communicate with 'patients' over a four-hour period. We analyzed qualitative problems with using the system and quantitative results of translation accuracy. null</Paragraph> <Section position="1" start_page="5" end_page="5" type="sub_section"> <SectionTitle> 5.1 Problems with System Usage </SectionTitle> <Paragraph position="0"> We observed a number of problems in the test scenarios with our system. These represent some of the more common problems with the S-MINDS system. The authors suspect these may be endemic of all such systems.</Paragraph> <Paragraph position="1"> Users were trained to use the wireless units, which interfered with each other when used in close proximity. For the exercise, we had to set up the units without the wireless devices because the users had not been trained on this type of setup. As a result, service members were forced to use a different system from the one they were trained on.</Paragraph> <Paragraph position="2"> Also, the users had difficulty navigating to the right domain. S-MINDS has multiple domains each optimized for a particular scenario (medical triage, pediatrics, etc.), but the user training did not include navigation among domains.</Paragraph> <Paragraph position="3"> The user interface and the system's user feedback messages caused unnecessary confusion with the interviewers. The biggest problem was that the system responded with, &quot;I'm sorry, I didn't hear that clearly&quot; whenever a particular utterance wasn't recognized. This made the users think they should just repeat their utterance over and over. In fact, the problem was that they were saying something that were out of domain or did not fit any dialogs in S-MINDS, so no matter how many times they repeated the phrase, it would not be recognized. This caused the users significant frustration.</Paragraph> </Section> <Section position="2" start_page="5" end_page="5" type="sub_section"> <SectionTitle> 5.2. Quantative Analysis </SectionTitle> <Paragraph position="0"> During the system testing, there were 363 recorded interactions for the English speakers. Unfortunately, the system was not set up to record the utterances that had a very low confidence score (as determined by the Nuance engine), and the user was asked to repeat those utterances again. Here is the rough breakdown for all of the English interactions: null utterances and percentage breakdown for each category.</Paragraph> <Paragraph position="1"> The Korean speakers' responses to each of the questions that were recognized and translated are analyzed in Figure 2. Note that the accuracy for the non-rejected responses is 78.3%.</Paragraph> <Paragraph position="2"> tion for the Korean utterances and percentage breakdown for each category.</Paragraph> </Section> </Section> <Section position="9" start_page="5" end_page="5" type="metho"> <SectionTitle> 6 Discussion </SectionTitle> <Paragraph position="0"> Although these results are less than impressive, a close evaluation pointed to three areas where a concentration of effort would significantly improve translation accuracy and reduce mistranslations.</Paragraph> <Paragraph position="1"> These areas were: 1) Data collection with English speakers to increase coverage on the dialogs.</Paragraph> <Paragraph position="2"> a) 34% of the things the soldiers said were things S-MINDS was not designed to translate.</Paragraph> <Paragraph position="3"> b) We had assumed that our existing English system would have adequate coverage without any additional data collection.</Paragraph> <Paragraph position="4"> 2) User verification on low-confidence results. 3) Improved feedback prompts when a phrase is not recognized; for example: a) One user said, &quot;Are you allergic to any allergies?&quot; three times before he caught himself and said, &quot;Are you allergic to any medications?&quot; b) Another user said, &quot;How old are you?&quot; seven times before realizing he needed to switch to a different domain, where he was able to have the phrase translated.</Paragraph> <Paragraph position="5"> c) Another user repeated, &quot;What is your name?&quot; nine times before giving up on the phrase (this phrase wasn't in the S-MINDS Korean medical mission set).</Paragraph> <Paragraph position="6"> Beyond improving the coverage, the system's primary problem seemed to be in the voice user interface since even the trained users had a difficult time in using the system.</Paragraph> <Paragraph position="7"> The attempt at realism in playing out a high-trauma scenario may have detracted from the effectiveness of the event as a test of the systems' abilities under more routine (but still realistic) conditions.</Paragraph> </Section> <Section position="10" start_page="5" end_page="5" type="metho"> <SectionTitle> 7 New Results </SectionTitle> <Paragraph position="0"> Based on the results of this experiment, we had a secondary deployment in a medical setting for a very similar system.</Paragraph> <Paragraph position="1"> We applied what we had learned to that setting and achieved better results in a few areas. For example: 1. Data collection in English helped tremendously. S-MINDS recognized about 40% more concepts than it had been able to recognize using only grammars created by subject-matter experts.</Paragraph> <Paragraph position="2"> 2. Verbal verification of the recognized utterance was added to system, and that improved the user confidence, although too much verification tended to frustrate the users.</Paragraph> <Paragraph position="3"> 3. Feedback prompts were designed to give more specific feedback, which seemed to reduce user frustration and the number of mistakes.</Paragraph> <Paragraph position="4"> Overall, the system performance seemed to improve. We continue to gather data on this task, and we believe that this is going to enable us to identify the next set of problems that need to be solved.</Paragraph> </Section> class="xml-element"></Paper>