File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3702_metho.xml

Size: 18,494 bytes

Last Modified: 2025-10-06 14:10:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3702">
  <Title>Marianne.Starlander@eti.unige.ch</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 The MedSLT system
</SectionTitle>
    <Paragraph position="0"> MedSLT (MedSLT, 2005; Bouillon et al., 2005) is a unidirectional, grammar-based medical speech translation system intended for use in doctor-patient diagnosis dialogues. The system is built on top of Regulus (Regulus, 2006), an Open Source platform for developing grammar-based speech applications. Regulus supports rapid construction of complex grammar-based language models using an example-based method (Rayner et al., 2003; Rayner et al., 2006), which extracts most of the structure of the model from a general linguistically motivated resource grammar. Regulus-based recognizers are reasonably easy to maintain, and grammar structure is shared automatically across different subdomains. Resource grammars are now available for several languages, including English, Japanese (Rayner et al., 2005b), French (Bouillon et al., 2006) and Spanish.</Paragraph>
    <Paragraph position="1"> MedSLT includes a help module, whose purpose is to add robustness to the system and guide the user towards the supported coverage. The help module uses a second backup recognizer, equipped with a statistical language model; it matches the results from this second recognizer against a corpus of utterances, which are within system coverage and have already been judged to give correct translations. In previous studies (Rayner et al., 2005a; Starlander et al., 2005), we showed that the grammar-based recognizer performs much better than the statistical one on in-coverage utterances, and rather worse on out-of-coverage ones. We also found that having the help module available approximately doubled the speed at which subjects learned to use the system, measured as the average difference in semantic error rate between the results for their first quarter-session and their last quarter-session. It is also possible to recover from recognition errors by selecting one of the displayed help sentences; in the cited studies, we found that this increased the number of acceptably processed utterances by about 10%.</Paragraph>
    <Paragraph position="2"> The version of MedSLT used for the experiments described in the present paper was configured to translate from spoken French into spoken English in the headache subdomain. Coverage is based on standard headache-related examination questions obtained from a doctor, and consists mostly of yes/no questions. WH-questions and elliptical constructions are also supported. A typical short session with MedSLT might be as follows:  - is the pain in the side of the head? - does the pain radiate to the neck? - to the jaw? - do you usually have headaches in the morning ?  The recognizer's vocabulary is about 1000 surface words; on in-grammar material, Word Error Rate is about 8% and semantic error rate (per utterance) about 10% (Bouillon et al., 2006). Both the main grammar-based recognizer and the statistical recognizer used by the help system were trained from the same corpus of about 975 utterances. Help sentences were also taken from this corpus.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Experimental Setup
</SectionTitle>
    <Paragraph position="0"> In previous work, we have shown how to build a robust and extendable speech translation system.</Paragraph>
    <Paragraph position="1"> We have focused on performance metrics defined in terms of recognition and translation quality, and tested the system on naive users without any medical background (Bouillon et al., 2005; Rayner et al., 2005a; Starlander et al., 2005).</Paragraph>
    <Paragraph position="2"> In this paper, our primary goal was rather to focus on task performance evaluation using plausible potential users. The basic methodology used is common in evaluating usability in software systems in general, and spoken language systems in particular (Cohen et. al 2000). We defined a simulated situation, where a French-speaking doctor was required to carry out a verbal examination of an English-speaking patient who claimed to be suffering from a headache, using the MedSLT system to translate all their questions. The patients were played by members of the development team, who had been trained to answer questions consistently with the symptoms of different medical conditions which could cause headaches. We recruited eight native French-speaking medical students to play the part of the doctor. All of the students had completed at least four years of medical school; five of them were already familiar with the symptoms of different types of headaches, and were experienced in real diagnosis situations.</Paragraph>
    <Paragraph position="3"> The experiment was designed to study how well users were able to perform the task using the MedSLT system. In particular, we wished to determine how quickly they could adapt to the restricted language and limited coverage of the system. As a comparison point, representing near-perfect performance, we also carried out the same test on two developers who had been active in implementing the system, and were familiar with its coverage.</Paragraph>
    <Paragraph position="4"> Since it seemed reasonable to assume that most users would not interact with the system on a daily basis, we conducted testing in three sessions, with an interval of two days between each session. At the beginning of the first session, subjects were given a standardized 10-minute introduction to the system. This consisted of instruction on how to set up the microphone, a detailed description of the MedSLT push-to-talk interface, and a video clip showing the system in action. At the end of the presentation, the subject was given four sample sentences to get familiar with the system.</Paragraph>
    <Paragraph position="5"> After the training was completed, subjects were asked to play the part of a doctor, and conduct an examination through the system. Their task was to identify the headache-related condition simulated by the &amp;quot;patient&amp;quot;, out of nine possible conditions. Subjects were given definitions of the simulated headache types, which included conceptual information about location, duration, frequency, onset and possible other symptoms the particular type of headache might exhibit.</Paragraph>
    <Paragraph position="6"> Subjects were instructed to signal the conclusion of their examination when they were sure about the type of simulated headache. The time required to reach a conclusion was noted in the experiment protocols by the experiment supervisor.</Paragraph>
    <Paragraph position="7"> The subjects repeated the same diagnosis task on different predetermined sets of simulated conditions during the second and third sessions. The sessions were concluded either when a time limit of 30 minutes was reached, or when the subject completed three headache diagnoses. At the end of the third session, the subject was asked to fill out a questionnaire.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Results
</SectionTitle>
    <Paragraph position="0"> Performance of a speech translation system is best evaluated by looking at system performance as a whole, and not separately for each subcomponent in the systems processing pipeline (Rayner et. al. 2000, pp. 297-pp. 312). In this paper, we consequently focus our analysis on objective and subjective usability-oriented measures.</Paragraph>
    <Paragraph position="1"> In Section 4.1, we present objective usability measures obtained by analyzing user-system interactions and measuring task performance. In Section 4.2, we present subjective usability figures and a preliminary analysis of translation quality.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Objective Usability Figures
</SectionTitle>
      <Paragraph position="0"> Most of our analysis is based on data from the MedSLT system log, which records all interactions between the user and the system. An interaction is initiated when the user presses the &amp;quot;Start Recognition&amp;quot; button. The system then attempts to recognize what the user says. If it can do so, it next attempts to show the user how it has interpreted the recognition result, by first translating it into the Interlingua, and then translating it back into the source language (in this case, French). If the user decides that the back-translation is correct, they press the &amp;quot;Translate&amp;quot; button. This results in the system attempting to translate the Interlingua representation into the target language (in this case, English), and speak it using a Text-To-Speech engine. The system also displays a list of &amp;quot;help sentences&amp;quot;, consisting of examples that are known to be within coverage, and which approximately match the result of performing recognition with the statistical language model. The user has the option of choosing a help sentence from the list, using the mouse, and submitting this to translation instead.</Paragraph>
      <Paragraph position="1"> We classify each interaction as either &amp;quot;successful&amp;quot; or &amp;quot;unsuccessful&amp;quot;. An interaction is defined to be unsuccessful if either i) the user re-initiates recognition without asking the system for a translation, or ii) the system fails to produce a correct translation or back translation.</Paragraph>
      <Paragraph position="2"> Our definition of &amp;quot;unsuccessful interaction&amp;quot; includes instances where users accidentally press the wrong button (i.e. &amp;quot;Start Recognition&amp;quot; instead of &amp;quot;Translate&amp;quot;), press the button and then say nothing, or press the button and change their minds about what they want to ask half way through. We observed all of these behaviors during the tests.</Paragraph>
      <Paragraph position="3"> Interactions where the system produced a translation were counted as successful, irrespective of whether the translation came directly from the user's spoken input or from the help list. In at least some examples, we found that when the translation came from a help sentence it did not correspond directly to the sentence the user had spoken; to our surprise, it could even be the case that the help sentence expressed the directly opposite question to the one the user had actually asked. This type of interaction was usually caused by some deficiency in the system, normally bad recognition or missing coverage. Our informal observation, however, was that, when this kind of thing happened, the user perceived the help module positively: it enabled them to elicit at least some information from the patient, and was less frustrating than being forced to ask the question again.</Paragraph>
      <Paragraph position="4"> Table I to Table III show the number of total interactions per session, the proportion of successful interactions, and the proportion of interactions completed by selecting a sentence from the help list. The total number of interactions required to complete a session decreased over the three sessions, declining from an average of 98.6 interactions in the first session to 63.4 in the second (36% relative) and 53.9 in the third (45% relative). It is interesting to note that interactions involving the help system did not decrease in frequency, but remained almost constant over the first two sessions (15.5% and 14.0%), and were in fact most common during the third session (21.7%).</Paragraph>
      <Paragraph position="5">  successful interactions, and interactions involving the help system by subject for the 1st session  successful interactions, and interactions involving the help system by subject for the 2nd session  successful interactions, and interactions involving the help system by subject for the 3rd session In order to establish a performance baseline, we also analyzed interaction data for two expert users, who performed the same experiment. The expert users were two native French-speaking system developers, which were both familiar with the diagnosis domain. Table IV summarizes the results of those users. One of our expert users, listed as Expert 2, is the French grammar developer, and had no failed interactions. This confirms that recognition is very accurate for users who know the coverage. null  of successful interactions, and interactions involving the help component The expert users were able to complete the experiment using an average of 33 interaction rounds. Similar performance levels were achieved by some subjects during the second and third session, which suggests that it is possible for at least some new users to achieve performance close to expert level within a few sessions.</Paragraph>
      <Paragraph position="6">  One of the important performance indicators for end users is how long it takes to perform a given task. During the experiments, the instructors noted completion times required to reach a definite diagnosis in the experiment log. Table VI shows task completion times, categorized by session (columns) and task within the session (rows).</Paragraph>
      <Paragraph position="7">  In the last two sessions, after subjects had acclimatized to the system, a diagnosis takes an average of about four minutes to complete. This compares to a three-minute average required to complete a diagnosis by our expert users.</Paragraph>
      <Paragraph position="8">  Table VI shows the percentage of in-coverage sentences uttered by the users on interactions that did not involve invocation of the help component.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
IN-COVERAGE SENTENCES
</SectionTitle>
    <Paragraph position="0"> This indicates that subjects learn and adapt to the system coverage as they use the system more.</Paragraph>
    <Paragraph position="1"> The average proportion of in-coverage utterances is 10 percent higher during the third session than during the first session.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Subjective Usability Measures
</SectionTitle>
      <Paragraph position="0"> After finishing the third session, subjects were asked to fill in a short questionnaire, where responses were on a five-point scale ranging from 1 (&amp;quot;strongly disagree&amp;quot;) to 5 (&amp;quot;strongly agree&amp;quot;). The results are presented in Table VIII.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
STATEMENT SCORE
</SectionTitle>
    <Paragraph position="0"> I quickly learned how to use the system. 4.4 System response times were generally satisfactory.</Paragraph>
    <Paragraph position="1">  When the system did not understand me, the help system usually showed me another way to ask the question.  When I knew what I could say, the system usually recognized me correctly.  I was often unable to ask the questions I wanted.</Paragraph>
    <Paragraph position="2">  I could ask enough questions that I was sure of my diagnosis.</Paragraph>
    <Paragraph position="3">  This system is more effective than non-verbal communication using gestures.  I would use this system again in a similar situation.</Paragraph>
    <Paragraph position="4">  Scores are on a 5-point scale, averaged over all answers.</Paragraph>
    <Paragraph position="5"> Answers were in general positive, and most of the subjects were clearly very comfortable with the system after just an hour and a half of use. Interestingly, even though most of the subjects answered &amp;quot;yes&amp;quot; to the question &amp;quot;I was often unable to ask the questions I wanted&amp;quot;, the good performance of the help system appeared to compensate adequately for missing coverage.</Paragraph>
    <Paragraph position="6">  In order to evaluate the translation quality of the newly developed French-to-English system, we conducted a preliminary performance evaluation, similar to the evaluation method described in (Bouillon 2005).</Paragraph>
    <Paragraph position="7"> We performed translation judgment in two rounds. In the first round, an English-speaking judge was asked to categorize target utterances as comprehensible or not without looking at corresponding source sentences. 91.1% of the sentences were judged as comprehensible. The remaining 8.9% consisted of sentences where the terminology used was not familiar to the judge and of sentences where the translation component failed to produce a sufficiently good translation. An example sentence is - Are the headaches better when you experience dark room? which stems from the French source sentence - Vos maux de tete sont ils soulages par obscurite? null In the second round, English-speaking judges, sufficiently fluent in French to understand source language utterances, were shown the French source utterance, and asked to decide whether the target language utterance correctly reflected the meaning of the source language utterance. They were also asked to judge the style of the target language utterance. Specifically, judges were asked to classify sentences as &amp;quot;BAD&amp;quot; if the meaning of the English sentence did not reflect the meaning of the French sentence. Sentences were categorized as &amp;quot;OK&amp;quot; if the meaning was transferred correctly and the sentence was comprehensible, but the style of the resulting English sentence was not perfect. Sentences were judged as &amp;quot;GOOD&amp;quot; when they were comprehensible, and both meaning and style were considered to be completely correct. Table VIII summarizes results of two judges.</Paragraph>
    <Paragraph position="8">  tions of 546 utterances It is apparent that translation judging is a highly subjective process. When translations were marked as &amp;quot;bad&amp;quot;, the problem most often seemed to be related to lexical items where it was challenging to find an exact correspondence between French and English. Two common examples were &amp;quot;troubles de la vision&amp;quot;, which was translated as &amp;quot;blurred vision&amp;quot;, and &amp;quot;faiblesse musculaire&amp;quot;, which was translated as &amp;quot;weakness&amp;quot;. It is likely that a more careful choice of lexical translation rules would deal with at least some of these cases.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML