File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/83/a83-1031_metho.xml

Size: 36,085 bytes

Last Modified: 2025-10-06 14:11:28

<?xml version="1.0" standalone="yes"?>
<Paper uid="A83-1031">
  <Title>INTERACTIVE NATURAL LANGUAGE PROBLEM SOLVING: A PRAGMATIC APPROACH</Title>
  <Section position="2" start_page="0" end_page="181" type="metho">
    <SectionTitle>
SYSTEM OVERVZEW
</SectionTitle>
    <Paragraph position="0"> The basic system design includes modules to do the following tasks:  (i) token acquisition (2) parsing (3) noun group resolution (4) imperative verb execution (5) flow-of-control semantics (6) system output  The token acquisition phase receives typed inputs, word guesses ~com the voice recognizer, and screen coordinates from the touch panel. These inputs are preprocessed and passed tO the parser which uses an augmented transition network to ~iscover the structure of the command and the roles of the individual tokens. ~oun group resolution attempts to discover what domain objects are being referred to, and the verb execution module transforms those objects as requested by the imperative verb. The flow-of-control semantics module manages the execution of metaimperative verbs such as ~, and handles user-defined imperatives. Finally, system output displays the state of the world on the screen. Any module may issue prompts and error messages via text or spoken output. Backup from any given module to an earlier stage may occur in unusual situations. More details appear in Ballard \[i\], Biermann \[5\], Biermann and Ballard \[6\], and Eallard and 8iermann \[3\].</Paragraph>
  </Section>
  <Section position="3" start_page="181" end_page="182" type="metho">
    <SectionTitle>
SPEECH INPUT
</SectionTitle>
    <Paragraph position="0"> An automatic speech recognizer such as the DP-200 or V-5000 recognizes speech by means of pattern matching algorithms.</Paragraph>
    <Paragraph position="1"> A subject is introduced to the device for a training session, and asked to repeat the various words of the vocabulary into a microphone. The device extracts and stores bit patterns corresponding to each vocabulary word uttered by that particular speaker. After training, when a speaker wishes to use the device, the appropriate bit patterns are loaded. Each utterance of the speaker is compared with the pre-stored bit patterns and the best match above a threshold limit is presented as the recognized word. Depending on the device being used, the speaker may be required to talk with discrete or connected speech. The results described below were obtained primarily in the discrete mode with a pause of at least 200 milliseconds after each word.</Paragraph>
    <Paragraph position="2"> Error Handlin~ The major difficulty facing users of automatic speech recognition equipment is the high error rate. Even the best devices in the best of circumstances are not entirely free of error, and when circumstances are less than optimal, and more like the real world, the error rate rises.</Paragraph>
    <Paragraph position="3"> Thus, a good part of the project effort has gone into coping with errors in recognition. In our view the speech recognition device is a component of the larger natural language computing system, and our goal is to reduce the system error rate as much as possible. We have therefore designed error correction software that corrects for certain kinds of errors, and error messages that elicit repetition from the human subject in less tractable cases.</Paragraph>
    <Paragraph position="4"> Error correction essentially functions by starting with a sequence of word guesses from the input system and filtering out the meaningless alternatives at the appropriate stages of processing.</Paragraph>
    <Paragraph position="5"> Beginning in the token acquisition phase, certain unacceptable word sequences can be disallowed. For example, a noun such as &amp;quot;matrix&amp;quot; or &amp;quot;row&amp;quot; would be disallowed as the first word in the sentence since this is illegal in the system grammar. In the parsing phase, a grammatical sequence of words is selected from the incoming sets of word guesses. Thus all ungrammatical word sequences are eliminated. The parser also disallows phrases containing certain semantically unacceptable relationships such as the second row in 6.</Paragraph>
    <Paragraph position="6"> or phrases containing disallowed operations such as Add the matrix to 6.</Paragraph>
    <Paragraph position="7"> In the noun group processor and Later stages, various other semantic errors can be eliminated such as references to nonexistent objects or impossible operations. For discrete mode operations, errors are classified into four types: a. Substitutions.</Paragraph>
    <Paragraph position="8"> The device reports word B when word A was actually spoken.</Paragraph>
    <Paragraph position="9"> b. Re~ections.</Paragraph>
    <Paragraph position="10"> The device sends a rejection code when a vocabulary word was spoken.</Paragraph>
    <Paragraph position="11"> c. Insertions.</Paragraph>
    <Paragraph position="12"> The device reports a vocabulary word when a non-vocabulary word, or noise, was uttered: d. Fusions. Two (or more) words are spoken but only one word is reported.</Paragraph>
    <Section position="1" start_page="181" end_page="182" type="sub_section">
      <SectionTitle>
Substitution Errors
</SectionTitle>
      <Paragraph position="0"> Substitution errors are the easiest to correct since the substituted word often resembles the actual word phonetically. Some of the substitutions are fairly predictable, e.g. &amp;quot;by&amp;quot; for &amp;quot;five&amp;quot;, &amp;quot;and&amp;quot; for &amp;quot;add&amp;quot;, or &amp;quot;up&amp;quot; for &amp;quot;of&amp;quot;. We have coined the term synophone to describe such sets. Many synophone pairs are symmetrically interchangable: however, some are not. For example, with some speakers, the word &amp;quot;a&amp;quot; is frequently reported as &amp;quot;eight&amp;quot; although the converse seldom occurs.</Paragraph>
      <Paragraph position="1"> Synophones of a particular word utterance come from two sources: alternate guesses offered by the recognition device based on its pattern matching computation, and a set of words stored in the system that are known to be confused with the selected word. Whenever a token is collected by the scanner, its synophone  llst is compiled. Passing the complete set of synophones for each word to the parser would result in excessive parse time so it is desirable to eliminate beforehand any synophones whose occurrence can be determined to be impossible based on grammatical or contextual considerations. For example the syntax of English (and of NLC) prevents certai~ words from occurring next to each other, or beginning or ending sentences. This information is recorded in a table of adJacencies. If there is a synophone in a word slot that cannot be preceded by any of the synophones in the previous word slot that synophone is deleted. This process is repeated until no more deletlons are possible. On average, roughly one-half of the candidate synophones are deleted.</Paragraph>
      <Paragraph position="2"> Since parsing time may increase exponentlally with the number of candidate synophones, and this table driven elimination process is very quick, considerable savings result.</Paragraph>
      <Paragraph position="3"> For reasons of indivldual speech variation some vocabulary words will have synophones peculiar to an individual speaker. The set of synophones of each vocabulary word is therefore augmented to accommodate this situation so that each speaker has personalized synophone sets.</Paragraph>
      <Paragraph position="4"> Early training includes a tutorial introductlon, part of which requires the sub-Ject to repeat sentences word for word.</Paragraph>
      <Paragraph position="5"> In this mode, the software has a priori knowledge of the correct token--for each word slot. If a given word slot does not contain the correct token, the substituted word can be added to the appropriate synophone set for that subject. Thereafter, if the same substitution error recurs during a session with that subject, the correct word will be included in the synophone list for that word slot.</Paragraph>
    </Section>
    <Section position="2" start_page="182" end_page="182" type="sub_section">
      <SectionTitle>
Re~ection Errors
</SectionTitle>
      <Paragraph position="0"> The occurrence of one or more rejections in a sentence almost always results in a request for repetition. However, we are designing a number of facilities to handle rejections. In some cases, the rejected word can be determined from context, and processing can continue uninterrupted. Otherwise, the current plan is to handle a single rejection by returning an audio response that repeats all of the sentence with the word &amp;quot;what&amp;quot; in place of the rejected element. The speaker will then .be able to choose to repeat the rejected word or, in case other errors are apparent, to repeat the entire utterance.</Paragraph>
      <Paragraph position="1"> In cases of multiple rejection errors, the speaker is requested to repeat the entire utterance. In all cases previous utterances will not be. discarded. The scanner will merge them, complete with synophones, in an attempt to eliminate reJectione and provide the broadest amount of information from which to extract what the speaker actually said. For example, if the actual utterance were</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="182" end_page="182" type="metho">
    <SectionTitle>
ABC D E FG
</SectionTitle>
    <Paragraph position="0"> and the recognizer returned Am * Z E* G where * stands for rejection, the speaker will be asked to repeat. If</Paragraph>
  </Section>
  <Section position="5" start_page="182" end_page="185" type="metho">
    <SectionTitle>
ABC* EFH
</SectionTitle>
    <Paragraph position="0"> is then recognized, it will be combined with the first utterance so that the scanner considers the seven word slots to contain= s(A) sis) sic) s(z) sis) s(F) s(G) sin) where siX) is the union of X with its synophones. (Hopefully D is in s(Z).) If subsequent utterances are so different from previous ones that they are unlikely to be word-for-word repetitions (for example, by containing a different number of words), previous utterances will be discarded and processing will be started over.</Paragraph>
    <Paragraph position="1"> It may also be possible to predict a rejected word with some degree of certainty based on semantic or pragmatic information. (We consider pragmatics to involve discourse dependent contextual factors.) For example suppose the scanner receives from the recognizer: Double * nine and add column four to it. The most likely possibilities for the rejection are entry, row and column.</Paragraph>
    <Paragraph position="2"> Entry can be ellmz~'~ate~--on semantic grounds since it is meaningless to a&lt;\]d a column to an entry. Row is semantically possible, but pragma-'~cally less likely than column since adding columns to columns is much more common than adding columns to rows. Thus column may be chosen. Furthermore if t-h-e matrix in focus is six by seven, then the nine is a substitution error, and the sentence will be rejected on pragmatic grounds initially. However, since five is a synophone of nine the sentence ~ be tried with flve 'in the place of nine. Ultimate!y t't~'~e~e user will see displayS, on the screen the result From: Double column five and add column four to it.</Paragraph>
    <Paragraph position="3">  The activity described above is transparent to the user. If the results are unsatisfactory to the user, the command &amp;quot;backup&amp;quot; will undo them.</Paragraph>
    <Paragraph position="4"> An additional source of pragmatic error correction comes from utterances in historically similar dialogs. We are developing a method for utilizing this type of information. Considering the last example, if the user had been adding columns to rows quite freque~: &amp;quot;~-&amp;quot; in the current and/or recent sessions, but rarely if ever adding columns to columns, the system would choose row as the rejected word.</Paragraph>
    <Paragraph position="5"> Insertion Errors and Fusion Errors Most speech recognizers allow the threshold value to be adjusted that determines whether the best match is &amp;quot;recognized&amp;quot; or is rejected. Since rejections are harder to correct for than substitutions there is reason to lower this value. Too low a value, however, aggravates the insertion problem. When the speaker utters a non-vocabulary word, or emits a grunt or uncouth sound, the correct response is a rejection. A non-rejection in this situation may be difficult to deal with.</Paragraph>
    <Paragraph position="6"> In our experience users have little trouble in confining themselves to the trained vocabulary. Most insertion errors occur between sentences, rather than between words within a sentence. This results in extraneous &amp;quot;words&amp;quot; in the first one or two word slots. These can often be eliminated because neither they nor their synophones can begin a sentence in the NLC grammar. Timing considerations, too, could be used to eliminate, or at least cast suspicion on, inter-sentence insertions, though we have not found the need for such measures.</Paragraph>
    <Section position="1" start_page="183" end_page="183" type="sub_section">
      <SectionTitle>
Raw Error Rate
</SectionTitle>
      <Paragraph position="0"> Although a good deal of our interest is in correcting or compensating for the various kinds of errors in recognition, we are also working on ways to reduce the actual number of errors made by the recognition devices (the raw error rate).</Paragraph>
      <Paragraph position="1"> Careful vocabulary choice and proper tuning of the hardware such as threshold level selections are crucial factors.</Paragraph>
      <Paragraph position="2"> It is important to choose vocabulary words as widely separated phonetic~lly as circumstances allow. Additionally, we have found that words containing nonstrident fricatives (e.g. the th in fifth), affricates (e.g. the c---h in c u-'~-~r'ch), liquids (r and I) and nasals-'(m,n an~q) are mort diffTcult to recognize than words containing other sounds.</Paragraph>
      <Paragraph position="3"> Monosyllabic words, in general, are not recognized as readily as polysyllabic ones, though words that are long and difficult to pronounce (e.g. anaesthetist) are also to be avoided. Often the domain leaves little latitude for vocabulary choice. If ordinal numbers are needed it is necessary to have fifth and sixth, which are difficult to--~inguish. But instead of a word like rate which is easily confused with eig t~-~-, tax rate or rate-of-pay (pronounced as a single--'~rd) m%--~t~ a better choice.</Paragraph>
      <Paragraph position="4"> Correct training procedures are instrumental in reducing the raw error rate as are such factors as whether the user receives immediate feedback from the recognizer, the form and frequency of error messages requesting repetition, and the degree of comfort fett by the user insofar as attitude toward computers is concerned. Some of these are discussed below in the section Measurin@ System Performance. null We have observed fusion errors in discrete mode. They arise when the speaker neglects to pause long enough between words. In our experience they occur so infrequently we have not tried to compensate for them. This type of error is more crucial when operating in connected mode. It may be the case that two (or possibly more) words are reported as a single word different from either of the two originally uttered words. It may also happen that two words, A and B, are reported as either A or 8. In this case the fusion error takes on the appearance of an omission. Our connected speech parser, currently under construction, will have the ability z9 guess an omission and insert a correction if sufficient contextual information is available.</Paragraph>
    </Section>
    <Section position="2" start_page="183" end_page="185" type="sub_section">
      <SectionTitle>
Some Miscellaneous Questions
</SectionTitle>
      <Paragraph position="0"> Apart from error correction, a number of other questions have arisen during our implementation of the voice driven system.</Paragraph>
      <Paragraph position="1"> Among these are: a) How is the beginning of a sentence detected? b) How is the end of a sentence detected? c) How can a user make a correction in mid-sentence? Currently a sentence begins with any input after the end of the previous sentence. The instances of inter- or presentence insertions were discussed above. Sentences are terminated by the mete-word over. This word has few syno- null phones in the current word set and has the advantage of being widely understood to mean &amp;quot;end of transmission.&amp;quot; However, we plan to experiment with other kinds of termination such as use of touch input or timing information.</Paragraph>
      <Paragraph position="2"> A user may misspeak in instructing the computer to perform a task and may wish to repeat all or part of the command. Also, if the words from the woice recognizer are displayed as they are spoken, the user may desire to correct a misrecognition. '~ne metaword correction is currently used to implement this facility. There are several levels of correction. Some may be accomplished by the scanner, while others require more information than is available to the scanner and must therefore be handled by the parser. The simplest type of correction consists of changing one word at the end of the sentence: null Add row one to row four correction three.</Paragraph>
      <Paragraph position="3"> Here the scanner merely deletes the word slot before the metaword. If several words follow &amp;quot;correction&amp;quot; as in Add row one to row two correction row one to column three.</Paragraph>
      <Paragraph position="4"> the scanner detects this fact and scans backward in the sentence, attempting tO match the largest possible number of word slots before and immediately after the metaword. In this example the tokens for row, one and to match, so the scanner copies--t'~e last ~rt of the sentence into the earlier part of the buffer to arrive at Add row one to column three.</Paragraph>
      <Paragraph position="5"> In the case of an utterance such as Add row one to row two correction column three.</Paragraph>
      <Paragraph position="6"> it is impossible to match the tokens before and after the metaword. The scanner therefore deletes the token \[~Ime,\]iately before the metaword, flags the word slot preceding that token and passes the result to the parser. In the example, Add row one to row column three.</Paragraph>
      <Paragraph position="7"> is passed, with the word slot containing ro w flagged. The parser attempts to make  sense of the set of tokens passed. If it cannot, the flagged word slot is deleted, the word previous to it is flagged and another parse is attempted. The process is repeated until a successful Parse is found. If none is found, an error message is issued. Thus in the example, after failing tO Parse the tokens as passed, the parser tries Add row one to column three.</Paragraph>
      <Paragraph position="8"> which is parsed successfully.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="185" end_page="186" type="metho">
    <SectionTitle>
TOUCH INPUT
</SectionTitle>
    <Paragraph position="0"> An important aspect of natural language communication is pointing, which is often used in connection with words such as this, that, here and there.</Paragraph>
    <Paragraph position="1"> Pointing may---'f'~ncto~-o'n--as em-'m'~asis, as in Put the dog out.</Paragraph>
    <Paragraph position="2"> where either the dog, the outside, or possibly both are pointed to. Pointing also functions to put objects into focus, allowing subsequent references to use a definite pronoun: for example, Move that there and cover it.</Paragraph>
    <Paragraph position="3"> with a point to the object to be moved and covered.</Paragraph>
    <Paragraph position="4"> A pointing ability would fit in very nicely with voice driven NLC and our pro-Ject includes a touch sensitive screen so that the user can say &amp;quot;double this&amp;quot;, point to a row, and cause the processor to double every element in that row. More complex sentences such as Add this row to that row putting the results here. (with three touchee) also become possible.</Paragraph>
    <Paragraph position="5"> Apart from being &amp;quot;natural&amp;quot; in the sense that ordinary language users point often, pointing may increase the efficiency of communication.</Paragraph>
    <Paragraph position="6"> There has been a good deal of interest among human factors scientists as to the efficiency of various modes of communication. Past experiments, for example, have compared the efficiency of typed versus voice messages (voice messages are more efficient). We carried out an experiment to verify the hypothesis that voice input together with touch input is more efficient than voice input alone, and we attempted to quantify the results, we solved eight different types of matrix problems including Gaussian elimination, divided differences and matrix inversion, using NLC without touch. We then went back and rewrote the solutions using the touch facility, but without any other changes. On the average 29% fewer words were needed to solve the problem, and individual sentences were shortened by 23%.</Paragraph>
    <Paragraph position="7"> A number of interesting problems arise when a touch facility is implemented. One is how to pair up tactile and verbal input in the way intended by the user. Another problem is identifying the actual object the user intends to refer to once the tactile and verbal input have been resolved.</Paragraph>
    <Paragraph position="8"> An example of the latter problem would be the command</Paragraph>
    <Section position="1" start_page="185" end_page="186" type="sub_section">
      <SectionTitle>
Double this
</SectionTitle>
      <Paragraph position="0"> accompanied by a touch of element &lt;3,2&gt; of a displayed matrix. Does the user want to double element (3,2&gt;, double row 3, double column 2, or even double the entire matrix? The same touch paired with Double this entry.</Paragraph>
      <Paragraph position="1"> Double this matrix.</Paragraph>
      <Paragraph position="2"> Double this column.</Paragraph>
      <Paragraph position="3"> or Double this matrix.</Paragraph>
      <Paragraph position="4"> would be unambiguous. If the demonstrative is not accompanied by a nominal some strategy is needed to process the sentence. We opt for the smallest possible noun group encompassed by the touch (the &lt;3,2&gt; entry in the above case), and rely on our &amp;quot;backup&amp;quot; facility in case the user's intentions are not fulfilled. If the utterance &amp;quot;double this&amp;quot; is accompanied by a touch of the displayed name of a row, column or matrix, then the named object will be referenced.</Paragraph>
      <Paragraph position="5"> Pairing up touches with spoken phrases is straightforward when a single noun group is used with a single touch, as in &amp;quot;double this entry.&amp;quot; In a more complicated case we might have Add this entry to that row and put the result here.</Paragraph>
      <Paragraph position="6"> accompanied by three touches. The strategy here is to -air touches and utterances in the order given by the user. In the last example all touches functioned to establish focus or resol~=e no,~n group reference. If the emphasis function of touch is mixed in, a more difficult situation arises. If three touches accompany null Add this entry to the first row and put the result here.</Paragraph>
      <Paragraph position="7"> then the second touch was presumably to emphasize the first row or even to establish a rhythm of touching. In any case the facility to match touches with nondeictic expressions iS needed. If only two touches accompany this last sentence then the focusing function should take precedence, and the touches should be matched with &amp;quot;this entry&amp;quot; and &amp;quot;here.&amp;quot; The situation is made even more complex by the ability to establish focus verbally. In NLC the user can say Consider row four.</Paragraph>
      <Paragraph position="8"> Double that row.</Paragraph>
      <Paragraph position="9"> and the expression &amp;quot;that row&amp;quot; will refer to row four. ~f the same utterance is accompanied by a touch to a row other than four a potential conflict results. Our strategy is to give precedence to touch, since it is the more immediate focussing mechanism. Thus the sequence Consider row four.</Paragraph>
      <Paragraph position="10"> Double that row. (touching row three) will result in the doubling of row three. When both verbal and touch focus are present, nearly unresolvable ambiguities may result. The sequence Consider row four.</Paragraph>
      <Paragraph position="11"> Add this row to that row.</Paragraph>
      <Paragraph position="12"> accompanied by one touch, gives rise to the problem as to which demonstrative noun group to associate with row four, and which to associate with the touch. One strategy is to associate with a demonstrative noun group the touch that occurred closest to the time of utterance. Another possible strategy is to assume that the expression with that refers to the more distant element in focus (the one established verbally in this case). This takes advantage of the ~act that this and that can be distinguished in English grammar by the feature +NEAR. Unfortunately by a simple change iF stress pattern a speaker can undo this fairly weak regularity. Thus the sequence Consider row four.</Paragraph>
      <Paragraph position="13"> ~dd th{s row to that row.</Paragraph>
      <Paragraph position="14">  plus a single touch, where this bears prlmary stress and that bears secondary stress, should flnd t-- e~touch referring to &amp;quot;this row.&amp;quot; If the stress pattern were Add this row to that row.</Paragraph>
      <Paragraph position="15"> with primary stress on Add, the touch would more llkely be assoc--~-ated with that row. It is unfortunate that to date we ~w of no voice equipment sensitive enough to distinguish between two such stress patterns.</Paragraph>
      <Paragraph position="16"> Somewhat more complicated cases are possible= Consider row three.</Paragraph>
      <Paragraph position="17"> ~ld this row to that row and put the result in the first row.</Paragraph>
      <Paragraph position="18"> accompanied by two touches. Since we allow a touch to occur with expressions such as &amp;quot;the first row,&amp;quot; and since it is possible to disregard the element in verbal focus altogether, such a case produces multiple ambiguities. Although we foresee being able to resolve these ambiguities effectively, and ca~ always fall back on our &amp;quot;backup&amp;quot; facility in case of mistakes, we also believe that such complex cases will be extremely rare. No sentence of such complexity was produced in our solutions to the eight problems mentioned above. With a voice and touch facility, sentences tend to be shorter and simpler. NLC has implemented plurals, but we have not considered their use in touch input. Such sentences as or Multiply these elements by this element.</Paragraph>
      <Paragraph position="19"> Add these elements up.</Paragraph>
      <Paragraph position="20"> with multiple touches, would be useful. In the trial run of eight problems, the introduction of plurality resulted in up to fifty percent reduction in number of words needed and sentence length.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="186" end_page="188" type="metho">
    <SectionTitle>
MEASURING SYSTEM PERFORMANCE
</SectionTitle>
    <Paragraph position="0"> Progress in any endeavor is greatly aided if the level of accomplishment can be measured in some meaningful way. It is desirable to give a figure of merit for a system both so that a project can indicate to the world the degree of the achievement and also so that the project can internally Judge its own improvements over time. In voice language processing, one can attempt to measure performance by the word and sentence error rates. However, experience shows that these measures are highly dependent on two factors and that almost any level of performance can be reached if those factors are appropriately adjusted. Those factors are (a) the environment and type of test within which the measurement is made, and (b} the level of training of the system user.</Paragraph>
    <Section position="1" start_page="186" end_page="187" type="sub_section">
      <SectionTitle>
Type O~f Testln~ Environment
</SectionTitle>
      <Paragraph position="0"> Considering (a), we tend to classify the type of test for a recognizer into one of the following five categories and we expect significant differences in device response in each case.</Paragraph>
      <Paragraph position="1">  (1) Lists of words are read in tests performed by the manufacturer.</Paragraph>
      <Paragraph position="2"> (2) Lists of words are read in our laboratory.</Paragraph>
      <Paragraph position="3"> (3) Sentences are read in our laboratory. (discrete or connected) (4) Sentences are uttered in a problem solving situation in our laboratory. (discrete or connected) null (5) Sentences are uttered in a prob- null lem solving situation in the user environment. (discrete or connected) null In the first situation, a manufacturer is interested in advertising the best performance achievable. Tests are performed in controlled conditions with microphone placement and all system parameters set for optimum performance, and an expert speaker is used. In our laboratory, we are not interested in the best possible system performance but rather what we can realistically expect. The parameters are set at medium levels, there is some ambient noise, the microphone ~ay move during the test, and the user wil\] be anyone we happen to bring in regardless of their speech characteristics.</Paragraph>
      <Paragraph position="4"> As soon as the sequential words become organized as sentences, situation (3), the speaker begins to impose inflections on the utterance that will affect recognition. Certain words may be stressed, and intonation may rise an~\] fall as the sequential parts of each sentence are voiced. Training samples based on reading lists of vocabulary items tend to be inaccurate templates for words spoken in context. When sentences are spoken in a problem solving environment, situation (4), these effects increase and other aspects of word pronunciation change.</Paragraph>
      <Paragraph position="5"> When voice control stops being the central concern of the speaker, largeT variations in speech are bound to occur with accompanying larger error rates.</Paragraph>
      <Paragraph position="6"> The most difficult situation of all occurs in situation (5) where the user might not even be a person who could be brought into a voice laboratory. In this case, the user has only one concern, achieving the desired machine performance. Encouragement to speak carefully could be met with impatience, and a few system errors could result in even worse speech quality and further degraded performance. Our experience has been that word error rates increase from about three to seven percent as one moves to each more difficult situation type depending on the vocabulary, t~e equipment, and other factors. Consequently, we tend to distrust any figures gathered in the easier classes of environments and attempt to do our own testing in the more difficult and more interesting situations. Most of our recent data is of type (4) and we hope to gain some type (5) experience in the coming year.</Paragraph>
      <Paragraph position="7"> Training the System User The second major factor affecting voice recognition performance is the level of training of the system user. Humans are extremely adaptive and capable of learning behaviors to a high degree of perfection. Thus the designer of a voice system might, over the years, learn to chat with it like an old friend whereas others might not be able to use the system at all. ~gain, almost any level of system performance can be observed depending on the quality of training of the user.</Paragraph>
      <Paragraph position="8"> Our approach to controlling this factor has been to develop a standardized training procedure and to only report statistics on uninitiated users whose experience with the system is limited to this procedure. Ideally this procedure would be administered by machine to obtain maximum uniformity in training but this has not yet been possible.</Paragraph>
      <Paragraph position="9"> The training procedure has two parts.</Paragraph>
      <Paragraph position="10"> The first part is an informal session in which the user is told how to speak individual words to the system and examples of the complete vocabulary are collected by the recognition system. ~he second part is administered very mechanically by reading a tutorial document to the user and requesting the utterance of trial sentences. This portion of the training introduces the user to the interactive system's capabilities and is specifically designed to be administered by the machine.</Paragraph>
    </Section>
    <Section position="2" start_page="187" end_page="188" type="sub_section">
      <SectionTitle>
Some Performance Data
</SectionTitle>
      <Paragraph position="0"> An experiment was run during the summer of 1982 to obtain DP-200 performance data in an environment of type (4) as described above. Beca~ise no voice interactive system was yet available, a system simulation was used. After the first part of the training session in which the voice samples were collected, the subject was placed in a room behind a display terminal with a head mounted microphone. The voice tutorial was read to the subject through a loudspeaker at the terminal introducing the capabilities of the simulated system and the types of voice commands that could be executed.</Paragraph>
      <Paragraph position="1"> The subject's commands were recognized by the DP-200 and executed by the simulation.</Paragraph>
      <Paragraph position="2"> Thus each user command resulted in either appropriate action visible on the screen or a voice error message. In the final portion of the experiment, the subject was asked to solve an invoice pro61em that involved computing costs for a series of individual items and finding the tax and total. The experiment gave a reasonably accurate simulation of the expected NLC system behavior when it becomes completely voice interactive. The experiment attempted to simulate a syntactic level of voice error correction but nothing deeper.</Paragraph>
      <Paragraph position="3"> It was fo,lnd that the DP-~00 word error rate rose to about 20 percent in this test with about 14 of the 20 percent being automatically correctable. The vocabulary size was 80, with three samples of most words, and six samples of a few of the difficult words, stn1=ed in the DP-200.</Paragraph>
      <Paragraph position="4"> This means that roughly every two to four sentences will have a single word error not correctable at shallow levels. This data comes from the first two hours o~ usage for these subjects and we expect significant improvement as usage experience increases over time.</Paragraph>
      <Paragraph position="5"> More recently, the ~LC system has become operative in a voice driven mode and subject testing has begun using the same training procedure. It is too early to report results but it appears that the performance predicted in the simulatiou will be approximately achieved. This experiment will include longer usage by the subjects and thus indicate how much error rates decrease over time.</Paragraph>
      <Paragraph position="6">  In conclusion, we have at this time only fragmentary information regarding what levels of performance can be achieved. HOwever, we have developed some tools for making measurements and will report the results as they become available. null systems has been refined to the point that it could actually support user interactions in real time as we are attempting to do. Our project uses well developed speaker dependent voice recognition equipment with a small enough vocabulary to achieve usable accuracy rates.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML