File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/83/a83-1030_metho.xml

Size: 10,616 bytes

Last Modified: 2025-10-06 14:11:30

<?xml version="1.0" standalone="yes"?>
<Paper uid="A83-1030">
  <Title>SPEECH IbrrERFACES: SESSION IN~ODUCTION</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
SPEECH IbrrERFACES: SESSION IN~ODUCTION
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
ABSTRACT II CURRENT TECHNOLOGY
</SectionTitle>
    <Paragraph position="0"> The speech interface is the natural one for the human user and is beginning to be used in a limited way in many applications. Some of these applications are experimental; still others have achieved the status of cost-effective utility. A brief summary of the current state-of-the-art of speech input and output are presented. The two papers in the session represent specific examples of current work. Some comments on the need for linguistically oriented development conclude the paper.</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="178" type="metho">
    <SectionTitle>
I INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> Over the past four decades it has often been felt that the solution to the problem of &amp;quot;machine recognition of speech&amp;quot; is &amp;quot;.. just around the corner.&amp;quot; When the sound spectrograph was invented (a little less than forty years ago) engineers, acousticians, phoneticists, and linguists were certain that the mysteries of speech were about to be unveiled. When powerful computers could be brought to bear (say - twenty years ago) there was a renewed feeling that such tools would provide the ~eans to a near term solution. When artificial intelligence was the buzzword (a little over ten years ago) it was clear that now the solution of the recognition problem was at hand. Where are we today? A number of modest, and modestly priced, speech recognition systems are on the market and in use. This has come about because technology has permitted some brute force methods to be used and because simple applications have been found to be cost effective.</Paragraph>
    <Paragraph position="1"> In speech output systems a similar pattern has emerged. Crude synthesizers such as the ~askins pattern playback of thirty years ago were capable of evoking &amp;quot;correct&amp;quot; responses from listeners. Twenty-five years ago it was thought that reading machines for the blind could be constructed by concatenating words. Twenty years ago formant synthesizers sounded extremely natural when their control was a &amp;quot;copy&amp;quot; of a natural utterance. Modern synthesizers are one one-thousandth the size and cost; they still only sound natural when a human utterance is analyzed and then resynthesized as a complete entity. Concatenatin 8 words is still no better, though cheaper, than it was twenty years ago.</Paragraph>
    <Paragraph position="2"> A. Speech Input There are now several speech recognition systems on the market which are intended to recognize isolated words and which have been trained for an individual speaker. The vocabulary sizes are on the order of 100 words are phrases. Accuracy is always quoted at &amp;quot;99+%.&amp;quot; These recognizers use a form of template matching within a space which has the dimensions of features versus time. The &amp;quot;true&amp;quot; accuracy is a function of the vocabulary size, the degree of cooperativeness of the speaker, and the innate dissimilarity of the vncab ulary. Since the systems are recognizing known words by known speakers the major source of varia billty in successive words is the time axis.</Paragraph>
    <Paragraph position="3"> The same word may (and will) be spoken at different speaking rates. Unfortunately, different speaking rates do not result in a linear speed change in all parts of a word; the voiced portions of the word, loosely speaking the vowels, respond more to speed change; the unvoiced portions of the word, loosely the consonants, respond less to speed change. As a result, a non-linear time adjustment is desired when matching templates. This sort of time adjustment is carried out with a mathematical process known as dynamic programming which permits exploration of all plausible non-linear matches at the expense of (approximately) squaring the compu rational complexity in contrast to the comblna torlal computational growth that would otherwise be required. The medium and high performance speech recognizers usually contain some form of dynamic programming. In some cases more than one level of dynamic programming is used to provide for recognition of short sequences of words.</Paragraph>
    <Paragraph position="4"> The actual use of these recognizers has developed a number of consequences. Many of them, including the first paper in this session involve the use of speech recognition during hands-andeyes busy operations. These applications will almost always be interactive in nature; the system response may be visual or aural. Prompt response saying what the system &amp;quot;heard&amp;quot; is crucial for improving the speaker's performance. A cooperative speaker clearly adapts to the system. To date, many applications are found where a restricted interactive speech dialog is useful and economical. At this time the speech recognition  mechanism is relatively inexpensive; the expensive component is the initial cost of developing the dialog for the appllcaClon and interfacing the recognition element Co the host computer system.</Paragraph>
    <Paragraph position="5"> At the present tlme recognition is not accomplished in units smaller than the word. It has been hoped chat it might be possible to segment speech into phonemes. These would be recognized, albeit with some errors; the strings of phonemes would then be matched with a lexicon. To date, adequate segmentation for this sort of approach has not been achieved. In fact, in continuous fluent speech good word boundaries are not readily found by any algorithmic means.</Paragraph>
    <Paragraph position="6"> B. Speech Output There are relatively few speech synthesizers in the pure sense of the word. There are many speech output devices which produce speech as the inverse of a previously formed analysis process.</Paragraph>
    <Paragraph position="7"> The analysis may have been performed by encodln&amp; techniques in the tlme domain; alternatively, it may be the result of soma form of extracting a vocal source or excitation function and a vocal tract descrlptlou. When the analysis is performed on a whole phrase the prosodic features of the indivdual uttering the phrase are preserved; the speech sounds natural. When individual words produced by such an analysls-synthesls process are concatenated the speech does not sound natural.</Paragraph>
    <Paragraph position="8"> In any event, the process described above does not allow for the open ended case, synthesis of unrestricted text. This process requires that a number of steps be carried out in a satisfactory way. First, orthographic text must be interpreted; e.g. we read &amp;quot;NFL&amp;quot; as a sequence of three words but we pronounce the word &amp;quot;FORTRAN', we automatically expand out the abreviation &amp;quot;St.&amp;quot;, etc. Second, the orthography must be converted Co pronunciation, a distinctly non-trlvial task in En~llsh. This is normally accomplished by a set of rules together with a table of exceptions to those rules. Although pronouncing dictionaries do exist in machine form, they are still coo large for random access memory technology, although thls will not be true in the reasonably near future.</Paragraph>
    <Paragraph position="9"> Proper nouns, especially names of people and places, will often not be amenable to the rules for normal English. Third, the pronunciation of the word must be mapped into sequences drawn from an inventory of smaller units. At various times these units have been allophones, phonemes, dlphones (phoneme pairs), demlsyllables, and syllables. The units are connected with procedures which range from concatenation to smooth interpolation. Finally, it is necessary to develop satisfactory prosody for a whole phrase or sentence.</Paragraph>
    <Paragraph position="10"> This is normally interpreted as providin&amp; the information about inflection, timing, and stress.</Paragraph>
    <Paragraph position="11"> This final step is the one in which the greatest difficulty exists at the present time and which presents the strongest bar to natural sounding speech. The second paper in thls session deals wlth the development of stress rules for prosody, one component of =he overall problem.</Paragraph>
    <Paragraph position="12"> llI LINGUISTIC NEEDS IN SPEECH INTERFACES A, Current Research Moat of the current high end work in speech recognition attempts Co c6nstrain the allowable sequence of words by the application of some kind of grammar. This may be a very artificial grammar, for example the interaction wlch an airline reservation system. Other research efforts attempt Co develop models of the language through an information cheoretlc analysis. Coming full circle we find words being analyzed as a Markov process; Merkov, of course, was analyzing language when he developed thls &amp;quot;mathematically defined&amp;quot; procese.</Paragraph>
    <Paragraph position="13"> Normalizing recognition to the speaker is being approached in two ways. The first, currently being explored at the word reco&amp;nitlon level consists of developing enough samples of each word from many speakers so chat clustering techniques will permit the speaker space to be spanned with a dozen or so examples. The second approach attempts to enroll a speaker in a recognltlon system by speaking &amp;quot;enough&amp;quot; text so tha~ the system is able to develop a model of that person's speech.</Paragraph>
    <Paragraph position="14"> In research on speech synthesis considerable attention is now being &amp;iven to try, by analysis, to determine rules for prosody. Application of these rules requires grammatical analysis of the text which is to be converted co speech.</Paragraph>
    <Paragraph position="15"> 8. The Future As both of the speech interface tasks become more and more open-ended It is clear that satisfactory performance will require very substantial aid from linguistic reseacrh. In the case of recognition this is necessary to reduce the number of hypotheses that must be explored at any given point in a stream of unknown words. In the case of text-to-speech, understandin~ of what iS being said will contribute to producing more natural and acceptable speech.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML