File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/79/j79-1060_metho.xml

Size: 76,121 bytes

Last Modified: 2025-10-06 14:11:16

<?xml version="1.0" standalone="yes"?>
<Paper uid="J79-1060">
  <Title>PITCH CONTOUR GEIIERATION IN SPEECH SYNTHESIS A JUNCTION GRAMMAR APPROACH</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
PITCH CONTOUR GEIIERATION
IN SPEECH SYNTHESIS
A JUNCTION GRAMMAR APPROACH
ALAN K. MELBY, WILLIAM 3. STRONG,
ELDON G. LYTLE, AND RONALD MILLETT
Translation Sciences Institute
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="2" start_page="0" end_page="3" type="metho">
    <SectionTitle>
SUMMARY
</SectionTitle>
    <Paragraph position="0"> Computer based text synthesis systems require a means for generating sentence-level pitch contours- These contours must have a kertain degree of &amp;quot;human fidelity&amp;quot; if the synthetic speech is to sound natural and not too machine-like.</Paragraph>
    <Paragraph position="1"> The pitch contowrs in cutrently operational text synthesis systems are still not perfectly natural-sounding and thus computer generation of pitch contours is a topic of current interest. The introduction includes a survey of current work in this area by researchers at MIT, Bell Labs, Stanford, etc., descrfbing their general approaches.</Paragraph>
    <Paragraph position="2"> The research described in thiq paper uses Junction Gramnar as a theoretical base, and Linear Predictor Coefficient (LPC) methods as aq analysis-synthesis technique. Motivations for these decisions are presentee S&amp;ction I begins with an explanation of some sentences which are being studied. 'For example, there is likely a stress on &amp;quot;study&amp;quot; in the sentence he boys who study get good grades,&amp;quot; if the context is &amp;quot;but the boys who don't get bad grades.&amp;quot; On the other had, if the context is &amp;quot;but the girls who study get poor grades,&amp;quot; then there is probably stress on &amp;quot;boys.&amp;quot; The various readings of &amp;quot;the boys who study.. . &amp;quot; and other sentences are explained within the 3ut1ct;Son Grammar framework. An overview is given of a system for generating pitch contours for a sentence from a Junction Grammar semantico-syntactic representation.</Paragraph>
    <Paragraph position="3"> Section I also in-ludes a description of an extension of Junction Grammar whi&amp; defines an object called an articulation tree, correspondtng to each junction tree. A junction tree contains semantico-syntactic information but no lexical information. An articulation tree  contains segmerital information about each lexical item and suprasegmental or prosod lc informatiofi combining the lexical items ihto prosodic units. Semantic distinctions in junction trees are recoded as distinctions in the prosodic structure of articulation trees and then articulation Vrees are used to generate pitch contours. Junction trees and articulation trees are included as figures for several sentences.</Paragraph>
    <Paragraph position="4"> Sectqon I1 describes-how pitch contours are generated, including the recoding of junction trees as articulation trees, the assignment of ~nitial and final pitch levels and pitch at nuclear syllables, apd how the generated contours are combined with analysis parameters and synthesized into speech. It should be noted that the junctlon trees are entered manually rather than by automatic analysis, in the cur~ent implementation. The te*t includes several graphs of natural pltch contours as well g~ contours generated by the computer system.</Paragraph>
    <Paragraph position="5"> The pitch contour system produces a synthesis output foL each reading of a sentence. Thirty-five sentences, some with natural, some with hand-drawn, and some with machhe-generated pitch contours were evaluated for naturalness and &amp;quot;intelligibility&amp;quot; of intonation in four types of tests. Results of testing several subjects showed that the generated pitch contours were judged nearly as natural as hwnan-produced contours, and except for some specific problems involving duration, the generated contours were intelligible in the sense of causing the listener to perceive the intended reading of the sentence. The text lncludes a quantitative summary of the results of the evaluation.</Paragraph>
    <Paragraph position="6"> For the corpus of sentences treated so far, Junction Grammar provides a satisfactory theoretical base for generating pitch contours and defines some specific cases where pitch alone is insufficient to make distinctions and must be used with duration, pause and intensity.</Paragraph>
    <Paragraph position="7"> Appendices: A. Suggested background reading in acoustic speech processing and Junction Grammar.</Paragraph>
    <Paragraph position="8"> 8. Glossary of terms, e.g. LPC, FO, Hertq etc.</Paragraph>
    <Paragraph position="9"> C. Description of the computer implkmentation (on a PDP-15 with a VT-15 grapnics display unit).</Paragraph>
    <Paragraph position="10"> D. More details on the evaluation procedure.</Paragraph>
    <Paragraph position="11"> For the convenience of the reader, a recent paper on Junction Theory presented at a BYU Linguistics Symposium is reprinted at the end of the microfiche.</Paragraph>
  </Section>
  <Section position="3" start_page="3" end_page="3" type="metho">
    <SectionTitle>
TABLE OF CONTENTS
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="4" start_page="3" end_page="58" type="metho">
    <SectionTitle>
INTRODUCTION AND SURVEY OF RESEARCH IN PITCH CONTOURS 6
Section
</SectionTitle>
    <Paragraph position="0"> ........................ . I THEORY 13 11 . METHOD ........................ 31 . 111 EVALUATION AND DISCUSSION ..............</Paragraph>
  </Section>
  <Section position="5" start_page="58" end_page="69" type="metho">
    <SectionTitle>
REPRINT OF ONE OF THE REFERENCES (Lytle. 1976) .......
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="6" start_page="69" end_page="69" type="metho">
    <SectionTitle>
INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> All computer based text synthesis systems require a means for generatips sentence-level pitch contours. These contours mvst have a certain degree of &amp;quot;human fidelity&amp;quot; if the synthetic speech is to sound natural, that is, not too machine-like. The pitch contours in currently operational text synthesis systeqs are still not perfectly natural-sounding and thus computer generation of pitch contours is a topic of current intel'est. This intere,st is shown, for example, by Allen as he discusses pause and duration in text synthesis and then goes~on to say: If temporal control presents great problems in the description of speech, then the problems of fundamental frequency $fO), or pitch control, are at least as difficult, Once again, problems arise due to the fact that the PS0 is correlated with many factors, including vowel tongue height, previous consonant, breath group contour, syntactic and semantic content of words, whether a sentence is a questfon, intonation effects, and word boundary glottalization.</Paragraph>
    <Paragraph position="1"> (Allen, 1976: 440) Given the need for further research in pitch control, a question remains of how to approach the problem. The authors feel it is important to work within a linguistic model that interrelates semantic and phonetic phenomena. Later on in Allen's article he makes the following statement (which coincides with our philosophy): The current use of sophisticated means for pitch recording, coupled wf th increased interaction between linguistics and speech resesrchers, should, however, lead to significantly improved pitch control programs which are based on sound linguistically motivated theory.</Paragraph>
    <Paragraph position="2"> (Allen, 1976: 441) The need for interation between linguistics and speech research is further explained by Umeda (1976: 450): The message realization forms one structure as a whole.</Paragraph>
    <Paragraph position="3"> Its constituents-acoustic realization, higher level prosody, and syntax-semantics-interact ~ith each other very closely; a decision niade at any level derives immediately from the obtained result at the level above, and afferts 2 decisgon at the level below.</Paragraph>
    <Paragraph position="4"> The remainder of this section consists of a survey of some of the current work in this area in the USA (at MIT, Bell Labs, and Stanford University), in Germany, and in the USSR. Then the section will conclude with an introduction to the present research.</Paragraph>
    <Paragraph position="5"> A. MIT At MIT, Allen (1976) is working on pitch control as an element in his overall plan to produce a system capable of producing synthetic sp,eech from unrestricted English text. He points oqt that although a syntactic and semantic analysis is needed, nb existing automatic algorithm can provide that analysis reliably for entire sentences of unrestricted text. So he has elected to do a local analysis of the sentence first and then tie together the local analyses into a sentence level analysis if possible. The analyzer is thus designed so that if at some point complete sentence analysis is blocked, the partial analyses are still useful in generating the pitch contour and other prosodic controls such as duration and pause. In response to toe need for a theoretical framework for relating a text and its pitch contour, Allen is using the ideas of Halllday (1970) (e.g. discourse focus) to ,&amp;~vestigate such questions as when and why elements of a verb string are stressed. For example, he no'tes that the sentence &amp;quot;A farmer was eating the carrot1' will receive emphasie on &amp;quot;eating&amp;quot; if'it is in response to a questJon about what the farmer is doing. Allen currectky notes that: The discovery and coordination of all these effects 1s a large and continuing effort, and it is clear that substantial setnantic and discourse-level knowledge is needad to correctly predict prosodic parameters.&amp;quot; (Allen, 1976: 441) B. Bell Labs Several workers at Bell Labs have attacked the problem of :ontrolling pitch in speech synthesis, Olive (1975) describes a system EUor generagivg pitch contcrulrs fsr the sentence type &amp;quot;article-subjectverb-article-object&amp;quot; with an optioqal adjective on the subject or object. TTis method for: generating the pitch contour was to record several sentences of the specified type using random words and to average the natural pi'tch contours to obtaiq prototype contours. Then the contour far each word was approximated by a fourth ~rder polynomial to &amp;quot;facilitate linear stretching and compressfon of the fundamental frequency contour.  Oli3e reports that by ushg this pitch contour generation system, &lt;n c~njuncfion with a word concatenation schesrie in which the words are stored in linear predictor coefficient (LPc) code, the synchesized sentences were of high quallty Umeda, at Bell Labs, is also concefned with pitch contours,  8sserting that Among acoustic companents, pitch (the fundamental frequency of the voice) shows the a~st direct relation to higher level prosody, stress and boundaries&amp;quot; (Umeda, 197'6: 448). Umeda's algorithm for controlling prosodic paldiueters is based on n syntactic analysis of the input text. The analyzer fits each clause into a $emplate consisting of the following optioaal slots: sentence modifier, subject, verb, object or complement, rail modifier, and punctuation mark. A poPnt where the above order of template elements is violated is marked as a boundary, and bvundaries are later used to as~ign pauses and intonation (Umeda, 1975).</Paragraph>
    <Paragraph position="6">  pl-osodics ir the Jsstitute for Mathematical Studies in the Social Sciences (IMSSS). Researchers on this project are developing a system which, ultimately, is intended to do synthesis in real time for use in computer-assisted instruction at IMSSS (Le~ine, 1976). Their technique is to compile a lexicon of words in LPC code (Atal and Eanauer, 1971) and then, when a given sentence is to be synthesized, concatenate the code for each word, adjuqting duzati~ns and pitch contours as needed. Whife Olive throws away the original pitch contour of each word, the IMSSS approach is to adjust the original contour of the ward and then further smooth the contour so that each word will not sound sentence final , The IMSSS group uses the ideas of Leben (1976), who relates hglish prosody to tone languages in that he views both tone languages and English as having a suprasegmental melody which is combined with the segmental phonolbgical elements. The IMSSS group (Levine, 1976: 3) defines melody as a sequence of &amp;quot;auto segmental tones (autonomous from t-he phonological segments) selected from the tonal repertoire of the language.&amp;quot; These tones are treated theoretically as discrete fundamental frequency levels, but then they are realized phonetically as continu~uscontour&amp;. In order tb assign tones to key syllables, a program analyzes the sentefice to be synthestzed using a simple phrase structure grammar which brackets phrase^, clauses and other complex constituents, and indicates boundarierj between maj or constituents .</Paragraph>
    <Paragraph position="7"> D. Germany Complementary to pitch contour generation, is the study of the perception of pitch contours.</Paragraph>
    <Paragraph position="8"> In Germany Isacenko and Schadlich (1970), performed an interesting series of experiments on the perception of German intonation. Natural sentences illustrating different intonation patterns were recorded and monotonised at various fundamental frequencies (e.g. 150 Hertz and 178.6 Hertz). Then the tapes of the monotone versions were cut and spliced at various points. The spliced tapes thus had an artificially simplified intonation of exactly two tone levels. The team found that they could change the way listeners perceived certain ambiguous sentences by changing only the points at which tone switches occurred.</Paragraph>
    <Paragraph position="9"> E. USSR In the USSR, fiaavel et! al. (1976) have also performed some experiments in manip~lating pitch contours while leaving other parameters constant. They are interested in finding ways to &amp;quot;decrease the amount of inPS ormation necessary for the description of pitch curves without distorting the parameters interpreted by man as prosodic characteristics of a sentence.&amp;quot; They base this search on the assumption that man has only a limited short term memory available for storing the pitch contour and so makes decisions concerning the prosody of a sentence by extracting prosodic features which contain considerably less information than that needed to reconstruct exactly the same pitch contour. They conclude from these experiments that decisions such as declarative versus interrogative are based on the position of the rise or fall in pitch and not on the difference in pitch from high to low. They also conclude that in determining emphasis, the position of the peak value of the second derivative of the pitch contour is very significant.</Paragraph>
    <Paragraph position="10"> F. Brigham Young University (BYU) The research in pitch contour generation to be described in this paper addresses basically the same questions as the various projects  surveyed above: (1) What theoretical base might one use to represent syntactik and semantic information? (2) How does one convert linguistic information, both et sentence-level and discourse-level, to the algorithmic control of prosodic parameters? (3) What aspects of the pitch contour (e.g. 1st and 2nd derivatives, transitions relative to key syllables, and actual frequenty) are significant in causing intonation and emphasis options to be perceived? (4) What synthesis technique should be used to incorporate the  prosodic controls into a working system (e.g, LPC synthesis, formant synthesis, or articulatory synthesis)? We have chosen to use Junction Grammar (JG) as a theoretical framework within which to look fbL answers to questions (1) and (2) absve. Junction Grammar refers to a linguistic model formulated by Lytle (1974)- Subsequently, Junction Thesry has been used to formulate a new theory of phonology in wfiich a semantico-syntactic representation (called a junction-tree) is recoded as a general articulatory representation (called an articulation-tree) (Lytle, 1976). Junction Grammar extended to include Junction Phonology was selected for use in the BYU project because it seems to provide some significant insights and a flexible framework for our research.</Paragraph>
    <Paragraph position="11"> It should be pointed out that at present there is no completely automatic algorft5m for obtaining a detailed and powerful representation of syntax-semantics from general English text. For this reason, other researchers (e.g., Allen at MIT, Umeda at Bell Labs, and Levine at Stanford) have chosen to use a simple representation which can be obtained automatically. The authors' research, however, takes advantage of a larger project (Lytle, 1975) which uses - man-machine --- interaction to obtain a more powerful representation than can be obtained automatically. Therefore, it was decided to use the full power of Junction Grammar repres-entations in hopes of a future automatic analyzer rather than use some 'restricted version of Junction Grammar and be forced to add to it piece by piece to accoufit for more and more phenomena. To gain insight into topic (3) above (concerning which aspects of the pitch contour are significant t~ perception), we experimented with manually specified pitch contours.</Paragraph>
    <Paragraph position="12"> In answer to question (4) above (concerning the choice of an analysis synthesis technique), we have chosen to work initially with an LPC synthesis technique (as did Olive at Ball Labs and Levine at Stanford) because an LPC software package was already available at BYU. But long range plans include the use of an articulatory functional model (Flanagan, 1975).</Paragraph>
    <Paragraph position="13"> I* THEORY We now turn our attemioa to certain linguistic phenomena which we consider especially interesttng. First, we will illustrate the phenomena with sample sentences which will be discussed in intuitive terms and then in terms of Junction Grammar junction-trees (J-trees) and articulation-trees (A-trees). 'Rle section will conclude with a block dLagram of what a fully developed Junction Grammar text synthests system would look like and a block diagram of the system as currently iaplemented.</Paragraph>
    <Paragraph position="14"> A. Intuitive Presentation of Some Test Sentences v 1 Consider the sefitence &amp;quot;John drove to the store. This sentence can be read several different ways depending on the discourse context, Figure 1 shows five possible readings and their context. Whatever system is used to represent the linguistics of this sentence, it should be possible to represent each of these four readings uniquely.</Paragraph>
    <Paragraph position="15"> Sentence Possible context la John drove to the store. What happened? Ib John drove to the store. Who drove to the store? lc John drove to the store. Row did John pet to the store.</Paragraph>
    <Paragraph position="16"> Id John drove to the store. Where did John drive? ie John drove to the store? John drove to the -,tore, you know.</Paragraph>
    <Paragraph position="17"> (Are you sure that's what you meant to say?) Figure 1. John drove to the store.</Paragraph>
    <Paragraph position="18"> Now consider the question &amp;quot;Did John or Mary come?&amp;quot; Suppose that you heard someone come in but you did not see who it was.</Paragraph>
    <Paragraph position="19"> Nevertheless, you are sure that it was either John or Mary, In this context, you would put stress on &amp;quot;John&amp;quot; and on &amp;quot;Mary&amp;quot; and a falling pitch at the end of the sentencze. Then you would expect a reply of &amp;quot;John&amp;quot; or &amp;quot;Mary.&amp;quot; (If you receive as a reply simply &amp;quot;yes&amp;quot; then the person responding either did not understand or is trying to be funny.) On the other hand, suppose a whole crowd came to a party and you have a message which you must deltver to either John or Mary. In this contzxt, you may or may not stress &amp;quot;John&amp;quot; and &amp;quot;Mary&amp;quot; but you would certainly end the sentence with a rising pitch. Then you would expect a ~es/no reply, or perhaps a yes/no with additional volunteered information such as &amp;quot;Yes, John is over there in the corner .I1 Again, we would like our system of representation to handle this distinction. The two readings of &amp;quot;Did John or Mary come?&amp;quot; are summarized In Figure 2.</Paragraph>
    <Section position="1" start_page="69" end_page="69" type="sub_section">
      <SectionTitle>
Sentence
</SectionTitle>
      <Paragraph position="0"> Finally, consider the sentence &amp;quot;The boys who study get good grades.&amp;quot; Idhat difference in meaning is there 1n stressing &amp;quot;study&amp;quot; as opposed to stressiqg &amp;quot;boys&amp;quot;7 The difference can be illustrated by expanding the sentence to &amp;quot;The boys who study get good grades but the others do not .&amp;quot; If &amp;quot;study&amp;quot; is stressed, &amp;quot;others&amp;quot; is interpreted as &amp;quot;boys&amp;quot;, namely the boys who do not study. If, however, &amp;quot;boys&amp;quot; is stressed, &amp;quot;others&amp;quot; may no longer be interpreted as &amp;quot;boys,&amp;quot; but it can be interpreted as &amp;quot;girlst' or '&amp;quot;men who study&amp;quot; or some other group of students in contrast with boys. Once again, our system of representation needs to handle this distinction, and handle it in a way conststent with the treatment of other distinctions. Three readings of this sentetlce are summarized in Figure 3.</Paragraph>
    </Section>
    <Section position="2" start_page="69" end_page="69" type="sub_section">
      <SectionTitle>
Sentence Possible continuation
</SectionTitle>
      <Paragraph position="0"> B. Junction Grammar Representations of the Same Sentences We now discuss how Junction Grammar represents the above distinctions in its representations. If the reader is not as yet familiar with Junction Grammar, it might be advisable to consult Apperidix A before roeading this section. As indicated therein, some recent refinements of Junction Grammar are not yet available in published form. We therefore briefly discuss two of them here. One is the specLalizations of subjunction in J-trees, and the other is the explicit representation of modalizers.</Paragraph>
      <Paragraph position="1"> Directan of Subjunction First consider the three major specializations of subjunction shown in Figure 4.</Paragraph>
      <Paragraph position="2">  A right sub junc tion (* 0) of ten signifies that information is to be entered into the hearer's memory net. For example, when we read the sentence &amp;quot;I saw a lost child with a scraped knee this morning, and 1 helped him find his mother,&amp;quot; we enter (according to Junction theory) into our memory a slot for a child who wqs lost. The junction between &amp;quot;a&amp;quot; and &amp;quot;child&amp;quot; would be N (&amp;quot;a&amp;quot;) *- N (&amp;quot;child&amp;quot;) , If we next read the sentence, &amp;quot;The child had been crying for two hours, the poor thing,&amp;quot; we would recover the slot for the child and add to it the information that he had been crying. The junction between &amp;quot;the&amp;quot; and &amp;quot;child&amp;quot; in this case would be N (&amp;quot;the&amp;quot;) ** N (&amp;quot;child&amp;quot;). The third type of subjunction (**.) woilld be used, for example, in the sentence &amp;quot;John, our mailman, is going to retire m March,&amp;quot; to show that t'John,'t and &amp;quot;our mailman&amp;quot; are  defining the same person independently (cf* the traditional restrictive non-restrictive distinction) .</Paragraph>
      <Paragraph position="3"> In the above examples, we considered full subjunctions, (e.g.</Paragraph>
      <Paragraph position="4"> &amp;quot;John, our mailman&amp;quot;) but the same specializations apply to interjunctfons, (e.g. &amp;quot;John, xho is our mailman&amp;quot;). In a normal, restrictive modification, a left subjunction is used. For example in, &amp;quot;Please give The the yellow book on the second shelf,&amp;quot; &amp;quot;yellow'1 and &amp;quot;book&amp;quot; would be joined as follows (Pig. 5) .</Paragraph>
      <Paragraph position="6"> For an explanation of the various nodes in this representation for a simple phrase see Lytle (1975).</Paragraph>
      <Paragraph position="7"> In the sentence &amp;quot;Of Tom, John and Rudolph, - John drove to the store,'' the prepositional phrase &amp;quot;of Tom, John and Rudolph&amp;quot; does not restrict the meaning of &amp;quot;John&amp;quot; in the way &amp;quot;yellow&amp;quot; restricted &amp;quot;book&amp;quot; in the previous example. Actually in this case, &amp;quot;John&amp;quot; restricts the scope of the prepositional phrase. As -a reflection of this, the prepositional phrase is intwjoined with &amp;quot;John&amp;quot; using a sight subjunction as illustrated in Figure 6.</Paragraph>
      <Paragraph position="8">  We call this an example of Frame I1 modificqtion because the</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="69" end_page="69" type="metho">
    <SectionTitle>
1 I
</SectionTitle>
    <Paragraph position="0"> right subjunction is relating John&amp;quot; to a second frame of reference (i.e.</Paragraph>
    <Paragraph position="1"> Tom, John and Rudolph). On the other hand, &amp;quot;yellow book'' is a hame I modification because it restricts &amp;quot;book&amp;quot; within its own frame of reference (i.e. it determines whi-ch book we are talking about).</Paragraph>
    <Paragraph position="2"> Remainder. The second type of specialization mentioned in Figure 4 is an indication of remainder. The concept of remainder (Lytle, 1974) is concerned with whether all or only part of a set is referred to.</Paragraph>
    <Paragraph position="3"> If one desires to indicate whether there is a remainder in a subjunction, be simply. replaces the dot with either a hyphen or an equals sfgn.</Paragraph>
    <Paragraph position="4"> The Hyphen option. For example, from the sentence &amp;quot;Please give fl me the yellow book on the second shelf, we must assume that there are books of some color other than yellow on the second shelf. These other colored books are the remainder and we could diagram &amp;quot;yellow book&amp;quot; more specifically than before as follows (Figure 7).</Paragraph>
    <Paragraph position="5"> book 1  The Equals option. One common case of the equals optibn is for explicit modalizers (e. g. artiCles) . For example, the phrase &amp;quot;The child&amp;quot; could be diagrammed as follows (Figure 8),  The identity of &amp;quot;child&amp;quot; is retrieved and placed in the article &amp;quot;the1', filling it entirely and leaving no remainder. However, for our purposes, we will leave the modalizers implicit and simply-use N (the) cat. Thls brief discussion of specialized sub3unction and modalizers will suffice for us to reexamine the three sample sentences presented at the beginning of the chapter, but this time in terms of J-trees and A-trees.</Paragraph>
    <Paragraph position="6"> &amp;quot;John drove to the Store.&amp;quot; Figure 9 shows the J-tree and A-tree for the neutral reading of &amp;quot;John drove to the store&amp;quot; (santence la of Figure 1). The J-tree (a semantico-syntactic representation) is consistent with the version of Junction Grammar described by Lytle (1975). The A-tree (a phonological representation) is eonsistent with Junction Phonology (Lytle, 1976), except that the internal structure of the V3 nodes is not shown. This A-tree specifies that the sentence is to be pronounced in two units &amp;quot;John&amp;quot; and &amp;quot;drove to the store1', and &amp;quot;drove to the store&amp;quot; is further divided into ' &amp;quot;drove&amp;quot; and &amp;quot;to the store. I' The sub junctions numbered 1 and 2 indicate the relations between the sub-ph~ases. In an articulation tree, a left subjunction between H constituents indicates that the right  operand is pr~sodically subordinate to the left operand. As for the pitch confour, a left subjunction causes a dbwnward pitch shift. Similarly, a right subjunction causes an upward shift. The extra subjunction at the top of the A-tree is available for adding prosodic feature specifications relevant to the entire sentence. The A-tree system of representation is very flexible and a different A~tree could be used if it were decided to group the elements of the sentence differently. At the bottom of Figure 9 is a simplified version of the A-tree, which is used throughout the rest of this paper to make the trees easier to read. But it should be noted that the computer implementation uses the trees in their full form.</Paragraph>
    <Paragraph position="7"> Having described the J-tree and A-tree for the neutral form of &amp;quot;John drove to the store,&amp;quot; we now consider how the trees differ for the four other versions shown in Figure 1. In versions b, c and d we stress &amp;quot;John,&amp;quot; &amp;quot;drove&amp;quot; and &amp;quot;to the storett respectively. This stregs is the reflectton of an implicit frame I1 modifier in the J-tree (see Figure 10). For example, according to Junction theory, when the context is &amp;quot;lho drove to the store?&amp;quot;, &amp;quot;~ohn&amp;quot; is implicitly modified by a right interjunction which indicates that John has been selected out of a set of possibilities. A possible explicit frame I1 modifier would be: &amp;quot;Of the persons who ~ight have gone to the store, John drove to the sto-e.</Paragraph>
    <Paragraph position="8"> At this point, it is worrh Ciscussing a very general relationship that has been observed between J-trees and English prosodic stress</Paragraph>
    <Paragraph position="10"> (1) In a full s~bjunction, any time a remainder is induced (i.e. by *- or -*) in an operand, the other operand  receives a stress (e.g= - two *- boys).</Paragraph>
    <Paragraph position="11"> (Continued on page 23.)  (2) In an interjunction, any right interjunction causes a stress on the primary operand, and a left hyphen subjunction causes a stress on the V3 of the subordinate part of the interjunction to which the topic is joined as an emclitic.</Paragraph>
    <Paragraph position="12"> Figure 11. J-trees and English prosodic stress In the case of the sentence at hand, the implfcit frame 11 modifier, being a right interjunction, causes the primary operand, that is, the element to which the Frame I1 feature is applied, to be stressed. Thus we have accounted for the th'ree stressed versions of &amp;quot;John drove to the store.&amp;quot; The interrogative version (version le of Figure 1) has a [+ verify] feature on the top of the J-tree. That is, the listener is asking for verification of what was said. This feature is retofded as a prosodic [+ verify] feature in the A-tree. Figure 12 shows the k-trees for these five versions.</Paragraph>
    <Paragraph position="13"> Having covered this first example in detail, let us examine the two other sample sentences in a more abbreviated fashion. id John or Mary Come?&amp;quot; Figure 13 shows the J-tree and A-tree for each version of &amp;quot;Did John or Mary come?&amp;quot;. As seen in these figures, the semantico-syntactic difference between the two versions is where the interrogative is placed, on the whole sentence or on the conjoined subject. The prosodic difference is that in version 2a, &amp;quot;John&amp;quot; and 'Wary&amp;quot; are stressed (stimulated by the interrogation on the OR junction), while  in version Zb, the A-tree is marked [unfinished] because of the [yesno interrogative] feature on the J-tree. A &amp;quot;finished&amp;quot; version would be &amp;quot;Did John or Mary come or not?&amp;quot;.</Paragraph>
    <Paragraph position="14"> &amp;quot;The'boys who study.&amp;quot; Figure 14 shows J-trees and At-trees for  the three versions of &amp;quot;The boys who study get good grades. The J-trees differ only in the type of subjunction #between &amp;quot;boys&amp;quot; and &amp;quot;who&amp;quot;. In the A-tree, &amp;quot;boys&amp;quot; or &amp;quot;who study&amp;quot; is stressed according to the type of subjunction in the J-tree, following the rule stated above. This concludes our discussion of how Junction Grammar handles the three samp,le sentences presented at the beginning of the section.</Paragraph>
    <Paragraph position="15"> C. Text Synthesis Yodel We now consider a fully-developed JunctPon Grammar text synthesis system (Figure 15). This system incorporates the Ju~ction Grammar model of translation so that the input text might be in Spanish and the output in English. In this full system, J-trees adjusted (transfered) for the target language vould be needed as well as fully specified A-trees. The A-trees would include the internal structure of the V3 nodes, and the information in the A-tree would be converted into parameters that drive a functional analog of the vocal cords and tract. Clearly, putting together such a system would be a very ambitious project. A restricted version. At pkesent, we have implemented only a restrfcted version of the f6ll system, illustrated in Figure 16. In this system we have isolated the pitch contour from qther control parameters. Thus, we have chosen to work with an entire sentence as a unit. Essentially,</Paragraph>
    <Paragraph position="17"> we LPC-analyze the spoken input sentence, enter a J-tree for tb sentence, recode the J-tree as an A-tree, generate a pitch contour from the A-tree, replace the natural pitch contour with the generated one, and PC-synthesize to prqduce a spokkn output sentence.</Paragraph>
    <Paragraph position="18"> 11. METHOD The model described in Section I provides a representation for the semantico-syntactic information underlying prosodic contrasts and a very flexible framework for representing phrasing and prosodic features at the general articulatory level. But we have not yet spedfied how a J-tree is recoded as an A-tree or how the pitch ,contour is actually obtained from the A-tree. This chapter will describe the computer algorithms that have been implemented td perform these two conversions. Of course, they should not be taken as any kind of fiml statement concerning the task as they are under continuing development.</Paragraph>
    <Paragraph position="19"> A. Recoding a J-tree as an A-tree The general form of the A-tree is obtained by traversing the J-tree according to the language specific order stored in the J-tree. At each node the algorithm decides whether or not to declare a phrase, thus allowing nested phrases. The criteria for declaring a phrase are:  (1) The topmost node of the J-tree defines a phrase.</Paragraph>
    <Paragraph position="20"> (2) If the ptedicate consists of more than a single vexb and a single object, the verb and object will be made into a phrase which will then be joined to the subject.</Paragraph>
    <Paragraph position="21"> (3) The cantents of each subordinate tree of the J-tree (which is e forest of trees), is phrased under the dominating tree.</Paragraph>
    <Paragraph position="22"> (4) Each operand of a conjunction forms a phrase.</Paragraph>
    <Paragraph position="23">  The assignment of prosodic features to the A-tree (f .e. [+ stress] , [+ un~inished phrase] , and [+ verify contour] ) is fairly strsightf orward. The criteria for assigning [+ stress] to a node are:  (1) A Frame XI feature in the J-tree, (2) A left or right hyphen sub j unction (indicating remainder), (3) The operands of an &amp;quot;OR&amp;quot; interrogative.</Paragraph>
    <Paragraph position="24"> Tne directionality of the subjunctions between n-constituents in the A-tree Is left except in the following situations: (1) There is a right subjunction between the A-tree phrases from a simple verb and its complex object in the J-tree, (2) If a phrase is marked  [+ stress], the sub-phrases of the phrase are subordinated to it by adjusting the direetionalities of the sub junotions .</Paragraph>
    <Paragraph position="25"> B. Background of the A-tree to Pitch Contour Algorithm With this overview of the J-tree to A-tree conversion algorithm, we describe an algorithm to obtain a pitch contour from an A-tree. The evolutionary phases in the development of this algorithm were: Plots.</Paragraph>
    <Paragraph position="26"> We plotted pitch ahd intensity against time for various readings of several sentences.</Paragraph>
    <Paragraph position="27"> Manual Contours. In order to determine which aspects of the pitch contour are essential to natural-sounding synthesis, we programmed a system to allow manual specification of the pi teh contour with linear interpolation between specified points and to then pennit listening comparison of synthesis outputs with natural versus manual contours.</Paragraph>
    <Paragraph position="28"> First Algorithm. Based QXI these initial experiments, we programmed a simple pitch contour algorithm that imposed on each phrase a contour selected from a ffxed inventory of contours and algebraically added in a pitch &amp;quot;bubble&amp;quot; to the syllable of a prosodically stressed V3. In this initial system we were able to create multiple readings of sentences like &amp;quot;John drove 'to the store&amp;quot; from a single set of LPC analysis parameters, varying only the pitch contour. In other words, we concluded that although the perceptual phenomenon called prosodic or suprasegmental stress is well-known to be based on several acoustic parameters, including pitch (i.e. fundamental frequency), intensity and duration, in st least some cases, changing only the pitch contour is sufficient to cause a word to be perceived as stressed or not stressed. However, after considerable theoretical discussions, we decided to abandon the approach of using a fixed inventory of prototype contours and try a more dynamic approach, which we will now describe.</Paragraph>
    <Paragraph position="29"> C. Current A-tree to Pitch Contour Algorithm Given an A-tree and an option code to indicate initial and final values and bounds on parameters, the algorithm assigns an initial and final pitch basea on the option code. Then the A-tree is traversed in left-right order. Upon encountering each V3, we assign a pitch to the core of its nuczear syllable as follows:  (1) The fixst-V3'receives the initial pitch of the sentence.</Paragraph>
    <Paragraph position="30"> (2) A left subjunction causes a ratio decrement (about 0.90) to the last assigned pitch.</Paragraph>
    <Paragraph position="31"> (3) A right subjunction causes a ratio increment (about 1,12) in relation to the last assYgned pifch.</Paragraph>
    <Paragraph position="32">  (4) A conjunction causes no change to date, but further research ib needed.</Paragraph>
    <Paragraph position="33"> (5) An B-constituent domxpstlng multiple V3's rekeives the average of the most recently assigned pitch level and the highest pitch assigned to any of its operands.</Paragraph>
    <Paragraph position="34"> Then the contiaurs between nuclear syllables are defined as valleys whose depth increases with the distance in time between the nuclear syllables it joins. After the initial contour is defined, twc. types of contour adjhstments are added: (1) ~djustments in the pitch contbur caused by stop consonants.  We call these stop discontAnuities because when the speech,waveform becomes voiced again after a stop, the pitch is significantly higher than when the stop began but soon settles down to a value which would be predicted by smooth interpolation ctf the pitch contour over the unvoiced segment.</Paragraph>
    <Paragraph position="35"> (2) The pitch &amp;quot;bubble&amp;quot; associated with a stressed V3. Although the above algorithm is not complete, it works reasonably well and does have one already mentioned aspect which we repeat here for emphasis : The,pitch contour is generated from the A-tree in a completely dynamic manner. That is, there is no fixed inventory of pitch levels or phrase contours. Each new pitch level is assigned relative to previous values assigned and in accordance with preassigned absolute pitch limits (egg. 60 Hz'and 200 Hz) and the overall structure of the A-tree. This means that, although we have so far restricted ourselves to carefully spoken speech, this system may have the flexibility to eventually allow synthesis of varying speech rates, i.e. very slow and careful or very fast and sloppy speech by appropriate option codes in the  J-tree to A-tree algorithm and the A-tree to pitch contour algorithm. D. Sample Pitch Contours To conclude this chapter we present some graphs of pitch contours for the sentence &amp;quot;The boys who study get good grades.&amp;quot; Figure 17 shows a natural, a rule-generated and a manual pitch contour for sentence 3b (&amp;quot;The boys who study get good grades&amp;quot;). Figure 18 shows a natural and a rule generated pitch contour for sentence 3c (&amp;quot;The boys who study get good grades&amp;quot;). Note that these two contours ar imposed on the same set of LPC analysis parameters to produce the two readings. Figure 18 also shows a rule generated and a natural contodr for &amp;quot;The cat that the dog chased got away.</Paragraph>
    <Paragraph position="36">  &amp;quot;The cat that the dog chased got away.&amp;quot; 111. EVALUATION AFD DISCUSSION We produced a demonstration tape of LPC synthesized speech using natural, monotone, and rule-generated pitch contours. Figure 19 shows the contents of the tape.</Paragraph>
    <Paragraph position="37"> Various subjects said that although the sentences with rule-generated pitch contours did not sound as natural as the natural versions, they could clearly perceive the same distinctions in the rule versions +is were made in the natural versfons. Thus we established two criteria of evaluation: naturalneSs of intonation, and &amp;quot;intelligibility&amp;quot; of intonation, by which we mean a human listener can correctly perceive which reading of a multiple-reading sentence as intended.</Paragraph>
    <Paragraph position="38"> A. Format of the Test In order to obtain a quantitative evaluation of the system, we devised the following four part test, which was presented to 17 subjects. The sentences in the test consisted of 35 versions made from a dozen sets of T2C analysis parameters by imposing various natural, manual, monotone, and rule-generated pitch contours on them. In the first part listeners were asked to rate readings of 34 sentences on a scale from 1 to 5, where</Paragraph>
    <Paragraph position="40"> &amp;quot;1&amp;quot; meant the intonation sounded mechanical or monotone, and 5 meant the int~nation sounded natural. In the secorid part, listenets were presented with 24 sentence pairs and asked to indicate whether the first or second sentence s~unded more natural.</Paragraph>
    <Paragraph position="41"> The third and fourth parts of the test dealt with intelligibility of intonation. In both of theski parts, the subjects heard a sentence and indicated which of several pessible readings the intonation bas- intended to convey.</Paragraph>
    <Paragraph position="42"> The only difference between these last two parts was the method of designating the different readings. In the third part, rhe  1'. John drove to the store. 2. John drove to the store, (monotone) 3. John drove to the store.</Paragraph>
    <Paragraph position="43"> 4. John drove to the store.</Paragraph>
    <Paragraph position="44"> 5. John drove to the store.</Paragraph>
    <Paragraph position="45"> 6. John or Mary come? 7. Did John or Mary come? 11. The boys who study get good grades.</Paragraph>
    <Paragraph position="46"> 12. The boys who study get good grades, 16. They are eating apples.</Paragraph>
    <Paragraph position="47"> 17. They are eating apples.</Paragraph>
    <Paragraph position="48"> 20. I have one.</Paragraph>
    <Paragraph position="49"> -T 21. - E have one.</Paragraph>
    <Paragraph position="50"> 24. The cat that the dog chased got away.</Paragraph>
    <Paragraph position="51"> 26. John buys rice? 8. Did John or Mat'ry come? (monotone) 9. Did John or Mary come? 10. Did John or Mary come? 13. The boys who study get good grades. (monotone) 14. The boys who study get good grades.</Paragraph>
    <Paragraph position="52"> 15. The boys who study get good grades.</Paragraph>
    <Paragraph position="53"> 18. They are eating app-s.</Paragraph>
    <Paragraph position="54"> 19. They are eating apglea.</Paragraph>
    <Paragraph position="55"> 22. - 1 have one.</Paragraph>
    <Paragraph position="56">  23. - I have one.</Paragraph>
    <Paragraph position="57"> 25. The cat that the dog chased got away.</Paragraph>
    <Paragraph position="58"> 27 . John buys rice?  readings were designated by underlining and using a perfad or question mark at the end. In the fourth part, t;he readings were designated by an indication of a typical context for that reading. [Appendix D cwtains additional details of the test and the results). B. Test Results Table 1 gives the results of the first part, where sentences were rated on a scale from 1 (mechani~al) to 5 (natural). h'atural pitch contours received the highest score as expected, followed by manual contours based on the natural contour, rule-generated contaurs and monotone  A paired t-test applied to the average scores for natural and rule contours for each listener showed a statistically significant overall preference for natural contours.</Paragraph>
    <Paragraph position="59"> In part 2, in a balanced subset of 42 paired comparisons where natural, manual and rule versions were paired in all possible ways, the natural contours received 87 votes, the manual ones received 76 and the rule contours received 41. Several subjects mentioned after the test thzt the natural, hand and rule versfons og the second sentence, he he cat that the dog cnased got may&amp;quot;) were indistinguishable ib naturalness of intonation.</Paragraph>
    <Paragraph position="60"> Usin&amp; a non-parametrfc sign test technique, we postulated that if there were a significant preference for one pitch contour method over another, the listeners would be consistent in their choice, regardless of the order of presentation. Specifically, if four or fewer subjects out of 17 changed their minds, we can conclude a preference for a given pair and its reverse.</Paragraph>
    <Paragraph position="61"> Using this criterion, we found that for the first sentence, the natural version was significantly preferred but for the second sentence, there was no clear preference for the natural over the rule version. In parts 3 and 4, we tabted for &amp;quot;intelligibility&amp;quot; of intonation by presenting sentences and asking which of several possible readings was intended. We evaluated the results of this part by preparing confusion matrices. (Figure 20.) Each one deals with readiilgs of a single sentence, showing reading transmitted and pitch contour method (N=natural, R=rule) compared to reading received by the listeners. All readings are listed in Appendix D.</Paragraph>
    <Paragraph position="62"> A simple Chi-Square test shows that for a given row of one of these confusion matrices, 24 correct votes out of 33 or 34 are sufficient to show sigpificance at the .05 level. Results for part 4 were similar. C. Transmission Problems Some of the sentences were not well transmitted by the above definition. A consideration of these indicates the kinds of problems that arose. For example, since the first word of any normal declarative sentence receives some extra stress, tRe listeners had difficulty distinguishing &amp;quot;John drove to the store&amp;quot; from &amp;quot;John drove to the store. r i Another problem sentence was &amp;quot;Did John or Mary come?&amp;quot; Although the two  rule version6 were clearly distinguishable (one with falling and one with riaing terminal intonation), the listeners made many incorrect choices. This may have bean due to either of the following two factors:  (1) As with the other sentences, all the rule versions were based on a single set of analysis parameters, and duration was held constant. In this sentence, duration plays a greater role than in others, and this may have influenced judgment.</Paragraph>
    <Paragraph position="63"> (2) There may have been some c~nfusion about what the versions meant, and there may have been confusion with ie possible third reading in which &amp;quot;Johnt1 and &amp;quot;~ary&amp;quot; are stressed and yet the intonation is rising at the end.</Paragraph>
    <Section position="1" start_page="69" end_page="69" type="sub_section">
      <SectionTitle>
Dm Termination Problems
</SectionTitle>
      <Paragraph position="0"> Another problem mentioned by several subjects wzs that the intonation on some version8 (rule and hahd versions only) was natural up until the very end of the sentence. Re have determined that this is a problem in shaping+the contour from the last nuclear syllable to the final pitch of the sentence, assigning an appropriate fYna1 pitch, and determining the interaction between the pitch of the last nuclear syllable and the sentence final pitch. Further research-is needed in thls area.</Paragraph>
      <Paragraph position="1"> C. Discussion This paper is the report of an attempt to generateepitch contours in speech s3~1tbes3.s using Junction Grammar as a theoretical base. Since the various readings of each sentence were ma&amp; by imposing different pitch contours on the same analysis parameters without changing durations, some versions were less than natural. However, this was to be expected and we feel that it was even desirable in that it pointed out some specific cases in which durationedjustments are necessary.</Paragraph>
      <Paragraph position="2"> The evaluation also pointed out the need for further research on the shaping of the contour from the last V3 to the efid of the sentence.</Paragraph>
      <Paragraph position="3"> We also realize the need to incorporate some refinements into the system in order to  (1) make degrees of adjustment for fricatives and stops, (2) improve the naturalness of the contours between nuclear syllables, (3) make adjustments for the inherent pitch of $owel$ (Flanagan and Landgraf , 1968) .</Paragraph>
      <Paragraph position="4">  Based on the results of the evaluation test, we feel it is appropriate to continue use of the Junction Grammar framework and to attempt to develop a word concatenation version with duration, pause and intensit? calculations, to attempt better shaping of the contour after the last nudlear syllable, and to examine many more sentence types in order to further test the adequacy of this framework for dealing with the problem of generating prosodic control parameters in speech synthesis.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="69" end_page="69" type="metho">
    <SectionTitle>
ACKNOWLEDMENT S
</SectionTitle>
    <Paragraph position="0"> The author would like to express deep appreciation to co-author Eldon Lytle for many theoretical discussions, to W.J. Strong for sharing his acoustical expertise and LPC analysis-synthesis programs, and to Ronald Millet t for his excellent suggestions in our innumerable d'iscussions during this research and for doing the FORTRAN coding of the J-tree input, display mechanism, the conversion algorithms from J-tree to A-tree, and from A-tree to pitch contour.</Paragraph>
  </Section>
  <Section position="9" start_page="69" end_page="69" type="metho">
    <SectionTitle>
APPENDICES
APPENDIX A
BACKGROUND READING
</SectionTitle>
    <Paragraph position="0"> If the reader desires further background in acoustics speech processing and/or Junction Grammar, the following sources may be helpful.</Paragraph>
  </Section>
  <Section position="10" start_page="69" end_page="69" type="metho">
    <SectionTitle>
ACOUSTIC SPEECH PROCESSING:
</SectionTitle>
    <Paragraph position="0"> (1) The Speech Chain, P.B. Denes and E.N. Pinson, (Garden City, N.Y.: Doubleday Anchor Books, 1973) (an excellent non-technical ovwview) (2) Speech Analysis Synthesis and Perception, J.L. Flanagan, (New York: Springer-~erlag,'1972) (a thorough technical presentation) (3) Speech Synthesis, edited by J. Flanagan and Lo Rabiner, (Stroudshllrg, Penn.: Dowden, Hutrhinson and Ross, 1973) (a cdllection of key historical and current professional articles) JUNCTION GRAMMAR: (1) A Grammar of Subordinate Structures in English, (Lytle, 1974) (A Description of Junction Grammar. The concepts discussed are still valid in Junction Grammar theory but the notation has changed significantly) (2) AJCL microf icke ip26 &amp;quot;JG as a Base for Natural Language Processing.</Paragraph>
    <Paragraph position="1"> (The first chaptq is a good introduction to JG bht does not go into much detail) (3) BYU Linguktics class textbooks. There are several Linguistics classes at BYU in Junction Grammar. Ling 426 is an introductory course and Ling2501 is an intermediate class. The textbooks are still in development and have not yet been published but if the reader would like more detail than is available in the first two sources, he can write the BYU Linguistics department for copies of class handouts for Ling 426 and Ling 501. The 501 textbuok is the only available source on specialized subjunction. (4) &amp;quot;Junction Theory as a Base for Dynamic Phonological Representation. BYU Linguistfcs Symposium, March 1976. (This is the only available document on the A-tree extension of JGo It is reprinted at the end of this microfiche, for the convenience of the reader.)  A syllable. See Lytle (1976) for a more precise definition.</Paragraph>
  </Section>
  <Section position="11" start_page="69" end_page="69" type="metho">
    <SectionTitle>
APPENQIX C
COMPUTER IMPLEMl3NTATION
</SectionTitle>
    <Paragraph position="0"> The pitch contoux generation system described in this paper has been implemented on a PDP-15 computer, equipped with s variety of peripheral devices configuted as shown-in Figure 21. The VT-15 allow6 the user to call a package of subroutines from FORT&amp; to plot points or draw lines or characters. The system uges the DEC supplied DOS-15 operating system.</Paragraph>
    <Paragraph position="1"> The PDP-15 is equipped with 32 K 18-bit words. This is not enough memory for our mqin pitch contour generation program so we use the DOS-15 CHAIN AND EXECUTE facility to overlay programs that need not be core resiaent shultaneously.</Paragraph>
    <Paragraph position="2"> As indicated in Figure 21, there are two disk drives on the system. One is a standard DOS-15 system pack for system programs and user ftles. The other drive is mainly for speech data. Data on packs nounted bn this drive is accessed through special assembler subroutines that are not part of the DOS-15 operating system, This allows the user to store data contiguously at a higher transfer rate than possible using standard DOS-15 files. This is especialA7 important in transferring large amounts of data from the A/D to disk or from the disk to the D/A in real time. Thus the system can deal with longer segments of speech than can be stored in in-core buffers at ohe time.</Paragraph>
    <Paragraph position="3"> In order to describe the pitch contour system, we will describe the major data files and off-line support programs the system requires, For each sentence to be processed, the system needs (1) an entry in a speech d9rectory file (SPCDTR) which indicates the address ac~d length on the speech data disk of the LPC analysis parameters. (2) An identification  of the J-tree files for the various readings of the sentence. The J-tree contains kejls to obtain lexical information about each word from a master lexicon file. (3) A J-tree file for each reading.</Paragraph>
    <Paragraph position="4"> In order to prepare a sentence for processing, it is tape recorded, then digitized at a lOKHZ sampling rate using a program called DIGTXZ. Then it is LPC analyzed and optionally examined Qn the graphics display, using a program called ANAPLT. The &amp;quot;PLT&amp;quot; at the end of the name refers to the fact that this program will also produce a hard copy plot of the pitch contour if desired.</Paragraph>
    <Paragraph position="5"> The pitch contaur generation program is called JTSPCH (&amp;quot;J-Tree to speech&amp;quot;). When this program is executed, it presents a list of available sentences and asks the user to indicate which read5ng to use in this case. Then the program reads the J-tree file and creates a J-tree in postfur notation. The program then optionally displays the J-tree on the graphics unit, depending on the status of the console sense switches. Then the J-tree is converted to an A-tree, which again is optionally displayed. Then a pftch contour is generated from the A-tree and displayed. Finally, the pitch contour is combined with the LPC analysis parameters retrieved from disk (gain factor, voiced/uniroiced decision and 12 linear predictor coefficients per 10 msec of speech waveform) and the contained parameters are used to synthesize a speech wayeform which is stored on a temporary disk area and repeatedly played through the D/A converter to a loudspeaker or headphones for evaluation. If desired, the user can then save it permanently on disk. Another processinq option is to create a manual pitch contour instead of gene rating it from an A-tkee. The manual contour can be catered either by drawing it on the graphics unit with the Mght pen or by entering a list of time and pitch coordinates on thsr teletype to a subroutine that intetpolatea linearly between them. Of course, the sentence can also be synthesized using the natural pitch contour retrieved from the original analysis data.</Paragraph>
    <Paragraph position="6"> After sav2ng several syntehsized sentences, one can listen to a list of sentences with any dr sired pause between them using a multiple 146tening prbgrh called MULTIL. MULTIL can receivv its control input from either the teletype or from a data file. This option allowed us to create a control file with the regular editing facilities of the operating system and then inytruct MULTIL to read it, creating the evaluation test tape in one continuous recording session without any t%pe splicing.</Paragraph>
  </Section>
  <Section position="12" start_page="69" end_page="69" type="metho">
    <SectionTitle>
APPENDIX D
MORT3 DETAILS ON THE EXALUATION
</SectionTitle>
    <Paragraph position="0"> This appendix contains the following information: An edited version of the evaluatisn response form given to the subjects and thenfour tables showing all responses. Note that the parts of the response form are numbered IA, IB, IIA and IIB. This edited response form shows which versions were used throughout the test but does not contain certain unnecessary details present in the actual response form used. Each version is i'dentified by a code consisting of a nwber (1-8), a letter (a-e) , a letter (N, R, M or H) and possibly another number (1-4).</Paragraph>
    <Paragraph position="1"> The first two characters identify the sentence and reading;  as follows: (1) a. John drove to the store.</Paragraph>
    <Paragraph position="2"> b. John drove to the store.</Paragraph>
    <Paragraph position="3"> c. John drove to the store.</Paragraph>
    <Paragraph position="4"> d. John drove to the store.</Paragraph>
    <Paragraph position="5"> e. John drove to the store? (2) a. Did John or Mary come? (fairing at end) b. LlLd John or Mary come? (rising at end).</Paragraph>
    <Paragraph position="6"> (3) a. The boys who study get good grade&amp;.</Paragraph>
    <Paragraph position="7"> b. The boys who study get good grades.</Paragraph>
    <Paragraph position="8"> c. The boys who study get good grades.</Paragraph>
    <Paragraph position="9"> (4) a: They are eating apples.</Paragraph>
    <Paragraph position="10"> b. They are eating apples.</Paragraph>
    <Paragraph position="11"> a I have one.</Paragraph>
    <Paragraph position="12">  b., 1- have one.</Paragraph>
    <Paragraph position="13">  .t(4) a. John, 30e ahd Fred bliy riae.</Paragraph>
    <Paragraph position="14"> (7) a- The cat that the dog chased got away. (8) a. Jdhn buys rice.</Paragraph>
    <Paragraph position="15"> b. John buys rice.</Paragraph>
    <Paragraph position="16"> c. John buys rice.</Paragraph>
    <Paragraph position="17"> d. John buys rite.</Paragraph>
    <Paragraph position="18">  e. Joha buys rib?  A. Below are two lists of the same 34 sentences. You will hear the first list with a $ second pause after each sentence. Just listen and don't write anything. Then 10 secqnds later, you will hear the second list with a 3 second pause after each sentence. This time, during the pauses, rate each sentence by writing down a number after, it, The rating scale is 1 to 5. Remember that the evaluation criterion is intonation only.</Paragraph>
    <Paragraph position="19"> So please do no9t let your judgements be inPS qugqced by crackles or pops  or hisses.</Paragraph>
    <Paragraph position="20"> A rating aPS 1 means the intonation sounded mechanical or unnatural, for example, monotone or the way computers talk in cartoons. A rating of 5 means the intonazion sounded natural, that is, you can imagine the sentence was produced by a human speaker speaking carefully. Please try to dis'tribute your scores over the entire range from 1 to 5. Before you begin, please read over the entire test to become familiar with it, because you will have only a few seconds to respond to each question.</Paragraph>
    <Paragraph position="21"> The test will last 17 minutes.</Paragraph>
    <Paragraph position="22"> (The following fouf pages are an edited, abbrevdated form of the rest of the response sheets. The codes in parentheses were not on the actual response sheets.</Paragraph>
    <Paragraph position="23"> By consulting the key on the previous pages of this appendix, the reader can determdne from the codes which version was used for each question.)  I A.</Paragraph>
    <Paragraph position="24"> 1. I haveone.</Paragraph>
    <Paragraph position="25"> 2, The cat that the dog chased got away* 3. Did John or Mary come? etc.</Paragraph>
    <Paragraph position="26"> 33. The cat that the dog chased got away.</Paragraph>
    <Paragraph position="27"> 34. John drove to the store.</Paragraph>
    <Paragraph position="28"> SECOND TIME THROUGH: Rate each sentence (1)Mechanical to (5)Natural.</Paragraph>
    <Paragraph position="29"> 1, I have one. .............. f5b~) 2, The cat that the dog chased gut away. (7aR) 3. Did John or Mary come?. ....... ( 2 bR) The rest of part IA will be shown in abbreviated form.</Paragraph>
    <Paragraph position="30"> 34. John drove to the store ....... (law Pair Number 1st sounded 2nd more more natural natural J J ........</Paragraph>
    <Paragraph position="31"> 1. Did John or Mary come?. (2aN) (2a~) 2. uid JoBn or Mary come?. ...... (2aH1) (2aH2) 3. Did John or Mary come?. . . . . . . (2aR) (2aN)  1. John buys rice (8dR) a. John buys rice.</Paragraph>
    <Paragraph position="32"> b, John buys rite.</Paragraph>
    <Paragraph position="33"> c. John buys rice.</Paragraph>
    <Paragraph position="34"> d. John buys rice.</Paragraph>
    <Paragraph position="35"> e. John buys rice? 2. Did John or Mary come (2aN) a. Did John or Mary come? b. Did John ox Mary come? The rest of part IIA will be shown in abbreviated form.</Paragraph>
    <Paragraph position="36"> I1 B.</Paragraph>
    <Paragraph position="37"> 1. They are eating apples (4a~)I a. They are in the process of eating apples. b. These apples are a Variegy good for eating as opposed to baking.</Paragraph>
    <Paragraph position="38"> 2. They boys who study get good grades (36~) a. Neutral b. But the boys who play around get bad grades. c. But the girls who study don' t get good grades. 3. Did John or Mary come (2aR) a. S-omebody came. Was it John or was it Mary? b. Several people came. Did the group include John or 'Mary? 4. John drove to the store (IbR) a. In response to: &amp;quot;What h~ppened?&amp;quot; b. In resp~nse to: &amp;quot;Who drove to the store?&amp;quot; c. In response to : &amp;quot;How did John get to the store?&amp;quot; d. In response to: &amp;quot;Where did John drive?&amp;quot; e. To ask for verification of what was said. X have one (5bN) a. But YOU have t-hree.</Paragraph>
    <Paragraph position="39"> b. But you don't.</Paragraph>
    <Paragraph position="40"> John drove to the store (IcR) a. In response to: &amp;quot;What happened?&amp;quot; b. In response tqA: &amp;quot;Who drove to the store?&amp;quot; c. In response to: &amp;quot;How did John get to the store?&amp;quot; d, In response to: &amp;quot;Where did John drive?&amp;quot; e. To ask for verificatioh of what was said. Did John or Mary come (2bN) a. Somebody came. Was it John or was it Mary? b. Several people came. Did the group irlclude John or Mary?  They are eating apples (4bN) a. They are in the process of eating apples. b. These apples are a variety good for esting as opposed to baking.</Paragraph>
    <Paragraph position="41"> The boys who study get good grades. (3cR) a. Neutral b. But the boys who play around get bad grades. c. But the girls who study don't get good grades. I have one (5aR) a. But you have three.</Paragraph>
    <Paragraph position="42"> b. But you don't.</Paragraph>
    <Paragraph position="43"> Table D-1 The responses for part IA. Each row gives the response of- subject 1 through 17 to a particular question. A zero response means the subject left that question blank.</Paragraph>
  </Section>
  <Section position="13" start_page="69" end_page="69" type="metho">
    <SectionTitle>
JUNCTION THEORY AS A BASE
FOR
DYNAMIC PHONOLOGICAL REPRESENTATION
</SectionTitle>
    <Paragraph position="0"> Orientation MacNeflage has pointed up the difficulty of mediating between abstract unitary phonological representations and the continuous nature of the dynamic speech chain, suggesting that unitary phonological represerrtations are analogous to a sequence of eggs conveyed to the wringer of a washing machine, while the scrambled mess that emerges fro9 the wringer is what must actually be dealt with by those engaged in computer analpsis and  synthesis of voice. The quqstion, as he states it, is: Given that there is a discrete linguistic input to the mechanism of speech production at some state, and given that the mechanism that transmits this input is incapable of discrete units of output, what is the nature of the transforma ion, at the peripheral staget, of one form to the other.</Paragraph>
    <Paragraph position="1">  Lieberman likewise notes a relative neglect of the phonetic level of speech, concldding that a quantitative and expl$c$t phonetic theory has yet to be developed, and suggesting that a successful attempt ta 'construct such a theory should be structured in terms of the aaatomfz, physiologic, and neural mechanisms of speech producrion and perception. Onn, similarly motivated by the notion that speech .ought to be described in the context of the organic mechanisms responsible for it, supgests, that: It may, be argued that an abstract representation may be regarded as instructions for particular types of behavior of the kpeech-generating mechanism. When these instructions are carried out, the various reactions occurring between afferent physiological structures will yield 4 quasicontinuous gesture in whieh the discrete lnstructions initiating the gesture are no longer always observable as distinct comporlents. Finally, the exe ution of these instructions produces the acoustic signal.</Paragraph>
    <Paragraph position="2"> E The p~irpose of the present paper is to outline briefly a new system of phonolo~cal description cumently being used as a basis for voice synthesis at BYU which attempts to satisfy the criteria suggested by ITacNeilage, Lieberman, and Onn ref eremed above. The descriptive system in question is based on the Junction Gramar Model of language developed by myself and my colleagues over the past eight years.5 It is a model specifically structured in terms of speech-related organs, either as they are known oi hypothesized, An Overview of the Junction Grammar Model A fundamental tenet of junction theory is that linguistic description must involve not shply multiple stages of derivation, but multiple types of data and data processing required to simulate the functions of different body organs. (See Figure 1.) Thus, the semantic components of the grammar are designed to gsocess data structured for specific semantic tracts, as it were; the articulatory component is designed to process data structured for the vocal tract, the audio component is designed to process data stfuctured for the auditory tract, and so on.</Paragraph>
    <Paragraph position="3"> Of course, such a model requires distinct rule systems and procedures to operate on thedifferent data types in the various tracts.</Paragraph>
    <Paragraph position="4"> Figure 3..</Paragraph>
    <Paragraph position="5"> A further tenet of junction theory is that data types may not be intermingled.</Paragraph>
    <Paragraph position="6"> To dq so would, f ot example, be tantamount to feeding instructions for both the heart 4nd diaphragm to the diaphragm. Of course, semantic instructions could not be executed by a vocal tract, nor could articulatory instructions be executed by a semantic tract. This means, in eff eot, that a &amp;quot;deep st~ucture&amp;quot; is not transfdrmed (in the usual sense of the word) into a surface qtructure, but rather that semantic data must be used to stimulate articulatory instructions, orthogrziphic instructions, motor instructions required to produce gestures, to make one blush, etc. Thus, in JG semantic representations there are no lexical items, since these are considered to be arqiculatory inS$xuctions. Similarly, there is no semantic inf ormdtion in phonological repyesentations, since these are a different data type. The various data types are considered to be symbolizations of each other, not transTdm or derivations of each other. Data stimulation between the various tracts or components of the system is accomplished by context sensitive coding/decoding procedures, which are intended to simulate the neural interfaces which coordinate the function bf body organs involved in speech production.</Paragraph>
    <Paragraph position="7"> Jupction Grammar takes its name from Junction Rules (J-rules), (See Figure 2.) J-rules structure data to be processed by the various components of the grammar. The essential ingredients of ev2ry.J-rule are two or more operands, an operation specifying hdw the operands are tu be joined, and a labelling operation which assigns a category to the operands taken as a unit. Thus, in junction grammar not only do rules for con-Junction require an operation symbol (visa the phrase structure rule S+S &amp; S).but all Jlules,  A schematic of the model in its present form is given in Figure 3. Basic semantic data is presumed to reside in the form of an information net. Drawing upon information in the net, J-rules or gad ze and s trueture inf ormatlon pragmatically, i.e. for use in specific utterances in specific discourse environments. Fillmore's arguments for semantic case relate specifically to the need to distinguish between basic semantic relations and pragmatically motivated grammatical relations. The semantic junction trees (J-trees) generated by J-rules then serve a? the basis for coding up articulatory instructions, instructions to the arm and hand for writing, or motor instructions of pmdry types necessary to produce body language.</Paragraph>
    <Paragraph position="8"> Incoming information, on the other hand, is decoded to obtain the pragmatic J-tree which stimulated it, and then each junction in the tree is executed by a semantic processor, resulting in additions to or changes in the information net.</Paragraph>
    <Paragraph position="9"> Junction trees occur in both semantic and articulatory data. However, the qpexands and operations are of a totally different nature from type to type, since in the semantic component they constitute complexes of instructions to be executed by the semantic processor, while in the articulatory component they constitute complexes of instructions to be execueed by the vocal tract. The operands of semantic trees are sememes, i.e. units which define locations and states in the information net; tEe operands of articulation trees are articulemes, i.e. units which relate to locations and states of the vocal tract. Figures 4 and 5 are the semantic and articulation trees, respectively, for the utteragce [~aysa iyt]. Notice, specifically, that while Why did you are not immediate semantic constituents, they are immediate Etrticulatory constituents; The point again, of&amp;quot; course, is that while articulatofsy structure and semantic structure are symbalically related, they axe not the same and should not be confused or intermingled.</Paragraph>
    <Section position="1" start_page="69" end_page="69" type="sub_section">
      <SectionTitle>
Bask Junction Types
</SectionTitle>
      <Paragraph position="0"> Junction theory posits three basic junction operations and numeroud subtypes depending upon the data tvpe beinn described.</Paragraph>
      <Paragraph position="1"> (11 Adjuqcfion results in the f~rmation of certain nuclear units which serve as a skeleton to whicL other elements may attach. In semantic trees, predicates and predications are formed via adjunction.</Paragraph>
      <Paragraph position="2"> In articulation trees, semi-syllables and syllables are formed via ad junction. (2) Subjunction results in overlapping constituents of contrasting rank, i.e. where one is in some sense subordinate to the other.</Paragraph>
      <Paragraph position="3"> In semantic trees, modifiers in all their variety are subjoiried.</Paragraph>
      <Paragraph position="4"> In articulation trees, clustered consQnants ar,e subjoined, as well as adjacent syllables having different degrees of gtress. Segmental structures are also subjoLned to prosodic consti-tuent~ to account for the supra-segmental aspects of articulation.</Paragraph>
      <Paragraph position="5"> (3' Conjunctipn results in the format ion of compounds consisting of units of the same category and rank. In semantic trees, compounds based.on - and, -' or and - but are formed via conjunction. In A-trees, conjunctjon yields evenly spaced non-overlapping units having the same degree of stress.</Paragraph>
      <Paragraph position="6"> Now, in the context of this rather general introduction to the subject, let us consider dynamic phonological representations corresponding to the artfeulatory structure of syllables, words, and phrases.</Paragraph>
      <Paragraph position="7"> The Syllablk The iptaitive articttllatory unit of which words consist is the syllable, which is in turn com;posed of phonemes. Generally speaking, syllables have as their nuclear component a coatinuous phoneme wlth vocalic properties. This nuclear phoneme may be delimited both initially and finally by a phoneme having consonantal properties. Eence, we observe syllables of the followtng string types:</Paragraph>
      <Paragraph position="9"> If, however, we invoke the concept of a null delimiter $, then these four syllable patterns can be reduced to a single type, DWD, where D may be either null or non-null. The use of the null delimiter $ is actually more than a simplifying assumption, since in many cases non-null segmentals replace $ in the articulation stream either as full geminates or partials of neighboring delimiters.</Paragraph>
      <Paragraph position="10"> Articulatory Adiunction As noted above, junction theory attributes to adjunction those kernel configurations upon which all else is built up. Since syllables are the intuitive units from Qhich words and phrases are formed, we attribute them to adjunction.</Paragraph>
      <Paragraph position="11"> There are two basic syllable types, corresponding to whether the sy1labi.c nucleus is joined to the initial or ffnal delimiter, The two cases are illustrated in Figure 6.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML