File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/p96-1039_metho.xml
Size: 27,523 bytes
Last Modified: 2025-10-06 14:14:18
<?xml version="1.0" standalone="yes"?> <Paper uid="P96-1039"> <Title>An Information Structural Approach to Spoken Language Generation</Title> <Section position="4" start_page="0" end_page="296" type="metho"> <SectionTitle> 2 Information Structure </SectionTitle> <Paragraph position="0"> Information Structure refers to the organization of information within an utterance. In particular, it 1In this example, and throughout the remainder of the paper, the intonation contour is informally noted by placing prosodic phrases in parentheses and marking pitch accented words with capital letters. The tunes are more formally annotated with a variant of (Pierrehumbert, 1980) notation described in (Prevost, 1995). Three different pause lengths are associated with boundaries in the modified notation. '(%)' marks intra-utterance boundaries with very little pausing, '%' marks intra-utterance boundaries associated with clauses demarcated by commas, and '$' marks utterance-final boundaries. For the purposes of generation and synthesis, these distinctions are crucial.</Paragraph> <Paragraph position="1"> (3) Q: I know the AMERICAN amplifier produces MUDDY treble, defines how the information conveyed by a sentence is related to the knowledge of the interlocutors and the structure of their discourse. Sentences conveying the same propositional content in different contexts need not share the same information structure. That is, information structure refers to how the semantic content of an utterance is packaged, and amounts to instructions for updating the models of the discourse participants. The realization of information structure in a sentence, however, differs from language to language. In English, for example, intonation carries much of the burden of information structure, while languages with freer word order, such as Catalan (Engdahl and Vallduvi, 1994) and Turkish (Hoffman, 1995) convey information structure syntactically. null</Paragraph> <Section position="1" start_page="294" end_page="295" type="sub_section"> <SectionTitle> 2.1 Information Structure and Intonation </SectionTitle> <Paragraph position="0"> The relationship between intonational structure and information structure is illustrated by (3) and (4). In each of these examples, the answer contains the same string words but different intonational patterns and information structural representations. The theme of each utterance is considered to be represented by the material repeated from the question. That is, the theme of the answer is what links it to the question and defines what the utterance is about. The rheme of each utterance is considered to be represented by the material that is new or forms the core contribution of the utterance to the discourse. By mapping the rise-fall tune (H* LL%) onto rhemes and the rise-fall-rise tune (L+H* LH%) onto themes (Steedman, 1991; Prevost and Steedman, 1994), we can easily identify the string of words over which these two prominent tunes occur directly from the information structure. While this mapping is certainly overly simplistic, the results presented in Section 4.3 demonstrate its appropriateness for the class of simple declarative sentences under investigation.</Paragraph> <Paragraph position="1"> Knowing the strings of words to which these two tunes are to be assigned, however, does not provide enough information to determine the location of the pitch accents (H* and L+H*) within the tunes.</Paragraph> <Paragraph position="2"> Moreover, the simple mapping described above does not account for the frequently occurring cases in which thematic material bears no pitch accents and is consequently unmarked intonationally. Previous approaches to the problem of determining where to place accents have utilized heuristics based on &quot;givenness.&quot; That is, content-bearing words (e.g. nouns and verbs) which had not been previously mentioned (or whose roots had not been previously mentioned) were assigned accents, while function words were de-accented (Davis and Hirschberg, 1988; Hirschberg, 1990). While these heuristics account for a broad range of intonational possibilities, they fail to account for accentual patterns that serve to contrast entities or propositions that were previously &quot;given&quot; in the discourse. Consider, for example the intonational pattern in (5), in which the pitch accent on amplifier in the response cannot be attributed to its being &quot;new&quot; to the discourse. (5) Q: Do critics prefer the BRITISH amplifier</Paragraph> <Paragraph position="4"> A: They prefer the AMERICAN amplifier.</Paragraph> <Paragraph position="5"> H* LL$ For the determination of pitch accent placement, we rely on a secondary tier of information structure which identifies focused properties within themes and rhemes. The theme-foci and the rheme-foci mark the information that differentiates properties or entities in the current utterance from properties or entities established in prior utterances. Conse- null quently, the semantic material bearing &quot;new&quot; information is considered to be in focus. Furthermore, the focus may include semantic material that serves to contrast an entity or proposition from alternative entities or propositions already established in the discourse. While the types of pitch accents (H* or L+H*) are determined by the theme/theme delineation and the aforementioned mapping onto tunes, the locations of pitch accents are determined by the assignment of foci within the theme and rheme, as illustrated in (3) and (4). Note that it is in precisely those cases where thematic material, which is &quot;given&quot; by default, does not contrast with any other previously established properties or entities that this material is intonationally unmarked, as in (6).</Paragraph> <Paragraph position="6"> (6) Q: Which amplifier does Scott PREFER?</Paragraph> </Section> <Section position="2" start_page="295" end_page="296" type="sub_section"> <SectionTitle> 2.2 Contrastive Focus Algorithm </SectionTitle> <Paragraph position="0"> The determination of contrastive focus, and consequently the determination of pitch accent locations, is based on the premise that each object in the knowledge base is associated with a set of alternatives from which it must be distinguished if reference is to succeed. The set of alternatives is determined by the hierarchical structure of the knowledge base. For the present implementation, only properties with the same parent or grandparent class are considered to be alternatives to one another.</Paragraph> <Paragraph position="1"> Given an entity z and a referring expression for x, the contrastive focus feature for its semantic representation is computed on the basis of the contrastive focus algorithm described in (7), (8) and (9). The data structures and notational conventions are given below.</Paragraph> <Paragraph position="2"> (7) DElist: a collection of discourse entities that have been evoked in prior discourse, ordered by recency. The list may be limited to some size k so that only the k most recent discourse entities pushed onto the list are retrievable.</Paragraph> <Paragraph position="3"> ASet(z): the set of alternatives for object x, i.e. those objects that belong to the same class as x, as defined in the knowledge base.</Paragraph> <Paragraph position="4"> RSet(z,S): the set of alternatives for object z as restricted by the referring expressions in DElist and the set of properties S.</Paragraph> <Paragraph position="5"> CSet(x, S): the subset of properties of S to be accented for contrastive purposes.</Paragraph> <Paragraph position="6"> Props(z): a list of properties for object x, ordered by the grammar so that nominal properties take precedence over adjectival properties.</Paragraph> <Paragraph position="7"> The algorithm, which assigns contrastive focus in both thematic and thematic constituents, begins by isolating the discourse entities in the given constituent. For each such entity x, the structures defined above are initialized as follows: (8) Props(x) :-- \[P I P(x) is true in KB \]</Paragraph> <Paragraph position="9"> The algorithm appears in pseudo-code in (9). 2</Paragraph> <Paragraph position="11"> In other words, given an object x, a list of its properties and a set of alternatives, the set of alternatives is restricted by including in the initial RSet only x and those objects that are explicitly referenced in the prior discourse. Initially, the set of properties to be contrasted (CSe~) is empty. Then, for each property of x in turn, the RSet is restricted to include only those objects satisfying the given property in the knowledge base. If imposing this restriction on the RSet for a given property decreases the cardinality of the RSe~, then the property serves to distinguish x from other salient alternatives evoked in the prior discourse, and is therefore added to the contrast set.</Paragraph> <Paragraph position="12"> Conversely, if imposing the restriction on the RSet for a given property does not change the RSet, the property is not necessary for distinguishing x from its alternatives, and is not added to the CSet.</Paragraph> <Paragraph position="13"> Based on this contrastive focus algorithm and the mapping between information structure and intonation described above, we can view information structure as the representational bridge between discourse and intonational variability. The following sections elucidate how such a formalism can be integrated into the computational task of generating spoken language.</Paragraph> </Section> </Section> <Section position="5" start_page="296" end_page="296" type="metho"> <SectionTitle> 3 Generation Architecture </SectionTitle> <Paragraph position="0"> The task of natural language generation (NLG) has often been divided into three stages: content planning, in which high-level goals are satisfied and discourse structure is determined, sentence planning, in which high-level abstract semantic representations are mapped onto representations that more fully constrain the possible sentential realizations (Rambow and Korelsky, 1992; Reiter and Mellish, 1992; Meteer, 1991), and surface generation, in which the high-level propositions are converted into sentences.</Paragraph> <Paragraph position="1"> The selection and organization of propositions and their divisions into theme and rheme are determined by the content planner, which maintains discourse coherence by stipulating that semantic information must be shared between consecutive utterances whenever possible. That is, the content planner ensures that the theme of an utterance links it to material in prior utterances.</Paragraph> <Paragraph position="2"> The process of determining foci within themes and rhemes can be divided into two tasks: determining which discourse entities or propositions are in focus, and determining how their linguistic realizations should be marked to convey that focus. The first of these tasks can be handled in the content phase of the NLG model described above. The second of these tasks, however, relies on information, such as the construction of referring expressions, that is often considered the domain of the sentence planning stage. For example, although two discourse entities el and e2 can be determined to stand in contrast to one another by appealing only to the discourse model and the salient pool of knowledge, the method of contrastively distinguishing between them by the placement of pitch accents cannot be resolved until the choice of referring expressions has been made.</Paragraph> <Paragraph position="3"> Since referring expressions are generally taken to be in the domain of the sentence planner (Dale and Haddock, 1991), the present approach resolves issues of contrastive focus assignment at the sentence processing stage as well.</Paragraph> <Paragraph position="4"> During the content generation phase, the content of the utterance is planned based on the previous discourse. While template-based systems (McKeown, 1985) have been widely used, rhetorical structure theory (RST) approaches (Mann and Thompson, 1986; Hovy, 1993), which organize texts by identifying rhetorical relations between clause-level propositions from a knowledge base, have recently flourished. Sibun (Sibun, 1991) offers yet another alternative in which propositions are linked to one another not by rhetorical relations or pre-planned templates, but rather by physical and spatial properties represented in the knowledge-base.</Paragraph> <Paragraph position="5"> The present framework for organizing the content of a monologue is a hybrid of the template and RST approaches. The implementation, which is presented in the following section, produces descriptions of objects from a knowledge base with context-appropriate intonation that makes proper distinctions of contrast between alternative, salient discourse entities. Certain constraints, such as the requirement that objects be identified or defined at the beginning of a description, are reminiscent of McKeown's schemata. Rather than imposing strict rules on the order in which information is presented, the order is determined by domain specific knowledge, the communicative intentions of the speaker, and beliefs about the hearer's knowledge. Finally, the system includes a set of rhetorical constraints that may rearrange the order of presentation for information in order to make certain rhetorical relationships salient. While this approach has proven effective in the present implementation, further research is required to determine its usefulness for a broader range of discourse types.</Paragraph> </Section> <Section position="6" start_page="296" end_page="299" type="metho"> <SectionTitle> 4 The Prolog Implementation </SectionTitle> <Paragraph position="0"> The monologue generation program produces text and contextually-appropriate intonation contours to describe an object from the knowledge base. The system exhibits the ability to intonationally contrast alternative entities and properties that have been explicitly evoked in the discourse even when they occur with several intervening sentences.</Paragraph> <Section position="1" start_page="296" end_page="297" type="sub_section"> <SectionTitle> 4.1 Content Generation </SectionTitle> <Paragraph position="0"> The architecture for the monologue generation program is shown in Figure 1, in which arrows represent the computational flow and lines represent dependencies among modules. The remainder of this section contains a description of the computational path through the system with respect to a single example. The input to the program is a goal to describe an object from the knowledge base, which in this case contains a variety of facts about hypothetical stereo components. In addition, the input provides a communicative intention for the goal which may affect its ultimate realization, as shown in (1O).</Paragraph> <Paragraph position="1"> For example, given the goal describe(x), the intention persuade-to-buy(hearer,x) may result in a radically different monologue than the intention persuade-t o-s ell (hearer, x).</Paragraph> <Paragraph position="2"> (10) Goal: describe el Input: generat e (int ention(bel (hl, good-t o-buy (e I) ) ) Information from the knowledge base is selected to be included in the output by a set of relations that determines the degree to which knowledge base facts and rules support the communicative intention of the speaker. For example, suppose the system &quot;believes&quot; that conveying the proposition in (11) moderately supports the intention of making hearer hl want to buy el, and further that the rule in (12) is known by hl.</Paragraph> <Paragraph position="4"> The program then consults the facts in the knowledge base, verifies that the property does indeed hold and consequently includes the corresponding facts in the set of properties to be conveyed to the hearer, as shown in (13).</Paragraph> <Paragraph position="5"> (13) holds(produce(el, e7)).</Paragraph> <Paragraph position="6"> holds(isa(e7, watts-per-channel)).</Paragraph> <Paragraph position="7"> holds(amount(e7, I00)).</Paragraph> <Paragraph position="8"> The content generator starts with a simple description template that specifies that an object is to be explicitly identified or defined before other propositions concerning it are put forth. Other relevant propositions concerning the object in question are then linearly organized according to beliefs about how well they contribute to the overall intention. FinMly, a small set of rhetorical predicates rearranges the linear ordering of propositions so that sets of sentences that stand in some interesting rhetorical relationship to one another will be realized together in the output. These rhetorical predicates employ information structure to assist in maintaining the coherence of the output. For example, the conjunction predicate specifies that propositions sharing the same theme or theme be realized together in order to avoid excessive topic shifting. The contrast predicate specifies that pairs of themes or rhemes that explicitly contrast with one another be realized together. The result is a set of properties roughly ordered by the degree to which they support the given intention, as shown in (14).</Paragraph> <Paragraph position="10"> The top-level propositions shown in (14) were selected by the program because the hearer (hl) is believed to be interested in the design of the amplifier and the reviews the amplifier has received.</Paragraph> <Paragraph position="11"> Moreover, the belief that the hearer is interested in buying an expensive, powerful amplifier justifies including information about its cost and power rating. Different sets of propositions would be generated for other (perhaps thriftier) hearers. Additionally, note that the propositions praise(e4, el) and revile(e5, el) are combined into the larger proposition contrast (praise ( e4, el ), revile (e5, e I ) ). This is accomplished by the rhetorical constraints that determine the two propositions to be contrastive because e4 and e5 belong to the same set of alternative entities in the knowledge base and praise and revile belong to the same set of alternative propositions in the knowledge base.</Paragraph> <Paragraph position="12"> The next phase of content generation recognizes the dependency relationships between the properties to be conveyed based on shared discourse entities. This phase, which represents an extension of the rhetorical constraints, arranges propositions to ensure that consecutive utterances share semantic material (cf. (McKeown et el., 1994)). This rule, which in effect imposes a strong bias for Centering Theory's continue and retain transitions (Grosz et el., 1986) determines the theme-rheme segmentation for each proposition.</Paragraph> </Section> <Section position="2" start_page="297" end_page="298" type="sub_section"> <SectionTitle> 4.2 Sentence Planning </SectionTitle> <Paragraph position="0"> After the coherence constraints from the previous section are applied, the sentence planner is responsible for making decisions concerning the form in which propositions are realized. This is accomplished by the following simple set of rules. First, Definitional isa properties are realized by the matrix verb. Other isa properties are realized by nouns or noun phrases. Top-level properties (such as those in (14)) are realized by the matrix verb. Finally, embedded properties (those evoked for building referring expressions for discourse entities) are realized by adjectival modifiers if possible and otherwise by relative clauses.</Paragraph> <Paragraph position="1"> While there are certainly a number of linguistically interesting aspects to the sentence planner, the most important aspect for the present purposes is the determination of theme-foci and rheme-foci.</Paragraph> <Paragraph position="2"> The focus assignment algorithm employed by the sentence planner, which has access to both the discourse model and the knowledge base, works as follows. First, each property or discourse entity in the semantic and information structural representations is marked as either previously mentioned or new to the discourse. This assignment is made with respect to two data structures, the discourse entity list (DEList), which tracks the succession of entities through the discourse, and a similar structure for evoked properties. Certain aspects of the semantic form are considered unaccentable because they correspond to the interpretations of closed-class items such as function words. Items that are assigned focus based on their &quot;newness&quot; are assigned the o focus operator, as shown in (15).</Paragraph> <Paragraph position="3"> The second step in the focus assignment algorithm checks for the presence of contrasting propositions in the ISStore, a structure that stores a history of information structure representations. Propositions are considered contrastive if they contain two contrasting pairs of discourse entities, or if they contain one contrasting pair of discourse entities as well as contrasting functors.</Paragraph> <Paragraph position="4"> Discourse entities are determined to be contrastive if they belong to the same set of alternatives in the knowledge base, where such sets are inferred from the isa-links that define class hierarchies. While the present implementation only considers entities with the same parent or grandparent class to be alternatives for the purposes of contrastive stress, a graduated approach that entails degrees of contrastiveness may also be possible.</Paragraph> <Paragraph position="5"> The effects of the focus assignment algorithm are easily shown by examining the generation of an utterance that contrasts with the utterance shown in (15). That is, suppose the generation program has finished generating the output corresponding to the examples in (10) through (15) and is assigned the new goal of describing entity e2, a different amplifier. After applying the second step on the focus assignment algorithm, contrasting discourse entities are marked with the * contrastive focus operator, as shown in (16). Since el and e2 are both instances of the class amplifiers and cl and c2 both describe the class araplifiers itself, these two pairs of discourse entities are considered to stand in contrastive relationships.</Paragraph> <Paragraph position="6"> While the previous step of the algorithm determined which abstract discourse entities and properties stand in contrast, the third step uses the contrastive focus algorithm described in Section 2 to determine which elements need to be contrastively focused for reference to succeed. This algorithm determines the minimal set of properties of an entity that must be &quot;focused&quot; in order to distinguish it from other salient entities. For example, although the representation in (16) specifies that e2 stands in contrast to some other entity, it is the property of e2 having a tube design rather than a solid-state design that needs to be conveyed to the hearer. After applying the third step of the focus assignment to (16), the result appears as shown in (17), with &quot;tube&quot; contrastively focused as desired.</Paragraph> <Paragraph position="7"> The final step in the sentence planning phase of generation is to compute a representation that can serve as input to a surface form generator based on Combinatory Categorial Grammar (CCG) (Steedman, 1991), as shown in (18). 3 (18) Theme: np(3, s) : (el^S) ^d#(el,.xh(el)~s)~u/rh Rheme: s : ( acU pres)^ indeI(el, ( amplifier(cl )& * tube(el))~isa(el, el))\np(a, s): elerh</Paragraph> </Section> <Section position="3" start_page="298" end_page="299" type="sub_section"> <SectionTitle> 4.3 Results </SectionTitle> <Paragraph position="0"> Given the focus-marked output of the sentence planner, the surface generation module consults a CCG grammar which encodes the information structure/intonation mapping and dictates the generation of both the syntactic and prosodic constituents. The result is a string of words and the appropriate prosodic annotations, as shown in (19). The output of this module is easily translated into a form suitable for a speech synthesizer, which produces spoken output with the desired intonation. 4 (19) The X5 is a TUBE amplifier.</Paragraph> <Paragraph position="1"> L+H~ L(H%) H* LL$ The modules described above and shown in Figure 1 are implemented in Quintus Prolog. The system produces the types of output shown in (20) and aA complete description of the CCG generator can be found in (Prevost and Steedman, 1993). CCG was chosen as the grammatical formalism because it licenses non-traditional syntactic constituents that are congruent with the bracketings imposed by information structure and intonational phrasing, as illustrated in (3).</Paragraph> <Paragraph position="2"> (21), which should be interpreted as a single (two paragraph) monologue satisfying a goal to describe two different objects. 5 Note that both paragraphs include very similar types of information, but radically different intonational contours, due to the discourse context. In fact, if the intonational patterns of the two examples are interchanged, the resulting speech sounds highly unnatural.</Paragraph> <Paragraph position="3"> Several aspects of the output shown above are worth noting. Initially, the program assumes that the hearer has no specific knowledge of any particular objects in the knowledge base. Note however, that every proposition put forth by the generator is assumed to be incorporated into the bearer's set of beliefs. Consequently, the descriptive phrase &quot;an audio journal,&quot; which is new information in the first paragraph, is omitted from the second. Additionally, when presenting the proposition 'Audiofad is an audio journal,' the generator is able to recognize the similarity with the corresponding proposition about Stereofool (i.e. both propositions are abstractions over the single variable open proposition 'X is an audio journal'). The program therefore interjects the o~her property and produces &quot;another audio journal.&quot; null 5The implementation assigns slightly higher pitch to accents bearing the subscript c (e.g. H~), which mark contrastive focus as determined by the algorithm describe above and in (Prevost, 1995).</Paragraph> <Paragraph position="4"> Several aspects of the contrastive intonational effects in these examples also deserve attention. Because of the content generator's use of the rhetorical contrast predicate, items are eligible to receive stress in order to convey contrast before the contrasting items are even mentioned. This phenomenon is clearly illustrated by the clause &quot;PRAISED by STEREOFOOL&quot; in (20), which is contrastively stressed before &quot;REVILED by AUDIOFAD&quot; is uttered. Such situations are produced only when the contrasting propositions are gathered by the content planner in a single invocation of the generator and identified as contrastive when the rhetorical predicates are applied. Moreover, unlike systems that rely solely on word class and given/new distinctions for determining accentual patterns, the system is able to produce contrastive accents on pronouns despite their &quot;given&quot; status, as shown in (21).</Paragraph> </Section> </Section> class="xml-element"></Paper>