File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-2007_metho.xml
Size: 22,001 bytes
Last Modified: 2025-10-06 14:09:46
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-2007"> <Title>American Sign Language Generation: Multimodal NLG with Multiple Linguistic Channels</Title> <Section position="3" start_page="37" end_page="38" type="metho"> <SectionTitle> 2 ASL NLG: A Form of Multimodal NLG </SectionTitle> <Paragraph position="0"> NLG researchers think of communication signals in a variety of ways: some as a written text, other as speech audio (with prosody, timing, volume, and intonation), and those working in Multimodal NLG as text/speech with coordinated graphics (maps, charts, diagrams, etc). Some Multimodal NLG focuses on &quot;embodied conversational agents&quot; (ECAs), computer-generated animated characters that communicate with users using speech, eye gaze, facial expression, body posture, and gestures (Cassell et al., 2000; Kopp et al., 2004).</Paragraph> <Paragraph position="1"> The output of any NLG system could be represented as a stream of values (or features) that change over time during a communication signal; some NLG systems specify more values than others. Because the English writing system does not record a speaker's prosody, facial expression or gesture1, a text-based NLG system specifies fewer communication stream values in its output than does a speech-based or ECA system. A text-based NLG system requires literate users, to whom it can transfer some of the processing burden; they must mentally reconstruct more of the language performance than do users of speech or ECA systems.</Paragraph> <Paragraph position="2"> Since most writing systems are based on strings, text-based NLG systems can easily encode their output as a single stream, namely a sequence of 1 Some punctuation marks loosely correspond to intonation or pauses, but most prosodic information is lost. Facial expression and gesture is generally not conveyed in writing, except perhaps for the occasional use of &quot;emoticons.&quot; ;-) words/characters. To generate more complex signals, multimodal systems decompose their output into several sub-streams - we'll refer to these as &quot;channels.&quot; Dividing a communication signal into channels can make it easier to represent the various choices the generator must make; generally, a different processing component of the system will govern the output of each channel. The trade-off is that these channels must be coordinated over time.</Paragraph> <Paragraph position="3"> Instead of thinking of channels as dividing a communication signal, we can think of them as groupings of individual values in the data stream that are related in some way. The channels of a multimodal NLG system generally correspond to natural perceptual/conceptual groupings called &quot;modalities.&quot; Coarsely, audio and visual parts of the output are thought of as separate modalities.</Paragraph> <Paragraph position="4"> When parts of the output appear on different portions of the display, then they are also generally considered separate modalities. For instance, a multimodal NLG system for automobile driving directions may have separate processing channels for text, maps, other graphics, and sound effects.</Paragraph> <Paragraph position="5"> An ECA system may have separate channels for eye gaze, facial expression, manual gestures, and speech audio of the animated character.</Paragraph> <Paragraph position="6"> When a language has no commonly-known writing system - as is the case for ASL - then it's not possible to build a text-based NLG system. We must produce an animation of a character (like an ECA) performing ASL; so, we must specify how the hands, eye gaze, mouth shape, facial expression, head-tilt, and shoulder-tilt are coordinated over time. With no conventional string-encoding of ASL (that would compress the signal into a single stream), an ASL signal is spread over multiple channels of the output - a departure from most Multimodal NLG systems, which have a single linguistic channel/modality that is coordinated with other non-linguistic resources (Figure 1).</Paragraph> <Paragraph position="7"> Of course, we could invent a string-based notation for ASL so that we could use traditional text-based NLG technology. (Since ASL has no writing system, we would have to invent an artificial notation.) Unfortunately, since the users of the system wouldn't be trained in this new writing system, it could not be used as output; we would still need to generate a multimodal animation output.</Paragraph> <Paragraph position="8"> An artificial writing system could only be used for internal representation and processing, However, flattening a naturally multichannel signal into a single-channel string (prior to generating a multichannel output) can introduce its own complications to the ASL system's design. For this reason, this project has been exploring ways to represent the hierarchical linguistic structure of information on multiple channels of ASL performance (and how these structures are coordinated or uncoordinated across channels over time).</Paragraph> <Paragraph position="9"> Some multimodal systems have explored using linguistic structures to control (to some degree) the output of multiple channels. Research on generating animations of a speaking ECA character that performs meaningful gestures (Kopp et al., 2004) has similarities to this ASL project. First of all, the channels in the signal are basically the same; an animated human-like character is shown onscreen with information about eye, face, and arm movements being generated. However, an ASL system has no audio speech channel but potentially more fine-grained channels of detailed body movement.</Paragraph> <Paragraph position="10"> The less superficial similarity is that (Kopp et.</Paragraph> <Paragraph position="11"> al, 2004) have attempted to represent the semantic meaning of some of the character's gestures and to synchronize them with the speech output. This means that, like an ASL NLG system, several channels of the signal are being governed by the linguistic mechanisms of a natural language.</Paragraph> <Paragraph position="12"> Unlike ASL, the gesture system uses the speech audio channel to convey nearly all of the meaning to the user; the other channels are generally used to convey additional/redundant information. Further, the internal structure of the gestures is not generally encoded in the system; they are typically atomic/lexical gesture events which are synchronized to co-occur with portions of speech output. A final difference is that gestures which co-occur with English speech (although meaningful) can be somewhat vague and are certainly less systematic and conventional than ASL body movements. So, while both systems may have multiple linguistic channels, the gesture system still has one primary linguistic channel (audio speech) and a few channels controlled in only a partially linguistic way.</Paragraph> </Section> <Section position="4" start_page="38" end_page="38" type="metho"> <SectionTitle> 3 This English-to-ASL MT Design </SectionTitle> <Paragraph position="0"> The linguistic and multimodal issues discussed above have had important consequences on the design of our English-to-ASL MT system. There are several unique features of this system caused by: (1) ASL having multiple linguistic channels that must be coordinated during generation, (2) ASL having both an LS and a CP form of signing, (3) CP signing visually conveying 3D spatial rela null tionships in front of the signer's torso, and (4) ASL lacking a conventional written form. While ASLparticular factors influenced this design, section 5 will discuss how this design has implications for NLG of traditional written/spoken languages.</Paragraph> <Section position="1" start_page="38" end_page="38" type="sub_section"> <SectionTitle> 3.1 Coordinating Linguistic Channels </SectionTitle> <Paragraph position="0"> Section 2 mentioned that this project is developing multichannel (non-string) encodings of ASL animation; these encodings must coordinate multiple channels of the signal as they are generated by the linguistic structures and rules of ASL. Kopp et al.</Paragraph> <Paragraph position="1"> (2004) have explored how to coordinate meaningful gestures with speech signal during generation; however, their domain is somewhat simpler. Their gestures are atomic events without internal hierarchical structure. Our project is currently developing grammar-like coordination formalisms that allow complex linguistic signals on multiple channels to be conveniently represented.2</Paragraph> </Section> <Section position="2" start_page="38" end_page="38" type="sub_section"> <SectionTitle> 3.2 ASL Computational Linguistic Models </SectionTitle> <Paragraph position="0"> This project uses representations of discourse, semantics, syntax, and (sign) phonology tailored to ASL generation (Huenerfauth, 2004b). In particular, since this MT system will generate animations of classifier predicates (CPs), the system consults a 3D model of real-world scenes under discussion.</Paragraph> <Paragraph position="1"> Further, since multimodal NLG requires a form of scheduling (events on multiple channels are coordinated over a performance timeline), all of the linguistic models consulted and modified during ASL generation are time-indexed according to a timeline of the ASL performance being produced.</Paragraph> </Section> </Section> <Section position="5" start_page="38" end_page="40" type="metho"> <SectionTitle> 2 Details of this work will be described in future publication. </SectionTitle> <Paragraph position="0"> Previous ASL phonological models were designed to represent non-CP ASL, but CPs use a reduced set of handshapes, standard eye-gaze and head-tilt patterns, and more complex orientations and motion paths. The phonological model developed for this system makes it easier to specify CPs. Because ASL signers can use the space in front of their body to visually convey information, it is possible during CPs to show the exact 3D layout of objects being discussed. (The use of channels representing the hands means that we can now indicate 3D visual information - not possible with speech or text.) To represent this 3D detailed form of meaning, this system has an unusual semantic model for generating CPs. We populate the volume of space around the signer's torso with invisible 3D objects representing entities discussed by CPs being generated (Huenerfauth, 2004b). The semantic model is the set of placeholders around the signer (augmented with the CP handshape used for each). Thus, the semantics of the &quot;car parked next to the house&quot; example (section 1.1) is that a 'bulky' object occupies a particular 3D location and a 'vehicle' object moves toward it and stops.</Paragraph> <Paragraph position="1"> Of course, the system will also need more traditional semantic representations of the information to be conveyed during generation, but this 3D model helps the system select the proper 3D motion paths for the signers' hands when &quot;drawing&quot; the 3D scenes during CPs. The work of (Kopp et al., 2004) studies gestures to convey spatial information during an English speech performance, but unlike this system, they use a logical-predicate-based semantics to represent information about objects referred to by gesture. Because ASL CPs indicate 3D layout in a linguistically conventional and detailed way, we use an actual 3D model of the objects being discussed. Such a 3D model may also be useful for ECA systems that wish to generate more detailed 3D spatial gesture animations.</Paragraph> <Paragraph position="2"> The discourse model in this ASL system records features not found in other NLG systems. It tracks whether a 3D location has been assigned to each discourse entity, where that location is around the signer, and whether the latest location of the entity has been indicated by a CP. The discourse model is not only relevant during CP performance; since ASL LS performance also assigns 3D locations to objects under discussion (for pronouns and verbal agreement), this model is also used for LS.</Paragraph> <Section position="1" start_page="39" end_page="40" type="sub_section"> <SectionTitle> 3.3 Generating 3D Classifier Predicates </SectionTitle> <Paragraph position="0"> An essential step in producing an animation of an ASL CP is the selection of 3D motion paths for the computer-generated signer's hands, eye gaze, and head tilt. The motion paths of objects in the 3D model described above are used to select corresponding motion paths for these parts of the signer's body during CPs. To build the 3D place-holder model, this system uses preexisting scene-visualization software to analyze an English text describing the motion of real-world objects and build a 3D model of how the objects mentioned in text are arranged and move (Huenerfauth, 2004b).</Paragraph> <Paragraph position="1"> This model is &quot;overlaid&quot; onto the volume in front of the ASL signer (Figure 2). For each object in the scene, a corresponding invisible placeholder is positioned in front of the signer; the layout of placeholders mimics the layout of objects in the 3D scene. In the &quot;car parked next to the house&quot; example, a miniature invisible object representing a 'house' is positioned in front of the signer's torso, and another object (with a motion path terminating next to the 'house') is added to represent the 'car.' The locations and orientations of the placeholders are later used by the system to select the locations and orientations for the signer's hands while performing CPs about them. So, the motion path calculated for the car will be the basis for the 3D motion path of the signer's hand during the classifier predicate describing the car's motion. Given the information in the discourse/semantic models, the system generates the hand motions, head-tilt, and eye-gaze for a CP. It stores a library containing templates representing a prototypical form of each CP the system can produce. The templates are planning operators (with logical pre-conditions, monitored termination conditions, and effects), allowing the system to &quot;trigger&quot; other elements of ASL signing performance that may be required during a CP. A planning-based NLG approach, described in (Huenerfauth, 2004b), is used to select a template, fill in its missing parameters, and build a schedule of the animation events on multiple channels needed to produce a sequence of CPs.</Paragraph> </Section> <Section position="2" start_page="40" end_page="40" type="sub_section"> <SectionTitle> 3.4 A Multi-Path Architecture </SectionTitle> <Paragraph position="0"> A multimodal NLG system may have several presentation styles it could use to convey information to its user; these styles may take advantage of the various output channels to different degrees. In ASL, there are multiple channels in the linguistic portion of the signal, and not surprisingly, the language has multiple sub-systems of signing that take advantage of the visual modality in different ways. ASL signers can select whether to convey information using lexical signing (LS) or classifier predicates (CPs) during an ASL performance (section 1.1). These two sub-systems use the space around the signer differently; during CPs, locations in space associated with objects under discussion must be laid out in a 3D manner corresponding to the topological layout of the real-world scene under discussion. Locations associated with objects during LS (used for pronouns and verb agreement) have no topological requirement. The layout of the 3D locations during LS may be arbitrary.</Paragraph> <Paragraph position="1"> The CP generation approach in section 3.3 is computationally expensive; so, we would only like to use this processing pathway when necessary.</Paragraph> <Paragraph position="2"> English input sentences not producing classifier predicates would not need to be processed by the visualization software; in fact, most of these sentences could be handled using the more traditional MT technologies of previous systems. For this reason, our English-to-ASL MT system has multiple processing pathways (Huenerfauth, 2004a).</Paragraph> <Paragraph position="3"> The pathway for handling English input sentences that produce CPs includes the scene visualization software, while other input sentences undergo less sophisticated processing using a traditional MT approach (that is easier to implement). In this way, our CP generation component can actually be layered on top of a pre-existing English-to-ASL MT system to give it the ability to produce CPs. This multi-path design is equally applicable to the architecture of written-language MT systems. The design allows an MT system to combine a resource-intensive deep-processing MT method for difficult (or important) inputs and a resource-light broad-coverage MT method for other inputs.</Paragraph> </Section> <Section position="3" start_page="40" end_page="40" type="sub_section"> <SectionTitle> 3.5 Evaluation of Multichannel NLG </SectionTitle> <Paragraph position="0"> The lack of an ASL writing system and the multichannel nature of ASL can make NLG or MT systems which produce ASL animation output difficult to evaluate using traditional automatic techniques. Many such approaches compare a string produced by a system to some human-produced 'gold-standard' string. While we could invent an artificial ASL writing system for the system to produce as output, it's not clear that human ASL signers could accurately or consistently produce written forms of ASL sentences to serve as 'gold standards' for such an evaluation. And of course, real users of the system would never be shown artificial &quot;written ASL&quot;; they would see full animations instead. User-based studies (where ASL signers evaluate animation output directly) may be a more meaningful measure of an ASL system.</Paragraph> <Paragraph position="1"> We are planning such an evaluation of a prototype CP-generation module of the system during the summer/fall of 2005. Members of the deaf community who are native ASL signers will view animations of classifier predicates produced by the system. As a control, they will also be shown animations of CPs produced using 3D motion capture technology to digitally record the performance of CPs by other native ASL signers. Their evaluation of animations from both sources will be compared to measure the system's performance. The multichannel nature of the signal also makes other interesting experiments possible. To study the system's ability to animate the signer's hands only, motion-captured ASL could be used to animate the head/body of the animated character, and the NLG system can be used to control only the hands of the character. Thus, channels of the NLG system can be isolated for evaluation - an experimental design only available to a multichannel NLG system.</Paragraph> </Section> </Section> <Section position="6" start_page="40" end_page="41" type="metho"> <SectionTitle> 4 Unique Design Features for ASL NLG </SectionTitle> <Paragraph position="0"> The design portion of this English-to-ASL project is nearly complete, and the implementation of the system is ongoing. Evaluations of the system will be available after the user-based study discussed above; however, the design itself has highlighted interesting issues about the requirements of NLG software for sign languages like ASL.</Paragraph> <Paragraph position="1"> The multichannel nature of ASL has led this project to study mechanisms for coordinating the values of the linguistic models used during generation (including the output animation specification itself). The need to handle both the LS and CP subsystems of the language has motivated: a multi-path MT architecture, a discourse model that stores data relevant to both subsystems, a model of the space around the signer capable of storing both LS and CP placeholders, and a phonological model whose values can be specified by either subsystem.</Paragraph> <Paragraph position="2"> Since this English-to-ASL MT system is the first to address ASL classifier predicates, designing an NLG process capable of producing the 3D locations and paths in a CP animation has been a major design focus for this project. These issues have been addressed by the system's use of a 3D model of placeholders produced by scene-visualization software and a planning-based NLG process operating on templates of prototypical CP performance.</Paragraph> </Section> <Section position="7" start_page="41" end_page="41" type="metho"> <SectionTitle> 5 Applications Beyond Sign Language </SectionTitle> <Paragraph position="0"> Sign language NLG requires 3D spatial representations and multichannel coordinated output, but it's not unique in this requirement. In fact, generation of a communication signal for any language may require these capabilities (even for spoken languages like English). We have mentioned throughout this paper how gesture/speech ECA researchers may be interested in NLG technologies for ASL - especially if they wish to produce gestures that are more linguistically conventional, internally complex, or 3D-topologically precise.</Paragraph> <Paragraph position="1"> Many other computational linguistic applications could benefit from an NLG design with multiple linguistic channels (and indirectly benefit from ASL NLG technology). For instance, NLG systems producing speech output could encode prosody, timing, volume, intonation, or other vocal data as multiple linguistically-determined channels of the output (in addition to a channel for the string of words being generated). And so, ASL NLG research not only has exciting accessibility benefits for deaf users, but it also serves as a research vehicle for NLG technology to produce a variety of richer-than-text linguistic communication signals.</Paragraph> </Section> class="xml-element"></Paper>