File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0903_metho.xml
Size: 19,431 bytes
Last Modified: 2025-10-06 14:14:43
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0903"> <Title>Software Re-Use and Evolution in Text Generation Applications $</Title> <Section position="3" start_page="14" end_page="16" type="metho"> <SectionTitle> 3 A Common Architecture </SectionTitle> <Paragraph position="0"> While PLANDoc, FLOwDoc, and ZEDDoC all share a common foundation, they embody distinctly different text generation applications. However, we aimed during the design of both FLOWDOC and ZEDDoc to utilize as much of PLANDoc's architecture as possible, often adapting and generalizing modules that were originally written with only the PLANDoc system in mind.</Paragraph> <Paragraph position="1"> All three systems employ a modular pipeline architecture. A pipeline architecture is one that separates the functions involved in text generation, such as content planning, discourse organization, lexicalization, and syntactic realization, into distinct modules that operate in sequence. Modular pipeline architectures have a long history of use in text gen- null eration systems (Kukich, 1983a; McKeown, 1985; McDonald and Pustejovsky, 1986; Reiter, 1994), although recent work argues for the need for interaction between modules (Danlos, 1987; Rubinoff, 1992; McKeown et al., 1993). The most powerful argument for using pipeline architectures is the potential benefit of re-using individual modules for subsequent applications. However, with the exception of surface realization modules such as FUF/SURGE (Elhadad, 1992; Robin, 1994), actual code re-use has been minimal due to the lack of agreement about the order and grouping of subprocesses into modules.</Paragraph> <Paragraph position="2"> In PLANDoc, FLowDoc, and ZEDDoc, we utilize the following main modules, in the order listed below: * Message Generator: The message generator transcribes the raw data from LEIS-PLAN execution traces, SHowBIz, or ZED transaction logs into instances of message classes. We refer to simple collections of (possibly nested) attribute-value pairs pertaining to a single event as messages. Message classes are domain-specific (e.g., there are 30 of them in PLAN-Doc, 13 in FLowDoc, and 6 in ZEDDoc), but they all share the same representation as the basic content unit. In all three systems, generalization must occur at this level in order to create semantically concise messages from relatively large amounts of input data.</Paragraph> <Paragraph position="3"> * Ontologizer: In PLANDoc, a pipelined ontoiogizer enriches messages with domain-specific knowledge that is not explicitly present in the input. \[n FLOWDoC and ZEDDoc, semantic enrichment is done at various stages by consulting external ontologies.</Paragraph> <Paragraph position="4"> * Discourse Organizer: The discourse organizer performs all the remaining functions prior to lexicalization and surface generation 2. Three sub-modules apply general discourse coherence constraints at the levels of discourse, sentence, and sentence constituent. The first module performs aggregation and text linearization operations using an ontology of rhetorical predicates derived from Hobbs (1985) and Polanyi (1988).</Paragraph> <Paragraph position="5"> Linear order and prominence of the subconstituents are then determined, followed by constraints on subconstituents that affect lexical choice (e.g., centering and informational constraints, as in (Passonneau, 1996)).</Paragraph> <Paragraph position="6"> 2|n previous work we referred to this module as the Sentence Planner (Passonneau et al., 1996).</Paragraph> <Paragraph position="7"> Lexicalizer: The lexicalizer maps message attributes into thematic/case roles, and chooses appropriate content (open-class) words for tile values of these attributes.</Paragraph> <Paragraph position="8"> Surface Generator: This module maps thematic roles into syntactic roles and builds syntactic constituents, chooses function (closedclass) words, ensures grammatical agreement, and linearizes to produce the final surface sentence. null Our message generator modules are largely domain-specific, and we have made major changes to them while porting them to new applications. Even so, their ontological generalization technique, which produces semantically concise descriptions from frequency data, is domain-independent. Our final surface generation module is completely domainindependent; it employs the FUF/SURGE (E1hadad, 1991; Robin, 1994) text generation tools, and was re-used in all three systems with virtually no modifications. Modules near the middle of the pipeline provide the most interesting examples of code that can be re-used if it is general enough and relies on plug-and-play knowledge bases rather than hard-coded data. We return to this issue of code re-use and of the evolution of our modules to accommodate it in Section 5.</Paragraph> </Section> <Section position="4" start_page="16" end_page="17" type="metho"> <SectionTitle> 4 A Common Representation </SectionTitle> <Paragraph position="0"> All three systems employ a consistent, standardized attribute-value data format that persists from each module to the next. Examples of this internal data format were shown in Figures 1-3. This fbrmat is used for representing and processing conceptualsemantic, lexical-semantic, syntactic, and other linguistic information. Its persistent use facilitates inter-module communication and module independence, hence re-usability. Furthermore, it does not restrict the kinds of information that can be represented, and it is common to many non-NLP computational systems and languages (e.g., relational databases), thus making it easier for text generation systems to interface with existing applications.</Paragraph> <Paragraph position="1"> The input to each of our three systems came from very different sources, some closer than others to attribute-value message format. PLANDoc's input came from n-tuple records representing program execution traces, so it required a filter to transform it into messages. FLOwDOC'S input came from ASCII representations of nodes and links in work flow diagrams which were already essentially in attribute-value format. ZEDDoc's input, representing Web activity data, had been stored in an Oracle TM relational database by its application, so it too required little transformation.</Paragraph> </Section> <Section position="5" start_page="17" end_page="17" type="metho"> <SectionTitle> 5 Architectural Evolution </SectionTitle> <Paragraph position="0"> As discussed earlier, a practical goal for text generation research is to converge on a separation of functions into modules that can be independently re-used. Towards this goal, we have generalized and refined our architecture with each successive application. In fact, we significantly adapted our PLAN-DOC architecture for use in FLOwDOC, but we were able to re-use the FLowDoc architecture and much of its code in ZEDDoc. Figure 4 contrasts the architecture of PLANDoc with those of FLOwDoc</Paragraph> <Section position="1" start_page="17" end_page="17" type="sub_section"> <SectionTitle> Text Generation Systems. </SectionTitle> <Paragraph position="0"> The obvious architectural change from PLANDoc to FLowDoc (and ZEDDoc) is the extraction of ontological knowledge from the processing pipeline. Ontological knowledge is necessarily domain-specific, so this modification allowed us to implement significantly more general Message Generation and Discourse Organization modules and a somewhat more general Lexicalizati6n module.</Paragraph> <Paragraph position="1"> These more general modules rely on external knowledge bases to supply the domain-specific information that was previously embedded in the code. Thus, we can replace the external knowledge base when moving to a new domain or application without having to modify the module itself. One of our future research goals is to further extract domain-specific lexical knowledge and further generalize the lexicalizer module (.ling et al., 1997).</Paragraph> <Paragraph position="2"> What is not so obvious from Figure 4 are the consistencies and shifts in function among the modules. In fact, the functions of the Lexicalization and Surface Generation modules remained constant across all three systems. But the functions of the first three modules shifted significantly from PLANDoc to FLOwDOC. In particular, the function of message aggregation lay exclusively in the Discourse Organization module in PLANDoc (Shaw, 1995), whereas aggregation functions are executed in both the Message Generation and Discourse Organization modules in FLOWDOC.</Paragraph> <Paragraph position="3"> Because the development of domain-independent, plug-and-play ontology modules is one of the major features that affected these shifts in function, and because such modules greatly increase the portability of the system, we devote the next section to a more detailed description of the function of ontological generalization.</Paragraph> </Section> </Section> <Section position="6" start_page="17" end_page="18" type="metho"> <SectionTitle> 6 Ontological Generalization </SectionTitle> <Paragraph position="0"> Ontological generalization refers to the problem of composing, with the help of an ontology, a concise description for a multi-set of concepts. For example, FLOWDOC's output sentence shown in Figure 2 The most frequent tasks in this workflow are those of creating, reviewing and saving documents.</Paragraph> <Paragraph position="1"> concisely describes a multi-set of ten specific task nodes in the flow diagram by locating superclass concepts in the ontology that encompass the specific predicates and objects of the task nodes. Our aim is to compose a description that is concise without sacrificing much in accuracy.</Paragraph> <Paragraph position="2"> While PLANDoc made extensive use of conjunction, ellipsis, and paraphrasing to produce a concise summary, ontological relations were not heavily used. For FLowDoc we implemented a more general, domain-independent solution. We were able to re-use this module with minor modifications in ZEDDoc, after replacing the ontological knowledge base.</Paragraph> <Paragraph position="3"> Our ontological generalization algorithm works as follows. Given a set Co = {01,02,... ,ON} of objects of a given predicate-class and an associated list (cl,c2,...,CN) of their occurrence counts, we compute an optimal set of concept generalizations {G1, G2,... ,GM} such that each generalization replaces a subset of Co while maintaining a reasonable trade-off between the accuracy, specificity, and verbosity of the resulting description.</Paragraph> <Paragraph position="4"> We consider as candidate concept generalizations the actual members of Co and all the concepts in the domain ontology that subsume one or more of them. Each such candidate concept generalization distance between each element of the subset and the candidate generalization.</Paragraph> <Paragraph position="5"> The semantic distance currently used is simply the number of levels between each object and the generalization in the domain ontology. It could be easily changed to an information-based distance, e.g., along the lines of the metrics proposed in (Resnik, 1995), who measures semantic distance between two concepts as a function of the lexical probabilities of their c.ommon superclasses.</Paragraph> <Paragraph position="6"> To compute the optimal set of generalizations, the algorithm starts by generating all possible partitions of the given set of objects 3, then locates the best single-term description for each subset in the partition by applying the procedure outlined above to each candidate generalization, and finally combines the single-term description scores in one number.</Paragraph> <Paragraph position="7"> The final score is adjusted by two additional penalties: null * A verbosity penalty, penalizing descriptions with more than one generalization (exponentially more as the number of terms in the description increases).</Paragraph> <Paragraph position="8"> * A heterogeneity penalty, for descriptions that are locally optimal but significantly lower in the ontology (more specific) than the global specificity level.</Paragraph> <Paragraph position="9"> The global specificity level indicates the appropriate overall level of detail. It is computed by applying the above ontological generalization procedure to the collection of all the objects appearing in the input graph, across all actions. It implements the idea of &quot;basic level&quot; descriptions from (Rosch, 1978) for the application domain modeled by the work flow. For example, while processing a flow diagram which covers documents of many types, our algorithm will have a bias in favor of the generic term &quot;Document&quot; rather the too-specific term &quot;Draft document in SGML format&quot;; a trade-off between the ~With some performance-imposed constraints, since the number of possible partitions grows exponentially with the number of objects and the number of subsets in the partition.</Paragraph> <Paragraph position="10"> heterogeneity penalty and other components of the description score occurs if the latter term looks locally optimal.</Paragraph> <Paragraph position="11"> The same generalization method for sets of (concept, occurrence count) pairs was applied in ZEDDoc, but instead of actions or graph components, the concepts were Internet addresses or ZED page types. ZED requires semantic types to be assigned to WWW pages and ads to help determine which ads from its database can be inserted in pre-defined ad slots. When a ZEDDOc user requests a summary of activity pertaining to a particular set of ads for a given time period, the raw data consists in part of frequency lists indicating how many users from a given Internet node saw the relevant ads and how many of the displayed pages corresponded to particular semantic types. One minor change for ZEDDoc was the replacement of predefined absolute frequency thresholds for determining the salience of items with relative ones.</Paragraph> <Paragraph position="12"> To summarize the Internet domain or page type data, ZEDDoc relies on plug-and-play ontologies.</Paragraph> <Paragraph position="13"> Specialization subtrees rooted at certain concepts, e.g., the Internet domain, can be replaced so long as at least one lexicalization is provided for every concept. Our ontology for the Internet domain combined world knowledge with the implicit hierarchical structure of domain names. For example, through hand analysis of WWW logs we created a geographical categorization of university nodes, on the assumption that such demographic information is important to advertisers.</Paragraph> </Section> <Section position="7" start_page="18" end_page="19" type="metho"> <SectionTitle> 7 Component Re-Use Revisited </SectionTitle> <Paragraph position="0"> The major theme throughout this paper has been how we re-used components from our original Plan-Doc system to implement the subsequent FLOwDOc and ZEDDoc systems, significantly cutting development time. In this section, we summarize our experiences regarding code re-use.</Paragraph> <Paragraph position="1"> * The message generator offers limited possibilities for reuse becanse it directly interfaces to an application-specific external source. Limited code sharing w~ possible however, because of our choice of a common representation format for all three systems.</Paragraph> <Paragraph position="2"> * As noted briefly in Section 3, the FLowDoc architecture had distinct modules pertaining to the three levels of discourse, sentence, and sentence constituent. Retaining this more general architecture in ZEDDoc proved useful with respect to one additional required functionality, namely the ability to produce plain text or HTML output. The three levels of discourse organization were exploited in ZEDDoc primarily to distinguish between HTML commands that pertain to the overall layout (e.g., paragraph divisions) versus those that pertain to sentence-internal features (e.g., fonts).</Paragraph> <Paragraph position="3"> * At the lexicalization level, we achieved only partial generalization of the lexicalizer's code. Given the state of the art in natural langlmge generation, the lexicon remains necessarily domain-specific. However, we are exploring ways to remove domain-specific lexical knowledge from the system pipeline, as we did with domain-specific ontological and discourse knowledge.</Paragraph> <Paragraph position="4"> We are building a large-scale general lexicon for generation, which provides syntactic arid partial semantic knowledge and can be used to select the generated sentence structure and possible paraphrases (Jing et al., 1997). By using this general lexicon together with a smaller domain-specific lexicon or with information extracted from a corpus from the application domain, we expect to significantly simplify the development of the lexicalization module, improving its reliability and portability.</Paragraph> <Paragraph position="5"> * At the final surface generation level, we took advantage of prior progress in component standardization and used FUF (Functional Unification Formalism) and its corresponding extensive English surface grammar SURGE. As a result, the surface generation module was ported unchanged to the other systems.</Paragraph> </Section> <Section position="8" start_page="19" end_page="19" type="metho"> <SectionTitle> 8 Conclusion </SectionTitle> <Paragraph position="0"> By teasing apart some of PLANDoc's modules and partially re-configuring others, we were able to port our text generation system to two completely new domains, those of flow chart and WWW activity summarization. In the process, ~ve devised domain-independent message aggregation and discourse restructuring modules for FLowDoc that we re-used intact for ZEDDoc. Indeed, we believe that our ontological generalization algorithm (i.e., message aggregation guided by quantitative formulas over plug-and-play ontologies) is generally domain-independent. We are exploring ways to introduce probability estimates in our weighting functions for message aggregation, linking the static ontology with corpus-observable variations in concept use and coverage.</Paragraph> <Paragraph position="1"> Re-usable tools and techniques can provide leverage for building practical text generation applications. They can also facilitate research leading to increasingly more general and more useful tools. This has been our experience in implementing tile three text generation systems covered in this paper which are all based on a common architecture, a common representation format, and a common, evolving foundation of text generation tools.</Paragraph> <Paragraph position="2"> At least three other factors that are critical to practical and commercial success should be mentioned though we cannot discuss them here. Two of them, i) extensive user-needs analysis and feed-back and ii) target corpus compilation and analysis, are highly correlated with the relative success of each of our systems. These two factors are discussed in more detail in previous papers (Kukich et al., 1994; Kukich, 1983b). A third, undocumented factor, the rigorous pre-release testing of the system under conditions similar to its deployment environment, played a critical role in PLANDoc's success.</Paragraph> </Section> class="xml-element"></Paper>