File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-1507_metho.xml

Size: 19,165 bytes

Last Modified: 2025-10-06 14:14:50

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-1507">
  <Title>Application-driven automatic subgrammar extraction</Title>
  <Section position="4" start_page="0" end_page="49" type="metho">
    <SectionTitle>
2 Grammar extraction algorithm
</SectionTitle>
    <Paragraph position="0"> Systemic Functional Grammar (SFG) (Halliday, 1985) is based on the assumption that the differentiation of syntactic phenomena is always deter- null mined by its function in the communicative context. This functional orientation has lead to the creation of detailed linguistic resources that are characterized by an integrated treatment of content-related, textual and pragmatic aspects. Computational instances of systemic grammar are successfully employed in some of the largest and most influential text generation projects--such as, for example, PENMAN (Mann, 1983), COMMUNAL (Fawcett and Tucker, 1990), TECHDOC (KSsner and Stede, 1994), Drafter (Paris and Vander Linden , 1996), and Gist (Not and Stock, 1994).</Paragraph>
    <Paragraph position="1"> For our present purposes, however, it is the formal characteristics of systemic grammar and its implementations that are more important. Systemic grammar assumes multifunctional constituent structuresrepresentable as feature structures with coreferences. As shown in the following function structure example for the sentence &amp;quot;The people that buy silver love it.&amp;quot;, different functions can be filled by one and the same constituent:  Given the notational equivalence of HPSG and systemic grammar first mentioned by (Carpenter, 1992) and (Zajac, 1992), and further elaborated in (Henschel, 1995), one can characterize a systemic grammar as a large type hierarchy with multiple (conjunctive and disjunctive) and multi-dimensional inheritance with an open-world semantics. The basic element of a systemic grammar--a so-called system--is a type axiom of the form (adopting the notation of CUF (DSrre et al., 1996)): entry = type_l I type_2 I ... I type_n.</Paragraph>
    <Paragraph position="2"> where type1 to typen are exhaustive and disjoint sub-types of type entry, entry need not necessarily be a single type; it can be a logical expression over types formed with the connectors AND and oR. A systemic grammar therefore resembles more a type lattice than a type hierarchy in the HPSG tradition. In systemic grammar, these basic type axioms, the systems, are named; we will use entry(s) to denote the left-hand side of some named system s, and out(s) to denote the set of subtypes {type1, type2, ..., type,}the output of the system. The following type axioms taken from the large systemic English grammar NXGI~L (Matthiessen, 1983) shall illustrate the nature of systems in a systemic grammar:</Paragraph>
    <Paragraph position="4"> The meaning of these type axioms is fairly obvious: Nominal groups can be subcategorized in class-names and individual-names on the one hand, they can be subcategorized with respect to their WHcontainment into WH-containing nominal-groups and nominal-groups without WH-element on the other hand. The singular/plural opposition is valid for class-names as well as for WH-containing nominal groups (be they class or individual names), but not for individual-names without WH-element.</Paragraph>
    <Paragraph position="5"> Systemic types inherit constraints with respect to appropriate features, their filler types, coreferences and order. Here are the constraints for some of the types defined above: nominal-group \[Thing: noun\] class-name \[Thing: common-noun, Deictic: top\] individual-name \[Thing: proper-noun\] wh-nominal \[Wh: top\] Universal principles and rules are in systemic grammar not factored out. The lexicon contains stem forms and has a detailed word class type hierarchy at its top. Morphology is also organized as a monotonic type hierarchy. Currently used implementations of SFG are the PENMAN system (Penman Project, 1989), the KPML system (Bateman, 1997) and WAG-KRL (O'Donnell, 1994).</Paragraph>
    <Paragraph position="6"> Our subgrammar extraction has been applied and tested in the context of the KPML environment.</Paragraph>
    <Paragraph position="7"> KPML adopts the processing strategy of the PENMAN system and so it is necessary to briefly describe this strategy. PENMAN performs a semantic driven top-down traversal through the grammatical type hierarchy for every constituent. Passed types are collected and their feature constraints are unified to build a resulting feature structure. Substructure generation requires an additional grammar traversal controlled by the feature values given in the superstructure. In addition to the grammar in its original sense, the PENMAN system provides a particular interface between grammar and semantics.</Paragraph>
    <Paragraph position="8"> This interface is organized with the help of so-called  choosers--these are decision trees associated with each system of the grammar which control the selection of an appropriate subtype during traversal. Choosers should be seen as a practical means of enabling applications (including text planners) to interact with the grammar using purely semantic specifications even though a fully specified semantic theory may not yet be available for certain important areas necessary for coherent, fluent text generation. They also serve to enforce deterministic choice an important property for practical generation (cf. (Reiter, 1994)).</Paragraph>
    <Paragraph position="9"> The basic form of a chooser node is as follows.</Paragraph>
    <Paragraph position="11"> The nodes in a chooser are queries to the semantics, the branches contain a set of actions including embedded queries. Possible chooser actions are the following:</Paragraph>
    <Paragraph position="13"> (identify function concept) (copyhub functionl functionP) A choose action of a chooser explicitly selects one of the output types of its associated system. In general, there can be several paths through a given chooser that lead to the selection of a single grammatical type: each such path corresponds to a particular configuration of semantic properties sufficient to motivate the grammatical type selected. Besides this (choose type), choosers serve to create a binding between given semantic objects and grammatical constituents to be generated. This is performed by the action (identify function concept). Because of the multifunctionality assumed for the constituent structure in systemic grammar, two grammatical functions can be realized by one and the same constituent with one and the same underlying semantics. The action (eopyhub functionl function2) is responsible for identifying the semantics of both grammatical functions.</Paragraph>
    <Paragraph position="14"> Within such a framework, the first stage of sub-grammar extraction is to ascertain a representative set of grammatical types covering the texts for the intended application. This can be obtained by running the text generation system within the application with the full unconstrained grammar. All grammatical types used during this training stage are collected to form the backbone for the subgrammar to be extracted. We call this cumulative type set the goal-types.</Paragraph>
    <Paragraph position="15"> The list of goal-types then gives the point of departure for the second stage, the automatic extraction of a consistent subgrammar, goal-types is used as a filter against which systems (type axioms) are tested. Types not in goal-types have to be excised from the subgrammar being extracted. This is carried out for the entries of the systems in a preparatory step.</Paragraph>
    <Paragraph position="16"> We assume that the entries are given in disjunctive normal form. First, every conjunction containing a type which is not in goal-types is removed. After this deletion of unsatisfiable conjunctions, every type in an entry which is not in goal-types is removed. The restriction of the outputs of every system to the goal-types is done during a simulated depth-first traversal through the entire grammatical type lattice. The procedure works on the type lattice with the revised entries. Starting with the most general type start (and the most general system called rank which is the system with start as entry), a hierarchy traversal looks for systems which although restricted to the type set goal-types actually branch, i.e. have more than one type in their output. These systems constitute the new subgrammar.</Paragraph>
    <Paragraph position="17"> In essence, each grammatical system s is examined to see how many of its possible subtypes in out(s) are used within the target grammar. Those types which are not used are excised from the subgrammar being extracted. More specific types that are dependent on any excised types are not considered further during the traversal. Grammatical systems where there is only a single remaining unexcised sub-type collapse to form a degenerated pseudo-system indicating that no grammatical variation is possible in the considered application domain. For example, in the application described in section 3 the system indicative = declarative I interrogative.</Paragraph>
    <Paragraph position="18"> collapses into indicative = declarative.</Paragraph>
    <Paragraph position="19"> because questions do not occur in the application domain. Pseudo-systems of this kind are not kept in the subgrammar. The types on their right-hand side (pseudotypes) are excised accordingly, although they are used for deeper traversal, thus defining a path to more specific systems. Such a path can consist of more than one pseudotype, if the repeated traversal steps find further degenerated systems. Constraints defined for pseudo-types are raised, chooser actions are percolated down--i.e., more precisely, constraints belonging to a pseudo-type are unified with the constraints of the most general not pseudo type at the beginning of the path. Chooser actions from systems on the path are collected and extend the chooser associated with the final (and first not pseudo) system of the path. However, in the case</Paragraph>
    <Paragraph position="21"> for all out E inter  do traverse-type(out, out, 0, goaltypes) constraints( supertype ) := unify ( constraint s( supert ype ) ,in herit edr eal i z at ions ) traverse-type(type, supertype, inheritedconstraints, goaltypes) 1 who := who-has-in-entry(type)  that a maximal type is reached which is not in goaltypes, chooser actions have to be raised too. The number of goal-types is then usually larger than the number of the types in the extracted subgrammar because all pseudotypes in goal-types are excised. As the recursion criteria in the traversal, we first simply look for a system which has the actual type in its revised entry regardless of the fact if it occurs in a conjunction or not. This on its own, however, oversimplifies the real logical relations between the types and would create an inconsistent subgrammar. The problem is the conjunctive inheritance. If the current type occurs in an entry of another system where it is conjunctively bound, a deeper traversal is in fact only licensed if the other types of the conjunctions are chosen as well. In order to perform such a traversal, a breadth traversal with compilation of all crowns of the lattice (see (A~t-Kaci et al., 1989)) would be necessary. In order to avoid this potentially computationally very expensive operation, but not to give up the consistency of the subgrammar, the implemented subgrammar extraction procedure sketched in Figure 1 maintains all systems with complex entries (be they conjunctive or disjunctive) for the subgrammar even if they do not really branch and collapse to a single-subtype system. 2 A related approach can be found in (O'Donnell, 1992) for the extraction of smaller systemic subgrammars for analysis.</Paragraph>
    <Paragraph position="22"> If the lexicon is organized as or under a complex type hierarchy, the extraction of an application-tuned lexicon is carried out similarly. This has the effect that closed class words are removed from the lexicon if they are not covered in the application domain. Open class words belonging to word classes not covered by the subgrammar type set are removed. Some applications do not need their own lexicon for open class words because they can be linked to an externally provided domain-specific thesaurus (as is the case for the examples discussed below). In this case, a sublexicon extraction is not necessary.</Paragraph>
    <Paragraph position="23"> 2 Keeping the disjunctive systems is not necessary for the consistency, but saves multiple raising of one and the same constraint.</Paragraph>
  </Section>
  <Section position="5" start_page="49" end_page="230" type="metho">
    <SectionTitle>
3 Application for text type 'lexicon
</SectionTitle>
    <Paragraph position="0"> biographies' The first trial application of the automatic subgrammar extraction tool has been carried out for an information system with an output component that generates integrated text and graphics. This information system has been developed for the domain of art history and is capable of providing short biography articles for around l0 000 artists. The underlying knowledge base, comprising half a million semantic concepts, includes automatically extracted information from 14 000 encyclopedia articles from McMillans planned publication &amp;quot;Dictionary of Art&amp;quot; combined with several additional information sources such as the Getty &amp;quot;Art and Architecture Thesaurus&amp;quot;; the application is described in detail in (Kamps et al., 1996). As input the user clicks on an artist name. The system then performs content selection, text planning, text and diagram generation and page layout automatically. Possible output languages are English and German.</Paragraph>
    <Paragraph position="1"> The grammar necessary for short biographical articles is, however, naturally much more constrained than that supported by general broilcoverage grammars. There are two main reasons for this: first, because of the relatively fixed text type &amp;quot;encyclopedia biography&amp;quot; involved, and second, particularly in the example information system, because of the relatively simple nature of the knowledge base--this does not support more sophisticated text generation as might appear in full encyclopedia articles. Without extensive empirical analysis, one can already state that such a gram:mar is restricted to main clauses, only coordinative complex clauses, and temporal and spatial prepositional phrases. It would probably be possible to produce the generated texts with relatively complex templates and aggregation heuristics: but the full grammars for English and German available in KPML already covered the required linguistic phenomena.</Paragraph>
    <Paragraph position="2"> The application of the automatic subgrammar extraction tool to this scenario is as follows.</Paragraph>
    <Paragraph position="3"> In the training phase, the information system runs with the full generation grammar. All grammatical types used during this stage are collected to yield the cumulative type set goal-types. How many text examples must be generated in this phase depends on the relative increase of new infi)rmation (occurrence of new types) obtained with every additional sentence generated. We show here the results for two related text types: 'short artist biographies' and 'artist biography notes'.</Paragraph>
    <Paragraph position="4"> Figure 2 shows the growth curve for the type set  &amp;quot;Oi yoi yoi&amp;quot; and &amp;quot;June 1953 (deep cadmium)&amp;quot;. Anni Albers is American, and she is a textile designer, a draughtsman and a printmaker. She was born in Berlin on 12 June 1899. She studied art in 1916 - 1919 with Brandenburg. Also, she studied art at the Kunstgewerbeschule in Hamburg in 1919 1920 and the Bauhaus at Weimar and Dessan in 1922 - 1925 and 1925 - 1929. In 1933 she settled in the USA. In 1933 - 1949 she taught at Black Mountain College in North Carolina.</Paragraph>
    <Paragraph position="5">  the short biography text type (vertical axis) with each additional semantic specification passed from the text planner to the sentence generator (horizontal axis) for the first of these text types. The graph shows the cumulative type usage for the first 90 biographies generated, involving some 230 sentences. 3 The subgrammar extraction for the &amp;quot;short artist biographies&amp;quot; text type can therefore be performed with respect to the 246 types that are required by the generated texts, applying the algorithm described above. The resulting extracted sub-grammar is a type lattice with only 144 types. The size of the extracted subgrammar is only 11% of that of the original grammar. Run times for sentence generation with this extracted grammar typically range 3This represented the current extent of the knowledge base when the test was performed. It is therefore possible that with more texts, the size of the cumulative set would increase slightly since the curve has not quite 'flattened out'. Explicit procedures for handling this situation are described below.</Paragraph>
    <Paragraph position="6">  the note biography text type from 55%-75% of that of the full grammar (see Table 1)--in most cases, therefore, less than one second with the regular KPML generation environment (i.e., unoptimized with full debugging facilities resident). null The generation times are indicative of the style of generation implemented by KPML. Clause types with more subtypes are likely to cause longer processing times than those with fewer subtypes. When there are in any case fewer subtypes available in the full grammar (as in the existential shown in Table 1), then there will be a less noticeable improvement compared with the extracted grammar. In addition, the run times reflect the fact that the number of queries being asked by choosers has not yet been maximally reduced in the current evaluation. Noting the cumulative set of inquiry responses during the training phase would provide sufficient information for more effective pruning of the extracted choosers. The second example shows similar improvements.</Paragraph>
    <Paragraph position="7"> The very short biography entry is appropriate more for figure headings, margin notes, etc. The cumulative type use graph is shown in Figure 3. With this 'smaller' text type, the cumulative use stabilizes very quickly (i.e., after 39 sentences) at 205 types.</Paragraph>
    <Paragraph position="8"> This remained stable for a test set of 500 sentences.</Paragraph>
    <Paragraph position="9"> Extracting the corresponding subgrammar yields a grammar involving only 101 types, which is 7% of the original grammar. Sentence generation time is accordingly faster, ranging from 40%-60% of that of the full grammar. In both cases, it is clear that the size of the resulting subgrammar is dramatically reduced. The generation run-time is cut to 2/3. The run-time space requirements are cut similarly. The processing time for subgrammar extraction is less than one minute, and is therefore not a significant issue for improvement.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML