File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-0808_metho.xml

Size: 15,187 bytes

Last Modified: 2025-10-06 14:07:40

<?xml version="1.0" standalone="yes"?>
<Paper uid="W01-0808">
  <Title>Construction of a Spanish Generation module in the framework of a General-Purpose, Multilingual Natural Language Processing System. In Proceedings of the VII</Title>
  <Section position="5" start_page="1" end_page="1" type="metho">
    <SectionTitle>
2 Syntactic Generation Component
</SectionTitle>
    <Paragraph position="0"> The different language generation modules in our system are syntactic realization components that take as input an LF characteristic of the language to be generated and produce a syntactic tree and surface string for that language. In this sense, they are functionally similar to the REALPRO system (Lavoie and Rambow, 1997).</Paragraph>
    <Paragraph position="1">  English gloss is provided in Figure 2 for readability purposes only.</Paragraph>
    <Paragraph position="2"> The generation modules are not designed specifically for MT, but rather are applicationindependent. They can take as input an LF produced by a dialog application, a critiquing application, a database query application, an MT application, etc. They only require a monolingual dictionary for the language being generated and an input LF that is characteristic of that language. For each language there is only one generation component that is used for all applications, and for MT, it is used for translation from all languages to that language. At the beginning of generation, the input LF is converted into a basic syntactic tree that conforms to the tree geometry of the NLP system. The nodes in LF become subtrees of this tree and the LF relations become complement/adjunct relationships between the subtrees. This basic tree can be set up in different ways. For English, Spanish, and Chinese, we set it up as strictly head-initial with all the complements/adjuncts following the head, resembling the tree of a VSO language.</Paragraph>
    <Paragraph position="3"> For Japanese, we set it up as strictly head-final, with all the complements/adjuncts preceding the head. Figure 3 gives the basic Spanish generation tree produced from the Spanish transferred LF in Figure 2.</Paragraph>
    <Paragraph position="4"> Figure 3 The generation rules apply to the basic tree, transforming it into a target language tree. In the application of the rules, we traverse the tree in a top-down, left-to-right, depth-first fashion, visiting each node and applying the relevant rules. Each rule can perform one or more of the following operations:  (1) Assign a syntactic label to the node. For example, the &amp;quot;DECL&amp;quot; label will be assigned to the root node of a declarative sentence.</Paragraph>
    <Paragraph position="5"> (2) Modify a node by changing some  information within the node. For example, a pronoun might be marked as reflexive if it is found to be co-referential with the subject of the clause it is in.</Paragraph>
    <Paragraph position="6">  (3) Expand a node by introducing new node(s) into the tree. For example, the &amp;quot;Definite&amp;quot; (+Def) feature on a node may become a determiner phrase attached to the syntactic subtree for that node.</Paragraph>
    <Paragraph position="7"> (4) Delete a node. For example, for a pro-drop language, a pronominal subject may be removed from the tree.</Paragraph>
    <Paragraph position="8"> (5) Move a node by deleting it from Position A and inserting it in Position B. For example, for an SVO language, the subject NP of a sentence may be moved from a post-verbal position to a pre-verbal position.</Paragraph>
    <Paragraph position="9"> (6) Ensure grammatical agreement between nodes. For example, if the subject of a sentence is first person singular, those number and person features will be assigned to the main verb.</Paragraph>
    <Paragraph position="10"> (7) Insert punctuation and capitalization.</Paragraph>
    <Paragraph position="11">  The nodes in the generated tree are linked to each other by relations such as &amp;quot;head&amp;quot;, &amp;quot;parent&amp;quot; and &amp;quot;sibling&amp;quot;. The entire tree is thus visible from any given node via these relations. When a rule is applied to a node, the decisions made in that rule can be based not just on features of that node, but also on features of any other node in the tree. This basically eliminates the need for backtracking, which would be necessary only if there were local ambiguities resulting from the absence of global information. In this sense, our approach is similar to that of other large-scale generators (Tomita and Nyberg, 1988).</Paragraph>
    <Paragraph position="12"> The generation rules operate on a single tree. Rule application is deterministic and thus very efficient. If necessary, the tree can be traversed more than once, as is the case in the generation modules for the languages we are currently working on. There is a &amp;quot;feeding&amp;quot; relationship among the rules. The rules that assign punctuation and capitalization, for example, do not apply until all the movement rules have applied, and movement rules do not apply until nodetypes and functional roles are assigned.</Paragraph>
    <Paragraph position="13"> To improve efficiency and to prevent a rule from applying at the wrong time or to the wrong structure, the rules are classified into different groups according to the passes in which they are applied. Each traversal of the tree activates a given group of rules. The order in which the different groups of rules are applied depends on the feeding relations.</Paragraph>
    <Paragraph position="14"> For the simple example in Figure 2 above, the Spanish, Chinese, and Japanese generation components all have an initial pass that assigns nodetypes and functional roles and a final pass that inserts punctuation marks.</Paragraph>
    <Paragraph position="15"> In addition, the Spanish component, in a first pass that identifies syntactic functions, deletes the pronominal subject and inserts a dative clitic pronoun. It also inserts the definite article and the personal marker &amp;quot;a&amp;quot;. In a second pass, it checks agreement between indirect object and doubled clitic as well as between subject and verb, assigning the appropriate person, number, and gender agreement information to the terminal nodes.</Paragraph>
    <Paragraph position="16"> Reordering operations, such as moving the clitic in front of the verb, if the verb is finite, or after, if it is non-finite, come later. The last pass takes care of euphonic issues, such as contractions or apocopated adjectives. Figure 4a shows the resulting tree.</Paragraph>
    <Paragraph position="17"> Figure 4a The Chinese component has a nodemodification pass, which adds the FUNCW node headed by (le) to indicate past tense. In this pass the direct object is also turned into a prepositional phrase introduced by (ba) to show the definiteness of the NP. Following this pass, a movement pass moves the subject in front of the verb.</Paragraph>
    <Paragraph position="18"> Figure 4b The Japanese component has a pass in which case-markers or modifiers are inserted. In Figure 4c, the nominative, the accusative, and the dative case markers are inserted in the subject, direct object, and indirect object NPs, respectively. Also, the demonstrative corresponding to English &amp;quot;that&amp;quot; is inserted at the beginning of the definite NP (pencil).</Paragraph>
    <Paragraph position="19"> Figure 4c After the grammatical rules apply, the morphological rules apply to the leaf nodes of thetree. Sinceeachnodeinthetreeisafeature matrix and agreement information has already been assigned by the generation rules, morphological processing simply turns the feature matrices into inflected forms. For instance, in our Spanish example, the verb &amp;quot;dar&amp;quot; with the &amp;quot;past&amp;quot;, &amp;quot;singular&amp;quot; and &amp;quot;1st person&amp;quot; features is spelled out as &amp;quot;di&amp;quot;. Once all the words are inflected, the inflected form of each leaf node is displayed to produce the surface string. This completes the generation process, as exemplified for Spanish in Figure 5.</Paragraph>
    <Paragraph position="20"> Figure 5</Paragraph>
  </Section>
  <Section position="6" start_page="1" end_page="1" type="metho">
    <SectionTitle>
3 Application-Driven Generation
</SectionTitle>
    <Paragraph position="0"> The example used in the previous sections is quite simple, and not representative of the actual problems that arise in MT. Applications, such as MT, that automatically create input for the generation component for a language will not always produce ideal LFs for that language, i.e., LFs that could have been produced by the analysis modules for that language.</Paragraph>
    <Paragraph position="1"> We have designed the generation components, therefore, to add a degree of robustness to our applications. To some extent, and based only on information about the language being generated, the generation components will fix incomplete or inconsistent LFs and will verify that the structures they generate comply with the constraints imposed by the target language.</Paragraph>
    <Paragraph position="2"> The core generation rules are designed to be application-independent and source-languageindependent. Expanding the rule base to cover all the idiosyncrasies of the input would contaminate the core rules and result in loss of generality. In order to maintain the integrity of the core rules while accommodating imperfect input, we have opted to add a pre-generation layer to our generation components.</Paragraph>
    <Paragraph position="3"> Pre-generation rules apply before the basic syntactic tree is built. They can modify the input LF by adding or removing features, changing lemmas, or even changing structural relations. Below we give examples of problems solved in the pre-generation layers of our different language generation modules. These illustrate not just the source-language independence, but also the applicationindependence of the generation modules. We start with the English generation component, which was used in experimental question-answering applications before being used in MT. Among the pre-generation rules in this component is one that removes the marker indicating non-restrictive modification (Nonrest) from LF nodes that are not in a modification relationship to another LF node. So, for example, when the question-answering application is presented with the query &amp;quot;When did Hitler come to power,&amp;quot; the NLP system analyzes the question, produces an LF for it, searches its Encarta Mindnet (which contains the LFs for the sentences in the Encarta encyclopedia), retrieves the LF fragment in  The LF that is the input to generation in this example is a portion of the LF representation of a complete sentence that includes the phrase &amp;quot;Hitler, who came to power in 1933.&amp;quot; The part of that sentence that answers the question is the nonrestrictive relative clause &amp;quot;who came to power in 1933.&amp;quot; Yet, we do not want to generate the answer as a non-restrictive relative clause (as indicated by Nonrest in the LF), but as a declarative sentence. So, rather than pollute the core generation rules by including checks for implausible contexts in the rule for generating nonrestrictive modifiers, a pre-generation rule simply cleans up the input. The rule is application-independent (though motivated by a particular application) and can only serve to clean up bad input, whatever its source.</Paragraph>
    <Paragraph position="4"> An example of a rule motivated by MT, but useful for other applications, is the pre-generation rule that changes the quantifier &amp;quot;less&amp;quot; to &amp;quot;fewer&amp;quot;, and vice versa, in the appropriate situations. When the LF input to the English generation component specifies &amp;quot;less&amp;quot; as a quantifier of a plural count noun such as &amp;quot;car,&amp;quot; this rule changes the quantifier to &amp;quot;fewer&amp;quot;. Conversely, when an input LF has &amp;quot;fewer&amp;quot; specified as a quantifier of a mass noun such as &amp;quot;luck&amp;quot;, the rule changes it to &amp;quot;less.&amp;quot; This rule makes no reference to the source of the input to generation. This has the advantage that it will apply in a grammar-checking application as well as in an MT application (or any other application). If the input to English generation were the LF produced for the ungrammatical sentence &amp;quot;He has less cars,&amp;quot; the generation component would produce the correct &amp;quot;He has fewer cars,&amp;quot; thereby effectively grammar checking the sentence. And, if the ultimate source of the same input LF were the Spanish sentence &amp;quot;Juan tiene menos coches, &amp;quot; the result would be the same, even if &amp;quot;menos&amp;quot; which corresponds to both &amp;quot;less&amp;quot; and &amp;quot;fewer&amp;quot; in English, were not transferred correctly. Another type of problem that a generation component might encounter is the absence of necessary information. The Spanish generation component, for instance, may receive as input underspecified nominal relations, such as the one exemplified in Figure 7, in which a noun (registro) is modified by another noun (programa). The relationship between the two nouns needs to be made explicit, in Spanish, by means of a preposition when the modifying noun is not a proper noun. Absent the necessary  informationintheincomingLF,apregeneration rule introduces the default preposition &amp;quot;de&amp;quot; to specify this relationship.  Another example of a pre-generation rule, this time from Japanese, deals with the unspecified 1st/2nd person pronominal subject for particular types of predicates. The 1st/2nd person pronoun ( ) is not used as the subject in sentences that express the speaker's/the listener's desire (unless there is some focus/contrast on the subject). So, one of the Japanese pre-generation rules deletes the subject in the input LF that involves such a predicate.</Paragraph>
    <Paragraph position="5"> For instance, below is the input LF, the modified LF, and the string produced from the English sentence &amp;quot;I want to read the book.&amp;quot; Figure 8 From Chinese, we give an example of a rule that actually changes the structure of an LF. In our system, it is possible for the source and target languages to have different LF representations for similar structures. In English and other European languages, for example, the verb &amp;quot;BE&amp;quot; is required in sentences like &amp;quot;He is smart&amp;quot;. In Chinese, however, no copula is used. Instead, an adjectival predicate is used. While we might atemptattheLFleveltounifythese representations, we have not yet done so.</Paragraph>
    <Paragraph position="6"> Moreover, the LF in our system is not intended to be an interlingua representation. Differences between languages and their LFs are tolerated.</Paragraph>
    <Paragraph position="7"> Therefore, Chinese uses a pre-generation rule to transform the be-predicate adjective LF into its Chinese equivalent as shown in Figure 9, though we soon expect transfer to automatically do this. Figure 9</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML