File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1116_metho.xml
Size: 20,093 bytes
Last Modified: 2025-10-06 14:14:54
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-1116"> <Title>Generation that Exploits Corpus-Based Statistical Knowledge</Title> <Section position="4" start_page="704" end_page="705" type="metho"> <SectionTitle> (,,2 I \[dog<canid\[ </SectionTitle> <Paragraph position="0"> : quant plural) IStrings can be used in place of concepts. If the string is not a recognized word/phrase, then the generator will add this ambiguity to the word lattice for the statistical extractor to resolve by proposing all possible part-of-speech tags. We prefer to use concepts because they make the AMR more language-lndependent, and enable semantic reasoning and inference.</Paragraph> <Paragraph position="1"> 2Concept names appear between vertical bars. We use a set of short, unique concept names derived from the structure of WordNet by Jonathan Graehl, and available from http://www.isi.edu/natural-language/GAZELLE.html This narrows the meaning to &quot;the dogs,&quot; or &quot;dogs.&quot; Concepts can be associated with each other in a nested fashion to form more complex meanings. These relations between conceptual meanings are also expressed through keywords. It is through them that our formalism exhibits an appealing flexibility. A client has the freedom to express the relations at various semantic and syntactic levels, using whichever level of representation is most convenient. 3 We have currently implemented shallow semantic versions of roles such as :agent, :patient, :sayer, :sensor, etc., as well as deep syntactic roles such as :obliquel, :oblique2, and :oblique3 (which correspond to deep subject, object, and indirect object respectively, and serve as an abstraction for passive versus active voice) and the straightforward syntactic roles :subject, :directobject, :indirect-object, etc. We explain further how this is implemented later in the paper.</Paragraph> <Paragraph position="2"> Below is an example of a slightly more complex meaning. The root concept is eating, and it has an agent and a patient, which are dogs and a bone (or bones), respectively.</Paragraph> <Paragraph position="4"> Possible output includes &quot;The dogs ate the bone,&quot; &quot;Dogs will eat a bone,&quot; &quot;The dogs eat bones,&quot; &quot;Dogs eat bone,&quot; and &quot;The bones were eaten by dogs.&quot;</Paragraph> </Section> <Section position="5" start_page="705" end_page="705" type="metho"> <SectionTitle> 3 Lexical Knowledge </SectionTitle> <Paragraph position="0"> The Sensus concept ontology is mapped to an English lexicon that is consulted to find words for expressing the concepts in an AMR. The lexicon is a list of 110,000 tuples of the form: (<word> <part-of-speech> <rank> <concept>) Examples: ((&quot;eat&quot; VERB I feat,take in\[) (&quot;eat&quot; VERB 2 Jeat>eat lunch\[) deg.* ) The <rank> field orders the concepts by sense frequency for the given word, with a lower number signifying a more frequent sense.</Paragraph> <Paragraph position="1"> Like other types of knowledge used in Nitrogen, the lexicon is very simple. It contains no 3This flexibility has another advantage from a research point of view. We consider the appropriate level of abstraction an important problem in interlingua-style machine translation. The flexibility of this representation allows us to experiment with various levels of abstraction without changing the underlying system. Further, it has opened up to us the possibility of implementing interlingua-based semantic transfer, where the interlingua serves as the transfer mechanism, rather than being a single, fixed peak point of abstraction. information about features like transitivity, subcategorization, gradability (for adjectives), or countability (for nouns), etc. Such features are needed in other generators to produce correct grammatical constructions. Our statistical post-processor instead more softly (and robustly) ranks different grammatical realizations according to their likelihood. At the lexical level, several important issues in word choice arise. WordNet maps a concept to one or more synonyms. However, some words may be less appropriate than others, or may actually be misleading in certain contexts. An example is the concept \[sell<cozen\[ to which the lexicon maps the words &quot;betray&quot; and &quot;sell.&quot; However, it is not very common to use the word &quot;sell&quot; in the sense of &quot;A traitor sells out on his friends.&quot; In the sentence &quot;I cannot \[sell<cozen\[ their trust&quot; the word &quot;sell&quot; is misleading, or at least sounds very strange; &quot;betray&quot; is more appropriate.</Paragraph> <Paragraph position="2"> This word choice problem occurs frequently, and we deal with it by taking advantage of the word-sense rankings that the lexicon offers. According to the lexicon, the concept \[sell<cozen\[ expresses the second most frequent sense of the word &quot;betray,&quot; but only the sixth most frequent sense of the word &quot;sell.&quot; To minimize the lexical choice problem, we have adopted a policy of rejecting words whose primary sense is not the given concept when better words are available. 4 Another issue in word choice relates to the broader issue of preserving ambiguities in MT. In source language analysis, it is often difficult to determine which concept is intended by a certain word. The AMR allows several concepts to be listed together in a disjunction. For example,</Paragraph> <Paragraph position="4"> The lexical lookup will attempt to preserve the ambiguity of this *0R*. If it happens that several or all of the concepts in a disjunction can be expressed using the same word, then the lookup will return only that word or words in preference to the other possibilities. For the example above, the lookup returns only the word &quot;betray.&quot; This also reduces the complexity of the final sentence lattices.</Paragraph> </Section> <Section position="6" start_page="705" end_page="706" type="metho"> <SectionTitle> 4 Morphological Knowledge </SectionTitle> <Paragraph position="0"> The lexicon contains words in their root form, so morphological inflections must be generated.</Paragraph> <Paragraph position="1"> The system also performs derivational morphology, such as adjective-*noun and noun--*verb (ex: 4A better &quot;soft&quot; technique would be to accept all words returned by the lexicon for a given concept, but associate with each word a preference score using a method such as Bayes' Rule and probabilities computed from a corpus such as SEMCOR, allowing the statistical extractor to choose the best alternative. We plan to implement this in the future.</Paragraph> <Paragraph position="2"> &quot;translation&quot;-~&quot;translate&quot;) to give the generator more syntactic flexibility in expressing complex AMK's. This flexibility ensures that the generator can find a way to express a complex meaning represented by nested AMRs, but is also useful for solving problems of syntactic divergence in MT.</Paragraph> <Paragraph position="3"> Both kinds of morphology are handled the same way. Rules and exception tables are merged into a single, concise knowledge base. Here, for example, is a portion of the table for pluralizing nouns: The last line means: if a noun ends in a consonant followed by &quot;-o,&quot; then we compute two plural forms, one ending in &quot;-os&quot; and one ending in &quot;-oes,&quot; and put both possibilities in the word lattice for the post-generation statistical extractor to choose between later. Deciding between these usually requires a large word list. However, the statistical extractor already has a strong preference for &quot;photos&quot; and &quot;potatoes&quot; over &quot;photoes&quot; and &quot;potatos,&quot; so we do not need to create such a list. Here again corpus-based statistical knowledge greatly simplifies the task of symbolic generation.</Paragraph> <Paragraph position="4"> Derivational morphology raises the issue of meaning shift between different part-of-speech forms (such as &quot;depart&quot;-~ &quot;departure&quot;/&quot;department&quot;). Errors of this kind are infrequent, and are corrected in the morphology tables.</Paragraph> </Section> <Section position="7" start_page="706" end_page="706" type="metho"> <SectionTitle> 5 Generation Algorithm </SectionTitle> <Paragraph position="0"> An AMR is transformed into word lattices by keyword-based grammar rules described in Section 7. By contrast, other generators organize their grammar rules around syntactic categories. A keyword-based organization helps achieve simplicity in the input specification, since syntactic information is not required from a client. This simplification can make Nitrogen more readily usable by client applications that are not inherently linguistically oriented. The decisions about how to syntactically realize a given meaning can be left largely up to the generator.</Paragraph> <Paragraph position="1"> The top-level keywords of an AMR are used to match it with a rule (or rules). The algorithm is compositional, avoiding a combinatorial explosion in the number of rules needed for the various keyword combinations. A matching rule splits the AMR apart, associating a sub-AMR with each keyword, and lumping the relations left over into a sub-AMR under the :rest role using the same root as the original AMR. Each sub-AMR is itself recursively matched against the keyword rules, until the recursion bottoms out at a basic AMR which matches with the instance rule.</Paragraph> <Paragraph position="2"> Lexical and morphological knowledge is used to build the initial word lattices associated with a concept when the recursion bottoms out. Then the instance rule builds basic noun and verb groups from these, as well as basic word lattices for other syntactic categories. As the algorithm climbs out of the recursion, each rule concatenates together lattices for each of the sub-AMR's to form longer phrases.</Paragraph> <Paragraph position="3"> The rhs specifies the needed syntactic category for each sub-lattice and the surface order of the concatenation, as well as the syntactic category for the new resulting lattice. Concatenation is performed by attaching the end state of one sub-lattice to the start state of the next. Upon emerging from the top-level rule, the lattice with the desired syntactic category, by default S (sentence), is selected and handed to the statistical extractor for ranking.</Paragraph> <Paragraph position="4"> The next sections describe further how lexical and morphological knowledge are used to build the initial word lattices, how underspecification is handled, and how the grammar is encoded.</Paragraph> </Section> <Section position="8" start_page="706" end_page="707" type="metho"> <SectionTitle> 6 The Instance Rule </SectionTitle> <Paragraph position="0"> The instance rule is the most basic rule since it is applied to every concept in the AMR. This rule builds the initial word lattices for each lexical item and for basic noun and verb groups. Each concept in the AMR is eventually handed to the instance rule, where word lattices are constructed for all available parts of speech.</Paragraph> <Paragraph position="1"> The relational keywords that apply at the instance level are :polarity, :quant, :tense, and :modal. In cases where a meaning is underspecified and does not include these keywords, the instance rule uses a recasting mechanism (described below) to add some of them. If not specified, the system assumes positive polarity, both singular and plural quantities, all possible time frames, and no modality.</Paragraph> <Paragraph position="2"> Japanese nouns are often ambiguous with respect to number, so generating both singular and plural possibilities and allowing the statistical extractor to choose the best one results in better translation quality than rigidly choosing a single default as traditional generation systems do. Allowing number to be unspecified in the input is also useful for general English generation as well. There are many instances when the number of a noun is dictated more by usage convention or grammatical constraint than by semantic content. For example, &quot;The company has (a plan/plans) to establish itself in February,&quot; or &quot;This child won't eat any carrots,&quot; (&quot;carrots&quot; must be plural by grammatical constraint). It is easier for a client program if the input is not required to specify number in these cases, but is allowed to rely on the statistical extractor to supply the best one.</Paragraph> <Paragraph position="3"> In translation, there is frequently no direct correspondence between tenses of different languages, so in Nitrogen, tense can be coarsely specified as either past, present, or future, but need not be specified at all. If not specified, Nitrogen generates lattices for the most common English tenses, and allows the statistical extractor to choose the most likely one.</Paragraph> <Paragraph position="4"> The instance rule is factored into several sub-instance rules with three main categories: nouns, verbs, and miscellaneous. The noun instance rules are further subdivided into two rules, one for plural noun phrases, and the other for singular. The verb instance rules are factored into two categories relating to modality and tense.</Paragraph> <Paragraph position="5"> Polarity can apply across all three main instance categories (noun, verb, and other), but only affects the level it appears in. When applied to nouns or adjectives, the result is &quot;non-&quot; prepended to the word, which conveys the general intention, but is not usually very grammatical. Negative polarity is usually most fluently expressed in the verb rules with the word &quot;not,&quot; e.g., &quot;does not eat. ''5</Paragraph> </Section> <Section position="9" start_page="707" end_page="708" type="metho"> <SectionTitle> 7 Grammar Formalism </SectionTitle> <Paragraph position="0"> The grammatical specifications in the keyword rules constitute the main formalism of the generation system. The rules map semantic and syntactic roles to grammatical word lattices. These roles include: :agent, :patient, :domain, :range, :source, : dest inat ion, : spat ial-locat ing,</Paragraph> <Paragraph position="2"> :subject, :object, :mod, etc.</Paragraph> <Paragraph position="3"> A simplified version of the rule that applies to an AMR with :agent and :patient roles is:</Paragraph> <Paragraph position="5"> The left-hand side is used to match an AMR with agent and patient roles at the top level. The :rest keyword serves as a catch-all for other roles that appear at the top level. Note that the rule specifies two ways to build a sentence, one an active voice SWe plan to generate more fluent expressions for negative polarity on nouns and adjectives, for example, &quot;unhappy&quot; instead of &quot;non-happy.&quot; version and the other passive. Since at this level the input may be underspecified regarding which voice to use, the statistical extractor is expected to choose later the most fluent version. Note also that this rule builds lattices for other parts of speech, in addition to sentences (ex: &quot;the consumption of the bone by the dogs&quot;). In this way the generation algorithm works bottom-up, building lattices for the leaves (innermost nested levels of the input) first, to be combined at outer levels according the relations between the leaves. For example, the AMR below will match this rule: here indicating an option for the null determiner. Before running, the statistical extractor removes all *empty* transitions by determinizing the word lattice. Note also the insertion of morphological tokens like +plural. Inflectional morphology rules also apply during this determinizing stage.</Paragraph> <Paragraph position="6"> The :rest keyword in the rule head provides a handy mechanism for decoupling the possible keyword combinations. By means of this mechanism, keywords which generate relatively independent word lattices can be organized into separate rules, avoiding combinatorial explosion in the number of rules which need to be written.</Paragraph> <Section position="1" start_page="708" end_page="708" type="sub_section"> <SectionTitle> 7.1 Recasting Mechanism </SectionTitle> <Paragraph position="0"> The recasting mechanism that is used in the grammar formalism gives it unique power and flexibility. The recasting mechanism enables the generator to transform one semantic representation into another one (such as deep to shallow, or instance to sub-instance) and to accept as input a specification anywhere along this spectrum, permitting meaning to be encoded at whatever level is most convenient.</Paragraph> <Paragraph position="1"> The recasting mechanism also makes it possible to handle non-compositional aspects of language.</Paragraph> <Paragraph position="2"> One area in which we use this mechanism is in the :domain rule. Take for example the sentence, &quot;It is necessary that the dog eat.&quot; It is sometimes most convenient to represent this as:</Paragraph> <Paragraph position="4"> and at other times as: (mll / \[have the quality of beingl :domain (m12 / lear,take inl :agent (d / \]dog,canidl)) :range (m13 / \[obligatory<necessaryl)) but we can define them to be semantically equivalent. In our system, both are accepted, and the first is automatically transformed into the second. Other ways to say this sentence include &quot;The dog is required to eat,&quot; or &quot;The dog must eat.&quot; However, the grammar formalism cannot express this, because it would require inserting the word lattice for \]obligatory<necessary\[ within the lattice for m9 or m12--but the formalism can only concatenate lattices. The recasting mechanism solves this problem, by recasting the above AMR as:</Paragraph> <Paragraph position="6"> which makes it possible to form these sentences. The syntax for recasting the first AMR to the second is:</Paragraph> <Paragraph position="8"> and for recasting the second into the third:</Paragraph> <Paragraph position="10"> The :new and :add keywords signal an AMR recast. The list after the keyword contains the instructions for doing the recast. In the first case, the :new keyword means: build an AMR with a new root, Ihave the quality of beingl, and two roles, one labeled :domain and assigned sub-AMR x2; the other labeled :range and assigned sub-AMR xl. The question mark causes a direct splice of the results from the recast.</Paragraph> <Paragraph position="11"> In the second case, the :add keyword means: insert into the sub-AMR of x2 a role labeled :modal and assign to it the sub-AMR of x3 which itself is recast to include the roles in the sub-AMR of xl but not its root. (This is in case there are other roles such as polarity or time which need to be included in the new AMR.) In fact, recasting makes it possible to nest modals within modals to any desired depth, and even to attach polarity and tense at any level. For example, &quot;It is not possible that it is required that you are permitted to go,&quot; can be also (more concisely) stated as &quot;It cannot be required that you be permitted to go,&quot; or &quot;It is not possible that you must be permitted to go,&quot; or &quot;You cannot have to be permitted to go.&quot; This is done by a grammar rule expressing the most nested modal concept as a modal verb and the remaining modal concepts as a combination of regular verbs or adjective phrases. Our grammar includes a fairly complete model of obligation, possibility, permission, negation, tense, and all of their possible interactions.</Paragraph> </Section> </Section> class="xml-element"></Paper>