File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/98/w98-1426_evalu.xml
Size: 13,267 bytes
Last Modified: 2025-10-06 14:00:32
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1426"> <Title>THE PRACTICAL VALUE OF N'GRAN IS IN GENERATION</Title> <Section position="4" start_page="0" end_page="253" type="evalu"> <SectionTitle> 2 Background </SectionTitle> <Paragraph position="0"> Nitrogen's symbolic generator works from underspecified input and simple lexical, morphological and grammatical knowledge bases. Most of the linguistic decisions that need to be made in realizing the input are left up to the extractor. These include word choice, number, determinateness, subject-verb agreement, verb tense, passive versus active voice, verb sub-categ0rization, expression of possessiveness, and others. To illustrate this more concretely, we briefly describe Nitrogen's symbolic knowledge bases and input.</Paragraph> <Paragraph position="1"> Lexical knowledge in Nitrogen is stored as a list of mappings from words to conceptual meanings in the Sensus Ontology. The list is composed of tuples in the form: (<word> <part-of-speech> <rank> <concept>) Examples: (&quot;eat&quot; VERB I lear,take inl) . (&quot;eat&quot; VERB 2 lear>eat lunchl) (&quot;take in&quot; VERB 14 \[eat,take in\[) The <rank> field orders the concepts by sense frequency for tile given word, with a lower number signifying a more frequent sense. This simple lexicon contains no information about features like transitivity, subcategorization, gradability (for adjectives), or countability (for nouns), etc.</Paragraph> <Paragraph position="2"> Only the root forms of words are stored in the lexicon, so morphological variants must be derived. Nitrogen's morphological knowledge base consists of a simple table that concisely merges both rules and exceptions. Here, for example, is a portion of the table for pluralizing nouns: (&quot;-child .... children&quot;) (&quot;-person .... people .... persons&quot;) (&quot;-x .... xes'! &quot;xen&quot;) ; boxes/oxen Nitrogen's list of exceptions is greatly reduced since the statistical extractor can be relied upon to discard non-word morphological derivations generated by the simplified rules (such as &quot;boxen&quot; or &quot;oxes&quot; generated from the rule above). These words never occur in the corpus, and so are assigned a very low score in the extractor phase.</Paragraph> <Paragraph position="3"> Nitrogen's grammar rules are organized around abstract semantic relations. This allows the details about syntactic structure to be decided internally, rather than requiring a client application to supply them. Nitrogen uses a recastin 9 mechanism to reformulate the input in terms of relations that are.less abstract, more syntactic, and closer to the final linear sentence realization. Recasting usually performs a one-to-many mapping (narrowing the final output to a single result is again left up to the statistical extractor). The multiple mappings reflect different constituent orders, different syntactic realizations, and different linguistic decisions for insertion of non-content words. Examples of the types of relations that Nitrogen handles are: Semantic-- :agent, :patient, :domain, :range, :source, :destination, :spatial-locating, This input is a labeled directed graph, or feature structure, that encodes relationships between entities A, A2, and T. Concept names are enclosed in vertical bars; we use an automated form of Wordnet 1.5 (Miller, 1990). The slash after a label is shorthand for the :INSTANCE relation.</Paragraph> <Paragraph position="4"> Symbolic knowledge in Nitrogen is used to convert an input to a word lattice, from which the statistical extractor selects the best output. Nitrogen's extraction algorithm is based on a bigram scoring method where the probability of a sentence Wl * ~, wn is approximated as</Paragraph> <Paragraph position="6"> This method is similar tothe approach used in speech recognition. Every sentence path through the word lattice is ranked according to this formula, and the best One is selected as the final output.</Paragraph> <Section position="1" start_page="0" end_page="253" type="sub_section"> <SectionTitle> 2.1 Example 1 </SectionTitle> <Paragraph position="0"> In order to describe the synergy that exists in Nitrogen between symbolic and statistical methods, it seems easiest to look in detail at few examples Of Nitrogen's actual inputs and outputs. As a first example we will use the input above.</Paragraph> <Paragraph position="1"> Symbolic generation for this input produces a word lattice containing 270 nodes, 592 arcs, and i55~764 distinct paths. Space permits only two portions (about one-fifth) of the sentence lattice to be shown in Figure 2. An example of a random path through the lattice not shown is &quot;Betrayal of an trust of them by me is not having a feasibility.&quot; The top 10 paths selected by the statistical extractor are: I cannot betray their trust .</Paragraph> <Paragraph position="2"> I will not be *able to betray their trust i am not able to betray their trust .</Paragraph> <Paragraph position="3"> I are not able to betray their trust .</Paragraph> <Paragraph position="4"> I is not able to betray their trust .</Paragraph> <Paragraph position="5"> I cannot *betray the trust of them .</Paragraph> <Paragraph position="6"> I cannot betray trust of them .</Paragraph> <Paragraph position="7"> I Cannot betray a trust of them .</Paragraph> <Paragraph position="8"> I cannot betray trusts of them*.</Paragraph> <Paragraph position="9"> I Will not be able to betray the trust of them .</Paragraph> <Paragraph position="10"> The statistical extractor inherently prefers common words and word combinations. When a subject and a verb are contiguous, it automatically prefers the verb conjugation that agrees with the subject. When a determiner and its head noun are contiguous, it automatically prefers the most grammatical combination (not &quot;a trusts&quot; or &quot;an trust&quot;). For example, here are the corpus counts for some of the subject-verb bigrams and determiner-noun bigrams:</Paragraph> <Paragraph position="12"> Note that tile secondary preference for &quot;I are&quot; over &quot;I is&quot; makes sense for sentences like &quot;John and I are...&quot; Als0 note the apparent preference for tile singular form of &quot;trust&quot; over the plural form, a subtle reflection of the most comnmn meaning of the word &quot;trust&quot;. This same preference is reflected for the bigrams &quot;their trust&quot; versus &quot;their trusts.&quot; their trust 28 their trusts B</Paragraph> <Paragraph position="14"> A purely symbolic generator would need a lot of deep, handcrafted knowledge to infer that the &quot;reliance&quot; meaning of trust must be singular. (The plural form is used in talking about monetary trusts, or as a verb).</Paragraph> <Paragraph position="15"> Our generator handles it automatically, based simply on the evidence of colloquial frequency.</Paragraph> <Paragraph position="16"> A heuristic for preferring shorter phrases accounts for the preference of sentence 1 over sentence 2, and also for the preference of &quot;their trust(s)&quot; over &quot;the trust(s) of them&quot;. As can be seen from the lattice in Figure 2, the generator must also make a word choice decision in expressing the concept I trust,reliancel.</Paragraph> <Paragraph position="17"> It has two root choices, &quot;trust&quot; and &quot;reliance.&quot; r reliance 567 reliances 0 trust 6100 trusts 1083 The generator's preference is predicted by these unigram counts. Trust(s) is much more common than reliance(s). Though one could modify the symbolic lexicon to suit their needs more precisely, doing so on a Visitors who came in Japan admire Mount Fuji .</Paragraph> <Paragraph position="18"> Visitors who came in Japan admires Mount Fuji ..</Paragraph> <Paragraph position="19"> Visitors who arrived in Japan admire Mount Fuji , Visitors who arrived in Japan admires Mount Fuji .</Paragraph> <Paragraph position="20"> Visitors who came to Japan admire Mount Fuji .</Paragraph> <Paragraph position="21"> A visitor who Came in Japan admire Mount Fuji .</Paragraph> <Paragraph position="22"> The visitor who came in Japan admire Mount Fuji .</Paragraph> <Paragraph position="23"> Visitors who came to Japan admires Mount Fuji .</Paragraph> <Paragraph position="24"> A visitor:who Came in Japan admires Mount Fuji .</Paragraph> <Paragraph position="25"> The visitor who came in Japan admires Mount Fuji . &quot; Mount Fuji is admired by a visitor who came in Japan .</Paragraph> <Paragraph position="26"> This example offers more word choice decisions to the generator than the previous example. There are two root choices for the concept I visitorl, &quot;visitor&quot; and &quot;visitant,&quot; and two for l arrive,getl. Between &quot;visitor&quot; and &quot;visitant&quot;, .it is easyto guess that tile generator will prefer &quot;visitor.&quot; However, the choice between singular and plural forms for &quot;visitor&quot; results in a decision opposite to the one made for-&quot;trust&quot; above.</Paragraph> <Paragraph position="27"> visitor 575 visitors 1083 .In this case the generator, prefer s the plurat form., hi choosing between forms of/'come&quot; and &quot;arrive,&quot; the gene~'ator unsurprisingly prefers the more common word &quot;come&quot; and its derived forms. For relative pronouns, the extractor is given a choice between &quot;that,&quot; &quot;which,&quot; and &quot;who,&quot; and nicely picks ,who&quot; in this case, though it has no symbolic knowledge of the grammatical constraint that &quot;who&quot; is &quot; only used to refer to people.</Paragraph> <Paragraph position="29"> As can be seen from the lattice in Figure 3, the generator is given a wide variety of choices for verb tense.</Paragraph> <Paragraph position="30"> The heuristic preferring shorter phrases in effect narrows the choice down to the one-word expressions.</Paragraph> <Paragraph position="31"> Between these, the past tense is chosen, which is typical.</Paragraph> <Paragraph position="32"> who_came 383 who_arrived 86 who_come 142 who_arrive 15 who_comes 89 who,arrives 11 A notable error of the generator in this example is in choosing the preposition &quot;in&quot; to precede &quot;Japan.&quot; in_Japan 5413 to_Japan 1196 Apparently in general, Japan is usually preceded by the preposition &quot;in.&quot; It is the context of the verb &quot;come&quot; that makes one wish the generator would choose &quot;to&quot;, because &quot;arrived* in Japan&quot; sounds fine, while &quot;came in Japan&quot; does not. &quot; : came_to 2443 arrived in 544 came_in 1498 .arrived_to 35 came_into 244 arrived_into 0 However, &quot;in Japan&quot; is so much stronger than &quot;to Japan&quot; when compared with &quot;came to&quot; versus &quot;came in&quot; that &quot;in&quot; wins. The problem here is that bigrams cannot capture dependencies that exist between more than two words. A look at trigrams (not used currently used by our system) shows that this dependency does indeed exist.</Paragraph> <Paragraph position="34"> * Later we will discuss possible improvements to Nitrogen that will address this problem.</Paragraph> <Paragraph position="35"> A final aspect Of this example worth discussing is the*choice of person for the root verb &quot;admire.&quot; In this case, all the relevant bigrams are zero (ie. between *&quot;Japan&quot; and &quot;admire,&quot; and between &quot;admire&quot; and &quot;Mount,&quot; so the decision essentially defaults to the unigrams.</Paragraph> <Paragraph position="36"> admire 212 admired 2il admires 107 In this example the generator accidentally got lucky in its choice, because if there had been non-zero bigranls; they would have most likely caused the generator to make the wrong choice, by choosing a third person singular conjugation to agree with*the contiguous word &quot;Japan&quot; rather than a third person plural conjugation to agree with the true subject &quot;visitors&quot;.</Paragraph> <Paragraph position="37"> Note that another weakness of the current extractor in being able to choose the conjugation of any verb is that it weighs the bigrams between the verb and the word previous to it the same as it does the bigrams between the verb and the word that follows, although the former dependency is more important in choosing grammatlical *out*put. In other words, * for a phrase like &quot;visitors admire(s) Mount'!, the bigrams &quot;admire(s) Mount&quot; are weighed equally with the bigrams &quot;visitors admire(s)&quot;, though obviously &quot;Mount&quot; should be less relevant in choosing between &quot;admire&quot; and &quot;admires&quot; than &quot;visitors&quot;.</Paragraph> </Section> </Section> class="xml-element"></Paper>