XML Viewer - a00-2026

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/a00-2026_metho.xml
Size: 23,260 bytes
Last Modified: 2025-10-06 14:07:01
<?xml version="1.0" standalone="yes"?>
<Paper uid="A00-2026">
  <Title>Trainable Methods for Surface Natural Language Generation</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Previous Approaches
</SectionTitle>
    <Paragraph position="0"> Templates are the easiest way to implement surface NLG. A template for describing a flight noun phrase in the air travel domain might be flight departing from $city-fr at $time-dep and arriving in $city-to at $time-arr where the words starting with &amp;quot;$&amp;quot; are actually variables -representing the departure city, and departure time, the arrival city, and the arrival time, respectively-whose values will be extracted from the environment in which the template is used. The approach of writing individual templates is convenient, but may not scale to complex domains in which hundreds or thousands of templates would be necessary, and may have shortcomings in maintainability and text quality (e.g., see (Reiter, 1995) for a discussion).</Paragraph>
    <Paragraph position="1"> There are more sophisticated surface generation packages, such as FUF/SURGE (Elhadad and Robin, 1996), KPML (Bateman, 1996), MUMBLE (Meteer et al., 1987), and RealPro (Lavoie and Rambow, 1997), which produce natural language text from an abstract semantic representation. These packages require linguistic sophistication in order to write the abstract semantic representation, but they are flexible because minor changes to the input can accomplish major changes to the generated text.</Paragraph>
    <Paragraph position="2"> The only trainable approaches (known to the author) to surface generation are the purely statistical machine translation (MT) systems such as (Berger et al., 1996) and the corpus-based generation system described in (Langkilde and Knight, 1998). The MT systems of (Berger et al., 1996) learn to generate text in the target language straight from the source language, without the aid of an explicit semantic representation. In contrast, (Langkilde and Knight, 1998) uses corpus-derived statistical knowledge to rank plausible hypotheses from a grammar-based surface generation component.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="196" type="metho">
    <SectionTitle>
3 Trainable Surface NLG
</SectionTitle>
    <Paragraph position="0"> In trainable surface NLG, the goal is to learn the mapping from semantics to words that would otherwise need to be specified in a grammar or knowledge base. All systems in this paper use attribute-value  pairs as a semantic representation, which suffice as a representation for a limited domain like air travel. For example, the set of attribute-value pairs { $city-fr = New York City, $city-to = Seattle , $time-dep = 6 a.m., $date-dep = Wednesday } represent the meaning of the noun phrase % flight to Seattle that departs from New York City at 6 a.m. on Wednesday&amp;quot;. The goal, more specifically, is then to learn the optimal attribute ordering and lexical choice for the text to be generated from the attribute-value pairs. For example, the NLG system should automatically decide if the attribute ordering in &amp;quot;flights to New York in the evening&amp;quot; is better or worse than the ordering in &amp;quot;flights in the evening to New York&amp;quot;. Furthermore, it should automatically decide if the lexical choice in &amp;quot;flights departing to New York&amp;quot; is better or worse than the choice in &amp;quot;flights leaving to New York&amp;quot;. The motivation for a trainable surface generator is to solve the above two problems in a way that reflects the observed usage of language in a corpus, but without the manual effort needed to construct a grammar or knowledge base.</Paragraph>
    <Paragraph position="1"> All the trainable NLG systems in this paper assume the existence of a large corpus of phrases in which the values of interest have been replaced with their corresponding attributes, or in other words, a corpus of generation templates. Figure 1 shows a sample of training data, where only words marked with a &amp;quot;$&amp;quot; are attributes. All of the NLG systems in this paper work in two steps as shown in Table 2.</Paragraph>
    <Paragraph position="2"> The systems NLG1, NLG2 and NLG3 all implement step 1; they produce a sequence of words intermixed with attributes, i.e., a template, from the the attributes alone. The values are ignored until step 2, when they replace their corresponding attributes in the phrase produced by step 1.</Paragraph>
    <Section position="1" start_page="194" end_page="194" type="sub_section">
      <SectionTitle>
3.1 NLGI: the baseline
</SectionTitle>
      <Paragraph position="0"> The surface generation model NLG1 simply chooses the most frequent template in the training data that corresponds to a given set of attributes. Its performance is intended to serve as a baseline result to the more sophisticated models discussed later. Specifically, nlgl(A) returns the phrase that corresponds to the attribute set A:</Paragraph>
      <Paragraph position="2"> where TA are the phrases that have occurred with A in the training data, and where C(phrase, A) is the training data frequency of the natural language phrase phrase and the set of attributes A. NLG1 will fail to generate anything if A is a novel combination of attributes.</Paragraph>
    </Section>
    <Section position="2" start_page="194" end_page="196" type="sub_section">
      <SectionTitle>
3.2 NLG2: n-gram model
</SectionTitle>
      <Paragraph position="0"> The surface generation system NLG2 assumes that the best choice to express any given attribute-value set is the word sequence with the highest probability that mentions all of the input attributes exactly once. When generating a word, it uses local information, captured by word n-grams, together with certain non-local information, namely, the subset of the original attributes that remain to be generated.</Paragraph>
      <Paragraph position="1"> The local and non-local information is integrated with use of features in a maximum entropy probability model, and a highly pruned search procedure attempts to find the best scoring word sequence according to the model.</Paragraph>
      <Paragraph position="2">  The probability model in NLG2 is a conditional distribution over V U * stop,, where V is the generation vocabulary and where .stop. is a special &amp;quot;stop&amp;quot; symbol. The generation vocabulary V consists of all the words seen in the training data. The form of the maximum entropy probability model is identical to the one used in (Berger et al., 1996; Ratnaparkhi, 1998):</Paragraph>
      <Paragraph position="4"> where wi ranges over V t3 .stop. and {wi-l,wi-2,attri} is the history, where wi denotes the ith word in the phrase, and attri denotes the attributes that remain to be generated at position i in the phrase. The fj, where fj(a, b) E {0, 1}, are called features and capture any information in the history that might be useful for estimating p(wi\[wi-1, wi-2, attri). The features used in NLG2 are described in the next section, and the feature weights aj, obtained from the Improved Iterative Scaling algorithm (Berger et al., 1996), are set to maximize the likelihood of the training data. The probability of the sequence W = wl ... wn, given the attribute set A, (and also given that its length</Paragraph>
      <Paragraph position="6"> The feature patterns, used in NLG2 are shown in  ing the patterns over the training data, e.g., an actual feature derived from the word bi-gram template might be:</Paragraph>
      <Paragraph position="8"> flights on $air from $city-fr to $city-to the $time-depint of $date-dep Strip flights on $air from $city-fr to $city-to leaving after $time-depaft on $date-dep flights leaving from $city-fr going to $city-to after Stime-depaft on $date-dep flights leaving from $city-fr to $city-to the $time-depint of Sdate-dep $air flight $fltnum from $city-fr to $city-to on $date-dep $city-fr to $city-to $air flight Sfltnum on the $date-dep Strip flights from $city-fr to $city-to Input to Step 1: Output of Step 1:  { $city-fr, $city-to, $time-dep, $date-dep } '% flight to $city-to that departs from $city-fr at Stime-dep on $date-dep&amp;quot; Input to Step 2: Output of Step 2: &amp;quot;a flight to $city-to that departs from $city-fr at $time-dep on $date-dep&amp;quot;, { $city-fr = New York City, $city-to = Seattle , $time-dep = 6 a.m., $date-dep = Wednesday } '% flight to Seattle that departs from New York City at 6 a.m. on Wednesday&amp;quot;  Low frequency features involving word n-grams tend to be unreliable; the NLG2 system therefore only uses features which occur K times or more in the training data.</Paragraph>
      <Paragraph position="9">  The search procedure attempts to find a word se- null quence wl ... wn of any length n ~ M for the input attribute set A such that 1. wn is the stop symbol ,stop, 2. All of the attributes in A are mentioned at least once 3. All of the attributes in A are mentioned at most once  and where M is an heuristically set maximum phrase length.</Paragraph>
      <Paragraph position="10"> The search is similar to a left-to-right breadthfirst-search, except that only a fraction of the word sequences are considered. More specifically, the search procedure implements the recurrence:</Paragraph>
      <Paragraph position="12"> The set WN# is the top N scoring sequences of length i, and the expression next(WN,i) returns all sequences wl...Wi+l such that wl...wi E WN,i, and wi+l E V U .stop.. The expression top(N, next(WN#)) finds the top N sequences in next(Wg,i). During the search, any sequence that ends with ,stop. is removed and placed in the set of completed sequences. If N completed hypotheses are discovered, or if WN,M is computed, the search terminates. Any incomplete sequence which does not satisfy condition (3) is discarded and any complete sequence that does not satisfy condition (2) is also discarded.</Paragraph>
      <Paragraph position="13"> When the search terminates, there will be at most N completed sequences, of possibly differing lengths. Currently, there is no normalization for different lengths, i.e., all sequences of length n &lt; M are equiprobable:</Paragraph>
      <Paragraph position="15"> where Wnt~2 are the completed word sequences that satisfy the conditions of the NLG2 search described above.</Paragraph>
      <Paragraph position="16"> 3.3 NLG3: dependency information NLG3 addresses a shortcoming of NLG2, namely that the previous two words are not necessarily the best informants when predicting the next word. Instead, NLG3 assumes that conditioning on syntactically related words in the history will result on more accurate surface generation. The search procedure in NLG3 generates a syntactic dependency tree from</Paragraph>
      <Paragraph position="18"> top-to-bottom instead of a word sequence from leftto-right, where each word is predicted in the context of its syntactically related parent, grandparent, and siblings. NLG3 requires a corpus that has been annotated with tree structure like the sample dependency tree shown in Figure 1.</Paragraph>
      <Paragraph position="19">  The probability model for NLG3, shown in Figure 2, conditions on the parent, the two closest siblings, the direction of the child relative to the parent, and the attributes that remain to be generated.</Paragraph>
      <Paragraph position="20"> Just as in NLG2, p is a distribution over V t2 .stop,, and the Improved Iterative Scaling algorithm is used to find the feature weights aj. The expression chi(w) denotes the ith closest child to the headword w, par(w) denotes the parent of the headword w, dir E {left, right} denotes the direction of the child relative to the parent, and attrw,i denotes the attributes that remain to be generated in the tree when headword w is predicting its ith child. For example, in Figure 1, if w =&amp;quot;flights&amp;quot;, then Chl(W) =&amp;quot;evening&amp;quot; when generating the left children, and chl(w) =&amp;quot;from&amp;quot; when generating the right children. As shown in Figure 3, the probability of a dependency tree that expresses an attribute set A can be found by computing, for each word in the tree, the probability of generating its left children and then its right children. 1 In this formulation, the left children are generated independently from the right children. As in NLG2, NLG3 assumes the uniform distribution for the length probabilities Pr(# of left children = n) and Pr(# of right children = n) up to a certain maximum length M' = 10.</Paragraph>
      <Paragraph position="21">  The feature patterns for NLG3 are shown in Table 4. As before, the actual features are created by matching the patterns over the training data. The features in NLG3 have access to syntactic information whereas the features in NLG2 do not. Low frequency features involving word n-grams tend to be unreliable; the NLG3 system therefore only uses features which occur K times or more in the training data. Furthermore, if a feature derived from Table 4 looks at a particular word chi(w) and attribute a, we only allow it if a has occurred as a descendent of 1We use a dummy ROOT node to generate the top most head word of the phrase chi(w) in some dependency tree in the training set.</Paragraph>
      <Paragraph position="22"> As an example, this condition allows features that look at chi(w) =&amp;quot;to&amp;quot; and $city-toE attrw,i but disallows features that look at ch~(w) =&amp;quot;to&amp;quot; and $cityfrE attrw,i.</Paragraph>
    </Section>
    <Section position="3" start_page="196" end_page="196" type="sub_section">
      <SectionTitle>
3.4 Search Procedure
</SectionTitle>
      <Paragraph position="0"> The idea behind the search procedure for NLG3 is similar to the search procedure for NLG2, namely, to explore only a fraction of the possible trees by continually sorting and advancing only the top N trees at any given point. However, the dependency trees are not built left-to-right like the word sequences in NLG2; instead they are built from the current head (which is initially the root node) in the following order:  1. Predict the next left child (call it xt) 2. If it is *stop,, jump to (4) 3. Recursively predict children of xt. Resume from (1) 4. Predict the next right child (call it Xr) 5. If it is *stop*, we are done predicting children for the current head 6. Recursively predict children ofxr. Resume from (4)  As before, any incomplete trees that have generated a particular attribute twice, as well as completed trees that have not generated a necessary attribute are discarded by the search. The search terminates when either N complete trees or N trees of the maximum length M are discovered. NLG3 chooses the best answer to express the attribute set A as follows:</Paragraph>
      <Paragraph position="2"> where Tntga are the completed dependency trees that satisfy the conditions of the NLG3 search described above.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="196" end_page="198" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> The training and test sets used to evaluate NLG1, NLG2 and NLG3 were derived semi-automatically from a pre-existing annotated corpus of user queries in the air travel domain. The annotation scheme used a total of 26 attributes to represent flights.</Paragraph>
    <Paragraph position="2"/>
    <Paragraph position="4"> The training set consisted of 6000 templates describing flights while the test set consisted of 1946 templates describing flights. All systems used the same training set, and were tested on the attribute sets extracted from the phrases in the test set. For example, if the test set contains the template &amp;quot;flights to $city-to leaving at Stime-dep&amp;quot;, the surface generation systems will be told to generate a phrase for the attribute set { $city-to, Stime-dep }. The output of NLG3 on the attribute set { $city-to, $city-fr, $time-dep } is shown in Table 9.</Paragraph>
    <Paragraph position="5"> There does not appear to be an objective automatic evaluation method 2 for generated text that correlates with how an actual person might judge the output. Therefore, two judges -- the author and a colleague -- manually evaluated the output of all three systems. Each judge assigned each phrase from each of the three systems one of the following rankings: Correct: Perfectly acceptable OK: Tense or agreement is wrong, but word choice is correct. (These errors could be corrected by post-processing with a morphological analyzer.) Bad: Words are missing or extraneous words are present No Output: The system failed to produce any output null While there were a total 1946 attribute sets from the test examples, the judges only needed to evaluate the 190 unique attribute sets, e.g., the attribute set { $city-fr $city-to } occurs 741 times in the test data. Subjective evaluation of generation output is 2Measuring word overlap or edit distance between the system's output and a &amp;quot;reference&amp;quot; set would be an automatic scoring method. We believe that such a method does not accurately measure the correctness or grammaticality of the text.</Paragraph>
    <Paragraph position="6"> not ideal, but is arguably superior than an automatic evaluation that fails to correlate with human linguistic judgement.</Paragraph>
    <Paragraph position="7"> The results of the manual evaluation, as well as the values of the search and feature selection parameters for all systems, are shown in Tables 5, 6, 7, and 8. (The values for N, M, and K were determined by manually evaluating the output of the 4 or 5 most common attribute sets in the training data). The weighted results in Tables 5 and 6 account for multiple occurrences of attribute sets, whereas the unweighted results in Tables 7 and 8 count each unique attribute set once, i.e., { $city-fr $city-to } is counted 741 times in the weighted results but once in the unweighted results. Using the weighted results, which represent testing conditions more realistically than the unweighted results, both judges found an improvement from NLG1 to NLG2, and from NLG2 to NLG3. NLG3 cuts the error rate from NLG1 by at least 33% (counting anything without a rank of Correct as wrong). NLG2 cuts the error rate by at least 22% and underperforms NLG3, but requires far less annotation in its training data. NLG1 has no chance of generating anything for 3% of the data -it fails completely on novel attribute sets. Using the unweighted results, both judges found an improvement from NLG1 to NLG2, but, surprisingly, judge A found a slight decrease while judge B found an increase in accuracy from NLG2 to NLG3. The unweighted results show that the baseline NLG1 does well on the common attribute sets, since it correctly generates only less than 50% of the unweighted cases but over 80% of the weighted cases.</Paragraph>
  </Section>
  <Section position="6" start_page="198" end_page="199" type="metho">
    <SectionTitle>
5 Discussion
</SectionTitle>
    <Paragraph position="0"> The NLG2 and NLG3 systems automatically attempt to generalize from the knowledge inherent in the training corpus of templates, so that they can generate templates for novel attribute sets. There  attributes: $time-dep -- &amp;quot;10 a.m.&amp;quot;, $city-fr = &amp;quot;New York&amp;quot;, $city-to = &amp;quot;Miami&amp;quot; is some additional cost associated with producing the syntactic dependency annotation necessary for NLG3, but virtually no additional cost is associated with NLG2, beyond collecting the data itself and identifying the attributes.</Paragraph>
    <Paragraph position="1"> The trainable surface NLG systems in this paper differ from grammar-based systems in how they determine the attribute ordering and lexical choice.</Paragraph>
    <Paragraph position="2"> NLG2 and NLG3 automatically determine attribute ordering by simultaneously searching multiple orderings. In grammar-based approaches, such preferences need to be manually encoded. NLG2 and NLG3 solve the lexical choice problem by learning the words (via features in the maximum entropy probability model) that correlate with a given attribute and local context, whereas (Elhadad et al., 1997) uses a rule-based approach to decide the word choice.</Paragraph>
    <Paragraph position="3"> While trainable approaches avoid the expense of crafting a grammar to determine attribute ordering and lexicai choice, they are less accurate than grammar-based approaches. For short phrases, accuracy is typically 100% with grammar-based approaches since the grammar writer can either correct or add a rule to generate the phrase of interest once an error is detected. Whereas with NLG2 and NLG3, one can tune the feature patterns, search parameters, and training data itself, but there is no guarantee that the tuning will result in 100% generation accuracy.</Paragraph>
    <Paragraph position="4"> Our approach differs from the corpus-based surface generation approaches of (Langkilde and Knight, 1998) and (Berger et al., 1996). (Langkilde and Knight, 1998) maps from semantics to words with a concept ontology, grammar, and lexicon, and ranks the resulting word lattice with corpus-based statistics, whereas NLG2 and NLG3 automatically learn the mapping from semantics to words from a corpus. (Berger et ai., 1996) describes a statistical machine translation approach that generates text in the target language directly from the source text.</Paragraph>
    <Paragraph position="5"> NLG2 and NLG3 are also statistical learning approaches but generate from an actual semantic representation. This comparison suggests that statistical MT systems could also generate text from an &amp;quot;interlingua&amp;quot;, in a way similar to that of knowledge-based translation systems.</Paragraph>
    <Paragraph position="6"> We suspect that our statistical generation approach should perform accurately in domains of similar complexity to air travel. In the air travel domain, the length of a phrase fragment to describe an attribute is usually only a few words. Domains which require complex and lengthy phrase fragments to describe a single attribute will be more challenging to model with features that only look at word n-grams for n E {2, 3). Domains in which there is greater ambiguity in word choice will require a more thorough search, i.e., a larger value of N, at the expense of CPU time and memory. Most importantly, the semantic annotation scheme for air travel has the property that it is both rich enough to accurately represent meaning in the domain, but simple enough to yield useful corpus statistics. Our approach may not scale to domains, such as freely occurring newspaper text, in which the semantic annotation schemes do not have this property.</Paragraph>
    <Paragraph position="7"> Our current approach has the limitation that it ignores the values of attributes, even though they might strongly influence the word order and word choice. This limitation can be overcome by using features on values, so that NLG2 and NLG3 might discover -- to use a hypothetical example -- that &amp;quot;flights leaving $city-fr&amp;quot; is preferred over &amp;quot;flights from $city-fr&amp;quot; when $city-fr is a particular value, such as &amp;quot;Miami&amp;quot;.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML