XML Viewer - w05-1601

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-1601_metho.xml
Size: 28,801 bytes
Last Modified: 2025-10-06 14:09:58
<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-1601">
  <Title>Statistical Generation: Three Methods Compared and Evaluated</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Generate-and-select NLG
</SectionTitle>
    <Paragraph position="0"> Generate-and-select NLG separates the definition of the space of all possible generation processes (the generation space) from the mechanism that controls which (set of) realisation(s) is selected as the output. Generate-and-select methods vary primarily along three dimensions: (i) Number of (partial) solutions generated at each step: some methods generate all possibilities, then select; some select a subset of partial solutions at each step; some use an automatically adaptable decision module to select the (single) next partial solution.</Paragraph>
    <Paragraph position="1"> (ii) Type of decision-making module and method of construction/adaptation: statistical models, various machine learning techniques or manual construction/adaptation. null (iii) Size of subtask of the generation process that the method is applied to: from the entire generation process e.g. in text summarisation, to all of surface realisation for domain-independent generation.</Paragraph>
    <Paragraph position="2"> Methods that can in principle be used to stochastically generate text have existed for a long time, but statistical generation from specified inputs started with Japan-Gloss [Knight et al., 1994; 1995] (which replaced PENMAN's defaults with statistical decisions), while comprehensive statistical generation started with Nitrogen [Knight and Langkilde, 1998] (which represented the set of alternative realisations as a word lattice and selected the best with a 2-gram model) and its successor Halogen [Langkilde, 2000] (where the word lattice was replaced by a more efficient AND/OR-tree representation).</Paragraph>
    <Paragraph position="3"> Since then, a steady stream of publications has reported work on statistical NLG. In FERGUS, Bangalore et al. used an XTAG grammar to generate a word lattice representation</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1Controlled Generation of Text (CoGenT) http://www.
</SectionTitle>
      <Paragraph position="0"> itri.brighton.ac.uk/projects/cogent.</Paragraph>
      <Paragraph position="1"> of a small number of alternative realisations, and a 3-gram model to select the best [Bangalore and Rambow, 2000b].</Paragraph>
      <Paragraph position="2"> Humphreys et al. [2001] reused a PCFG trained for NL parsing to build syntactic generation trees from candidate syntactic nodes.</Paragraph>
      <Paragraph position="3"> Recently, Habash [2004] reported work using structural 2-grams for lexical/syntactic selection tasks (using joint probability of word and parent word in dependency structures, instead of probability of word given preceding word), as well as conventional n-grams for selection among surface strings. Velldal et al. [2004] compared the performance of a 4-gram model trained on the BNC2 with a Maximum Entropy model reused from a parsing application and trained on the small, domain-specific LOGON corpus, finding that the domain-specific ME model performs better on the LOGON corpus, but a combined model performs best.</Paragraph>
      <Paragraph position="4"> Some statistical NLG research has looked at subproblems of language generation, such as ordering of NP premodifiers [Shaw and Hatzivassiloglou, 1999; Malouf, 2000], attribute selection in content planning [Oh and Rudnicky, 2000], NP type determination [Poesio et al., 1999], pronominalisation [Strube and Wolters, 2000], and lexical choice [Bangalore and Rambow, 2000a].</Paragraph>
      <Paragraph position="5"> In hybrid symbolic-statistical approaches, White [2004] prunes edges in chart realisation using n-gram models, and Varges uses quantitative methods for determining the weights on instance features in instance-based generation [Varges and Mellish, 2001].</Paragraph>
      <Paragraph position="6"> The likelihood of realisations given concepts or semantic representations has been modeled directly, but is probably limited to small-scale and specialised applications: summarisation construed as term selection and ordering [Witbrock and Mittal, 1999], grammar-free stochastic surface realisation [Oh and Rudnicky, 2000], and surface realisation construed as attribute selection and lexical choice [Ratnaparkhi, 2000]. Some of the above papers compare the purely statistical methods to other machine learning methods such as memory-based learning and reinforcement learning. Some other research has focussed on machine learning methods, e.g. Walker et al. [2001] look at using a boosting algorithm to train a sentence plan ranker on a corpus of labelled examples, and Marciniak &amp; Strube [2004] construe the entire generation process as a sequence of classification problems, solved by corpus-trained feature-vector classifiers.</Paragraph>
      <Paragraph position="7"> Generate-and-select NLG has been applied either to all of surface realisation, or to a small subproblem in deep or surface realisation (not to the entire generation process); it is either very expensive or not guaranteed to find the optimal solution; and the models it has used are either shallow and unstructured, or require manual corpus annotation.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Treebank-Training of Generators
</SectionTitle>
    <Paragraph position="0"> Treebank-training of generators is a method for modelling likelihoods of realisations in generation. It is introduced in this section first in terms of the general idea behind it (which could be implemented with various formalisms, training methods and generation algorithms), and then in Sec- null tion 3.1 by a description of the actual technique that was used in the experiments reported below.</Paragraph>
    <Paragraph position="1"> The generation space of a generator can be seen as a set of decision trees, where the root nodes correspond to the inputs, and the paths down the trees are all possible generation processes in the generator that lead to a realisation (leaf) node (a view discussed in more detail in [Belz, 2004]).</Paragraph>
    <Paragraph position="2"> Consider the diagrammatic generation space representation in Figure 1. It shows examples of three realisations and the sequences of generator decisions that generate them, represented as paths connecting decision nodes and leading to realisation nodes. In this view of the generation space, using an n-gram model is equivalent to following all paths (from the given input node) down to the realisations, and then applying the model to select the most likely realisation.</Paragraph>
    <Paragraph position="3"> An alternative is to estimate the likelihood of a realisation in terms of the likelihoods of the generator decisions that give rise to it, looking at the possible sequences of decisions that generate the realisation (its 'derivations'). One way of doing this is to say the likelihood of a string is the sum of the likelihoods of its derivations. To train such a model on a corpus of raw text, the set of derivations for each sentence is determined, frequencies for individual decisions are added up, and a probability distribution over sets of alternative decisions is estimated.</Paragraph>
    <Paragraph position="4"> If a sentence has more than one derivation, as in the example on the right of Figure 1, there are two possibilities: either the frequency counts are evenly divided between them, or disambiguation is carried out to determine the correct derivation. The former uses the surface frequencies of word strings regardless of their meaning and structure, as do n-gram models.</Paragraph>
    <Paragraph position="5"> The latter is complicated by the fact that there is not always a single 'correct' derivation in generation (e.g. an expression may end up being passivised in more than one way).</Paragraph>
    <Paragraph position="6"> There are at least three strategies for using a treebank-trained model during generation: (i) select the most likely decision at each choice point; (ii) select the most likely generation process (joint probability of all decisions); or (iii) select the most likely string of words (summed probabilities of all generation processes that generate the string). The first would always make the same decision given the same alternatives, whereas for (ii) and (iii) it would depend also on the other decisions in the derivation(s). On the other hand, the complexity of (iii) is much greater than that of (ii) which in turn is greater than that of (i).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Context-free Generator Treebank-Training
without Disambiguation
</SectionTitle>
      <Paragraph position="0"> There are many different ways of representing generator decisions and annotating sentences with derivations, and treebank-training is clearly not equally suitable for all types of generators. In the current version of the method, generation rules must be context-free with atomic arguments.</Paragraph>
      <Paragraph position="1"> Derivations for sentences in the corpus are then standard context-free derivations, and corpora are annotated in the standard context-free way with brackets and labels.</Paragraph>
      <Paragraph position="2"> No disambiguation is performed, the assumption being that all derivations are equally good for a sentence. If a sentence has more than one derivation, frequency counts are divided equally between them.</Paragraph>
      <Paragraph position="3"> The three basic steps in context-free generator treebank-training are: 1. For each sentence in the corpus, find all generation processes that generate it, that is, all the ways in which the generator could have generated it. For each generation process, note the sequences of generator decisions involved in it (the derivations for the sentence). If there is no complete derivation, maximal partial derivations are  used instead.</Paragraph>
      <Paragraph position="4"> 2. Annotate the (sub)strings in the sentence with the derivation, resulting in a generation tree for the sentence. If there is more than one derivation for the sentence, create a set of annotated trees. The resulting annotated corpus is a generation treebank.</Paragraph>
      <Paragraph position="5"> 3. Obtain frequency counts for each individual decision  from the annotations, adding 1/n to the count for every decision, where n is the number of alternative derivations; convert counts into probability distributions over alternative decisions, smoothing for unseen decisions.</Paragraph>
      <Paragraph position="6"> The probability distribution is currently smoothed with the simple add-1 method3. This is equivalent to Bayesian estimation with a uniform prior probability on all decisions, and is entirely sufficient for present purposes given the very small vocabulary and the good coverage of the data. A standard maximum likelihood estimation is performed: the total number of occurrences of a decision (e.g. passive) is divided by the total number of occurrences of all alternatives (e.g. passive + active). In the context-free setting, a decision type corresponds to a nonterminal N, and decisions correspond to expansion rules N - a. Given a function c(x) which returns the frequency count for a decision x, normalising each occurrence by the number of derivations for the sentence, the probability of a decision is obtained in the standard way (R is the set of all decisions):</Paragraph>
      <Paragraph position="8"/>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Greedy generation
</SectionTitle>
      <Paragraph position="0"> One way of using a treebank-trained generator is to make the single most likely decision at each choice point in a generation process. This is not guaranteed to result in the most likely generation process, but the computational cost in application is exceedingly low.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Viterbi generation
</SectionTitle>
      <Paragraph position="0"> The alternative is to do a Viterbi search of the generation forest for a given input, which maximises the joint likelihood of all decisions taken in the generation process. This is guaranteed to select the most likely generation process, but is considerably more expensive. The efficiency of greedy probabilistic generation, Viterbi generation and 2-gram postselection is compared in Section 4.3 below.</Paragraph>
      <Paragraph position="1"> A possible alternative to greedy search is to use a non-uniform random distribution proportional to the likelihoods of alternatives. E.g. if there are two alternative decisions D1 and D2, with the model giving p(D1) = .8 and p(D2) = .2, then the generator would decide D1 with probability .8, and D2 with a probability of .2 for an arbitrary input (instead of always deciding D1 as does the greedy generator). However, such a strategy, while increasing variation, would come at the price of lowering the overall likelihood of making the right decision. With the strategy of always going for the most frequent alternative, the overall likelihood of making the right decision when faced with the choice D1 or D2 is simply .8 in the current example (1 for D1, 0 for D2). With the likelihood-proportional random strategy, however, the overall likelihood of making the right decision is only .68 (.64 for D1, .04 for D2). Variation can alternatively be increased by making the model more fine-grained.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Evaluation on Weather Forecast Generation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Domain and Data: Weather Forecasting
</SectionTitle>
      <Paragraph position="0"> The corpus used in the experiments reported below is the SUMTIME-METEO corpus created by the SUMTIME project team in collaboration with WNI Oceanroutes [Sripada et al., 2002]. The corpus was collected by WNI Oceanroutes from the commercial output of five different (human) forecasters, and each instance in the corpus consists of three numerical data files (output by three different weather simulators) and the weather forecast file written by the forecaster on the evidence of the data files (and sometimes additional resources).</Paragraph>
      <Paragraph position="1"> Following the SUMTIME work, the experiments reported below focussed on the part of the forecasts that predicts wind characteristics for the next 15 hours. Such 'wind statements' look as follows (for 10-08-01):  To keep things simple, only the data file type that contains (virtually all) the information about wind parameters (the .tab file type) was used. Figure 2 is the .tab file corresponding to the above forecast. The first column is the day/hour time stamp, the second the wind direction predicted for the corresponding time period; the third the wind speed at 10m above the ground; the fourth the gust speed at 10m; and the fifth the gust speed at 50m. The remaining columns contain wave data.</Paragraph>
      <Paragraph position="2"> The mapping from time series data to forecast is not straightforward (even when all three data files are taken into account). An example here is that while the wind direction in the first part of the wind statement is given as WNW-NW, NW does not appear as a wind speed anywhere in the data file. Nor is it obvious why the wind speeds 11, 12 and 7 are mapped to the two ranges 12-15 and 5-10.</Paragraph>
      <Paragraph position="3"> The SUMTIME Project construed the mapping from time series data to weather forecasts as two tasks [Sripada et al., 2003]: selecting a subset of the time series data to be included in the forecast, and expressing this subset of numbers as an NL forecast. The focus of the research reported here is not the numerical summarisation of time series data, but NLG techniques. Therefore, the SUMTIME-METEO corpus was converted into a parallel corpus of wind statements and the wind data included in each statement. The wind data is a vector of time stamps and wind parameters, and was 'reverseengineered', by automatically aligning wind speeds and wind directions in the forecasts with time-stamps in the data file. In order to do this, wind speed and directions in the data file have to be matched with those in the forecast. This was not straightforward either, because more often than not, there is no exact match in the data file for the wind speeds and directions in the forecast. The strategy adopted in the work reported here was to select the time stamp beside the first exact match, and to leave time undefined if there was no exact match. The instances in the final training corpus look as follows (same example, 10-08-01, resulting in two instances):</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Automatic Generation of Weather Forecasts
</SectionTitle>
      <Paragraph position="0"> The three generation methods compared below are all generate-and-select methods. The idea was to build a basic generator that for any given input generates a set of alternatives that reflects all the variation found in the corpus (rather than deciding which alternative to select in which context), and then to create statistical decision makers trained on the corpus to select (a subset of) alternatives. The rest of this section describes the basic generator, and the following section describes the experiments that were carried out.</Paragraph>
      <Paragraph position="1"> The basic generator was written semi-automatically as a set of generation rules with atomic arguments that convert an input vector of numbers in steps to a set of NL forecasts.</Paragraph>
      <Paragraph position="2"> The automatic part was analysing the entire corpus with a set of simple chunking rules that split wind statements into wind direction, wind speed, gust speed, gust statements, time expressions, transition phrases (such as and increasing), pre-modifiers (such as less than for numbers, and mainly for wind direction), and post-modifiers (e.g. in or near the low centre).</Paragraph>
      <Paragraph position="3"> The manual part was to write the chunking rules themselves, and higher-level rules that combine different sequences of chunks into larger components.</Paragraph>
      <Paragraph position="4"> The higher-level generation rules were based on an interpretation of wind statements as sequences of fairly independent units of information, each containing as a minimum a wind direction or wind speed range, and as a maximum all the chunks listed above. The only context encoded in the rules was whether a unit of information was the first in a wind statement, and whether a wind statement contained wind direction (only), wind speed (only), or both. The final generator takes as inputs number vectors of length 6 to 36, and has a large amount of non-determinism. For the simplest input (one number), it generates 8 alternative realisations. For the most complex input (36 numbers), it generates 4.96 x 1040 alternatives (as a tightly packed AND/OR-tree).</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Experiments and Results
</SectionTitle>
      <Paragraph position="0"> The converted corpus (as described in Section 4.1 above) consisted of 2,123 instances, corresponding to a total of 22,985 words. This may not sound like much, but considering that the entire corpus only has a vocabulary of about 90 words (not counting wind directions), and uses only a handful of different syntactic structures, the corpus provides extremely good coverage (an initial impression confirmed by the small differences between training and testing data results below).</Paragraph>
      <Paragraph position="1"> The corpus was divided at random into training and testing data at a ratio of 9:1. The training set was used to treebanktrain the weather forecast generation grammar (as described in Section 3.1) and a back-off 2-gram model (using the SRILM toolkit, [Stolcke, 2002]). The treebank-trained generation grammar was used in conjunction with a greedy and a Viterbi generation method. The 2-gram model was used in more or less exactly the way reported in the Halogen papers [Langkilde, 2000]. That is to say, the packed AND/OR-tree representation of all alternatives is generated in full, then the 2-gram model is applied to select the single best one.</Paragraph>
      <Paragraph position="2"> One small difference to Halogen is that a Viterbi algorithm was used to identify the single most likely string. This is achieved as follows. The AND/OR-tree is interpreted directly as a finite-state automaton where the states correspond to words as well as to the nodes in the AND/OR-tree. The transitions are assigned probabilities according to the 2-gram model, and then a straightforward single-best Viterbi algorithm is applied to find the best path through the automaton.</Paragraph>
      <Paragraph position="3"> Random selection among all alternatives was used as a baseline. All results were evaluated against the gold standard (the human written forecasts) of the test set. Results were validated with 5-fold cross-validation.</Paragraph>
      <Paragraph position="4"> In the following overview of the results4, the similarity between automatically generated forecasts and gold standard was measured by conventional string-edit (SE) distance with substitution at cost 2, and deletion and insertion at cost 1.</Paragraph>
      <Paragraph position="5"> Baseline results are given as absolute SE scores, results for the non-random generators in terms of improvement over the baseline (reduction in string-edit distance, with SE score in brackets)5.</Paragraph>
      <Paragraph position="6">  The SE scores show that, as expected, the improvements for the training set are slightly larger in all cases. The greedy generator achieves a significant improvement over random selection, but is outperformed by the Viterbi generator, with 2-gram selection the clear overall winner.</Paragraph>
      <Paragraph position="7"> The generated strings were also evaluated using the BLEU metric [Papineni et al., 2001] which is ultimately based on n-gram agreement between generated and gold standard strings.</Paragraph>
      <Paragraph position="8"> In simple terms, the more 1-grams, 2-grams, ... and n-grams are the same between two strings, the higher their BLEU score. This implies that BLEU with n = 1 is the most closely related to SE scoring (which can also be seen from the similarity between the relative scores assigned by BLEU1 and SE  to the generators). Table 1 gives BLEUn scores for the 4 generators, for 1 [?] n [?] 4. These scores give a different impression of the results. BLEU consistently scores the Viterbi generator lower than the greedy generator, although the difference between them is only 0.003 on average, less than the average mean deviation for the two generators across the five runs (0.005). The difference between the random generator and the other methods increases significantly with increasing n. For n = 2 and n = 3, the differences between the 2-gram generator and the two treebank-trained generators drops noticeably. null Results were stable over the five runs of the crossvalidation. For the test set, the mean deviation in SE scores was between 0.13 and 0.35, and in BLEU scores between 0.0034 and 0.0065. The margins by which the 2-gram generator outperformed the other two were nearly identical across the five runs. The SE score of the greedy generator was lower (by nearly identical margins) than that of the Viterbi generator in all runs. All four BLEU scores of the greedy generator, however, were slightly higher than those of the Viterbi generator in four out of five runs, and slightly lower in one run. Small mean deviation figures and consistency of results confirm the significance of the differences between the scores in the SE and BLEU evaluations to some extent.</Paragraph>
      <Paragraph position="9"> There is a debate over the degree to which an evaluation metric for NLG should be sensitive to word order and word adjacency. The appropriate degree of word-order sensitivity may vary from one evaluation task to the next, but for the task of evaluating weather forecasts it certainly is important. A clear example is time expressions. A forecast can contain up to five different time expressions, which have to observe the correct chronological order. It is not appropriate to reward the mere presence (regardless of place in the string) of, say, by midnight (which is what some evaluation metrics are specifically designed to do, e.g. [Bangalore et al., 2000]). SE scoring has a tendency to reward proximity to the intended place (although not in a very straightforward way), and BLEU is increasingly strict about place with increasing n.</Paragraph>
      <Paragraph position="10"> The SE score gives an intuitive, initial impression of how much the three methods have learned in comparison to the baseline: it is easy to conceptualise how different one string is from another if you know there are two insert and delete operations between them (not so easy with the BLEU scores).</Paragraph>
      <Paragraph position="11"> However, by far the more complete picture (and the most appropriate to this evaluation task) is given by BLEU. It shows that the 2-gram generator does better than the other two, but not by a very large margin. The margin is important considering the far greater expense of n-gram generation. The following table shows the total amount of time it took to test the training and test sets with the different methods in one (controlled) run.</Paragraph>
      <Paragraph position="12"> Total time (minutes) Training set Test set TT/greedy 23m52s 2m06s TT/Viterbi 3h22m33s 27m05s 2-gram 24h04m05s 3h35m06s Testing the training set took all methods about 7 times as long as the test set. The Viterbi generator took about 13 times longer than the greedy generator for the test set, and about 9 times longer for the training set. The 2-gram generator took 6.5 times longer than the Viterbi generator on the test set, and 7.5 times longer on the training set.</Paragraph>
      <Paragraph position="13"> To make the comparison fair, the Viterbi and the 2-gram generator were implemented identically as far as possible.</Paragraph>
      <Paragraph position="14"> Both start by generating the packed AND/OR-tree of all alternatives. During this process, the former makes a note of the rule probabilities (at the AND and OR nodes) along with the rules, and then directly identifies the most likely realisation by a Viterbi method. The 2-gram generator first has to annotate the AND/OR-tree with 2-gram probabilities looked up in the 2-gram model before using the same Viterbi method. The overhead of looking up the 2-gram model to score the packed representation is the only difference between the two methods and multiplies out into a large overhead in computing time for the 2-gram generator6, which would make it unsuitable for use in practical applications.</Paragraph>
      <Paragraph position="15"> The random generator has no preference for shorter strings at all, and has an average string length almost twice that of the other generators. The 2-gram generator has an almost absolute preference of shorter over longer strings, and so produces the shortest strings. The Viterbi generator does not prefer shorter strings, but does prefer shorter derivations, and there is a correlation between string length and derivation length.</Paragraph>
      <Paragraph position="16"> The greedy generator does not have a built-in preference for shorter strings or derivations, but it reflects the fact that short (sub)strings were more frequent in the training corpus: gold: 10.8 greedy: 9.3 2-gram: 8.7 random: 17.2 Viterbi: 9.0 In some application domains, the n-gram model's preference for shorter strings is irrelevant: e.g. in speech recognition (where n-gram models are used widely) the alternatives among which the model must choose are always of the same length. In language generation, where equally good alterna6Even if the lookup can be implemented more efficiently there will always be some overhead multiplied by the total number of 2-grams in the packed representation.</Paragraph>
      <Paragraph position="17"> tives can vary greatly in length, this preference is positively harmful (see following section for discussion of methods to counteract this bias).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML