File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/w05-1611_evalu.xml

Size: 9,092 bytes

Last Modified: 2025-10-06 13:59:31

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-1611">
  <Title>Discrete Optimization as an Alternative to Sequential Processing in NLG</Title>
  <Section position="5" start_page="22" end_page="22" type="evalu">
    <SectionTitle>
4 Experiments and Results
</SectionTitle>
    <Paragraph position="0"> In order to evaluate our approach we conducted a series of experiments with two implementations of the ILP model and two different pipelines. Each system takes as input the tree-based representation of the semantic content of route directions described in Section 2. The generation process traverses the temporal tree in a depth-first fashion, and for each node a single discourse unit is realized.</Paragraph>
    <Paragraph position="2"> null and as after until T3 Connective  Connective (horizontal) and Verb Form (vertical), computed for all discourse units in a corpus.</Paragraph>
    <Section position="1" start_page="22" end_page="22" type="sub_section">
      <SectionTitle>
4.1 Correlations Between Tasks
</SectionTitle>
      <Paragraph position="0"> We started with running the correlation tests for all pairs of tasks. The obtained correlation network is presented in Figure 5. It is interesting to observe that tasks which realize FEs belonging to the same levels of linguistic organization, and have traditionally been handled within the same generation stages (i.e. Text Planning, Microplanning and Realization) are closely correlated with one another. This fact supports empirically some assumptions behind Reiter's consensus model.</Paragraph>
      <Paragraph position="1"> On the other hand, there exist quite a few correlations that extend over the stage boundaries, and all three lexicalization tasks i.e. T3, T6 and T8 are correlated with many tasks of a totally different linguistic character.</Paragraph>
    </Section>
    <Section position="2" start_page="22" end_page="22" type="sub_section">
      <SectionTitle>
4.2 ILP Systems
</SectionTitle>
      <Paragraph position="0"> We used the ILP model described in Section 3 to implement two generation systems. To obtain assignment costs, both systems get a probability distribution for each task from basic classifiers trained on the training data. To calculate the separation costs, modeling the stochastic constraints on the co-occurrence of labels, we considered correlated tasks only (cf. Figure 5) and applied two calculation methods, which resulted in two different system implementations.</Paragraph>
      <Paragraph position="1"> In ILP1, for each pair of tasks we computed the joint distribution of the respective labels considering all discourse units in the training data before the actual input was known. Such obtained joint distributions were used for generating all discourse units from the test data. An example matrix with joint distribution for selected labels of tasks Connective and Verb Form is given in Table 3. An advantage of this approach is that the computation can be done in an offline mode and has no impact on the run-time.</Paragraph>
      <Paragraph position="2"> In ILP2, the joint distribution for a pair of tasks was calculated at run-time, i.e. only after the actual input had been null and as after until T3 Connective  Verb Form, considering only disc. units similar to (c): until you see the river side in front of you, at Phi-threshold [?] 0.8. known. This time we did not consider all discourse units in the training data, but only those whose meaning, represented as a feature vector, was similar to the meaning of the input discourse unit. As a similarity metric we used the Phi coefficient7, and set the similarity threshold at 0.8. As can be seen from Table 4, the probability distribution computed in this way is better suited to the specific semantic context. This is especially important if the available corpus is small and the frequency of certain pairs of labels might be too low to have a significant impact on the final assignment.</Paragraph>
    </Section>
    <Section position="3" start_page="22" end_page="22" type="sub_section">
      <SectionTitle>
4.3 Pipeline Systems
</SectionTitle>
      <Paragraph position="0"> As a baseline we implemented two pipeline systems. In the first one we used the ordering of tasks that resembles most closely the standard NLG pipeline and which we also used before in [Marciniak and Strube, 2004]8.</Paragraph>
      <Paragraph position="1"> Individual classifiers had access to both the semantic features, and the features output by the previous modules. To train the classifiers, the correct feature values were extracted from the training data and during testing the generated, and hence possibly erroneous, values were taken.</Paragraph>
      <Paragraph position="2"> In the other pipeline system we wanted to minimize the error-propagation effect and placed the tasks in the order of decreasing accuracy. To determine the ordering of tasks we applied the following procedure: the classifier with the highest baseline accuracy was selected as the first one. The remaining classifiers were trained and tested again, but this time they had access to the additional feature. Again, the classifier with the highest accuracy was selected and the procedure was repeated until all classifiers were ordered.</Paragraph>
    </Section>
    <Section position="4" start_page="22" end_page="22" type="sub_section">
      <SectionTitle>
4.4 Evaluation
</SectionTitle>
      <Paragraph position="0"> We evaluated our system using leave-one-out crossvalidation, i.e. for all texts in the corpus, each text was used once for testing, and the remaining texts provided the training data. To solve individual classification tasks we used the decision tree learner C4.5 in the pipeline systems and the Naive Bayes algorithm9 in the ILP systems. Both learning schemes  position of the tasks in the pipeline.</Paragraph>
      <Paragraph position="1"> yielded highest results in the respective configurations10. To solve the ILP models we used lp solve, a highly efficient GNU-licence Mixed Integer Programming (MIP) solver11, that implements the Branch-and-Bound algorithm. For each task we applied a feature selection procedure (cf. [Kohavi and John, 1997]) to determine which semantic features should be taken as the input by the basic classifiers.</Paragraph>
      <Paragraph position="2"> To evaluate individual tasks we applied two metrics: accuracy, calculated as the proportion of correct classifications to the total number of instances, and the k statistic, which corrects for the proportion of classifications that might occur by chance. For end-to-end evaluation, we applied the Phi coefficient to measure the degree of similarity between the vector representations of the generated form (i.e. built from the outcomes of individual tasks) and the reference form obtained from the test data. The Phi-based similarity metric is similar to k as it compensates for the fact that a match between two multi-label features is more difficult to obtain than in the case of binary features. This measure tells us how well all the tasks have been solved together, which in our case amounts to generating the whole text.</Paragraph>
      <Paragraph position="3"> The results presented in Table 5 show that the ILP systems achieved highest accuracy and k for most tasks and reached the highest overall Phi score. Notice that ILP2 improved the accuracy of both pipeline systems for the three correlated tasks that we discussed before, i.e. Connective, S Exp. and Verb Form. Another group of correlated tasks for which the results appear interesting are i.e. Verb Lex., Phrase Type and Phrase Rank (cf. Figure 3). Notice that Verb Lex. got higher scores in Pipeline2, with outputs from both Phrase Type and Phrase Rank (see the respective pipeline positions), but the reverse effect did not occur: scores for both phrase tasks were lower in Pipeline1 when they had access to the output from Verb Lex., contrary to what we might expect. Apparently, this was due to the low accuracy for Verb Lex. which caused the 10We have found that in direct comparison C4.5 performs better than Naive Bayes but the probability distribution that it outputs is strongly biased towards the winning label. In this case it is practically impossible for the ILP system to change the classifier's decision, as the costs of other labels get extremely high. Hence the more balanced probability distribution given by Naive Bayes can be easier corrected in the optimization process.</Paragraph>
      <Paragraph position="4"> 11http://www.geocities.com/lpsolve/ already mentioned error propagation. This example shows well the advantage that optimization processing brings: both ILP systems reached much higher scores for all three tasks.</Paragraph>
      <Paragraph position="5"> Finally, it appears as no coincidence that the three tasks involving lexical choice, i.e. Connective, Verb Lex. and Preposition Lex. scored lower than the syntactic tasks in all systems. This can be attributed partially to the limitations of retrieval measures which do not allow for the fact, that in a given semantic content more than one lexical form can be appropriate. null</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML