File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/w05-0618_evalu.xml
Size: 9,358 bytes
Last Modified: 2025-10-06 13:59:31
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0618"> <Title>Beyond the Pipeline: Discrete Optimization in NLP</Title> <Section position="8" start_page="139" end_page="141" type="evalu"> <SectionTitle> 5 Experiments and Results </SectionTitle> <Paragraph position="0"> In order to evaluate our approach we conducted experiments with two implementations of the ILP model and two different pipelines (presented below).</Paragraph> <Paragraph position="1"> Each system takes as input a tree structure, representing the temporal structure of the text. Individual nodes correspond to single discourse units and their semantic content is given by respective feature vectors. Generation occurs in a number of stages, during which individual discourse units are realized.</Paragraph> <Section position="1" start_page="139" end_page="140" type="sub_section"> <SectionTitle> 5.1 Implemented Systems </SectionTitle> <Paragraph position="0"> We used the ILP model described in Section 3 to build two generation systems. To obtain assignment costs, both systems get a probability distribution for each task from basic classi ers trained on the training data. To calculate the separation costs, modeling the stochastic constraints on the co-occurrence of labels, we considered correlated tasks only (cf. Figure 3) and applied two calculation methods, which resulted in two different system implementations.</Paragraph> <Paragraph position="1"> In ILP1, for each pair of tasks we computed the joint distribution of the respective labels considering all discourse units in the training data before the actual input was known. Such obtained joint distributions were used for generating all discourse units from the test data. An example matrix with joint distribution for selected labels of tasks Connective and Verb Form is given in Table 2. An advantage of this null and as after until T3 Connective Connective (horizontal) and Verb Form (vertical), computed for all discourse units in a corpus.</Paragraph> <Paragraph position="2"> null and as after until T3 Connective Verb Form, considering only discourse units similar to (c): until you see the river side in front of you, at Phi-threshold [?] 0.8. approach is that the computation can be done in an of ine mode and has no impact on the run-time.</Paragraph> <Paragraph position="3"> In ILP2, the joint distribution for a pair of tasks was calculated at run-time, i.e. only after the actual input had been known. This time we did not consider all discourse units in the training data, but only those whose meaning, represented as a feature vector was similar to the meaning vector of the input discourse unit. As a similarity metric we used the Phi coef cient9, and set the similarity threshold at 0.8. As can be seen from Table 3, the probability distribution computed in this way is better suited to the speci c semantic context. This is especially important if the available corpus is small and the frequency of certain pairs of labels might be too low to have a signi cant impact on the nal assignment.</Paragraph> <Paragraph position="4"> As a baseline we implemented two pipeline systems. In the rst one we used the ordering of tasks most closely resembling the conventional NLG pipeline (see Figure 4). Individual classi ers had access to both the semantic features, and those output by the previous modules. To train the classi ers, the correct feature values were extracted from the training data and during testing the generated, and multi-class features on a binary scale we applied dummy coding which transforms multi class-nominal variables to a set of dummy variables with binary values.</Paragraph> <Paragraph position="5"> other pipeline system we wanted to minimize the error-propagation effect and placed the tasks in the order of decreasing accuracy. To determine the ordering of tasks we applied the following procedure: the classi er with the highest baseline accuracy was selected as the rst one. The remaining classi ers were trained and tested again, but this time they had access to the additional feature. Again, the classier with the highest accuracy was selected and the procedure was repeated until all classi ers were ordered. null</Paragraph> </Section> <Section position="2" start_page="140" end_page="141" type="sub_section"> <SectionTitle> 5.2 Evaluation </SectionTitle> <Paragraph position="0"> We evaluated our system using leave-one-out crossvalidation, i.e. for all texts in the corpus, each text was used once for testing, and the remaining texts provided the training data. To solve individual classi cation tasks we used the decision tree learner C4.5 in the pipeline systems and the Naive Bayes algorithm10 in the ILP systems. Both learning schemes yielded highest results in the respective con gurations11. For each task we applied a feature selection procedure (cf. Kohavi & John (1997)) to determine which semantic features should be taken as the input by the respective basic classiers12. We started with an empty feature set, and then performed experiments checking classi cation accuracy with only one new feature at a time. The feature that scored highest was then added to the feature set and the whole procedure was repeated iteratively until no performance improvement took place, or no more features were left.</Paragraph> <Paragraph position="1"> To evaluate individual tasks we applied two metrics: accuracy, calculated as the proportion of correct classi cations to the total number of instances, and the k statistic, which corrects for the proportion of classi cations that might occur by chance13 10Both implemented in the Weka machine learning software (Witten & Frank, 2000).</Paragraph> <Paragraph position="2"> 11We have found that in direct comparison C4.5 reaches higher accuracies than Naive Bayes but the probability distribution that it outputs is strongly biased towards the winning label. In this case it is practically impossible for the ILP system to change the classi er's decision, as the costs of other labels get extremely high. Hence the more balanced probability distribution given by Naive Bayes can be easier corrected in the position of the tasks in the pipeline.</Paragraph> <Paragraph position="3"> (Siegel & Castellan, 1988). For end-to-end evaluation, we applied the Phi coef cient to measure the degree of similarity between the vector representations of the generated form and the reference form obtained from the test data. The Phi statistic is similar to k as it compensates for the fact that a match between two multi-label features is more dif cult to obtain than in the case of binary features. This measure tells us how well all the tasks have been solved together, which in our case amounts to generating the whole text.</Paragraph> <Paragraph position="4"> The results presented in Table 4 show that the ILP systems achieved highest accuracy and k for most tasks and reached the highest overall Phi score. Notice that for the three correlated tasks that we considered before, i.e. Connective, S Exp. and Verb Form, ILP2 scored noticeably higher than the pipeline systems. It is interesting to see the effect of sequential processing on the results for another group of correlated tasks, i.e. Verb Lex, Phrase Type and Phrase Rank (cf. Figure 3). Verb Lex got higher scores in Pipeline2, with outputs from both Phrase Type and Phrase Rank (see the respective pipeline positions), but the reverse effect did not occur: scores for both phrase tasks were lower in Pipeline1 when they had access to the output from Verb Lex, contrary to what we might expect. Apparently, this was due to the low accuracy for Verb Lex which caused the already mentioned error propagation14. This example shows well the advantage that optimization processing brings: both ILP systems reached much ties can be directly compared, which gives a clear notion how well individual tasks have been solved.</Paragraph> <Paragraph position="5"> 14Apparantly, tasks which involve lexical choice get low scores with retrieval measures as the semantic content allows typically more than one correct form higher scores for all three tasks.</Paragraph> </Section> <Section position="3" start_page="141" end_page="141" type="sub_section"> <SectionTitle> 5.3 Technical Notes </SectionTitle> <Paragraph position="0"> The size of an LP model is typically expressed in the number of variables and constraints. In the model presented here it depends on the number of tasks in T, the number of possible labels for each task, and the number of correlated tasks. For n different tasks with the average of m labels, and assuming every two tasks are correlated with each other, the number of variables in the LP target functions is given by: num(var) = n * m + 1/2 * n(n [?] 1) * m2 and the number of constraints by: num(cons) = n + n * (n [?] 1) * m. To solve the ILP models in our system we use lp solve, an ef cient GNU-licence Mixed Integer Programming (MIP) solver15, which implements the Branch-and-Bound algorithm. In our application, the models varied in size from: 557 variables and 178 constraints to 709 variables and 240 constraints, depending on the number of arguments in a sentence. Generation of a text with 23 discourse units took under 7 seconds on a twoprocessor 2000 MHz AMD machine.</Paragraph> </Section> </Section> class="xml-element"></Paper>