File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/i05-5012_evalu.xml
Size: 9,189 bytes
Last Modified: 2025-10-06 13:59:25
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-5012"> <Title>a1 Information and Communication Technologies</Title> <Section position="6" start_page="91" end_page="94" type="evalu"> <SectionTitle> 5 Evaluation </SectionTitle> <Paragraph position="0"> Inthissection, wedescribe twosmallexperiments designed to evaluate whether a dependency-based statistical generator improves grammaticality. The first experiment uses a precision and recall styled metric on verb arguments. Wefind that our approach performs significantly better than the bigram baseline. The second experiment examines the precision and recall statistics on short and long distance verb arguments. We now describe these two experiments in more detail.</Paragraph> <Section position="1" start_page="91" end_page="93" type="sub_section"> <SectionTitle> 5.1 Improvements in Grammaticality: Verb Argument Precision and Recall </SectionTitle> <Paragraph position="0"> In this evaluation, we want to know what advantages a consideration of input text dependencies affords, compared to just using bigrams from the input text. Given a set of sentences which has been clustered on the basis of similarity of event, the system generates the most probable sentence by recombining words from the cluster.4 The aim of the evaluation is to measure improvements in grammaticality. To do so, we compare our dependency based generation method against a bigram model baseline.</Paragraph> <Paragraph position="1"> Since verbs are crucial in indicating the grammaticality of a clause, we examine the verb arguments of the generated sentence. We use a recall and precision metric over verb dependency relations and compare generated verb arguments with those from the input text. For any verbs included in the generated summary, we count how many generated verb-argument relations can be found amongst the input text relations for that verb. A relation match consists of an identical head, and also an identical modifier. Since word order in English is vital for grammaticality, a matching relation must also preserve the relative order of the two words within the generated sentence.</Paragraph> <Paragraph position="2"> The precision metric is as follows: precision a20 count a10 matched-verb-relationsa18 count a10 generated-verb-relationsa18 The corresponding recall metric is defined as: recall a20 count a10 matched-verb-relationsa18 count a10 source-text-verb-relationsa18 The data for our evaluation cases is taken from the information fusion data collected by (Barzilay et al., 1999). This data is made up of news articles that have first been grouped by topic, and then component topic sentences further clustered by similarity of event. We use 100 sentence clusters and on average there are 4 sentences per cluster. null Each sentence cluster forms an evaluation case for which the task is to generate a single sentence. Foreach evaluation case, the baseline method and ourmethodgenerates asetof answerstrings, from 1 to 40 words in length.</Paragraph> <Paragraph position="3"> For each cluster, sentences are parsed using the Connexor dependency parser (www.connexor.com) to obtain dependency relations used to build dependency models for that cluster. In the interests of minimising conflating factors in this comparison, we similarly 4This sentence could be an accurate replica of an original sentences, or a non-verbatim sentence that fuses information from various input sentences.</Paragraph> <Paragraph position="4"> scores for generated output compared to a bigram baseline train bigram language models on the input cluster of text. This provides both the bigram baseline and our system with the best possible chance of producing a grammatical sentence given the vocabulary of the input cluster. Note that the baseline is a difficult one to beat because it is likely to reproduce long sequences from the original sentences of the input cluster. However, the exact regurgitation of input sentences is not necessarily the outcome of the baseline generator since, for each cluster, bigrams from multiple sentences are combined into a single model.</Paragraph> <Paragraph position="5"> We do not use any smoothing algorithms for dependency counts in this evaluation since at present time. Thus, given the sparseness arising from a small set of sentences, our dependency probabilities tend towards boolean values. For both our approach and the baseline, the bigrams are smoothed using Katz's back-off method.</Paragraph> <Paragraph position="6"> Figure 5 shows the average precision score across sentence lengths. Thatis, foreachsentence length, there are 100 instances whose precisions are averaged. As can be seen, the system almost always achieves a higher precision than the baseline. Asexpected, precision decreases assentence length increases.</Paragraph> <Paragraph position="7"> Ourapproach isdesigned tominimise the number of spurious dependency relations generated in the resulting sentence. As this is typically measured byprecision scores, recall scores are lessin- null Precision scores forgenerated output compared to a bigram baseline teresting as a measure of the generated sentence. However, for completeness, they are presented Figure 6. Results indicate that our system was indistinguishable from the baseline. This is unsurprising as our approach is not designed to increase the retrieval of dependency relations from the source text.</Paragraph> <Paragraph position="8"> Using a two-tailed Wilcoxon test (alpha a154 0.05), we find that the differences in precision scores are significant for most sentence lengths except lengths 17 and 32. The failure to reject the null hypothesis5 for these lengths is interpreted as idiosyncratic in our data set. In the case of the recall scores, differences are not significant. The results support the claim that a dependency-based statistical generator improves grammaticality by reducing the number of spurious verb-argument dependency relations. It is also possible to treat dependency precision as being a superficial measure of content conservation between the generated sentence and the input sentences. Thus, it can also be seen as a poor measure of how well the summary captures the source text.</Paragraph> </Section> <Section position="2" start_page="93" end_page="94" type="sub_section"> <SectionTitle> 5.2 Examining Short and Long Distance Verb Arguments </SectionTitle> <Paragraph position="0"> Intuitively, one would expect the result from the first experiment to be reflected in both short (ie.</Paragraph> <Paragraph position="1"> adjacent) and long distance verb dependencies.</Paragraph> <Paragraph position="2"> To test this intuition, we examined the precision and recall statistics for the two types of dependencies separately. The same experimental setup is used as in the first experiment.</Paragraph> <Paragraph position="3"> The results for adjacent (short) dependencies echo that of the first experiment. The precision results for adjacent dependencies are presented in Figure 7. Again, our system performs better than the baseline in terms of precision. Our system is indistinguishable in recall performance from the baseline. Due to space constraints, we omit the recall graph. Using the same significance test as before, we find that the differences in precision are generally significant across sentence lengths.</Paragraph> <Paragraph position="4"> That our approach should achieve a better precision for adjacent relations supports the claim of improved grammaticality. The result resonates well with the earlier finding that sentences generated by the dependency-based statistical generator contain fewer instances of fragmented text. If this is so, one would expect that a parser is able to identify more of the original intended dependencies. null The results for the long distance verb argument precision and recall tests are slightly different.</Paragraph> <Paragraph position="5"> Whilst the graph of precision scores, presented in Figure 8, shows our system often performing better than the baseline, this difference is not significant. As expected, the recall scores between our system and the baseline are on par and we again omit the results.</Paragraph> <Paragraph position="6"> This result is interesting because one would expect that what our approach offers most is the ability to preserve long distance dependencies from the input text. However, long distance relations are fewer in number than adjacent relations, which account for approximately 70% of dependency relations (Collins, 1996). As the generator still does not produce perfect text, if the intermediate text between the head and modifier of a long distance relation contains any grammatical errors, the parser will obviously have difficulty in identifying the original intended relation. Given that there are fewer long distance relations, the presence of such errors quickly reduces the performance margin for the precision metric and hence no significant effect is detected. We expect that as we fine-tune the probabilistic models, the precision of long distance relations is likely to improve. null</Paragraph> </Section> </Section> class="xml-element"></Paper>