File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/n03-1026_evalu.xml
Size: 7,881 bytes
Last Modified: 2025-10-06 13:58:56
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-1026"> <Title>Statistical Sentence Condensation using Ambiguity Packing and Stochastic Disambiguation Methods for Lexical-Functional Grammar</Title> <Section position="5" start_page="0" end_page="0" type="evalu"> <SectionTitle> 4 Experimental Evaluation </SectionTitle> <Paragraph position="0"> The sentences and condensations we used are taken from data for the experiments of Knight and Marcu (2000), which were provided to us by Daniel Marcu. These data consist of pairs of sentences and their condensed versions that have been extracted from computer-news articles and abstracts of the Ziff-Davis corpus. Out of these data, we parsed and manually disambiguated 500 sentence pairs.</Paragraph> <Paragraph position="1"> These included a set of 32 sentence pairs that were used for testing purposes in Knight and Marcu (2000). In order to control for the small corpus size of this test set, we randomly extracted an additional 32 sentence pairs from the 500 parsed and disambiguated examples as a second test set. The rest of the 436 randomly selected sentence pairs were used to create training data. For the purpose of discriminative training, a gold-standard of transferred f-structures was created from the transfer output and the manually selected f-structures for the condensed strings.</Paragraph> <Paragraph position="2"> This was done automatically by selecting for each example the transferred f-structure that best matched the f-structure annotated for the condensed string.</Paragraph> <Paragraph position="3"> In the automatic evaluation of f-structure match, three different system variants were compared. Firstly, randomly chosen transferred f-structures were matched against the manually selected f-structures for the manually created condensations. This evaluation constitutes a lower bound on the F-score against the given gold standard. Secondly, matching results for transferred f-structures yielding the maximal F-score against the gold standard were recorded, giving an upper bound for the system. Thirdly, the performance of the stochastic model within the range of the lower bound and upper bound was measured by recording the F-score for the f-structure that received highest probability according to the learned distribution on transferred structures.</Paragraph> <Paragraph position="4"> In order to make our results comparable to the results of Knight and Marcu (2000) and also to investigate the correspondence between the automatic evaluation and human judgments, a manual evaluation of the strings generated by these system variants was conducted. Two human judges were presented with the uncondensed surface string and five condensed strings that were displayed in random order for each test example. The five condensed strings presented to the human judges contained (1) strings generated from three randomly selected fstructures, (2) the strings generated from the f-structures which were selected by the stochastic model, and (3) the manually created gold-standard condensations extracted from the Ziff-Davis abstracts. The judges were asked to judge summarization quality on a scale of increasing quality from 1 to 5 by assessing how well the generated strings retained the most salient information of the original uncondensed sentences. Grammaticality of the system output is optimal and not reported separately. Results for both evaluations are reported for two test corpora of 32 examples each. Testset I contains the sentences and condensations used to evaluate the system described in Knight and Marcu (2000). Testset II consists of another randomly extracted 32 sentence pairs from the same domain, prepared in the same way.</Paragraph> <Paragraph position="5"> Fig. 5 shows evaluation results for a sentence condensation run that uses manually selected f-structures for the original sentences as input to the transfer component.</Paragraph> <Paragraph position="6"> These results demonstrate how the condenstation system performs under the optimal circumstances when the parse chosen as input is the best available. Fig. 6 applies the same evaluation data and metrics to a sentence condensation experiment that performs transfer from packed fstructures, i.e. transfer is performed on all parses for an ambiguous sentence instead of on a single manually selected parse. Alternatively, a single input parse could be selected by stochastic models such as the one described in Riezler et al. (2002). A separate phase of parse disambiguation, and perhaps the effects of any errors that this might introduce, can be avoided by transferring from all parses for an ambiguous sentence. This approach is computationally feasible, however, only if condensation can be carried all the way through without unpacking. Our technology is not yet able to do this (in particular, as mentioned earlier, we have not yet implemented a method for stochastic disambiguation on packed f-structures). However, we conducted a preliminary assessment of this possibility by unpacking and enumerating the transferred fstructures. For many sentences this resulted in more candidates than we could operate on in the available time and space, and in those cases we arbitrarily set a cut-off on the number of transferred f-structures we considered.</Paragraph> <Paragraph position="7"> Since transferred f-structures are produced according to the number of rules applied to transfer them, in this setup the transfer system produces smaller f-structures first, and cuts off less condensed output. The result of this experiment, shown in Fig. 6, thus provides a conservative estimate on the quality of the condensations we might achieve with a full-packing implementation.</Paragraph> <Paragraph position="8"> In Figs. 5 and 6, the first row shows F-scores for a random selection, the system selection, and the best possible selection from the transfer output against the gold standard. The second rows show summarization quality scores for generations from a random selection and the system selection, and for the human-written condensation. The third rows report compression ratios. As can testset I lowerbound systemselection upperbound f-structure for original uncondensed sentences.</Paragraph> <Paragraph position="9"> be seen from these tables, the ranking of system variants produced by the automatic and manual evaluation confirm a close correlation between the automatic evaluation and human judgments. A comparison of evaluation results across colums, i.e. across selection variants, shows that a stochastic selection of transferred f-structures is indeed important. Even if all f-structures are transferred from the same linguistically rich source, and all generated strings are grammatical, a reduction in error rate of around 50% relative to the upper bound can be achieved by stochastic selection. In contrast, a comparison between transfer runs with and without perfect disambiguation of the original string shows a decrease of about 5% in F-score, and of only .1 points for summarization quality when transferring from packed parses instead of from the manually selected parse. This shows that it is more important to learn what a good transferred f-structure looks like than to have a perfect f-structure to transfer from.</Paragraph> <Paragraph position="10"> The compression rates associated with the systems that used stochastic selection is around 60%, which is acceptable, but not as aggressive as human-written condensations. Note that in our current implementation, in some cases the transfer component was unable to operate on the packed representation. In those cases a parse was chosen at random as a conservative estimate of transfer from all parses. This fall-back mechanism explains the drop in F-score for the upper bound in comparing Figs. 5 and 6.</Paragraph> </Section> class="xml-element"></Paper>