File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/j05-4004_evalu.xml
Size: 15,741 bytes
Last Modified: 2025-10-06 13:59:25
<?xml version="1.0" standalone="yes"?> <Paper uid="J05-4004"> <Title>Induction of Word and Phrase Alignments for Automatic Document Summarization</Title> <Section position="7" start_page="521" end_page="526" type="evalu"> <SectionTitle> 5. Experimental Results </SectionTitle> <Paragraph position="0"> The experiments we perform are on the same Ziff-Davis corpus described in the introduction. In order to judge the quality of the alignments produced, we compare them against the gold-standard references annotated by the humans. The standard precision and recall metrics used in information retrieval are modified slightly to deal with the sure and possible alignments created during the annotation process. Given the set S of sure alignments, the set S [?] P of possible alignments, and a set A of hypothesized alignments, we compute the precision as |A [?] P|/|A |and the recall as |A [?] S|/|S|.</Paragraph> <Paragraph position="1"> One problem with these definitions is that phrase-based models are fond of making phrases. That is, when given an abstract containing the man and a document also containing the man, a human will align the to the and man to man. However, a phrase-based model will almost always prefer to align the entire phrase the man to the man.This is because it results in fewer probabilities being multiplied together.</Paragraph> <Paragraph position="2"> To compensate for this, we define soft precision (SoftP in the tables) by counting alignments where abis aligned to abthe same as ones in which a is aligned to a and b is aligned to b. Note, however, that this is not the same as a aligned to aband b aligned to b. This latter alignment will, of course, incur a precision error. The soft precision metric induces a new, soft F-Score, labeled SoftF.</Paragraph> <Paragraph position="3"> Often, even humans find it difficult to align function words and punctuation. A list of 58 function words and punctuation marks that appeared in the corpus (henceforth called the ignore-list) was assembled. We computed precision and recall scores both on all words and on all words that do not appear in the ignore-list.</Paragraph> <Section position="1" start_page="521" end_page="523" type="sub_section"> <SectionTitle> 5.1 Systems Compared </SectionTitle> <Paragraph position="0"> Overall, we compare various parameter settings of our model against three other systems. First, we compare against two alignment models developed in the context of machine translation. Second, we compare against the Cut and Paste model developed in the context of &quot;summary decomposition&quot; by Jing (2002). Each of these systems will be discussed in more detail shortly. However, the machine translation alignment models assume sentence pairs as input. Moreover, even though the semi-Markov model is based on efficient dynamic programming techniques, it is still too inefficient to run on very long <document, abstract> pairs.</Paragraph> <Paragraph position="1"> To alleviate both of these problems, we preprocess our <document, abstract> corpus down to an <extract, abstract> corpus, and then subsequently apply our models to this smaller corpus (see Figure 7). In our data, doing so does not introduce significant noise. To generate the extracts, we paired each abstract sentence with three sentences from the corresponding document, selected using the techniques described by Marcu (1999). In an informal evaluation, 20 such pairs were randomly extracted and evaluated by a human. Each pair was ranked as 0 (document sentences contain little to none of the information in the abstract sentence), 1 (document sentences contain some of the information in the abstract sentence) or 2 (document sentences contain all of the Pictorial representation of the conversion of the <document, abstract> corpus to an <extract, abstract> corpus.</Paragraph> <Paragraph position="2"> information). Of the 20 random examples, none were labeled as 0; 5 were labeled as 1; and 15 were labeled as 2, giving a mean rating of 1.75. We refer to the resulting corpus as the <extract, abstract> corpus, statistics for which are shown in Table 3. Finally, for fair comparison, we also run the Cut and Paste model only on the extracts.</Paragraph> <Paragraph position="3"> the first of which is based on the original IBM Model 4 for machine translation (Brown et al. 1993) and the HMM machine translation alignment model (Vogel, Ney, and Tillmann 1996) as implemented in the GIZA++ package (Och and Ney 2003).</Paragraph> <Paragraph position="4"> We modified the code slightly to allow for longer inputs and higher fertilities, but otherwise made no changes. In all of these setups, 5 iterations of Model 1 were run, followed by five iterations of the HMM model. For Model 4, 5 iterations of Model 4 were subsequently run.</Paragraph> <Paragraph position="5"> In our model, the distinction between the summary and the document is clear, but when using a model from machine translation, it is unclear which of the summary and the document should be considered the source language and which should be considered the target language. By making the summary the source language, we are effectively requiring that the fertility of each summary word be very high, or that many words are null generated (since we must generate all of the document). By making the document the source language, we are forcing the model to make most document words have zero fertility. We have performed experiments in both directions, but the latter (document as source) performs better in general.</Paragraph> <Paragraph position="6"> In order to seed the machine translation model so that it knows that word identity is a good solution, we appended our corpus with sentence pairs consisting of one source word and one target word, which were identical. This is common practice in the machine translation community when one wishes to cheaply encode knowledge from a dictionary into the alignment model.</Paragraph> <Paragraph position="7"> mary decomposition method (Jing 2002), based on a non-trainable HMM. Briefly, 9 Interestingly, the Cut and Paste method actually achieves higher performance scores when run on only the extracts rather than the full documents.</Paragraph> <Paragraph position="8"> the Cut and Paste HMM searches for long contiguous blocks of words in the document and abstract that are identical (up to stem). The longest such sequences are aligned. By fixing a length cutoff of n and ignoring sequences of length less than n, one can arbitrarily increase the precision of this method. We found that n = 2 yields the best balance between precision and recall (and the highest F-measure). On this task, this model drastically outperforms the machine translation models. 5.1.3 The Semi-Markov Model. While the semi-HMM is based on a dynamic programming algorithm, the effective search space in this model is enormous, even for moderately sized <document, abstract> pairs. The semi-HMM system was then trained on this <extract, abstract> corpus. We also restrict the state-space with a beam, sized at 50% of the unrestricted state-space. With this configuration, we run ten iterations of the forward-backward algorithm. The entire computation time takes approximately 8 days on a 128-node cluster computer.</Paragraph> <Paragraph position="9"> We compare three settings of the semi-HMM. The first, semi-HMM-relative,uses the relative movement jump table; the second, semi-HMM-Gaussian, uses the Gaussian parameterized jump table; the third, semi-HMM-syntax, uses the syntax-based jump model.</Paragraph> </Section> <Section position="2" start_page="523" end_page="524" type="sub_section"> <SectionTitle> 5.2 Evaluation Results </SectionTitle> <Paragraph position="0"> The results, in terms of precision, recall, and F-score, are shown in Table 4. The first three columns are when these three statistics are computed over all words. The next three columns are when these statistics are only computed over words that do not appear in our ignore list of 58 stop words. Under the methodology for combining the two human annotations by taking the union, either of the human scores would achieve a precision and recall of 1.0. To give a sense of how well humans actually perform on this task, we compare each human against the other.</Paragraph> <Paragraph position="1"> As we can see from Table 4, none of the machine translation models is well suited to this task, achieving, at best, an F-score of 0.298. The flipped models, in which the document sentences are the source language and the abstract sentences are the target language perform significantly better (comparatively). Since the MT models are not symmetric, going the bad way requires that many document words have zero fertility, which is difficult for these models to cope with.</Paragraph> <Paragraph position="2"> The Cut and Paste method performs significantly better, which is to be expected, since it is designed specifically for summarization. As one would expect, this method achieves higher precision than recall, though not by very much. The fact that the Cut and Paste model performs so well, compared to the MT models, which are able to learn non-identity correspondences, suggests that any successful model should be able to take advantage of both, as ours does.</Paragraph> <Paragraph position="3"> Our methods significantly outperform both the IBM models and the Cut and Paste method, achieving a precision of 0.522 and a recall of 0.712, yielding an overall F-score of 0.606 when stop words are not considered. This is still below the human-against-human F-score of 0.775 (especially considering that the true human-against-human scores are 1.0), but significantly better than any of the other models.</Paragraph> <Paragraph position="4"> Among the three settings of our jump table, the syntax-based model performs best, followed by the relative jump model, with the Gaussian model coming in worst (though still better than any other approach). Inspecting Figure 2, the fact that the Gaussian model does not perform well is not surprising; the data shown there is very non-Gaussian. A double-exponential model might be a better fit, but it is unlikely that such a model will outperform the syntax based model, so we did not perform this experiment.</Paragraph> </Section> <Section position="3" start_page="524" end_page="526" type="sub_section"> <SectionTitle> 5.3 Error Analysis </SectionTitle> <Paragraph position="0"> The first mistake frequently made by our model is to not align summary words to null.</Paragraph> <Paragraph position="1"> In effect, this means that our model of null-generated summary words is lacking. An example of this error is shown in Example 1 in Figure 8. In this example, the model has erroneously aligned from DOS in the abstract to from DOS in the document (the error is shown in bold). This alignment is wrong because the context of from DOS in the document is completely different from the context it appears in the summary. However, the identity rewrite model has overwhelmed the locality model and forced this incorrect alignment. To measure the frequency of such errors, we have post-processed our system's alignments so that whenever a human alignment contains a null-generated summary word, our model also predicts that this word is null-generated. Doing so will not change our system's recall, but it can improve the precision. Indeed, in the case of the relative jump model, the precision jumps from 0.456 to 0.523 (F-score increases Computational Linguistics Volume 31, Number 4 Figure 8 Erroneous alignments are in bold. (Top) Example of an error made by our model (from file ZF207-585-936). From DOS should be null generated, but the model has erroneously aligned it to an identical phrase that appeared 11 sentences earlier in the document. (Bottom) Error (from ZF207-772-628); The DMP 300 should be aligned to the printer but is instead aligned to a far-away occurrence of The DMP 300.</Paragraph> <Paragraph position="2"> from 0.548 to 0.594) in the case of all words and from 0.512 to 0.559 (F-score increases from 0.593 to 0.624). This corresponds to a relative improvement of roughly 8% F-score. Increases in score for the syntax-based model are roughly the same.</Paragraph> <Paragraph position="3"> The second mistake our model frequently makes is to trust the identity rewrite model too strongly. This problem has to do either with synonyms that do not appear frequently enough for the system to learn reliable rewrite probabilities, or with coreference issues, in which the system chooses to align, for instance, Microsoft to Microsoft, rather than Microsoft to the company, as might be correct in context. As suggested by this example, this problem is typically manifested in the context of coreferential noun phrases. It is difficult to perform a similar analysis of this problem as for the aforementioned problem (to achieve an upper bound on performance), but we can provide some evidence. As mentioned before, in the human alignments, roughly 51% of all aligned phrases are lexically identical. In the alignments produced by our model (on the same documents), this number is 69%. In the case of stem identity, the hand-aligned data suggests that stem identity should hold in 67% of the cases; in our alignments, this number was 81%. An example of this sort of error is shown in Example 2 in Figure 8. Here, the model has aligned The DMP 300 in the abstract to The DMP 300 in the document, while it should have been aligned to the printer due to locality constraints (note that the model also misses the (produces - producing) alignment, likely as a side-effect of it making the error depicted in bold).</Paragraph> <Paragraph position="4"> In Table 5, we have shown examples of common errors made by our system (these were randomly selected from a much longer list of errors). These examples are shown out of their contexts, but in most cases, the error is clear even so. In the first column, we show the summary phrase in question. In the second column, we show the document phrase to which it should be aligned, and in the third column, we show the document phrase that our model aligned it to (or null). In the right column, we classify the model's alignment as incorrect or partially correct.</Paragraph> <Paragraph position="5"> The errors shown in Table 5 show several weaknesses of the model. For instance, in the first example, it aligns to port with to port, which seems correct without context, but the chosen occurrence of to port in the document is in the discussion of a completely different porting process than that referred to in the summary (and is Daum'e and Marcu Alignments for Automatic Document Summarization Table 5 Ten example phrase alignments from the hand-annotated corpus; the last column indicates whether the semi-HMM correctly aligned this phrase.</Paragraph> <Paragraph position="6"> Summary Phrase True Phrase Aligned Phrase Class to port can port to port incorrect OS - 2 the OS / 2 OS / 2 partial will use will be using will using partial word processing programs word processors word processing incorrect consists of also includes null of partial will test will also have to test will test partial the potential buyer many users the buyers incorrect The new software Crosstalk for Windows new software incorrect are generally powered by run on null incorrect Oracle Corp. the software publisher Oracle Corp. incorrect several sentences away). The seventh and tenth examples (The new software and Oracle Corp., respectively) show instances of the coreference error that occurs commonly.</Paragraph> </Section> </Section> class="xml-element"></Paper>