File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/05/w05-1209_abstr.xml
Size: 20,605 bytes
Last Modified: 2025-10-06 13:44:41
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-1209"> <Title>Generating an Entailment Corpus from News Headlines</Title> <Section position="1" start_page="0" end_page="53" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> We describe our efforts to generate a large (10,00 instance) corpus of textual entailment pairs from the lead paragraph and headline of news articles. We manually inspected a small set of news stories in order to locate the most productive source of entailments, then built an annotation interface for rapid manual evaluation of further exemplars. With this training data we built an SVM-based document classifier, which we used for corpus refinement purposes--we believe that roughly three-quarters of the resulting corpus are genuine entailment pairs. We also discus the difficulties inherent in manual entailment judgment, and sugest ways to ameliorate some of these.</Paragraph> <Paragraph position="1"> MITRE has a long-standing interest in robust text understanding, and, like many, we believe that adequate progress in such an endeavor requires a well-designed evaluation methodology. We have explored in great depth the use of human reading comprehension exams for this purpose (Hirschman et al., 199, Wellner et al., 205) as well as TREC-style question answering (Burger, 204).</Paragraph> <Paragraph position="2"> In this context, the recent Pascal RTE evaluation (Recognizing Textual Entailment, Dagan et al., 205) captured our interest. The goal of RTE is to assess systems' abilities at judging semantic entailment with respect to a pair of sentences, e.g.: * Fred spiled wine on the carpet.</Paragraph> <Paragraph position="3"> * The rug was wet.</Paragraph> <Paragraph position="4"> In RTE parlance, the antecedent sentence is known as the text, while the consequent sentence is known as the hypothesis. Simply put, the challenge for an RTE system is to judge whether the text entails the hypothesis. Judgments are Boolean, and the primary evaluation metric is simple accuracy, although there were other, secondary metrics used in the evaluation.</Paragraph> <Paragraph position="5"> The RTE organizers provided 567 exemplar sentence pairs. This is adequate for system development, but not for the application of large-scale statistical models. In particular, we wished to cast the problem as one of statistical alignment as used in machine translation. MT systems typically use milions of sentence pairs, and so we decided to find or generate a much larger corpus. This paper describes our efforts along these lines, as well as some observations about the problems of annotating entailment data. In Section 2 we describe our initial search for an entailment corpus. Section 3 briefly describes an annotation interface we devised, as well as our efforts to refine our corpus. Section 4 explains many of the isues and problems inherent in manual annotation of entailment data.</Paragraph> <Section position="1" start_page="0" end_page="50" type="sub_section"> <SectionTitle> 2Finding Entailment Data </SectionTitle> <Paragraph position="0"> In our study of the Pascal RTE development corpus, we found that a considerable majority of the TRUE pairs exhibit a stronger relationship than entailment; namely, the hypothesis is a paraphrase of a subset of the text. For instance, given the text John murdered Bil yesterday, the hypothesis Bil is dead is an entailment, while the hypothesis Bil was kiled by John exhibits the stronger partial paraphrase relationship to the text. We found that 94% (131/140) of the TRUE pairs in the Pascal RTE dev2 corpus were these sorts of paraphrases.</Paragraph> <Paragraph position="1"> In our search for an entailment corpus, we observed that the headline of a news article is often a partial paraphrase of the lead paragraph, much like the RTE data, or is sometimes a genuine entailment. We thus deduced that headlines and their corresponding lead paragraphs might provide a readily available source of training data. As an initial test of this hypothesis, we manually inspected over 20 news stories from 1 different sources. We found a great deal of variety in headline formats, and ultimately found the Xinhua News Agency English Service articles from the Gigaword corpus (Graff, 203) to be the richest source, though somewhat limited in subject domain. We describe here our data colection and analysis process.</Paragraph> <Paragraph position="2"> Because our goal was to automatically generate an extremely large corpus of exemplars, we focused on large data sources. We first examined 11 news stories culed from MiTAP (Damianos et al., 203), which colects over one milion articles per month from approximately 75 different sources. By first counting the number of articles typically colected for each source, we selected a mixture of sources that each had more than 10,00 articles for our sample period of one and half months. As discused further below, part way through our investigation it became clear that we needed to include more native English sources, so the Christian Science Monitor articles were added, though they fell below our arbitrary 10K mark.</Paragraph> <Paragraph position="3"> Figure 1 summarizes the MiTAP news sources examined. null For each lead paragraph/headline pair, a human rendered a judgment of yes, no, or maybe as to whether the lead paragraph entailed the headline, where maybe meant that the headline was very close to being an entailment or paraphrase. This is likely equivalent to the notion of &quot;more or less semantically equivalent&quot; used in the Microsoft Research Paraphrase Corpus (Dolan et al., 205).</Paragraph> <Paragraph position="4"> The purpose of maybe in this case was that we thought that many of the near-mis pairs would make adequate training data for statistical algorithms, in spite of being less than perfect.</Paragraph> <Paragraph position="5"> There were many types of news articles in the MiTAP data that did not yield god headline/lead paragraph pairs for our purposes. Many would be difficult to filter out using automated heuristics.</Paragraph> <Paragraph position="6"> Two frequent examples of this were opinion-ed itorial pieces and daily Wall Street summaries.</Paragraph> <Paragraph position="7"> Others would be more amenable to automatic elimination, including obituaries and colections of news snipets like the Washington Post's &quot;World in Brief&quot;. Articles consisting of personal naratives never yielded god headlines, but these could easily be eliminated by recognizing first person pronouns in the lead paragraph. Figure 2 shows the judgments for all the MiTAP articles examined, where the Filtered row excludes these easily eliminated article types.</Paragraph> <Paragraph position="8"> As Figure 2 shows, the MiTAP data did not yield a high percentage of god pairs. In addition, whether due to por machine translation or English dialectal differences, our evaluator found it difficult to understand some of the text from sources that were not English-primary. A certain amount of il-formed text was acceptable, since the Pascal RTE challenge included training and test data drawn from MT scenarios, but we did not wish our data to be to dominated by such sources. Thus, we selected aditional native-English articles to add to our sample set.</Paragraph> <Paragraph position="9"> Despite the overall por yield from this data, it Source No. articles examined No. articles in 1.5 mos.</Paragraph> <Paragraph position="10"> was aparent that some news sources tended to be more fruitful than others. For example, 13 out of 18 of the Washington Post articles yielded god pairs, as oposed to only 1 of the 1 Christian Science Monitor articles.</Paragraph> <Paragraph position="11"> This generalization was likewise true in the second corpus we examined, the Gigaword newswire corpus (Graff, 203). Gigaword contains over 4 milion documents from four news sources: * Agence France Press English Service (AFE) * Asociated Press Worldstream English Service (APW) * The New York Times Newswire Service (NYT) * The Xinhua News Agency English Service (XIE) For each source, Gigaword articles are classified into several types, including newswire advisories, etc. We restricted our investigations to actual news stories. As Figure 3 shows, overall results were much the same as the MiTAP articles, but 85% of the XIE articles yielded adequate pairs. Based on these preliminary results we decided to focus further manual investigations on the XIE articles from Gigaword. We also decided to expend some effort on an annotation tol that would allow us to proceed more quickly than the early annotation experiments described above.</Paragraph> </Section> <Section position="2" start_page="50" end_page="51" type="sub_section"> <SectionTitle> 3Refining the Data </SectionTitle> <Paragraph position="0"> MITRE has developed a series of annotation tols for a variety of linguistic phenomena (Day et al, 197; Day et al, 204), but these are primarily designed for fine-grained tasks such as named entity and syntactic annotation. For our headline corpus, we wanted the ability to rapidly annotate at a document level from a small set of categories. Further, we wanted the interface to easily suport distributed annotation efforts.</Paragraph> <Paragraph position="1"> The resulting annotation interface is shown in Figure 4. It is web-based, and annotations and other document information are stored in an SQL database. The document to be evaluated is displayed in the user's chosen browser, with the XML document zoning tags visible so that the user can easily identify the headline and lead paragraph. At the top of the document are three butons from which to select a yes/no/maybe judgment. The user can also add a comment before moving to the next document. Typically several documents can be judged per minute. The client-server architecture suports multiple annotations of the same document by different annotators--accordingly, it has a mode enabling reconciliation of inter-annotator disagreements. Al further annotation efforts discused below were carried out with this tol.</Paragraph> <Paragraph position="2"> Using the tol, we tagged approximately 90 randomly chosen Gigaword documents, including 520 XIE documents. From this, we estimate that 70% of the XIE headlines in Gigaword are entailed by the corresponding lead paragraph. (This is lower than the rough estimate described in Section 2, but that was based on a very small sample.) We decided to explore ways to refine the data in order to arrive at a smaller, but less noisy subcorpus. We observed that different subgenres within the newspaper corpus evinced the lead-entails-headline quality to different degrees. For example, articles about sports or entertainment often had whimsical (non-entailed) headlines, while articles about politics or business more frequently had the headline quality we sought.</Paragraph> <Paragraph position="3"> Accordingly, we decided to treat the data refinement process as a text classification problem, one of finding the mix of genres or topics that would most likely posess the lead-entails-headline quality. We used SVM-light (Joachims, 202) as a document classifier, training it on the initial set of annotated articles. (Note that these text classification experiments made use of the entire article, not just the lead and headline.) We experimented with a variety of feature representations and SVM parameters, but found the best performance with a Boolean bag-of-words representation, and a simple linear kernel. Leave-one-out estimates indicate that SVM-light could identify documents with the requisite entailment quality with 7% accuracy.</Paragraph> <Paragraph position="4"> We performed one round of active learning (Tong & Koler, 200), in which we used SVM-light to classify a large subset of the unannotated corpus, and then selected a 10-document subset about which the classifier was least certain. The rationale is that annotating these uncertain documents wil be more informative to further learning runs than a randomly selected subset. In the case of large-margin classifiers like SVMs, the natural choice is to select the instances closest to the margin. These were then annotated, and added back to the training data for the next learning run. However, leave-one-out estimates indicated that the classifier benefited litle from these new instances.</Paragraph> <Paragraph position="5"> As described above, we estimate that the base rate of the headline entailment property in the XIE portion of Gigaword is 70%. Our hypothesis in training the SVM was that we could identify a smaller but less noisy subset. In order to evaluate this, we ran the trained SVM on all 679,00 of the unannotated XIE documents, and selected the 10,00 &quot;best&quot; instances--that is, the documents most likely (according to the SVM) to evince the headline quality. We selected a random subset of these best documents, and annotated them to evaluate our hypothesis. 74% of these posessed the lead-entails-headline property, a difference of 4% absolute over the XIE base rate. We used the lead-headline pairs from this 10,00-best subset to train our MT-alignment-based system for the RTE evaluation (Bayer et al., 205). This system was one of the best performers in the evaluation, which we ascribe to our large training corpus Later examination showed that the 4% &quot;improvement&quot; in purity is not statistically significant. We intend to perform further experiments in data refinement, but this may prove unecessary. Perhaps the base rate of the entailment phenomenon in the XIE documents is sufficient to train an effective alignment-based entailment system. In this case, al of the XIE documents could be used, perhaps resulting in a more robust, and even better performing system.</Paragraph> </Section> <Section position="3" start_page="51" end_page="53" type="sub_section"> <SectionTitle> 4Judging Headline Entailments </SectionTitle> <Paragraph position="0"> In the process of generating the training data, we doubly-judged an additional 30 XIE documents to measure inter-judge reliability. As in the pilot phase described above, each pair was labeled as yes, no, or maybe. In addition, the judges were given a comment field to record their reasoning and misgivings. The judging was performed in two steps, first on a set of 10 documents and then on a set of 20. One of the judges was already well versed in the RTE task, and had performed the earlier pilot investigations. Prior to judging the first set, the second judge was given a brief verbal overview of the task. After the first 10 documents had been doubly-judged, the more experienced judge then reviewed the differences and drafted a set of guidelines. The guidelines provided a synopsis of the official RTE guidelines, plus a few rules unique to headlines. For example, one rule specified what to do when partial entailment only held if the lead were combined with location or date information from the dateline. The two evaluators then judged the second set. The results for both sets are shown in Figure 5.</Paragraph> <Paragraph position="1"> As these results show, the guidelines had only a small effect on the strict measure of agreement.</Paragraph> <Paragraph position="2"> Three problem areas existed: (1) Raw, messy data. The Gigaword corpus was automatically colected and zoned. Thus, the headlines in particular contained a number of irregularities that made it difficult to judge their appropriateness. Such irregularities included truncations, phrases lacking any proposition, prepended alerts like URGENT:, and bylines and date lines miszoned into the headline.</Paragraph> <Paragraph position="3"> (2) Disagreement on what constitutes synonymy. Our judges found they had irreconcilable differences about differences in meaning. For example, in the folowing pair, the judges disagreed about whether safe operation in the lead paragraph meant the same thing as, and thus entailed, operates smothly in the headline: * Shanghai's Hongqiao Airport Operates Smothly * As of Saturday, Shanghai's Hongqiao Airport has performed safe operation for some 2,60 consecutive days, seting a record in the country.</Paragraph> <Paragraph position="4"> (3) Disagreement on the amount of world knowledge permited. Figure 5 shows that if maybe is counted as equivalent to yes, the agreement level improves significantly. This is likely because there were two important aspects of the RTE definition of entailment that were not imparted to the second judge until the writen guidelines: that one can assume &quot;common human understanding of language and some common background knowledge.&quot; However, our judges did not always agree on what counts as &quot;common,&quot; which accounts for much of the high overlap between yes and maybe. Nevertheless, our 90% agreement compares favorably to the 83% agreement rate reported by Dolan et al. (205) for their judgments on &quot;more or less semantically equivalent&quot; pairs. Our 78% strict agreement compares favorably to the 80% agreement achieved by Dagan et al. (205), given that our data was messier than the pairs crafted for the RTE chalenge.</Paragraph> <Paragraph position="5"> Like Dagan et al. (205), we did not force resolution on all disagreements. Disagreements over synonymy and common knowledge result in irreconcilable differences, because it is neither posible nor desirable to use guidelines to force a shared understanding of an uterance. Thus, for the first set of data 15 (15%) of the pairs were left unreconciled. In the second set, 42 (21%) were left unreconciled. Eleven (6%) of the irreconcilable pairs in the second set were due to confusion stemming from the telegraphic nature of headlines, which led to misunderstandings about how to judge truncated headlines (Chinese President Vows to Open New Chapters With) vs. headlines lacking propositions (subject headings like Mandela's Speech) vs. well-formed but terse headlines (Crackdown on Auto-Mafia in Bulgaria).</Paragraph> <Paragraph position="6"> Despite the high number of irreconcilable pairs, one encouraging sign was evident from the comment field. The judges' comments revealed that on pairs where they disagreed on how to label the pair, they often agreed on what the problem was.</Paragraph> <Paragraph position="7"> Our experience in generating a training corpus, particularly the number of irreconcilable cases we encountered, raises an important isue, namely, the feasibility of semantic equivalence tasks. We suggest that the optimum method for empirically modeling semantic equivalence is to capture the variation in human judgments. Three judges would evaluate each pair, so that there would always be a tie breaker. After reconciling for disagreements arising from human error, each distinct judgment would become part of the data set. We also recommend that where there is genuine disagreement, the questionable portions of each pair be annotated in some way to capture the source of the problem, going one step further than the comment field we found beneficial in our annotation interface. The three judgments would result in a four way clasification of pairs:</Paragraph> <Paragraph position="9"> System developers could chose to train on all the data, or limit themselves to the TT/FFF cases.</Paragraph> <Paragraph position="10"> For evaluation purposes, the systems' results on the TF/TFF pairs could be evaluated in light of the human variation, providing a more realistic measure of the complexity of the task.</Paragraph> <Paragraph position="11"> 5Conclusion Given the number of natural language processing applications that require the ability to recognize semantic equivalence and entailment, there is an obvious need for both robust evaluation methodologies and adequate development and test data. We've described here our work in generating supplemental training data for the recent Pascal RTE evaluation, with which we produced a competitive system. Some news corpora provide a rich source of exemplars, and an automatic document classifier can be used to reduce the noisiness of the data.</Paragraph> <Paragraph position="12"> There are lingering difficulties in achieving high inter-judge agreement in determining paraphrase and entailment, and we believe the best way to cope with this is to allow the data to reflect the variance that exists in cros-human judgments.</Paragraph> </Section> </Section> class="xml-element"></Paper>