File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-2105_evalu.xml
Size: 7,461 bytes
Last Modified: 2025-10-06 13:59:44
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2105"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Logic-based Semantic Approach to Recognizing Textual Entailment</Title> <Section position="9" start_page="823" end_page="825" type="evalu"> <SectionTitle> 7 Experiments and Results </SectionTitle> <Paragraph position="0"> The benchmark corpus for the RTE 2005 task consists of seven subsets with a 50%-50% split between the positive entailment examples and the negative ones. Each subgroup corresponds to a different NLP application: Information Retrival (IR), Comparable Documents (CD), Reading Comprehension (RC), Question Answering (QA), Information Extraction (IE), Machine Translation (MT), and Paraphrase Acquisition (PP). The RTE data set includes 1367 English a3 a1a5a4 a0a7a6 pairs from the news domain (political, economical, etc.). The RTE 2006 data covered only four NLP tasks (IE, IR, QA and Multi-document Summarization (SUM)) with an identical split between positive and negative examples. Table 2 presents the data statistics.</Paragraph> <Section position="1" start_page="823" end_page="824" type="sub_section"> <SectionTitle> 7.1 COGEX's Results </SectionTitle> <Paragraph position="0"> Tables 3 and 4 summarize COGEX's performance on the RTE datasets, when it received as input the different-source logic forms11.</Paragraph> <Paragraph position="1"> On the RTE 2005 data, the overall performance on the test set is similar for both logic proving runs, COGEX a0 and COGEXa79 . On the development set, the semantically enhanced logic forms helped the prover distinguish better the positive entailments (COGEX a0 has an overall higher precision than COGEXa79 ). If we analyze the performance on the test data, then COGEX a0 performs slightly better on MT, CD and PP and worse on the RC, IR and QA tasks. The major differences between the two logic forms are the semantic content (incomplete for the dependency-derived logic forms) and, because the text's tokenization is different, the number of predicates in a0 's logic forms is different which leads to completely different proof scores.</Paragraph> <Paragraph position="2"> On the RTE 2006 test data, the system which uses the dependency logic forms outperforms COGEX a0 . COGEXa79 performs better on almost all tasks (except SUM) and brings a significant improvement over COGEX a0 on the IR task. Some of the positive examples that the systems did not label correctly require world knowledge that we do not have encoded in our axiom set. One example for which both systems returned the wrong answer is pair 353 (test 2006) where, from China's decade-long practice of keeping its currency valued at around 8.28 yuan to the dollar, the system should recognize the relation between the yuan and China's currency and infer that the currency used in China is the yuan because a country's currency a0 currency used in the country. Some of the pairs that the prover, currently, cannot handle involve numeric calculus and human-oriented estimations. Consider, for example, pair 359 (dev set, RTE 2006) labeled as positive, for which the logic prover could not determine that 15 safety violations a0 numerous safety violations.</Paragraph> <Paragraph position="3"> The deeper analysis of the systems' output 11For the RTE 2005 data, we list the confidence-weighted score (cws) (Dagan et al., 2005) and, for the RTE 2006 data, the average precision (ap) measure (Bar-Haim et al., 2006). showed that while WordNet lexical chains and NLP axioms are the most frequently used axioms throughout the proofs, the semantic and temporal axioms bring the highest improvement in accuracy, for the RTE data.</Paragraph> </Section> <Section position="2" start_page="824" end_page="824" type="sub_section"> <SectionTitle> 7.2 Lexical Alignment </SectionTitle> <Paragraph position="0"> Inspired by the positive examples whose a0 is in a high degree lexically subsumed by a1 , we developed a shallow system which measures their overlap by computing an edit distance between the text and the hypothesis. The cost of deleting a word from a1 a3a7a4 a1 a16 a82 a6 is equal to 0, the cost of replacing a word from a1 with another from a0 of-speech of the inserted word (higher values for WordNet nouns, adjectives or adverbs, lower for verbs and a minimum value for everything else). Table 5 shows a minimum cost alignment.</Paragraph> <Paragraph position="1"> The performance of this lexical method (LEX-ALIGN) is shown in Tables 3 and 4. The alignment technique performs significantly better on the a3 a1 a4 a0 a6 pairs in the CD (RTE 2005) and SUM (RTE 2006) tasks. For these tasks, all three systems performed the best because the text of false pairs is not entailing the hypothesis even at the lexical level. For pair 682 (test set, RTE 2006), a1 and a0 have very few words overlapping and there are no axioms that can be used to derive knowledge that supports the hypothesis. Contrarily, for the IE task, the systems were fooled by the high word overlap between a1 and a0 . For example, pair 678's text (test set, RTE 2006) contains the entire hypothesis in its if clause. For this task, we had the highest number of false positives, around double when compared to the other applications. LEXALIGN works surprisingly well on the RTE data. It outperforms the semantic systems on the 2005 QA test data, but it has its limitations. The logic representations are generated from parse trees which are not always accurate (a1 86% accuracy). Once syntactic and semantic parsers are perfected, the logical semantic approach shall prove its potential.</Paragraph> </Section> <Section position="3" start_page="824" end_page="825" type="sub_section"> <SectionTitle> 7.3 Merging three systems </SectionTitle> <Paragraph position="0"> Because the two logical representations and the lexical method are very different and perform better on different sets of tasks, we combined the scores returned by each system12 to see if a mixed approach performs better than each individual method. For each NLP task, we built a classifier based on the linear combination of the three scores. Each task's classifier labels pair a20 as pos- null close to 0 indicating a probable negative example and a number close to 1 indicating a probable positive example. Each a17a19a18a21a20a23a22a25a24 pair's lexical alignment score, a26a28a27a10a29a31a30a31a32a31a33a35a34a37a36a39a38a41a40 a59a43a42a10a44 , is the normalized average edit distance cost.</Paragraph> <Paragraph position="1"> mined using a grid search on each development set. Given the different nature of each application, the a2 parameters vary with each task. For example, the final score given to each IE 2006 pair is highly dependent on the score given by COGEX when it received as input the logic forms created from the constituency parse trees with a small correction from the dependency parse trees logic form system13. For the IE task, the lexical alignment performs the worst among the three systems. On the other hand, for the IR task, the score given by LEXALIGN is taken into account14. Tables 3 and 4 summarize the performance of the three system combination. This hybrid approach performs better than all other systems for all measures on all tasks. It displays the same behavior as its dependents: high accuracy on the CD and SUM tasks and many false positives for the IE task.</Paragraph> </Section> </Section> class="xml-element"></Paper>