File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/p99-1042_intro.xml
Size: 5,980 bytes
Last Modified: 2025-10-06 14:06:55
<?xml version="1.0" standalone="yes"?> <Paper uid="P99-1042"> <Title>Deep Read: A Reading Comprehension System</Title> <Section position="4" start_page="325" end_page="326" type="intro"> <SectionTitle> 2 Evaluation </SectionTitle> <Paragraph position="0"> We had three goals in choosing evaluation metrics for our system. First, the evaluation should be automatic. Second, it should maintain comparability with human benchmarks. Third, it should require little or no effort to prepare new answer keys. We used three metrics, P&R, HumSent, and AutSent, which satisfy these constraints to varying degrees.</Paragraph> <Paragraph position="1"> P&R was the precision and recall on stemmed content words 2, comparing the system's response at the word level to the answer key provided by the test's publisher. HumSent and AutSent compared the sentence chosen by the system to a list of acceptable answer sentences, scoring one point for a response on the list, and zero points otherwise. In all cases, the score for a set of questions was the average of the scores for each question.</Paragraph> <Paragraph position="2"> For P&R, the answer key from the publisher was used unmodified. The answer key for HumSent was compiled by a human annotator, I These materials consisted of levels 2-5 of &quot;The 5 W's&quot; written by Linda Miller, which can be purchased from Remedia Publications, 10135 E. Via Linda #D124, Scottsdale, AZ 85258.</Paragraph> <Paragraph position="3"> z Precision and recall are defined as follows:</Paragraph> <Paragraph position="5"> # content words in system response Repeated words in the answer key match or fail together. All words are stemmed and stop words are removed. At present, the stop-word list consists of forms of be, have, and do, personal and possessive pronouns, the conjunctions and, or, the prepositions to, in, at, of, the articles a and the, and the relative and demonstrative pronouns this, that, and which.</Paragraph> <Paragraph position="6"> Query: What is the name of our national library? Story extract: 1. But the Library of Congress was built for all the people.</Paragraph> <Paragraph position="7"> 2. From the start, it was our national library. Answer key: Library of Congress who examined the texts and chose the sentence(s) that best answered the question, even where the sentence also contained additional (unnecessary) information. For AutSent, an automated routine replaced the human annotator, examining the texts and choosing the sentences, this time based on which one had the highest recall compared against the published answer key.</Paragraph> <Paragraph position="8"> For P&R we note that in Figure 2, there are two content words in the answer key (library and congress) and sentence 1 matches both of them, for 2/2 = 100% recall. There are seven content words in sentence 1, so it scores 2/7 = 29% precision. Sentence 2 scores 1/2=50% recall and 1/6=17% precision. The human preparing the list of acceptable sentences for HumSent has a problem. Sentence 2 responds to the question, but requires pronoun coreference to give the full answer (the antecedent of it). Sentence 1 contains the words of the answer, but the sentence as a whole doesn't really answer the question. In this and other difficult cases, we have chosen to list no answers for the human metric, in which case the system receives zero points for the question. This occurs 11% of the time in our test corpus. The question is still counted, meaning that the system receives a penalty in these cases. Thus the highest score a system could achieve for HumSent is 89%.</Paragraph> <Paragraph position="9"> Given that our current system can only respond with sentences from the text, this penalty is appropriate. The automated routine for preparing the answer key in AutSent selects as the answer key the sentence(s) with the highest recall (here sentence 1). Thus only sentence 1 would be counted as a correct answer.</Paragraph> <Paragraph position="10"> We have implemented all three metrics.</Paragraph> <Paragraph position="11"> HumSent and AutSent are comparable with human benchmarks, since they provide a binary score, as would a teacher for a student's answer. In contrast, the precision and recall scores of P&R lack such a straightforward comparability.</Paragraph> <Paragraph position="12"> However, word recall from P&R (called AnsWdRecall in Figure 3) closely mimics the scores of HumSent and AutSent. The correlation coefficient for AnsWdRecall to HumSent in our test set is 98%, and from HumSent to AutSent is also 98%. With respect to ease of answer key preparation, P&R and AutSent are clearly superior, since they use the publisher-provided answer key. HumSent requires human annotation for each question. We found this annotation to be of moderate difficulty. Finally, we note that precision, as well as recall, will be useful to evaluate systems that can return clauses or phrases, possibly constructed, rather than whole sentence extracts as answers.</Paragraph> <Paragraph position="13"> Since most national standardized tests feature a large multiple-choice component, many available benchmarks are multiple-choice exams. Also, although our short-answer metrics do not impose a penalty for incorrect answers, multiple-choice exams, such as the Scholastic Aptitude Tests, do. In real-world applications, it might be important that the system be able to assign a confidence level to its answers. Penalizing incorrect answers wouldhelp guide development in that regard. While we were initially concerned that adapting the system to multiple-choice questions would endanger the goal of real-world applicability, we have experimented with minor changes to handle the multiple choice format.</Paragraph> <Paragraph position="14"> Initial experiments indicate that we can use essentially the same system architecture for both short-answer and multiple choice tests.</Paragraph> </Section> class="xml-element"></Paper>