File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/03/n03-2037_abstr.xml
Size: 2,929 bytes
Last Modified: 2025-10-06 13:42:47
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-2037"> <Title>Evaluating Answers to Definition Questions</Title> <Section position="1" start_page="0" end_page="0" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> This paper describes an initial evaluation of systems that answer questions seeking definitions. The results suggest that humans agree sufficiently as to what the basic concepts that should be included in the definition of a particular subject are to permit the computation of concept recall. Computing concept precision is more problematic, however. Using the length in characters of a definition is a crude approximation to concept precision that is nonetheless sufficient to correlate with humans' subjective assessment of definition quality.</Paragraph> <Paragraph position="1"> The TREC question answering track has sponsored a series of evaluations of systems' abilities to answer closed class questions in many domains (Voorhees, 2001). Closed class questions are fact-based, short answer questions. The evaluation of QA systems for closed class questions is relatively simple because a response to such a question can be meaningfully judged on a binary scale of right/wrong. Increasing the complexity of the question type even slightly significantly increases the difficulty of the evaluation because partial credit for responses must then be accommodated.</Paragraph> <Paragraph position="2"> The ARDA AQUAINT1 program is a research initiative sponsored by the U.S. Department of Defense aimed at increasing the kinds and difficulty of the questions automatic systems can answer. A series of pilot evaluations has been planned as part of the research agenda of the AQUAINT program. The purpose of each pilot is to develop an effective evaluation methodology for systems that answer a certain kind of question. One of the first pilots to be implemented was the Definitions Pilot, a pilot to develop an evaluation methodology for questions such as What is mold? and Who is Colin Powell?.</Paragraph> <Paragraph position="3"> aquaint/index.html.</Paragraph> <Paragraph position="4"> This paper presents the results of the pilot evaluation. The pilot demonstrated that human assessors generally agree on the concepts that should appear in the definition for a particular subject, and can find those concepts in the systems' responses. Such judgments support the computation of concept recall, but do not support concept precision since it is not feasible to enumerate all concepts contained within a system response. Instead, the length of a response is used to approximate concept precision.</Paragraph> <Paragraph position="5"> An F-measure score combining concept recall and length is used as the final metric for a response. Systems ranked by average F score correlate well with assessors' subjective opinions as to definition quality.</Paragraph> </Section> class="xml-element"></Paper>