File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/h05-1117_intro.xml

Size: 4,656 bytes

Last Modified: 2025-10-06 14:02:55

<?xml version="1.0" standalone="yes"?>
<Paper uid="H05-1117">
  <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 931-938, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Automatically Evaluating Answers to Definition Questions</Title>
  <Section position="3" start_page="931" end_page="932" type="intro">
    <SectionTitle>
2 Evaluating Definition Questions
</SectionTitle>
    <Paragraph position="0"> To date, NIST has conducted two formal evaluations of definition questions, at TREC 2003 and TREC 2004.1 In this section, we describe the setup of the task and the evaluation methodology.</Paragraph>
    <Paragraph position="1"> Answers to definition questions are comprised of an unordered set of [document-id, answer string] pairs, where the strings are presumed to provide some relevant information about the entity being &amp;quot;defined&amp;quot;, usually called the target. Although no explicit limit is placed on the length of the answer string, the final scoring metric penalizes verbosity (discussed below).</Paragraph>
    <Paragraph position="2"> To evaluate system responses, NIST pools answer strings from all systems, removes their association with the runs that produced them, and presents them to a human assessor. Using these responses and researchperformedduringtheoriginaldevelopmentof null the question, the assessor creates an &amp;quot;answer key&amp;quot;-a list of &amp;quot;information nuggets&amp;quot; about the target. An information nugget is defined as a fact for which the assessor could make a binary decision as to whether a response contained that nugget (Voorhees, 2003).</Paragraph>
    <Paragraph position="3"> The assessor also manually classifies each nugget as 1TREC 2004 questions were arranged around &amp;quot;topics&amp;quot;; definition questions were implicit in the &amp;quot;other&amp;quot; questions.  [XIE19971012.0112] The Cassini space probe, due to be launched from Cape Canaveral in Florida of the United States tomorrow, has a 32 kilogram plutonium fuel payload to power its seven year journey to Venus and Saturn.</Paragraph>
    <Paragraph position="4"> Nuggets assigned: 1, 2 [NYT19990816.0266] Early in the Saturn visit, Cassini is to send a probe named Huygens into the smog-shrouded atmosphere of Titan, the planet's largest moon, and parachute instruments to its hidden surface to see if it holds oceans of ethane or other hydrocarbons over frozen layers of methane or water. Nuggets assigned: 4, 5, 6  either vital or okay. Vital nuggets represent concepts that must be present in a &amp;quot;good&amp;quot; definition; on the other hand, okay nuggets contribute worthwhile information about the target but are not essential; cf. (Hildebrandt et al., 2004). As an example, nuggets for the question &amp;quot;What is the Cassini space probe?&amp;quot; are shown in Table 1.</Paragraph>
    <Paragraph position="5"> Once this answer key of vital/okay nuggets is created,theassessorthenmanuallyscoreseachrun. For each system response, he or she decides whether or not each nugget is present. Assessors do not simply perform string matches in this decision process; rather, this matching occurs at the conceptual level, abstracting away from issues such as vocabulary differences, syntactic divergences, paraphrases, etc.</Paragraph>
    <Paragraph position="6"> Two examples of this matching process are shown in Figure 1: nuggets 1 and 2 were found in the top passage, while nuggets 4, 5, and 6 were found in the bottom passage. It is exactly this process of conceptually matching nuggets from the answer key with system responses that we attempt to capture with an automatic scoring algorithm.</Paragraph>
    <Paragraph position="7"> The final F-score for an answer is calculated in the manner described in Figure 2, and the final score of a run is simply the average across the scores of all questions. The metric is a harmonic mean between nugget precision and nugget recall, where recall is heavily favored (controlled by the b parameter, set to five in 2003 and three in 2004). Nugget recall is calculated solely on vital nuggets, while nugget precision is approximated by a length allowance given based on the number of both vital and okay nuggets returned. Early on in a pilot study, researchers discovered that it was impossible for assessors to con-</Paragraph>
    <Paragraph position="9"> in a system response, given that they were usually extracted text fragments from documents (Voorhees, 2003). Thus, a penalty for verbosity serves as a surrogate for precision.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML