File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/h05-1117_metho.xml
Size: 17,209 bytes
Last Modified: 2025-10-06 14:09:34
<?xml version="1.0" standalone="yes"?> <Paper uid="H05-1117"> <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 931-938, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Automatically Evaluating Answers to Definition Questions</Title> <Section position="4" start_page="932" end_page="933" type="metho"> <SectionTitle> 3 Previous Work </SectionTitle> <Paragraph position="0"> The idea of employing n-gram co-occurrence statistics to score the output of a computer system against one or more desired reference outputs was first successfully implemented in the BLEU metric for machine translation (Papineni et al., 2002). Since then, the basic method for scoring translation quality has been improved upon by others, e.g., (Babych and Hartley, 2004; Lin and Och, 2004). The basic idea has been extended to evaluating document summarization with ROUGE (Lin and Hovy, 2003).</Paragraph> <Paragraph position="1"> Recently, Soricut and Brill (2004) employed n-gram co-occurrences to evaluate question answering in a FAQ domain; unfortunately, the task differs from definition question answering, making their resultsnotdirectlyapplicable. Xuetal.(2004)applied ROUGE to automatically evaluate answers to definition questions, viewing the task as a variation of document summarization. Because TREC answer nuggets were terse phrases, the authors found it necessary to rephrase them--two humans were asked to manually create &quot;reference answers&quot; based on the assessors'nuggetsandIRresults,whichwasalaborintensive process. Furthermore, Xu et al. did not perform a large-scale assessment of the reliability of ROUGE for evaluating definition answers.</Paragraph> </Section> <Section position="5" start_page="933" end_page="933" type="metho"> <SectionTitle> 4 Criteria for Success Beforeproceedingtoourdescriptionof POURPRE,it </SectionTitle> <Paragraph position="0"> is important to first define the basis for assessing the quality of an automatic evaluation algorithm. Correlation between official scores and automatically-generated scores, as measured by the coefficient of determination R2, seems like an obvious metric for quantifying the performance of a scoring algorithm.</Paragraph> <Paragraph position="1"> Indeed, this measure has been employed in the evaluation of BLEU, ROUGE, and other related metrics.</Paragraph> <Paragraph position="2"> However, we believe that there are better measures of performance. In comparative evaluations, we ultimately want to determine if one technique is &quot;better&quot; than another. Thus, the system rankings produced by a particular scoring method are often more important than the actual scores themselves. Following the information retrieval literature, we employ Kendall's t to capture this insight.</Paragraph> <Paragraph position="3"> Kendall's t computes the &quot;distance&quot; between two rankings as the minimum number of pairwise adjacent swaps necessary to convert one ranking into the other. This value is normalized by the number of items being ranked such that two identical rankings produce a correlation of 1.0; the correlation between a ranking and its perfect inverse is[?]1.0; and the expected correlation of two rankings chosen at random is 0.0. Typically, a value of greater than 0.8 is considered &quot;good&quot;, although 0.9 represents a threshold researchers generally aim for. In this study, we primarily focus on Kendall's t, but also report R2 values where appropriate.</Paragraph> </Section> <Section position="6" start_page="933" end_page="934" type="metho"> <SectionTitle> 5 POURPRE </SectionTitle> <Paragraph position="0"> Previously, it has been assumed that matching nuggets from the assessors' answer key with systems' responses must be performed manually because it involves semantics (Voorhees, 2003). We would like to challenge this assumption and hypothesize that term co-occurrence statistics can serve as a surrogate for this semantic matching process. Experience with the ROUGE metric has demonstrated the effectiveness of matching unigrams, an idea we employ in our POURPRE metric. We hypothesize that matching bigrams, trigrams, or any other longer n-grams will not be beneficial, because they primarily account for the fluency of a response, more relevant in a machine translation task. Since answers to definition questions are usually document extracts, fluency is less important a concern.</Paragraph> <Paragraph position="1"> The idea behind POURPRE is relatively straightforward: match nuggets by summing the unigram co-occurrencesbetweentermsfromeachnuggetand terms from the system response. We decided to start with the simplest possible approach: count the word overlap and divide by the total number of terms in the answer nugget. The only additional wrinkle is to ensure that all words appear within the same answer string. Since nuggets represent coherent concepts, they are unlikely to be spread across different answer strings (which are usually different extracts of source documents). As a simple example, let's say we're trying to determine if the nugget &quot;A B C D&quot; is contained in the following system response: 1. A 2. B C D 3. D 4. A D The match score assigned to this nugget would be 3/4, from answer string 2; no other answer string would get credit for this nugget. This provision reduces the impact of coincidental term matches. Once we determine the match score for every nugget, the final F-score is calculated in the usual way, except that the automatically-derived match scores are substituted where appropriate. For example,nuggetrecallnowbecomesthesumofthematch null scores for all vital nuggets divided by the total number of vital nuggets. In the official F-score calcula- null tion, thelengthallowance--forthepurposesofcomputing nugget precision--was 100 non-whitespace characters for every okay and vital nugget returned. Since nugget match scores are now fractional, this required some adjustment. We settled on an allowance of 100 non-whitespace characters for every nugget match that had non-zero score.</Paragraph> <Paragraph position="2"> A major drawback of this basic unigram overlap approach is that all terms are considered equally important--surely,matching&quot;year&quot;inasystem'sresponse should count for less than matching &quot;Huygens&quot;, in the example about the Cassini space probe. We decided to capture this intuition using inverse document frequency, a commonly-used measure in information retrieval; idf(ti) is defined as log(N/ci), where N is the number of documents in the collection, and ci is the number of documents that contain the term ti. With scoring based on idf, term counts are simply replaced with idf sums in computing the match score, i.e., the match score of a particular nugget is the sum of the idfs of matching terms in the system response divided by the sum of all term idfs from the answer nugget. Finally, we examined the effects of stemming, i.e., matching stemmed terms derived from the Porter stemmer.</Paragraph> <Paragraph position="3"> In the next section, results of experiments with submissions to TREC 2003 and TREC 2004 are reported. We attempted two different methods for aggregating results: microaveraging and macroaveraging. For microaveraging, scores were calculated by computing the nugget match scores over all nuggets for all questions. For macroaveraging, scores for each question were first computed, and then averaged across all questions in the testset. With microaveraging, each nugget is given equal weight, while with macroaveraging, each question is given equal weight.</Paragraph> <Paragraph position="4"> As a baseline, we revisited experiments by Xu et al. (2004) in using ROUGE to evaluate definition questions. What if we simply concatenated all the answer nuggets together and used the result as the &quot;reference summary&quot; (instead of using humans to create custom reference answers)?</Paragraph> </Section> <Section position="7" start_page="934" end_page="936" type="metho"> <SectionTitle> 6 Evaluation of POURPRE </SectionTitle> <Paragraph position="0"> We evaluated all definition question runs submitted to the TREC 20032 and TREC 2004 question answering tracks with different variants of our POURPRE metric, and then compared the results with the official F-scores generated by human assessors. The Kendall's t correlations between rankings produced by POURPRE and the official rankings are shown in tween the two sets of scores are shown in Table 3. We report four separate variants along two different parameters: scoring by term counts only vs. scoring bytermidf,andmicroaveragingvs.macroaveraging.</Paragraph> <Paragraph position="1"> Interestingly, scoring based on macroaveraged term 2In TREC 2003, the value of b was arbitrarily set to five, which was later determined to favor recall too heavily. As a result, it was readjusted to three in TREC 2004. In our experiments with TREC 2003, we report figures for both values. against the POURPRE scores (macro, count) for TREC 2003 (b = 5).</Paragraph> <Paragraph position="2"> counts outperformed any of the idf variants.</Paragraph> <Paragraph position="3"> A scatter graph plotting official F-scores against POURPRE scores (macro, count) for TREC 2003 (b = 5) is shown in Figure 3. Corresponding graphs for other variants appear similar, and are not shown here. The effect of stemming on the Kendall's t correlation between POURPRE (macro, count) and official scores in shown in Table 4. Results from the same stemming experiment on the other POURPRE variants are similarly inconclusive.</Paragraph> <Paragraph position="4"> For TREC 2003 (b = 5), we performed an analysis of rank swaps between official and POURPRE scores. A rank swap is said to have occurred if the relative ranking of two runs is different under different conditions--they are significant because rank swaps might prevent researchers from confidently drawing conclusions about the relative effectiveness of different techniques. We observed 81 rank swaps (out of a total of 1431 pairwise comparisons for 54 runs). A histogram of these rank swaps, binned by the difference in official score, is shown in Figure 4. As can be seen, 48 rank swaps (59.3%) occurred when the difference in official score is less than 0.02; there were no rank swaps observed for runs in which the official scores differed by more than 0.061. Since measurement error is an inescapable fact of evaluation, we need not be concerned with rank swaps that can be attributed to this factor. For TREC 2003, Voorhees (2003) calculated this value to be approximately 0.1; that is, in order to conclude with 95% confidence that one run is better than an- null (b = 5), binned by difference in official score.</Paragraph> <Paragraph position="5"> other, anabsoluteF-scoredifferencegreaterthan0.1 mustbeobserved. Ascanbeseen, alltherankswaps observed can be attributed to error inherent in the evaluation process.</Paragraph> <Paragraph position="6"> From these results, we can see that evaluation of definition questions is relatively coarse-grained. However, TREC 2003 was the first formal evaluationofdefinitionquestions; asmethodologiesarerefined, the margin of error should go down. Although a similar error analysis for TREC 2004 has not been performed, we expect a similar result.</Paragraph> <Paragraph position="7"> Given the simplicity of our POURPRE metric, the correlation between our automatically-derived scores and the official scores is remarkable. Starting from a set of questions and a list of relevant nuggets, POURPRE can accurately assess the performance of a definition question answering system without any human intervention.</Paragraph> <Section position="1" start_page="935" end_page="936" type="sub_section"> <SectionTitle> 6.1 Comparison Against ROUGE </SectionTitle> <Paragraph position="0"> We choose ROUGE over BLEU as a baseline for comparison because, conceptually, the task of answering definition questions is closer to summarization than it is to machine translation, in that both are recall-oriented. Since the majority of question an- null swering systems employ extractive techniques, fluency (i.e., precision) is not usually an issue.</Paragraph> <Paragraph position="1"> How does POURPRE stack up against using ROUGE3 to directly evaluate definition questions? The Kendall's t correlations between rankings produced by ROUGE (with and without stopword removal) and the official rankings are shown in Table 2; R2 values are shown in Table 3. In all cases, ROUGE does not perform as well.</Paragraph> <Paragraph position="2"> We believe that POURPRE better correlates with official scores because it takes into account special characteristics of the task: the distinction between vitalandokaynuggets, thelengthpenalty, etc. Other than a higher correlation, POURPRE offers an advantage over ROUGE in that it provides a better diagnostic than a coarse-grained score, i.e., it can reveal why an answer received a particular score. This allows researchers to conduct failure analyses to identify opportunities for improvement.</Paragraph> </Section> </Section> <Section position="8" start_page="936" end_page="937" type="metho"> <SectionTitle> 7 The Effect of Variability in Judgments </SectionTitle> <Paragraph position="0"> As with many other information retrieval tasks, legitimate differences in opinion about relevance are an inescapable fact of evaluating definition questions--systems are designed to satisfy real-world information needs, and users inevitably disagree on which nuggets are important or relevant.</Paragraph> <Paragraph position="1"> These disagreements manifest as scoring variations in an evaluation setting. The important issue, however, is the degree to which variations in judgments affect conclusions that can be drawn in a comparative evaluation, i.e., can we still confidently conclude that one system is &quot;better&quot; than another? For the ad hoc document retrieval task, research has shownthatsystemrankingsarestablewithrespectto disagreementsaboutdocumentrelevance(Voorhees, 2000). In this section, we explore the effect of judgment variability on the stability and reliability of TREC definition question answering evaluations.</Paragraph> <Paragraph position="2"> Thevital/okaydistinctiononnuggetsisonemajor source of differences in opinion, as has been pointed out previously (Hildebrandt et al., 2004). In the Cassini space probe example, we disagree with the assessors' assignment in many cases. More importantly, however, there does not appear to be any op- null dom judgments (for top two runs from TREC 2004).</Paragraph> <Paragraph position="3"> erationalizablerulesforclassifyingnuggetsaseither vital or okay. Without any guiding principles, how can we expect our systems to automatically recognize this distinction? How do differences in opinion about vital/okay nuggets impact the stability of system rankings? To answer this question, we measured the Kendall's t correlation between the official rankings and rankings produced by different variations of the answer key. Three separate variants were considered: iment was conducted with the manually-evaluated system responses, not our POURPRE metric. For the last condition, we conducted one thousand random trials, taking into consideration the original distribution of the vital and okay nuggets for each question using a simplified version of the Metropolis-Hastings algorithm (Chib and Greenberg, 1995); the standard deviations are reported.</Paragraph> <Paragraph position="4"> These results suggest that system rankings are sensitive to assessors' opinion about what constitutesavitalorokaynugget. Ingeneral,theKendall's t values observed here are lower than values computed from corresponding experiments in ad hoc document retrieval (Voorhees, 2000). To illustrate, the distribution of ranks for the top two runs from scores. The 95% confidence interval is presented for the random judgments case. TREC 2004 (RUN-12 and RUN-8) over the one thousand random trials is shown in Figure 5. In 511 trials, RUN-12 was ranked as the highest-scoring run; however, in 463 trials, RUN-8 was ranked as the highest-scoring run. Factoring in differences of opinion about the vital/okay distinction, one could notconcludewithcertaintywhichwasthe&quot;best&quot;run in the evaluation.</Paragraph> <Paragraph position="5"> It appears that differences between POURPRE and the official scores are about the same as (or in some cases, smaller than) differences between the official scoresandscoresbasedonvariantanswerkeys(with theexceptionof&quot;everythingvital&quot;). Thismeansthat further refinement of the metric to increase correlation with human-generated scores may not be particularly meaningful; it might essentially amount to overtraining on the whims of a particular human assessor. Webelievethatsourcesofjudgmentvariability and techniques for managing it represent important areas for future study.</Paragraph> </Section> class="xml-element"></Paper>