File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/93/m93-1007_concl.xml

Size: 5,411 bytes

Last Modified: 2025-10-06 13:57:01

<?xml version="1.0" standalone="yes"?>
<Paper uid="M93-1007">
  <Title>MUC-5 EVALUATION METRIC S</Title>
  <Section position="5" start_page="286" end_page="286" type="concl">
    <SectionTitle>
CHANGES TO THE METRICS FROM PREVIOUS EVALUATION S
</SectionTitle>
    <Paragraph position="0"> The changes to the evaluation metrics are expected to enable three different types of evaluation &amp;quot;users &amp;quot; (NLP researchers, program managers, and potential customers) to assess and compare system performance in a meaningful way. It is also hoped that the changes will correct deficiencies in the evaluation that may unwittingl y encourage conservative development strategies on the part of the researchers and that may also limit the evaluation' s meaningfulness to other evaluation users.</Paragraph>
    <Paragraph position="1"> Although the terms recall and precision were borrowed from IR, the metrics themselves represent a significant departure from the contingency table model, which underlies the IR version of the metrics . The task o f extraction is a complex one that includes elements of information detection and classification, plus open-ende d generation of strings and object pointers . The focus on recall and precision as primary metrics for the last few year s has had some advantages, among them the following : * they bring out the fundamental tension between spurious and missing data; * they require that evaluation users view system performance along more than one dimension ; * they present a positive view of system performance, which may have helped to make the NL P researchers more comfortable with the idea of submitting their systems to evaluation .</Paragraph>
    <Paragraph position="2"> However, recall and precision have the disadvantage of making a two-way distinction between error type s (spurious and missing) when in fact there are three types of error. The third kind of error is captured by the substitution metric; it is accounted for by the categories of incorrect and ( .5 times) partial . Substitution errors arc taken into account in the recall-precision metrics to the extent that they contribute to the denominator of both recal l and precision; however, this type of error is not isolated, and its inclusion in the denominator of recall and precision prevents those metrics from revealing to what extent a system's shortfalls are due to substitution rather than t o missing (in the case of recall) or spurious (in the case of precision) .</Paragraph>
    <Paragraph position="3">  In a way, the recall-precision metrics view substitution as a blend of missing and spurious ; a system did no t simply produce the wrong fill, but rather produced a spurious fill on the one hand and missed a fill on the other hand . This is a reasonable model of system behavior in many cases, but not in others, especially when a response is scored partially correct. These deficiencies of the recall and precision metrics make the use of the error per response fil l reasonable, as long as it is accompanied by the secondary metrics of overgeneration (spurious), undergeneratio n (missing), and substitution (incorrect, including half of the partial) .</Paragraph>
    <Paragraph position="4"> The F-measure, which was introduced for MUC-4 in response to needs of researchers and program managers for a ranking metric, has come to be used more generally than just for cross-system comparisons . By becoming the one metric of focus, it has been competing with recall and precision for the role of primary metric , thereby weakening two of the major advantages that recall and precision originally had . Furthermore, now that performance of some systems is in or approaching the 50% range, recall and precision are at a disadvantage fo r motivating researchers to push performance of the top systems through the more difficult stages ahead because the y focus on the positive aspects of performance. These factors make the adoption of error per response fill as the primary metric a reasonable next step in determining the best way to measure performance .</Paragraph>
    <Paragraph position="5"> The statistical significance results from MUC-5 give us feedback on how well the error metric and the F measure distinguish systems . The results show that there are no differences between the rankings determined by erro r per response fi11 2 and the rankings determined by F-measure. The error per response fill distinguishes systems slightly better; four more system pairs were significantly different in their error per response fill than were significantl y different in their F-measure . The error per response fill also shows a tendency towards clustering systems in slightl y clearer groups than the F-measure for EJV due to its ability to distinguish systems slightly better .</Paragraph>
    <Paragraph position="6"> The richness-normalized error represents another change from previous evaluations and was motivated b y the desire for a system-independent metric . The nature of this metric requires that spurious behavior be ignored . The search for such a metric led us to innovate one in which two values, a minimum and maximum, were calculated sinc e language understanding necessarily involves variability in interpretation . It remains to be seen whether ignoring overgeneration interferes with the predictive quality of the richness-normalized error metric .</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML