File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/w01-0905_intro.xml

Size: 3,476 bytes

Last Modified: 2025-10-06 14:01:18

<?xml version="1.0" standalone="yes"?>
<Paper uid="W01-0905">
  <Title>Two levels of evaluation in a complex NL system</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> For two years, the TREC Evaluation Conference, (Text REtrieval Conference) has been featuring a Question Answering track, in addition to those already existing. This track involves searching for answers to a list of questions, within a collection of documents provided by NIST, the conference organizer.</Paragraph>
    <Paragraph position="1"> Questions are factual or encyclopaedic, while documents are newspaper articles. The TREC9-QA track, for instance, proposed 700 questions whose answers should be retrieved in a corpus of about one million documents.</Paragraph>
    <Paragraph position="2"> In addition to the evaluation, by human judges, of their systems results (Voorhees and Tice, 2000), TREC participants are also provided with an automated evaluation tool, along with a database. These data consist of a list of judgements of all results sent in by all participants. The evaluation tool automatically delivers a score to a set of answers given by a system to a set of questions. This score is derived from the mean reciprocal rank of the first five answers. For each question, the first correct answers get a mark in reverse proportion to their rank. Those evaluation tool and data are quite useful, since it gives us a way of appreciating what happens when modifying our system to improve it.</Paragraph>
    <Paragraph position="3"> We have been taking part to TREC for two years, with the QALC question-answering system (Ferret et al, 2000), currently developed at LIMSI. This system has following architecture: parsing of the question to find the expected type of the answer, selection of a subset of documents among the approximately one million TREC-provided items, tagging of named entities within the documents, and, finally, search for possible answers. Some of the components serve to enrich both questions and documents, by adding system-readable data into them. Such is the case for the modules that parse questions and tag documents. Other components operate a selection among documents, using added data. One example of such modules are those which select relevant documents, another is the one which extracts the answer from the documents.</Paragraph>
    <Paragraph position="4"> A global evaluation of the system is based on judgement about its answers. This criterion provides only indirect evaluation of each component, via the evolution of the final score when this component is modified. To get a closer evaluation of our modules, we need other criteria. In particular, concerning the evaluation of components for document selection, we adopted an additional criterion about selected relevant documents, that is, those that yield the correct answer.</Paragraph>
    <Paragraph position="5"> This paper describes a quantitative evaluation of various modules in our system, based on two criteria: first, the number of selected relevant documents, and secondly, the number of found answers. The first criterion is used for evaluating locally the modules in the system, which contribute in selecting documents that are likely to contain the answer. The second one provides a global evaluation of the system. It also serves for an indirect evaluation of various modules.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML