File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0604_metho.xml
Size: 13,595 bytes
Last Modified: 2025-10-06 14:07:21
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0604"> <Title>Answer Extraction Towards better Evaluations of NLP Systems</Title> <Section position="4" start_page="23" end_page="23" type="metho"> <SectionTitle> IF Conditions </SectionTitle> <Paragraph position="0"> THEN X has a cover ~-> X is covered.</Paragraph> <Paragraph position="1"> that converts the verbal phrase with the nominal expression into a the corresponding passive construction (and vice versa) taking the present context into consideration.</Paragraph> <Paragraph position="2"> As these concrete examples show, the task of QA over this simple piece of text is frighteningly difficult. Finding the correct answers to the questions requires far more information that one would think at first. Apart from linguistic knowledge a vast amount of world knowledge and a number of bridging inferences are necessary to answer these seemingly simple questions. For human beings bridging inferences are automatic and for the most part unconscious. The hard task consists in reconstructing all this information coming from different knowledge sources and modeling the suitable inference rules in a general way so that the system scales up.</Paragraph> </Section> <Section position="5" start_page="23" end_page="24" type="metho"> <SectionTitle> 3 Answer Extraction as an </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="23" end_page="24" type="sub_section"> <SectionTitle> Alternative Task </SectionTitle> <Paragraph position="0"> An alternative to QA is answer extraction (AE).</Paragraph> <Paragraph position="1"> The general goal of AE is the same as that of QA, to find answers to user queries in textual documents. But the way to achieve this is different. Instead of generating the answer from the information given in the text (possibly in implicit form only), an AE system will retrieve the specific sentence(s) in the text that contain(s) the explicit answer to the query. In addition, those phrases in the sentence that represent the explicit a_nswer to the query may be highlighted.</Paragraph> <Paragraph position="2"> For example, let us assume that the following sentence is in the text (and we are going to use examples from a technical domain, that of the However, an AE system will return all the sentences in the text that directly answer the question, among them (1).</Paragraph> <Paragraph position="3"> Obviously, an AE system is far less powerful than a real QA system. Information that is not explicit in a text will not be found, let alone information that must be derived from textual information together with world knowledge. But AE has a number of important advantages over QA as a test paradigm. First, an obvious advantage of this approach is that the user receives first-hand information, right from the text, rather than system-generated replies.</Paragraph> <Paragraph position="4"> It is therefore much easier for the user to determine whether the result is reliable. Second, it is a realistic task (as the systems we are describing below proves) as there is no need to generate natural language output, and there is less need to perform complex inferences because it merely looks up things in the texts which axe explicitly there. It need not use world knowledge. Third, it requires the solution of a number of well-defined and truly important linguistic problems and is therefore well suited to measure, and advance, progress in these respects. We will come to this later. And finally, there is a real demand for working AE systems in technical domains since the standard IR approaches just do not work in a satisfactory manner in many applications where the user is in pressure to quickly find a specific answer to a specific question, and not just (potentially long) lists of pointers to (potentially large) documents that may (or may not) be relevant to the query. Examples of applications are on-line software help systems, interfaces to machine-readable technical manuals, help desk systems in large organizations, and public enquiry systems accessible over the Web.</Paragraph> <Paragraph position="5"> The basic procedure we use in our approach to AE is as follows: In an off-line stage, the documents are processed and the core meaning of each sentence is extracted and stored as so-called minimal logical forms. In an on-line stage, the user query is also processed to produce a minimal logical form. In order to retrieve answer sentences from the document collection, the minimal logical form of the query is proved, by a theorem prover, over the minimal logical forms of the entire document collection (Moll~t et al., 1998). Note that this method will not retrieve patently wrong answer sentences like bkup files all copies on the hard disk in response to queries like Which command copies files? This is the kind of response we inevitably get if we use some variation of the bag-of-words approach adopted by IR based systems not performing any kind of content analysis.</Paragraph> <Paragraph position="6"> We are currently developing two AE systems. The first, ExtrAns, uses deep linguistic analysis to perform AE over the Unix manpages. The prototype of this system uses 500 Unix manpages, and it can be tested over the Web \[http://www.ifi.unizh.ch/cl/extrans\]. In the second (new) project, WebExtrAns, we intend to perform AE-over the &quot;Aircraft Maintenance Manual&quot; of the Airbus 320 (ADRES, 1996). The larger volume of data (about 900 kg of printed paper!) will represent an opportunity to test the scalability of an AE system that uses deep linguistic analysis.</Paragraph> <Paragraph position="7"> There is a number of important areas of research that ExtrAns and WebExtrAns, and by extension any AE system, has to focus on. First of all, in order to generate the logical form of the sentences, the following must be tackled: Finding the verb arguments, performing disambiguation, anaphora resolution, and coping with nominalizations, passives, ditransitives, compound nouns, synonymy, and hyponymy (Moll~t et al., 1998; Mollh and Hess, 2000). Second, the very idea of producing the logical forms of real-world text requires the formalization of the logical form notation so that it is expressive enough but still remaining usable (Schwitter et al., 1999).</Paragraph> <Paragraph position="8"> Finally, the goal of producing a practical system for a real-world application needs to address the issue of robustness and scalability (Moll~t and Hess, 1999).--Note that the fact that AE and QA share the same goal makes it possible to start a project that initially performs AE, and gradually enhance and extend it with inference and generation modules, until we get a full-fledged QA system. This is the long-time g0al of our current series of projects on AE.</Paragraph> </Section> </Section> <Section position="6" start_page="24" end_page="25" type="metho"> <SectionTitle> 4 Evaluating the Results </SectionTitle> <Paragraph position="0"> Instead of using reading comprehension tests that are meant for humans, not machines, we should produce the specific tests that would evaluate the AE capability of machines. Here is our proposal.</Paragraph> <Paragraph position="1"> Concerning test queries, it is always better to use real world queries than queries that were artificially constructed to match a portion of text. Experience has shown time and again that real people tend to come up with questions different from those the test designers could think of. By using, as we suggest, manuals of real world systems, it is possible to tap the interaction of real users with this system as a source of real questions (we do this by logging the questions submitted to our system over the Web). Another way of finding queries is to consult the FAQ lists concerning a given system sometimes available on the Web. In both cases you will have to filter out those queries that have no answers in the document collection or that are clearly beyond the scope of the system to evaluate (for example, if the inference needed to answer a query is too complex, even for a human judge).</Paragraph> <Paragraph position="2"> Concerning answers, the principal measures for the AE task must be recall and precision, applied to individual answer sentences. Recall is the number of correct answer sentences the system retrieved divided by the total number of correct answers in the entire document collection. Precision is the number of correct answer sentences the system retrieved divided by the total number of answers it returned. As is known all too well, recall is nearly impossible to determine in an exact fashion for all but toy applications since the totality of correct answers in the entire document collection has to be found mainly by hand. Almost certainly one will have to resort to (hopefully) representative samples of documents to arrive at a reasonable approximation to this value. Precision is easier to determine although even this step can become very time consuming in real world applications.</Paragraph> <Paragraph position="3"> If, on the other hand, one only needs to do an approximate evaluation of the AE system, it would be possible to find a representative set of correct answers by making a person write the ideal answers, and then automatically finding the sentences in the documents that are semantically close to these ideal answers. Semantic closeness between a sentence and the ideal answer can be computed by combining the succinctness and correctness of the sentence with respect to the ideal answer. Succinctness and correctness are the counterparts of precision and recall, but on the sentence level. These measures can be computed by checking the overlap of words between the sentence and the ideal answer (Hirschman et al., 1999), but we suggest a more content-based approach.</Paragraph> <Paragraph position="4"> Our proposal is to compare not words in a sentence, but their logical forms. Of course, this comparison can be done only if it is possible to agree on how logical forms should look like, to compute them, and to perform comparisons between them. The second and third conditions can be fulfilled if the logical forms are simple lists of predicates that contain some minimal semantic information, as it is the case in ExtrAns (Schwitter et al., 1999). In this paper we will use a simplification of the minimal logical forms used by ExtrAns. Below are two sentences with their logical forms: (1) rm removes one or more files.</Paragraph> <Paragraph position="5"> remove(x,y), rm(x), file(y) (2) csplit prints the character counts .for each file created, and removes any files it creates if an error occurs.</Paragraph> <Paragraph position="6"> print(x,y), csplit(x), character-count(y), remove(x,z), file(z), create(x,z), occur(e), error(e) As an example of how to compute succinctness and correctness, take the following question: null Which command removes files? The ideal answer is a full sentence that contains the information given by the question and the information requested. Since rm is the command used to remove files, the ideal answer is: rm removes files.</Paragraph> <Paragraph position="7"> remove(x,y), rm(x), file(y) Instead of computing the overlap of words, succinctness and correctness of a sentence can be determined by computing the overlap of predicates. The overlap of the predicates (overlap henceforth) of two sentences is the maximum set of predicates that can be used as part of the logical form in both sentences. The predicates in boldface in the two examples above indicate the overlap with the ideal answer: 3 for (1), and 2 for (2).</Paragraph> <Paragraph position="8"> Succinctness of a sentence with respect to an ideal answer (precision on the sentence level) is the ratio between the overlap and the total number of predicates in the sentence. Succinctness is, therefore, 3/3=1 for (1), and 2/8=0.25 for (2).</Paragraph> <Paragraph position="9"> Correctness of a sentence with respect to an ideal answer (recall on the sentence level) is the ratio between the overlap and the number of predicates in the ideal answer. In the examples above, correctness is 3/3=1 for (1), and 2/3=0.66 for (2).</Paragraph> <Paragraph position="10"> A combined measure of succinctness and correctness could be used to determine the semantic closeness of the sentences to the ideal answer. By establishing a threshold to the semantic closeness, one can find the sentences in the documents that are answers to the user's query.</Paragraph> <Paragraph position="11"> The advantage of using overlap of predicates against overlap of words is that the relations between the words also affect the measure for succinctness and correctness. We can see this in the following artificial example. Let us suppose that the ideal answer to a query is: Madrid defeated Barcelona.</Paragraph> <Paragraph position="12"> defeat(x,y), madrid(x), barcelona(y) The following candidate sentence produces the same predicates: Barcelona defeated Madrid.</Paragraph> <Paragraph position="13"> defeat(x,y), madrid(y), barcelona(x) However, at most two predicates only can be chosen at the same time (in boldface), because of the restrictions of the arguments. In the ideal answer, the first argument of &quot;defeat&quot; is Madrid and the second argument is Barcelona.</Paragraph> <Paragraph position="14"> In the candidate sentence, however, the arguments are reversed (the name of the variables have no effect on this). The overlap is, therefore,</Paragraph> </Section> class="xml-element"></Paper>