File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-2506_evalu.xml
Size: 11,197 bytes
Last Modified: 2025-10-06 13:59:19
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2506"> <Title>A Novel Approach to Focus Identification in Question/Answering Systems</Title> <Section position="6" start_page="8" end_page="8" type="evalu"> <SectionTitle> 5 Experiments </SectionTitle> <Paragraph position="0"> The aim of the experiments is to prove that category information used, as described in the previous section, is useful for Q/A systems. For this purpose we have to show that the performance of a basic Q/A system is improved when the question classification is adopted. To implement our Q/A and filtering system we used: (1) A state of the art Q/A system: improving low accurate systems is not enough to prove that TC is useful for Q/A. The basic Q/A system that we employed is based on the architecture described in (Pasca and Harabagiu, 2001), which is the current state-of-the-art. (2) The Reuters collection of categorized documents on which training our basic Q/A system. (3) A set of questions categorized according to the Reuters categories. A portion of this set is used to train PRTC and QSVM models, the other disjoint portion is used to measure the performance of the Q/A systems.</Paragraph> <Paragraph position="1"> Next section, describes the technique used to produce the question corpus.</Paragraph> <Section position="1" start_page="8" end_page="8" type="sub_section"> <SectionTitle> 5.1 Question Set Generation </SectionTitle> <Paragraph position="0"> The idea of PRTC and QSVM models is to exploit a set of questions for each category to improve the learning of the PRC and SVM classifiers. Given the complexity of producing any single question, we decided to test our algorithms on only 5 categories. We chose Acq, Earn, Crude, Grain, Trade and Ship categories since for them is available the largest number of training documents. To generate questions we randomly selected a number of documents from each category, then we tried to formulate questions related to the pairs <document, category>. Three cases were found: (a) The document Acq Which strategy aimed activities on core businesses? null How could the transpacific telephone cable between the U.S. and Japan contribute to forming a join venture? Earn What was the most significant factor for the lack of the distribution of assets? What do analysts think about public companies? null Crude What is Kuwait known for? What supply does Venezuela give to another oil producer? Grain Why do certain exporters fear that China may renounce its contract? Why did men in port's grain sector stop work? Trade How did the trade surplus and the reserves weaken Taiwan's position? What are Spain's plans for reaching European Community export level? Ship When did the strikes start in the ship sector? Who attacked the Saudi Arabian supertanker in the United Arab Emirates sea? does not contain general questions about the target category. (b) The document suggests general questions, in this case some of the question words that are in the answers are replaced with synonyms to formulate a new (more general) question. (c) The document suggests general questions that are not related to the target category. We add these questions in our data-set associated with their true categories.</Paragraph> <Paragraph position="1"> Table 3 lists a sample of the questions we derived from the target set of categories. It is worth noting that we included short queries also to maintain general our experimental set-up.</Paragraph> <Paragraph position="2"> We generated 120 questions and we used 60 for the learning and the other 60 for testing. To measure the impact that TC has on Q/A, we first evaluated the question categorization models presented in Section 3.1. Then we compared the performance of the basic Q/A system with the extended Q/A systems that adopt the answer elimination and re-ranking methods.</Paragraph> </Section> <Section position="2" start_page="8" end_page="8" type="sub_section"> <SectionTitle> 5.2 Performance Measurements </SectionTitle> <Paragraph position="0"> In sections 3 and 4 we have introduced several models.</Paragraph> <Paragraph position="1"> From the point of view of the accuracy, we can divided them in two categories: the (document and question) categorization models and the Q/A models. The former are usually measured by using Precision, Recall, and f-measure (Yang, 1999); note that questions can be considered as small documents. The latter often provide as output a list of ranked answers. In this case, a good measure of the system performance should take into account the order of the correct and incorrect questions.</Paragraph> <Paragraph position="2"> One method employed in TREC is the reciprocal value of the rank (RAR) of the highest-ranked correct answer generated by the Q/A system. Its value is 1 if the first answer is correct, 0.5 if the second answer is correct but not the first one, 0.33 when the correct answer was on the third position, 0.25 if the fourth answer was correct, and 0.1 when the fifth answer was correct and so on. If none of the answers are corrects, RAR=0. The Mean Reciprocal Answer Rank (MRAR) is used to compute the overall performance of Q/A systems8, defined as MRAR = 1n Pi 1ranki , where n is the number of questions and ranki is the rank of the answer i.</Paragraph> <Paragraph position="3"> Since we believe that TC information is meaningful to prefer out incorrect answers, we defined a second measure to evaluate Q/A. For this purpose we designed the Signed Reciprocal Answer Rank (SRAR), which is defined as 1n Pj2A 1srankj , where A is the set of answers given for the test-set questions, jsrankjj is the rank position of the answer j and srankj is positive if j is correct and negative if it is not correct. The SRAR can be evaluated over a set of questions as well as over only one question. SRAR for a single question is 0 only if no answer was provided for it.</Paragraph> <Paragraph position="4"> For example, given the answer ranking of Table 2 and considering that we have just one question for testing, the MRAR score is 0.33 while the SRAR is -1 -.5 +.33 -.25 .1 = -1.52. If the answer re-ranking is adopted the MRAR improve to 1 and the SRAR becomes +1 -.5 -.33 -.25 -.1 = -.18. The answer elimination produces a MRAR and a SRAR of 1.</Paragraph> </Section> <Section position="3" start_page="8" end_page="8" type="sub_section"> <SectionTitle> 5.3 Evaluation of Question Categorization </SectionTitle> <Paragraph position="0"> Table 4 lists the performance of question categorization for each of the models described in Section 3.1. We noticed better results when the PRTC and QSVM models were used. In the overall, we find that the performance of</Paragraph> </Section> <Section position="4" start_page="8" end_page="8" type="sub_section"> <SectionTitle> 5.4 Evaluation of Question Answering </SectionTitle> <Paragraph position="0"> To evaluate the impact of our filtering methods on Q/A we first scored the answers of a basic Q/A system for the test set, by using both the MRAR and the SRAR measures. Additionally, we evaluated (1) the MRAR when answers were re-ranked based on question and answer category information; and (2) the SRAR in the case when answers extracted from documents with different categories were eliminated. Rows 1 and 2 of Table 5 report the MRAR and SRAR performances of the basic Q/A.</Paragraph> <Paragraph position="1"> Column 2,3,4,5 and 6 show the MRAR and SRAR accuracies (rows 4 and 5) of Q/A systems that eliminate or re-rank the answer by using the RTC0, SVM0, PRTC, QSVM and QATC question categorization models.</Paragraph> <Paragraph position="2"> The basic Q/A results show that answering the Reuters based questions is a quite difficult task9 as the MRAR is .662, about 15 percent points under the best system result obtained in the 2003 TREC competition. Note that the basic Q/A system, employed in these experiments, uses the same techniques adopted by the best figure Q/A system of TREC 2003.</Paragraph> <Paragraph position="3"> The quality of the Q/A results is strongly affected by the question classification accuracy. In fact, RTC0 and QATC that have the lowest classification f1 (see Table 4) produce very low MRAR (i.e. .622% and .607%) and SRAR (i.e. -.189 and -.320). When the best question classification model QSVM is used, the basic Q/A performance improves with respect to both the MRAR (66.35% vs 66.19%) and the SRAR (-.077% vs -.372%) scores.</Paragraph> <Paragraph position="4"> In order to study how the number of answers impacts the accuracy of the proposed models, we have evaluated the MRAR and the SRAR score varying the maximum number of answers, provided by the basic Q/A system.</Paragraph> <Paragraph position="5"> We adopted as filtering policy the answer re-ranking.</Paragraph> <Paragraph position="6"> Figure 2 shows that as the number of answers increases the MRAR score for QSVM, PRTC and the basic Q/A in9Past TREC competition results have shown that Q/A performances strongly depend on the questions/domains used for the evaluation. For example, the more advanced systems of 2001 performed lower than the systems of 1999 as they were evaluate on a more difficult test-set.</Paragraph> <Paragraph position="7"> swer re-ranking based on question categorization via the PRTC and QSVM models.</Paragraph> <Paragraph position="8"> creases, for the first four answers and it reaches a plateau afterwards. We also notice that the QSVM outperforms both PRTC and the basic Q/A. This figure also shows that question categorization per se does not greatly impact the MRAR score of Q/A.</Paragraph> <Paragraph position="9"> Figure 3 illustrates the SRAR curves by considering the answer elimination policy. The figure clearly shows that the QSVM and PRTC models for question categorization determine a higher SRAR score, thus indicating that fewer irrelevant answers are left. Figure 3 shows that question categorization can greatly improve the quality of Q/A when irrelevant answers are considered. It also shows that perhaps, when evaluating Q/A systems with the MRAR scoring method, the &quot;optimistic&quot; view of Q/A is taken, in which erroneous results are ignored for the sake of emphasizing that an answer was obtained after all, even if it was ranked below several incorrect answers. In contrast, the SRAR score that we have described in Section 5.2 produce a &quot;harsher&quot; score, in which errors are given the same weight as the correct results, but affect negatively the overall score. This explains why, even for a baseline Q/A, we obtained a negative score, as illustrated in Table 5. This shows that the Q/A system generates more erroneous answers then correct answers. If only the MRAR scores would be considered we may assess that TC does not bring significant information to Q/A for precision enhancement by re-ranking answers. However, the results obtained with the SRAR scoring scheme, indicate that text categorization impacts on Q/A results, by eliminating incorrect answers. We plan to further study the question categorization methods and empirically find which weighting scheme is ideal.</Paragraph> </Section> </Section> class="xml-element"></Paper>