File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/h05-1118_evalu.xml
Size: 3,686 bytes
Last Modified: 2025-10-06 13:59:22
<?xml version="1.0" standalone="yes"?> <Paper uid="H05-1118"> <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 939-946, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Integrating linguistic knowledge in passage retrieval for question answering</Title> <Section position="6" start_page="944" end_page="945" type="evalu"> <SectionTitle> 5 Experiments </SectionTitle> <Paragraph position="0"> We selected a random set of 420 questions from the CLEF data for training and used the remaining 150 questions for evaluation. We used the optimization algorithm with the settings as described above. IR was run in parallel on 3-7 Linux workstations on a local network. We retrieved a maximum of 20 passages per question. For each setting we computed the fitness scores for the training set and the evaluation set using MTRR. The top scores have been printed after each 10 runs and compared to the evaluation scores. Figure 3 shows a plot of the fitness score development throughout the optimization process in comparison with the evaluation scores.</Paragraph> <Paragraph position="1"> The base-line of 0.8799 refers to the retrieval result on evaluation data when using traditional IR with plain text keywords only (i.e. using the text layer, Dutch stemming and stop word removal). The base-line performance on training data is slightly worse with 0.8224 MTRR. After 1130 settings the MTRR scores increased to 0.9446 for training data and 1.0247 for evaluation data. Thereafter we can observe a surprising drop in evaluation scores to around 0.97 in MTRR. This might be due to over-fitting although the drop seems to be rather radical. After that the curve of the evaluation scores goes back to about the same level as achieved before and the training curve seems to level out. The MTRR score after 3200 settings is at 1.0169 on evaluation data which is a statistically significant improvement of the baseline score (tested using the Wilcoxon matched-pairs signed-ranks test at p < 0.01). MTRR measured on document IDs and evaluation data did also increase from 0.5422 to 0.6215 which is statistically significant at p!0.02. Coverage went up from 78.68% to 81.62% on evaluation data and the redundancy was improved from 3.824 to 4.272 (significance tests have not been carried out).</Paragraph> <Paragraph position="2"> Finally, the QA performance using Joost with only the IR based strategy was increased from 0.289 (using CLEF scores) to 0.331. This, however, is not statistically significant according to the Wilcoxon test and may be due to chance.</Paragraph> <Paragraph position="3"> Table 3 shows the features and weights selected in the training process. The largest weights are given to names in the text layer, to root forms of names in modifier relations and to plain text adjectives. Many keyword types use 'name' or 'noun' as POS restriction. A surprisingly large number of keyword types are marked as required. Some of them overlap with each other and are therefore redundant. For example, all RootPOS keywords are marked as required and therefore, the restrictions of RootPOS keywords are useless because they do not alter the query. However, in other cases overlapping keyword type definitions do influence the query. For example, RootRel keywords in general are marked as required. However, other type definitions replace some of them with weighted keywords, e.g., RootRel noun key- null words. Finally, some of them may be changed back to required keywords, e.g., RootRel keywords of nouns in a modifier relation.</Paragraph> </Section> class="xml-element"></Paper>