File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-0704_evalu.xml
Size: 6,962 bytes
Last Modified: 2025-10-06 13:59:50
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0704"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Situated Question Answering in the Clinical Domain: Selecting the Best Drug Treatment for Diseases</Title> <Section position="9" start_page="28" end_page="29" type="evalu"> <SectionTitle> 8 Results </SectionTitle> <Paragraph position="0"> The results of our automatic evaluation are shown in Table 2: the rows show average ROUGE scores at one, three, five, and ten hits, respectively. In addition to the PubMed baseline and our complete EBM model, we conducted a component-level analysis of our semantic matching algorithm. Three separate ablation studies isolate the effects of the PICO-based score, the strength of evidence score, and the MeSH-based score (columns &quot;PICO&quot;, &quot;SoE&quot;, and &quot;MeSH&quot;). At all document cutoffs, the quality of the EBM-reranked hits is higher than that of the originalPubMedhits, asmeasuredby ROUGE. Thedifferences are statistically significant, according to the Wilcoxon signed-rank test, the standard non-parametric test employed in IR.</Paragraph> <Paragraph position="1"> Based on the component analysis, we can see that the strength of evidence score is responsible for the largest performance gain, although the combination of all three components outperforms each one individually (for the most part). All three components of our semantic model contribute to the overall QA performance, which is expected because clinical relevance is a multi-faceted property that requires a multitude of considerations. Evidence-based medicine provides a theory of these factors, and we have shown that a question answering algorithm which operationalizes EBM yields good results.</Paragraph> <Paragraph position="2"> The distribution of human judgments from our manual evaluation is shown in Figure 2. For the development set, the average human judgment of the original PubMed hits is 1.52 (between &quot;marginally relevant&quot; and &quot;relevant&quot;); after semantic matching, 2.32 (better than &quot;relevant&quot;). For the test set, the averages are 1.49 before ranking and 2.10 after semantic matching. These results show that our system performs significantly better than the PubMed baseline.</Paragraph> <Paragraph position="3"> The performance improvement observed in our experiments is encouraging, considering that we were starting off with a strong state-of-the-art PubMed baseline that leverages MeSH terms. All initial citations retrieved by PubMed were clinical trials and &quot;about&quot; the disease in question, as determinedbyhumanindexers. Ourworkdemonstrates that principles of evidence-based medicine can be codified in an algorithm.</Paragraph> <Paragraph position="4"> Since a number of abstracts were both automatically evaluated with ROUGE and manually assessed, it is possible to determine the degree to which automatic metrics predict human judgments. For the 125 human judgments gathered on the test set, we computed a Pearson's r score of0.544, whichindicatesmoderatepredictiveness.</Paragraph> <Paragraph position="5"> DuetothestructureofourPubMedquery, thekeyword content of retrieved abstracts are relatively homogeneous. Nevertheless, automatic evaluation with ROUGE appears to be useful.</Paragraph> </Section> <Section position="10" start_page="29" end_page="30" type="evalu"> <SectionTitle> 9 Discussion and Related Work </SectionTitle> <Paragraph position="0"> Recently, researchers have become interested in restricted-domain question answering because it provides an opportunity to explore the use of knowledge-rich techniques without having to tackle the commonsense reasoning problem.</Paragraph> <Paragraph position="1"> Knowledge-based techniques dependent on rich semantic representations contrast with TREC-style factoid question answering, which is primarily driven by keyword matching and named-entity detection.</Paragraph> <Paragraph position="2"> Our work represents a successful case study of how semantic models can be employed to capture domain knowledge (the practice of medicine, in our case). The conception of question answering as the matching of knowledge frames provides us with an opportunity to experiment with semantic representations that capture the content of both documents and information needs. In our case, PICO-based scores were found to have a positive impact on performance. The strength of evidence and the MeSH-based scores represent attempts to model user requirements by leveraging meta-level informationnotdirectlypresentineitherquestions or candidate answers. Both contribute positively to performance. Overall, the construction of our semantic model is enabled by the UMLS ontology, which provides an enumeration of relevant concepts (e.g., the names of diseases, drugs, etc.) and semantic relations between those concepts.</Paragraph> <Paragraph position="3"> Question answering in the clinical domain is an emerging area of research that has only recently begun to receive serious attention. As a result, there exist relatively few points of comparison to our own work, as the research space is sparsely populated.</Paragraph> <Paragraph position="4"> The idea that information systems should be sensitive to the practice of evidence-based medicine is not new. Many researchers have studied MeSH terms associated with basic clinical tasks (Mendonc,a and Cimino, 2001; Haynes et al., 1994). Although originally developed as a tool to assist in query formulation, Booth (2000) pointed out that PICO frames can be employed to structure IR results for improving precision; PICO-based querying is merely an instance of faceted querying, which has been widely used by librarians since the invention of automated retrieval systems. The feasibility of automatically identifying outcome statements in secondary sources has been demonstrated by Niu and Hirst (2004), but our workdiffersinitsfocusontheprimarymedicalliterature. Approaching clinical needs from a different perspective, the PERSIVAL system leverages patient records to rerank search results (McKeown et al., 2003). Since the primary focus is on person- null alization, this work can be viewed as complementary to our own.</Paragraph> <Paragraph position="5"> The dearth of related work and the lack of a pre-existing clinical test collection to a large extent explains the ad hoc nature of some aspects of our semantic matching algorithm. All weights were heuristically chosen to reflect our understanding of the domain, and were not optimized in a principled manner. Nevertheless, performance gains observed in the development set carried over to the blind held-out test collection, providing confidence in the generality of our methods. Developing a more formal scoring model for evidence-based medicine will be the subject of future work.</Paragraph> </Section> class="xml-element"></Paper>