File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1802_metho.xml
Size: 4,220 bytes
Last Modified: 2025-10-06 14:10:49
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1802"> <Title>Linguistic Knowledge and Question Answering</Title> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> 1 Linguistically Informed IR </SectionTitle> <Paragraph position="0"> Information retrieval is used in most QA systems to filter out relevant passages from large document collections to narrow down the search for answer extraction modules in a QA system. Given a full syntactic analysis of the text collection, it becomes feasible to exploit linguistic information as a knowledge source for IR. Using Apache's IR system Lucene, we can index the document collection along various linguistic dimensions, such as part of speech tags, named entity classes, and dependency relations. Tiedemann (2005) uses a genetic algorithm to optimize the use of such an extended IR index, and shows that it leads to significant improvements of IR performance.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Acquisition of Lexical Knowledge </SectionTitle> <Paragraph position="0"> Syntactic similarity measures can be used for automatic acquisition of lexical knowledge required for QA, as well as for answer extraction and ranking. For instance, in van der Plas and Bouma (2005) it is shown that automatically acquired class-labels for named entities improve the accuracy of answering general WH-questions (i.e.</Paragraph> <Paragraph position="1"> Which ferry sank in the Baltic Sea?) and questions which ask for the definition of a named entity (i.e.</Paragraph> <Paragraph position="2"> Who is Nelson Mandela? or What is MTV?).</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Off-line answer extraction </SectionTitle> <Paragraph position="0"> Off-line extraction of answers to frequent question types can be based on dependency patterns and coreference resolution (Bouma et al., 2005; Mur and van der Plas, 2006), leading to higher recall (compared to systems using surface patterns). Closed-domain (medical) QA can benefit from the fact that dependency relations allow answers to be identified for questions which are not restricted to specific named entity classes, i.e.</Paragraph> <Paragraph position="1"> definitions, causes, symptoms, etc. Answering definition questions, for instance, is a task which has motivated approaches that go well beyond the techniques used for answering factoid questions.</Paragraph> <Paragraph position="2"> In Fahmi and Bouma (2006) it is shown that syntactic patterns can be used to extract potential definition sentences from Wikipedia, and that syntactic features of these sentences (in combination with obvious clues such as the position of the sentence in the document) can be used to improve the accuracy of an automatic classifier which distinguishes definitions from non-definitions in the extracted data set.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Joost </SectionTitle> <Paragraph position="0"> Joost is a QA system for Dutch which incorporates the features mentioned above, using the Alpino parser for Dutch to parse (offline) the document collections as well as (interactively) user questions. It has been used for the open-domain mono-lingual QA task of CLEF 2005, as well as for closed domain medical QA. For CLEF, the full Dutch text collection (4 years of newspaper text, approximately 80 million words) has been parsed.</Paragraph> <Paragraph position="1"> For the medical QA system, we have been using a mixture of texts from general and medical encyclopedia, medical reference works, and web pages 2 KRAQ06 dedicated to medical topics. The medical data are from mixed sources and contain a fair amount of domain specific terminology. Although the Alpino system is robust enough to deal with such material, we believe that the accuracy of linguistic analysis on this task can be further improved by incorporating domain specific terminological resources. We are currently investigating methods for acquiring such knowledge automatically from the encyclopedia sources.</Paragraph> </Section> class="xml-element"></Paper>