File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1187_intro.xml
Size: 4,471 bytes
Last Modified: 2025-10-06 14:02:10
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1187"> <Title>Web-Based List Question Answering</Title> <Section position="2" start_page="0" end_page="1" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The Text REtrieval Conference Series (TREC) has greatly encouraged Question Answering (QA) research in the recent years. The QA main task in the recent TREC-12 involved retrieving short concise answers to factoid and list questions, and answer nuggets for definition questions (Voorhees, 2003).</Paragraph> <Paragraph position="1"> The list task in TREC-12 required systems to assemble a set of distinct and complete exact answers as responses to questions like &quot;What are the brand names of Belgian chocolates?&quot;. Unlike the questions in previous TREC conferences, TREC-12 list questions did not specify a target number of instances to return but expected all answers contained in the corpus. Current QA systems (Harabagiu et al., 2003; Katz et al., 2003) usually extract a ranked list of factoid answers from the top returned documents by retrieval engines. This is actually the traditional way to find factoid answers. The only difference between answering list questions and factoid questions here is that list QA systems allow for multiple answers, whose scores are above a cut-off threshold.</Paragraph> <Paragraph position="2"> An analysis of the results of TREC-12 list QA systems (Voorhees, 2003) reveals that many of them severely suffer from two general problems: low recall and non-distinctive answers. The median average F performance of list runs was only 21.3% while the best performer could only achieve 39.6% (Table 1). This unsatisfactory performance exposes the limitation of using only traditional Information Retrieval and Natural Language Processing techniques to find an exhaustive set of factoid answers as compared to only one.</Paragraph> <Paragraph position="4"> In contrast to the traditional techniques, the Web is used extensively in systems to rally round factoid questions. QA researchers have explored a variety of uses of the Web, ranging from surface pattern mining (Ravichandran et al., 2002), query formulation (Yang et al., 2003), answer validation (Magnini et al., 2002), to directly finding answers on the Web by data redundancy analysis (Brill et al., 2001). These systems demonstrated that with the help of the Web they could generally boost baseline performance by 25%-30% (Lin 2002).</Paragraph> <Paragraph position="5"> The well-known redundancy-based approach identifies the factoid answer as an N-gram appearing most frequently on the Web (Brill et al. 2001). This idea works well on factoid questions because factoid questions require only one instance and web documents contains a large number of repeated information about possible answers. However, when dealing with list questions, we need to find all distinct instances and hence we cannot ignore the less frequent answer candidates. The redundancy-based approach fails to spot novel or unexpectedly valuable information in lower ranked web pages with few occurrences. null In this paper, we propose a novel framework to employ the Web to support list question answering.</Paragraph> <Paragraph position="6"> Based on the observations that multiple answer instances often appear in the list or table of a single web page while multiple web pages may also contain information about the same instance, we differentiate these two types of web pages. For the first category, which we call Collection Page (CP), we need to extract table/list content from the web page.</Paragraph> <Paragraph position="7"> For the second category, which we call Topic Page (TP), we need to find distinct web pages relating to different answer instances. We will demonstrate that the resulting system, FADA (Find All Distinct Answers), could achieve effective list question answering in the TREC corpus.</Paragraph> <Paragraph position="8"> The remainder of this paper is organized as following. Section 2 gives the design considerations of our approach. Section 3 details our question analysis and web query formulation. Section 4 describes the web page classification and web document features used in FADA. Section 5 shows the algorithm of topic page clustering while Section 6 details the answer extraction process. Section 7 discusses experimental results. Section 8 concludes the paper.</Paragraph> </Section> class="xml-element"></Paper>