File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-0703_intro.xml
Size: 3,032 bytes
Last Modified: 2025-10-06 14:03:54
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0703"> <Title>Question Pre-Processing in a QA System on Internet Discussion Groups</Title> <Section position="3" start_page="0" end_page="16" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Question answering has been a hot research topic in recent years. Large scale QA evaluation projects (e.g. TREC QA-Track QAC and CLQA Tracks) are helpful to the developments of question answering.</Paragraph> <Paragraph position="1"> However, real automatic QA services are not ready in the Internet. One popular way for Internet users to ask questions and get answers is to visit discussion groups, such as Usenet newsgroups http://answers.yahoo.com/ answers. You can post your question in a related discussion group and wait for other users to provide answers. Some discussion groups provide search toolbars so that you can search your question first to see if there are similar postings asking the same question. In Yahoo! Answers, you can also judge answers offered by other users and mark the best one.</Paragraph> <Paragraph position="2"> Postings in discussion groups are good materials to develop a FAQ-style QA system in the Internet. By finding questions in the discussion groups similar to a new posting, responses to these questions can provide answers or relevant information.</Paragraph> <Paragraph position="3"> But without pre-processing, measuring similarity with original texts will arise some problems: 1. Some phrases such as &quot;many thanks&quot; or &quot;help me please&quot; are not part of a question. These kinds of phrases will introduce noise and harm matching performance.</Paragraph> <Paragraph position="4"> 2. Quite often there is more than one question in one posting. If the question which is most similar to the user's question appears in an existed posting together with other different questions, it will get a lower similarity score than the one it is supposed to have because of other questions.</Paragraph> <Paragraph position="5"> Therefore, inappropriate phrases should be removed and different questions in one posting should be separated before question comparison. There is no research focusing on this topic. FAQ finders (Lai et al., 2002; Lytinen and Tomuro, 2002; Burke, 1997) are closely related to this topic. However, there are differences between them. First of all, questions in a FAQ set are often written in perfect grammar without garbage text. Second, questions are often paired with answers separately. I.e. there is often one question in one QA pair.</Paragraph> <Paragraph position="6"> There were some research groups who divided questions into segments. Soricut and Brill (2004) chunked questions and used them as queries to search engines. Saquete et al. (2004) focused on decomposition of a complex question into several sub-questions. In this paper, question segmentation is to identify different questions posed in one posting.</Paragraph> </Section> class="xml-element"></Paper>