File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-0108_intro.xml
Size: 8,267 bytes
Last Modified: 2025-10-06 14:03:46
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0108"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Cluster-based Language Model for Sentence Retrieval in Chinese Question Answering</Title> <Section position="3" start_page="0" end_page="58" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> To facilitate the answer extraction of question answering, the task of retrieval module is to find the most relevant passages or sentences to the question. So, the retrieval module plays a very important role in question answering system, which influences both the performance and the speed of question answering. In this paper, we mainly focus on the research of improving the performance of sentence retrieval in Chinese question answering.</Paragraph> <Paragraph position="1"> Many retrieval approaches have been proposed for sentence retrieval in English question answering. For example, Ittycheriach [Ittycheriah, et al. 2002] and H. Yang [Hui Yang, et al. 2002] proposed vector space model. Andres [Andres, et al. 2004] and Vanessa [Vanessa, et al. 2004] proposed language model and translation model respectively. Compared to vector space model, language model is theoretically attractive and a potentially very effective probabilistic framework for researching information retrieval problems [Jian-Yun Nie. 2005].</Paragraph> <Paragraph position="2"> However, language model for sentence retrieval is not mature yet, which has a lot of difficult problems that cannot be solved at present. For example, how to incorporate the structural information, how to resolve data sparseness problem. In this paper, we mainly focus on the research of the smoothing approach of language model because sparseness problem is more serious for sentence retrieval than for document retrieval. null At present, the most popular smoothing approaches for language model are Jelinek-Mercer method, Bayesian smoothing using Dirichlet priors, absolute discounting and so on [C. Zhai, et al. 2001]. The main disadvantages of all these smoothing approaches are that each document model (which is estimated from each document) is interpolated with the same collection model (which is estimated from the whole collection) through a unified parameter. Therefore, it does not make any one particular document more probable than any other, on the condition that neither the documents originally contains the query term. In other word, if a document is relevant, but does not contain the query term, it is still no more probable, even though it may be topically related.</Paragraph> <Paragraph position="3"> As we know, most smoothing approaches of sentence retrieval in question answering are learned from document retrieval without many adaptations. In fact, question answering has some characteristics that are different from traditional document retrieval, which could be used to improve the performance of sentence retrieval.</Paragraph> <Paragraph position="4"> These characteristics lie in: 1. The input of question answering is natural language question which is more unambiguous than query in traditional document retrieval.</Paragraph> <Paragraph position="5"> For traditional document retrieval, it's difficult to identify which kind of information the users want to know. For example, if the user submit the query {Fa Ming /invent, Dian Hua /telephone}, search engine does not know what information is needed, who invented telephone, when telephone was invented, or other information. On the other hand, for question answering system, if the user submit the question {Shui Fa Ming Liao Dian Hua ?/who invented the telephone?}, it's easy to know that the user want to know the person who invented the telephone, but not other information.</Paragraph> <Paragraph position="6"> 2. Candidate answers extracted according to the semantic category of the question's answer could be used for sentence clustering of question answering.</Paragraph> <Paragraph position="7"> Although the first retrieved sentences are related to the question, they usually deal with one or more topics. That is, relevant sentences for a question may be distributed over several topics.</Paragraph> <Paragraph position="8"> Therefore, treating the question's words in retrieved sentences with different topics equally is unreasonable. One of the solutions is to organize the related sentences into several clusters, where a sentence can belong to about one or more clusters, each cluster is regarded as a topic. This is sentence clustering. Obviously, cluster and topic have the same meaning and can be replaced each other. In the other word, a particular entity type was expected for each question, and every special entity of that type found in a retrieved sentence was regarded as a cluster/topic.</Paragraph> <Paragraph position="9"> In this paper, we propose two novel approaches for sentence clustering. The main idea of the approaches is to conduct sentence clustering according to the candidate answers which are also considered as the names of the clusters.</Paragraph> <Paragraph position="10"> For example, given the question {Shui Fa Ming Liao Dian Hua ?/who invented telephone?}, the top ten retrieved sentences and the corresponding candidate answers are shown as Table 1. Thus, we can conduct sentence clustering according to the candidate answers, that are, {Bei Er /Bell, Xi Men Zi Based on the above analysis, this paper presents cluster-based language model for sentence retrieval of Chinese question answering. It differs from most of the previous approaches mainly as follows. 1. Sentence Clustering is conducted according to the candidate answers extracted from the top 1000 sentences. 2. The information of the cluster of the sentence, which is also called as topic, is incorporated into language model through aspect model. For sentence clustering, we propose two novel approaches that are One-Sentence-Multi-Topics and One-Sentence-One-Topic respectively. The experimental results show that the performances of cluster-based language model for sentence retrieval are improved significantly.</Paragraph> <Paragraph position="11"> The framework of cluster-based language model for sentence retrieval is shown as Figure 1. Figure 1 The Framework of Cluster-based Language Model for Sentence Retrieval 2 Language Model for Information Retrieval null Language model for information retrieval is presented by Ponte & Croft in 1998[J. Ponte, et al. 1998] which has more advantages than vector space model. After that, many improved models are proposed like J.F. Gao [J.F Gao, et al. 2004], C. Zhai [C. Zhai, et al. 2001], and so on. In 1999, Berger & Lafferty [A. Berger, et al. 1999] presented statistical translation model for information retrieval.</Paragraph> <Paragraph position="12"> The basic approach of language model for information retrieval is to model the process of generating query Q. The approach has two steps.</Paragraph> <Paragraph position="13"> 1. Constructing document model for each document in the collection; 2. Ranking the documents according to the probabilities p(Q|D). A classical unigram language model for IR could be expressed in equation (1).</Paragraph> <Paragraph position="15"> model which represents terms distribution over document. Obviously, estimating the probability</Paragraph> <Paragraph position="17"> |D) is the key of document model. To solve the sparseness problem, Jelinek-Mercer is commonly used which could be expressed by equation (2).</Paragraph> <Paragraph position="19"> model and collection model respectively estimated via maximum likelihood.</Paragraph> <Paragraph position="20"> As described above, the disadvantages of standard language model is that it does not make any one particular document any more probable than any other, on the condition that neither the documents originally contain the query term. In the other word, if a document is relevant, but does not contain the query term, it is still no more probable, even though it may be topically related. Thus, the smoothing approaches based on standard language model are improper. In this paper, we propose a novel cluster-based language model to overcome it.</Paragraph> </Section> class="xml-element"></Paper>