File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/n06-1052_intro.xml

Size: 4,857 bytes

Last Modified: 2025-10-06 14:03:24

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-1052">
  <Title>Language Model Information Retrieval with Document Expansion</Title>
  <Section position="2" start_page="0" end_page="407" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Information retrieval with statistical language models (Lafferty and Zhai, 2003) has recently attracted much more attention because of its solid theoretical background as well as its good empirical performance. In this approach, queries and documents are assumed to be sampled from hidden generative models, and the similarity between a document and a query is then calculated through the similarity between their underlying models.</Paragraph>
    <Paragraph position="1"> Clearly, good retrieval performance relies on the accurate estimation of the query and document models. Indeed, smoothing of document models has been proved to be very critical (Chen and Goodman, 1998; Kneser and Ney, 1995; Zhai and Lafferty, 2001b). The need for smoothing originated from the zero count problem: when a term does not occur in a document, the maximum likelihood estimator would give it a zero probability. This is unreasonable because the zero count is often due to insuf cient sampling, and a larger sample of the data would likely contain the term. Smoothing is proposed to address the problem.</Paragraph>
    <Paragraph position="2"> While most smoothing methods utilize the global collection information with a simple interpolation (Ponte and Croft, 1998; Miller et al., 1999; Hiemstra and Kraaij, 1998; Zhai and Lafferty, 2001b), several recent studies (Liu and Croft, 2004; Kurland and Lee, 2004) have shown that local corpus structures can be exploited to improve retrieval performance.</Paragraph>
    <Paragraph position="3"> In this paper, we further study the use of local corpus structures for document model estimation and propose to use document expansion to better exploit local corpus structures for estimating document language models.</Paragraph>
    <Paragraph position="4"> According to statistical principles, the accuracy of a statistical estimator is largely determined by the sampling size of the observed data; a small data set generally would result in large variances, thus can not be trusted completely. Unfortunately, in retrieval, we often have to estimate a model based on a single document. Since a document is a small sample, our estimate is unlikely to be very accurate.</Paragraph>
    <Paragraph position="5"> A natural improvement is to enlarge the data sample, ideally in a document-speci c way. Ideally, the enlarged data sample should come from the same original generative model. In reality, however, since  the underlying model is unknown to us, we would not really be able to obtain such extra data. The essence of this paper is to use document expansion to obtain high quality extra data to enlarge the sample of a document so as to improve the accuracy of the estimated document language model. Document expansion was previously explored in (Singhal and Pereira, 1999) in the context of the vector space retrieval model, mainly involving selecting more terms from similar documents. Our work differs from this previous work in that we study document expansion in the language modeling framework and implement the idea quite differently.</Paragraph>
    <Paragraph position="6"> Our main idea is to augment a document probabilistically with potentially all other documents in the collection that are similar to the document. The probability associated with each neighbor document re ects how likely the neighbor document is from the underlying distribution of the original document, thus we have a probabilistic neighborhood , which can serve as extra data for the document for estimating the underlying language model. From the viewpoint of smoothing, our method extends the existing work on using clusters for smoothing (Liu and Croft, 2004) to allow each document to have its own cluster for smoothing.</Paragraph>
    <Paragraph position="7"> We evaluated our method using six representative retrieval test sets. The experiment results show that document expansion smoothing consistently outperforms the baseline smoothing methods in all the data sets. It also outperforms a state-of-the-art clustering smoothing method. Analysis shows that the improvement tends to be more signi cant for short documents, indicating that the improvement indeed comes from the improved estimation of the document language model, since a short document presumably would bene t more from the neighborhood smoothing. Moreover, since document expansion and pseudo feedback exploit different corpus structures, they can be combined to further improve performance. As document expansion can be done in the indexing stage, it is scalable to large collections.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML