File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-4002_metho.xml
Size: 3,005 bytes
Last Modified: 2025-10-06 14:08:55
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-4002"> <Title>MMR-based feature selection for text categorization</Title> <Section position="3" start_page="0" end_page="21" type="metho"> <SectionTitle> 2 Maximal Marginal Relevance </SectionTitle> <Paragraph position="0"> Most modern IR search engines produce a ranked list of retrieved documents ordered by declining relevance to the user's query. In contrast, the need for 'relevant novelty' was motivated as a potentially superior criterion. A first approximation to relevant novelty is to measure the relevance and the novelty independently and provide a linear combination as the metric.</Paragraph> <Paragraph position="1"> The linear combination is called 'marginal relevance' - i.e. a document has high marginal relevance if it is both relevant to the query and contains minimal similarity to previously selected documents. In document retrieval and summarization, marginal relevance is strived to maximize, hence the method is labeled</Paragraph> <Paragraph position="3"> where C={D1,...,Di,...} is a document collection (or document stream); Q is a query or user profile; R = IR(C, Q, th ), i.e., the ranked list of documents retrieved by an IR system, given C and Q and a relevance threshold th , below which it will not retrieve documents (th can be degree of match or number of documents); S is the subset of documents in R which is already selected; R\S is the set difference, i.e. the set of as yet unselected documents in R; Sim is the similarity metric used in document retrieval and relevance ranking between documents (passages) and a query; and Sim can be the same as Sim or a different metric.</Paragraph> </Section> <Section position="4" start_page="21" end_page="21" type="metho"> <SectionTitle> 3 MMR-based Feature Selection </SectionTitle> <Paragraph position="0"> We propose a MMR-based feature selection which selects each feature according to a combined criterion of information gain and novelty of information. We define MMR-based feature selection as follows:</Paragraph> <Paragraph position="2"> where C is the set of class labels, R is the set of candidate features, S is the subset of features in R which was already selected, R\S is the set difference, i.e. the set of as yet unselected features in R, IG is the information gain scores, and IGpair is the information gain scores of co-occurrence of the word (feature) pairs. IG and IGpair are defined as follows:</Paragraph> <Paragraph position="4"> ) is the probability of the k-th class value, p(C</Paragraph> <Paragraph position="6"> ) is the conditional probability of the k-th class value given that w</Paragraph> <Paragraph position="8"> Given the above definition, MMR_FS computes incrementally the information gain scores when the parameter l =1, and computes a maximal diversity among the features in R when l =0. For intermediate values of l in the interval [0,1], a linear combination of both criteria is optimized.</Paragraph> </Section> class="xml-element"></Paper>