File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/93/h93-1070_intro.xml
Size: 4,924 bytes
Last Modified: 2025-10-06 14:05:28
<?xml version="1.0" standalone="yes"?> <Paper uid="H93-1070"> <Title>The Importance of Proper Weighting Methods</Title> <Section position="2" start_page="0" end_page="349" type="intro"> <SectionTitle> 1. INTRODUCTION </SectionTitle> <Paragraph position="0"> Other than experimental results, the first part of this paper contains little new material. Instead, it's an attempt to demonstrate the relative importance and difficulties involved in the common information retrieval task of forming documents and query representatives and weighting features. This is the sort of thing that tends to get passed by word of mouth if at all, and never gets published. However, there is a tremendous revival of interest in information retrieval; thus this attempt to help all those new people just starting in experimental information retrieval.</Paragraph> <Paragraph position="1"> A common approach in many areas of natural language processing is to 1. Find &quot;features&quot; of a natural language excerpt 2. Determine the relative importance of those features within the excerpt 3. Submit the weighted features to some task null appropriate decision procedure This presentation focuses on the second sub:task above: the process of weighting features of a natural language representation. Features here could be things like single word occurrences, phrase occurrences, other relationships between words, occurrence of a word in a title, part-of-speech of a word, automatically or manually assigned categories of a document, citations of a document, and so on. The particular overall task addressed here is that of information retrieval - finding textual documents (from a large set of documents) that are relevant to a user's information need. Weighting features is something that many information retrieval systems seem to regard as being of minor importance as compared fo finding the features in the first place; but the experiments described here suggest that weighting is considerably more important than additional feature selection.</Paragraph> <Paragraph position="2"> This is not an argument that feature selection is unimportant, but that development of feature selection and methods of weighting those features need to proceed hand-in-hand if there is to be hope of improving performance. There have been many papers (and innumerable unpublished negative result experiments) where authors have devoted tremendous resources and intellectual insights into finding good features to help represent a document, but then weighted those features in a haphazard fashion and ended up with little or no improvement. This makes it extremely difficult for a reader to judge the worthiness of a feature approach, especially since the weighting methods are very often not described in detail.</Paragraph> <Paragraph position="3"> Long term, the best weighting methods will obviously be those that can adapt weights as more information becomes available. Unfortunately, in information retrieval it is very difficult to learn anything useful from one query that will be applicable to the next. In the routing or relevance feedback environments, weights can be learned for a query and then applied to that same query. But in general there is not enough overlap in vocabulary (and uses of vocabulary) between queries to learn much about the usefulness of particular words. The second half of this paper discusses an approach that learns the important characteristics of a good term. Those characteristics can then be used to properly weight all terms.</Paragraph> <Paragraph position="4"> Several sets of experiments are described, with each set using different types of information to determine the weights of features. All experiments were done with the SMART information retrieval system, most using the TREC/TIPSTER collections of documents, queries, and relevance judgements. Each run is evaluated using the &quot;ll-point recall-precision average&quot; evaluation method that was standard at the TREC 1 conference.</Paragraph> <Paragraph position="5"> The basic SMART approach is a completely automatic indexing of the full text of both queries and documents. Common meaningless words (like 'the' or 'about') are removed, and all remaining words are stemmed to a root form. Term weights are assigned to each unique word (or other feature) in a vector by the statistical/learning processes described below. The final form of a representative for a document (or query) is a vector D~ = (w~,l, w~,2,..., wi,~) where D~ represents a document (or query) text and w~,k is a term weight of term Tk attached to document Di.</Paragraph> <Paragraph position="6"> The similarity between a query and document is set to the inner-product of the query vector and document vector; the information retrieval system as a whole will return those documents with the highest similarity to the query.</Paragraph> </Section> class="xml-element"></Paper>