File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/00/w00-1315_concl.xml
Size: 2,324 bytes
Last Modified: 2025-10-06 13:52:55
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-1315"> <Title>Empirical Term Weighting and Expansion Frequency</Title> <Section position="8" start_page="122" end_page="122" type="concl"> <SectionTitle> 6 Conclusions </SectionTitle> <Paragraph position="0"> This paper introduced an empirical histogram-based supervised learning method for estimating term weights, ~. Terms are assigned to bins based on features such as inverse document frequency, burstiness and expansion frequency. A different is estimated for each bin and each tf by counting the number of relevant and irrelevant documents associated with the bin and tff value. Regression techniques are used to interpolate between bins, but care is taken so that the regression cannot do too much harm (or too much good). Three variations were considered: fit-G, fit-B and fit-E. The performance of query expansion (fit-E) is particularly encouraging. Using simple purely statistical methods, fit-E is nearly comparable to JCB1, a sophisticated natural language processing system developed by Just System, a leader in the Japanese word processing industry.</Paragraph> <Paragraph position="1"> .-: In addition to performance, we are also interested in the interpretation of the weights. Empirical weights tend to lie between 0 and idf. We find these limits to be a surprise given that standard term weighting formulas such as tf. idf generally do not conform to these limits. In addition, we find that ~ generally grows linearly with idf, and that the slope is between 0 and 1. We interpret the slope as a statistical shrink. The larger slopes are associated with very robust conditions, e.g., terms mentioned explicitly in all three areas of interest: (1) the query (where = D), (2) the document (tf _> 1) and (3) the expansion (ef > 1). There is generally more shrinking for terms brought in by query expansion (where = E), but if a term is mentioned in several documents in the expansion (el > 2), then it is not as essential that the term be mentioned explicitly in the query. The interactions among t f, id\], where, B, el, etc., are complicated, and therefore, we have found it safer and easier to use histogram methods than to try to account for all of the interactions at once in a single multiple regression.</Paragraph> </Section> class="xml-element"></Paper>