File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/95/w95-0110_abstr.xml
Size: 3,006 bytes
Last Modified: 2025-10-06 13:48:28
<?xml version="1.0" standalone="yes"?> <Paper uid="W95-0110"> <Title>Unfortunately, we believe the words in the middle are often the most important words for Information</Title> <Section position="1" start_page="0" end_page="0" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> Low frequency words tend to be rich in content, and vice versa. But not all equally frequent words are equally mean!ngful. We will use inverse document frequency (IDF), a quantity borrowed from Information Retrieval, to distinguish words like somewhat and boycott. Both somewhat and boycott appeared approximately 1000 times in a corpus of 1989 Associated Press articles, but boycott is a better keyword because its IDF is farther from what would be expected by chance (Poisson).</Paragraph> <Paragraph position="1"> 1. Document frequency is similar to word frequency, but different Word frequency is commonly used in all sorts of natural language applications. The practice implicitly assumes that words (and ngrams) are distributed by a single parameter distribution such as a Poisson or a Binomial. But we find that these distributions do not fit the data very well. Both the Poisson and Binomial assume that the variance over documents is no larger than the mean, and yet, we find that it can be quite a bit larger, especially for interesting words such as boycott where there are hidden variables such as topic that conspire to undermine the independence assumption behind the Poisson and the Binomial. Much better fits are obtained by introducing a second parameter such as inverse document frequency (IDF).</Paragraph> <Paragraph position="2"> Inverse document frequency (IDF) is commonly used in Information Retrieval (Sparck Jones, 1972). IDF is defined as -log2dfw/D, where D is the number of documents in the collection and dfw is the document frequency, the number of documents that contain w. Obviously, there is a strong relationship between document frequency, dfw, and word frequency, fw. The relationship is shown in Figure 1, a plot of iog\]0fw and IDF for 193 words selected from a 50 million word corpus of 1989 Associated Press (AP) Newswire stories (D = 85,432 stories).</Paragraph> <Paragraph position="3"> Although log lofw is highly correlated with IDF (p =-0.994), it would be a mistake to assume that the two variables are completely predictable from one another. Indeed, the experience of the Information Retrieval community has indicated that IDF is a very useful quantity. Attempts to replace IDF with fw (or some simple transform offw) have not been very successful.</Paragraph> <Paragraph position="4"> Figure 2 shows one such attempt. It compares the observed IDF with II~F, an estimate based on f Assume that a document is merely a &quot;bag of words&quot; with no interesting structure (content). Words are randomly generated by a Poisson process, n. The probability of k instances of a word w is n(0,k) fw where O= --:</Paragraph> </Section> class="xml-element"></Paper>