File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1133_metho.xml
Size: 5,604 bytes
Last Modified: 2025-10-06 14:10:26
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1133"> <Title>Are These Documents Written from Different Perspectives? A Test of Different Perspectives Based On Statistical Distribution Divergence</Title> <Section position="5" start_page="1057" end_page="1058" type="metho"> <SectionTitle> 3 Statistical Distribution Divergence </SectionTitle> <Paragraph position="0"> We take a model-based approach to measure to what degree, if any, two document collections are different. A document is represented as a point 2However, the close subjectivity ratio doesn't mean that subjectivity can never help identify document collections of opposing perspectives. For example, the accuracy of the test of different perspectives may be improved by focusing on only subjective sentences.</Paragraph> <Paragraph position="1"> in a V -dimensional space, where V is vocabulary size. Each coordinate is the frequency of a word in a document, i.e., term frequency. Although vector representation, commonly known as a bag of words, is oversimplified and ignores rich syntactic and semantic structures, more sophisticated representation requires more data to obtain reliable models. Practically, bag-of-word representation has been very effective in many tasks, including text categorization (Sebastiani, 2002) and information retrieval (Lewis, 1998).</Paragraph> <Paragraph position="2"> We assume that a collection of N documents, y1,y2,...,yN are sampled from the following process,</Paragraph> <Paragraph position="4"> We first sample a V -dimensional vector th from a Dirichlet prior distribution with a hyperparameter a, and then sample a document yi repeatedly from a Multinomial distribution conditioned on the parameter th, where ni is the document length of the ith document in the collection and assumed to be known and fixed.</Paragraph> <Paragraph position="5"> We are interested in comparing the parameter th after observing document collections A and B:</Paragraph> <Paragraph position="7"> The posterior distribution p(th|*) is a Dirichlet distribution since a Dirichlet distribution is a conjugate prior for a Multinomial distribution.</Paragraph> <Paragraph position="8"> How should we measure the difference between two posterior distributions p(th|A) and p(th|B)? One common way to measure the difference between two distributions is Kullback-Leibler (KL) divergence (Kullback and Leibler, 1951), defined as follows,</Paragraph> <Paragraph position="10"> p(th|A)log p(th|A)p(th|B)dth. (5) Directly calculating KL divergence according to (5) involves a difficult high-dimensional integral. As an alternative, we approximate KL divergence using Monte Carlo methods as follows, 1. Sample th1,th2,...,thM from Dirichlet(th|a+summationtext yi[?]Ayi).</Paragraph> <Paragraph position="11"> 2. Return ^D = 1M summationtextMi=1 log p(thi|A)p(thi|B) as a Monte Carlo estimate of D(p(th|A)||p(th|B)).</Paragraph> <Paragraph position="12"> Algorithms of sampling from Dirichlet distribution can be found in (Ripley, 1987). As M - [?], the Monte Carlo estimate will converge to true KL divergence by the Law of Large Numbers.</Paragraph> </Section> <Section position="6" start_page="1058" end_page="1059" type="metho"> <SectionTitle> 4 Corpora </SectionTitle> <Paragraph position="0"> To evaluate how well KL divergence between posterior distributions can discern a document collection pair of different perspectives, we collect two corpora of documents that were written or spoken from different perspectives and one newswire corpus that covers various topics, as summarized in document length -|d |, and vocabulary size V of the three corpora.</Paragraph> <Paragraph position="1"> The first perspective corpus consists of articles published on the bitterlemons website3 from late 2001 to early 2005. The website is set up to &quot;contribute to mutual understanding [between Palestinians and Israelis] through the open exchange of ideas&quot;4. Every week an issue about the Israeli-Palestinian conflict is selected for discussion (e.g., &quot;Disengagement: unilateral or coordinated?&quot;), and a Palestinian editor and an Israeli editor each contribute one article addressing the issue. In addition, the Israeli and Palestinian editors interview a guest to express their views on the issue, resulting in a total of four articles in a weekly edition. The perspective from which each article is written is labeled as either Palestinian or Israeli by the editors.</Paragraph> <Paragraph position="2"> The second perspective corpus consists of the transcripts of the three Bush-Kerry presidential debates in 2004. The transcripts are from the website of the Commission on Presidential Debates5. Each spoken document is roughly an answer to a question or a rebuttal. The transcript are segmented by the speaker tags already in the transcripts. All words from moderators are discarded.</Paragraph> <Paragraph position="3"> The topical corpus contains newswire from Reuters in 1987. Reuters-215786 is one of the most common testbeds for text categorization.</Paragraph> <Paragraph position="4"> Each document belongs to none, one, or more of the 135 categories (e.g., &quot;Mergers&quot; and &quot;U.S. Dollars&quot;.) The number of documents in each category is not evenly distributed (median 9.0, mean 105.9).</Paragraph> <Paragraph position="5"> To estimate statistics reliably, we only consider categories with more than 500 documents, resulting in a total of seven categories (ACQ, CRUDE, EARN, GRAIN, INTEREST, MONEY-FX, and TRADE).</Paragraph> </Section> class="xml-element"></Paper>