File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3403_metho.xml
Size: 7,831 bytes
Last Modified: 2025-10-06 14:11:01
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3403"> <Title>Computational Measures for Language Similarity across Time in Online Communities</Title> <Section position="4" start_page="16" end_page="17" type="metho"> <SectionTitle> 3 The Current Study </SectionTitle> <Paragraph position="0"> In this paper, we examine entrainment among 419 of the 1000 user groups (the ones who wrote in English) and among the 15366 messages they wrote over a six-week period (with participants divided into 20 topic groups, with an average of 20.95 English writers per group). We ask whether the young people's language converges over time in an online community. Is similarity between the texts that are produced by the young people greater between adjacent weeks than between the less proximally-related weeks? Furthermore, what computational tools can effectively measure trends in similarity over time?</Paragraph> <Section position="1" start_page="16" end_page="17" type="sub_section"> <SectionTitle> 3.1 Hypotheses </SectionTitle> <Paragraph position="0"> In order to address these questions, we chose to examine change in similarity scores along two dimensions: (1) at the level of the individual; and (2) across the group as a whole. More specifically, we examine similarity between all pairs of individuals in a given topic group over time. We also compared similarity across the entire group at different time periods.</Paragraph> <Paragraph position="1"> As depicted below, we first look at pairwise comparisons between the messages of participants in a particular topic group within a given time period, T k (one week). For every pair of participants in a group, we calculated the similarity between two documents, each comprising all messages for a participant in the pair. Then we averaged the scores computed for all topic groups within a time</Paragraph> <Paragraph position="3"> . Our first hypothesis is that the average, pairwise similarity will increase over time, such that:</Paragraph> <Paragraph position="5"> For our second set of tests, we compared all messages from a single time period to all messages of a previous time period within a single topic group. Our hypothesis was that temporal proximity would correlate with mean similarity, such that the messages of two adjacent time periods would exhibit more similarity than those of more distant time periods. In order to examine this, we perform two individual hypothesis tests, where M k is the document containing all the messages produced in time period T k , and S(X,Y) is the similarity score for the two documents X and Y.</Paragraph> <Paragraph position="7"> Finally, we posit that SCC, Zipping and LSA will yield similar results for these tests.</Paragraph> </Section> </Section> <Section position="5" start_page="17" end_page="18" type="metho"> <SectionTitle> 4 Method </SectionTitle> <Paragraph position="0"> To prepare the data, we wrote a script to remove the parts of messages that could interfere with computing their similarity, in particular quoted messages and binary attachments, which are common in a corpus of email-like messages. We also removed punctuation and special characters.</Paragraph> <Section position="1" start_page="17" end_page="17" type="sub_section"> <SectionTitle> 4.1 Spearman's Correlation Coefficient </SectionTitle> <Paragraph position="0"> SCC is calculated as in Kilgarriff (2001). First, we compile a list of the common words between the two documents. The statistic can be calculated on the n most common words, or on all common words (i.e. n = total number of common words).</Paragraph> <Paragraph position="1"> We applied the latter approach, using all the words in common for each document pair. For each document, the n common words are ranked by frequency, with the lowest frequency word ranked 1 and the highest ranked n. For each common word, d is the difference in rank orders for the word in each document. SCC a normalized sum of the squared differences: The sum is taken over the n most frequent common words. In the case of ties in rank, where more than one word in a document occurs with the same frequency, the average of the ranks is assigned to the tying words.</Paragraph> <Paragraph position="2"> (For example, if words w</Paragraph> </Section> <Section position="2" start_page="17" end_page="17" type="sub_section"> <SectionTitle> 4.2 Zipping </SectionTitle> <Paragraph position="0"> When compressing a document, the resulting compression ratio provides an estimate of the document's entropy. Many compression algorithms generate a dictionary of sequences based on frequency that is used to compress the document.</Paragraph> <Paragraph position="1"> Likewise, one can leverage this technique to determine the similarity between two documents by assessing how optimal the dictionary generated when compressing one document is when applied to another document. We used GZIP for compression, which employs a combination of the LZ77 algorithm and Huffman coding. We based our approach on the algorithm used by (Benedetto, Caglioti, & Loreto, 2002), where the cross-entropy per character is defined as: Here, A and B are documents; A B+ is document B appended to document A; zip(A) is the zipped document; and length(A) is the length of the document. It is important to note that the test document (B) needs to be small enough that it doesn't cause the dictionary to adapt to the appended piece. (Benedetto, Caglioti, & Loreto, 2002) refer to this threshold as the crossover length. The more similar the appended portion is, the more it will compress, and vice versa. We extended the basic algorithm to handle the extremely varied document sizes found in our data. Our algorithm does two one-way comparisons and returns the mean score. Each one-way comparison between two documents, A and B, is computed by splitting B into 300 character chunks. Then for each chunk, we calculated the cross entropy per character when appending the chunk onto A. Each one-way comparison returns the mean calculation for every chunk.</Paragraph> <Paragraph position="2"> We fine-tuned the window size with a small, hand-built corpus of news articles. The differences are slightly more pronounced with larger window sizes, but that trend starts to taper off between window sizes of 300 and 500 characters. In the end we chose 300 as our window size, because it provided sufficient contrast and yet still gave a few samples from even the smallest documents in our primary corpus.</Paragraph> </Section> <Section position="3" start_page="17" end_page="18" type="sub_section"> <SectionTitle> 4.3 Latent Semantic Analysis (LSA) </SectionTitle> <Paragraph position="0"> For a third approach, we used LSA to analyze the semantic similarity between messages across different periods of time. We explored three imple- null mentations of LSA: (a) the traditional algorithm described by Foltz et al (1998 ) with one semantic space per topic group, (b) the same algorithm but with one semantic space for all topic groups and (c) an implementation based on Word Space (Schutze, 1993) called Infomap. All three were tested with several settings such as variations in the number of dimensions and levels of control for stop words, and all three demonstrated similar results. For this paper, we present the Infomap results due to its wide acceptance among scholars as a successful implementation of LSA.</Paragraph> <Paragraph position="1"> To account for nuances of the lexicon used in the Junior Summit data, we built a semantic space from a subset of this data comprised of 7000 small messages (under one kb) and 100 dimensions without removing stop words. We then built vectors for each document and compared them using cosine similarity (Landauer, Foltz, & Laham, 1998).</Paragraph> </Section> </Section> class="xml-element"></Paper>