File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/06/w06-3403_concl.xml
Size: 5,702 bytes
Last Modified: 2025-10-06 13:55:46
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3403"> <Title>Computational Measures for Language Similarity across Time in Online Communities</Title> <Section position="7" start_page="19" end_page="20" type="concl"> <SectionTitle> 6 Discussion </SectionTitle> <Paragraph position="0"> This work presents several novel contributions to the analysis of text-based messages in online communities. Using three separate tools, Spearman's Correlation Coefficient, Zipping and Latent Semantic Analysis measures, we found that across time, members of an online community diverge in the language they use. More specifically, a comparison of the words contributed by any pair of users in a particular topic group shows increasing dissimilarity over the six-week period.</Paragraph> <Paragraph position="1"> This finding seems counter-intuitive given work in linguistics and psychology, which shows that dyads and communities converge, entrain and echo each other's lexical choices and communication styles. Similarly, our own temporal proximity results appear to indicate convergence, since closer time periods are more similar than more distant ones. Finally, previous hand-coding of these data revealed convergence, for example between boys and girls on the use of emotion words, between older and younger children on talk about the future (Cassell & Tversky, 2005). So we ask, why do our tools demonstrate this divergent trend? We believe that one answer comes from the fact that, while the young people may be discussing a more restricted range of topics, they are contributing a wider variety of vocabulary. In order to examine whether indeed there were more unique words over time, we first simply manually compared the frequency of words over time and found that, on the contrary, there are consistently fewer unique words by T , which suggests convergence.</Paragraph> <Paragraph position="2"> However, there are also fewer and fewer total words by the end of the forum. This is due to the number of participants who left the forum after they were not elected to go to Boston. If we divide the unique words by the total words, we find that the ratio of unique words consistently increases over time (see Figure 4). It is likely that this ratio contributes to our results of divergence.</Paragraph> <Paragraph position="3"> In order to further examine the role of increasing vocabulary in the Junior Summit as a whole, we also created several control groups comprised of random pairs of users (i.e., users that had never written to each other), and measured their pairwise similarity across time. The results were similar to the experimental groups, demonstrating a slope with roughly the same shape. This argues for convergence and divergence being affected by something at a broader, community-level such as an increase in vocabulary.</Paragraph> <Paragraph position="4"> This result is interesting for an additional reason. Some users - perhaps particularly non-native speakers or younger adolescents, may be learning new vocabulary from other speakers, which they begin to introduce at later time periods. An increasingly diversified vocabulary could conceivably result in differences in word frequency among speakers. This leads us to some key questions: to what extent does the language of individuals change over time? Is individual language influenced by the language of the community? This is heart of entrainment.</Paragraph> <Paragraph position="5"> In conclusion, we have shown that SCC, Zipping and LSA can be used to assess message similarity over time, although they may be somewhat blunt instruments for our purposes. In addition, while Zipping is somewhat contentious and not as widely-accepted as SCC or LSA is, we found that the three tools provide very similar results. This is particularly interesting given that, while all three methods take into account word or word-sequence frequencies, LSA is designed to also take into account aspects of semantics beyond the surface level of lexical form.</Paragraph> <Paragraph position="6"> All in all, these tools not only contribute to ways of measuring similarity across documents, but can be utilized in measuring smaller texts, such as online messages or emails. Most importantly, these tools remind us how complex and dynamic everyday language really is, and how much this complexity must be taken into account when building computational tools for the analysis of text and conversation.</Paragraph> <Section position="1" start_page="20" end_page="20" type="sub_section"> <SectionTitle> 6.1 Future Directions </SectionTitle> <Paragraph position="0"> In future work, we intend to find ways to compare the results obtained from different topic groups and also to examine differences among individual users, including re-running our analyses after removing outliers. We also hope to explore the interplay between individuals and the community and changes in language similarity. In other words, can we find those individuals who may be acquiring new vocabulary? Are there &quot;language leaders&quot; responsible for language change online? We also plan to analyze words in terms of their local contexts, to see if this changes over time and how it impacts our results. Furthermore, we intend to go beyond word frequency to classify topic changes over time to get a better understanding of the dynamics of the groups (Kaufmann, 1999).</Paragraph> <Paragraph position="1"> Finally, as we have done in the past with our analyses of this dataset, we would like to perform a percentage of hand-coded, human content analysis to check reliability of these statistical methods.</Paragraph> </Section> </Section> class="xml-element"></Paper>