File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/w93-0310_metho.xml
Size: 6,814 bytes
Last Modified: 2025-10-06 14:13:32
<?xml version="1.0" standalone="yes"?> <Paper uid="W93-0310"> <Title>Computation of Word Associations Based on the Co-Occurences of Words in Large Corpora I</Title> <Section position="4" start_page="84" end_page="85" type="metho"> <SectionTitle> 3 Association norms used </SectionTitle> <Paragraph position="0"> For the comparison between the predicted and the associations of human subjects we have used the association norms coUected by Russet\] ~ Jenkins (Jenkins, 1970). They have the advantage that translations of the stimulus words were also given to German subjects (Russell & Meseck, 1959, and RusseLl, 1970) so that our model could be tested for English as well as for German.</Paragraph> <Paragraph position="1"> The Russell & Jenkins association norms, also referred to as the Minnesota word association norms, were collected in 1952. The I00 stimulus words from the Kent-Rosanoff word association test (Kent ~ Rosanoff, 1910) were presented to 1008 students of two large introductory psychology classes at the University of Minnesota. The subjects were instructed, to write after each word &quot;the first word that it makes you think of'. Seven years later, Russell Meseck (1959) repeated the same experiment in Germany with a carefully translated list of the stimulus words. The subjects were 331 students and pupils from the area near W~rzburg.</Paragraph> <Paragraph position="2"> The quantitative results reported on later will be based on comparisons with these norms.</Paragraph> <Paragraph position="3"> The American as well as the German association norms were collected more than 30 years ago. The texts Which were used to simulate these associations are more recent. One might expect therefore that this discrepancy will impair the agreement between the observed and the predicted responses. Better predictions might be attained if the observed associations had been produced by the same subjects as the texts from which the predictions are computed.</Paragraph> <Paragraph position="4"> However, such a procedure is hardly realizable, and our results will show that despite these discrepancies associations to common words can be predicted successfully.</Paragraph> </Section> <Section position="5" start_page="85" end_page="86" type="metho"> <SectionTitle> 4 Text corpora </SectionTitle> <Paragraph position="0"> In order to get reliable estimates of the co-occurences of words, large text corpora have to be used. Since associations of the &quot;average subject&quot; are to be simulated, the texts should not be specific to a certain domain, but reflect the wide distribution of different types of texts and speech as perceived in every day life.</Paragraph> <Paragraph position="1"> The following selection of some 33 million words of machine readable English texts used in this study is a modest attempt to achieve this goal: * Brown corpus of present day American English (1 million words) * LOB corpus of present day British English (1 million words) * Belletristic literature from Project Gutenberg (1 million words) * Articles from the New Scientist from Oxford Text Archive (1 million words) * Wall Street Journal from the ACL/DCI (selection of 6 million words) * Hansard Corpus. Proceedings of the Canadian Parliament (selection of 5 million words from the ACL/DCI-corpus) * Grolier's Electronic Encyclopedia (8 million words) * Psychological Abstracts from PsycLIT (selection of 3.5 million words) * Agricultural abstracts from the Agricola database (3.5 million words) * DOE scientific abstracts from the ACL/DCI (selection of 3 million words) To compute associations for German the following corpora comprising about 21 million words were used: * LIMAS corpus of present-day written German (1.1 million words) * Freiburger Korpus from the Institute for German Language (IDS), Mannheim (0.5 million words of spoken German) * Ma~nheimer Korpus 1 from the IDS (2.2 million words of present-day written German from books and periodicals) * Handbuchkorpora 85, 86 and 87 from the IDS (9.3 million words of newspaper texts) * German abstracts from the psychological database PSYNDEX (8 million words) For technical reasons, not all words occuring in the corpora have been used in the simulation. The vocabulary used consists of all words which appear more than ten times in the English or German corpus. It also includes all 100 stimulus words and all responses in the English or German association norms. This leads to an English vocabulary of about 72000 and a German vocabulary of 65000 words. Hereby, a word is defined as a string of alpha characters separated by non-alpha characters. Punctuation marks and special characters axe treated as words.</Paragraph> </Section> <Section position="6" start_page="86" end_page="86" type="metho"> <SectionTitle> 5 Computation of the association strengths </SectionTitle> <Paragraph position="0"> The text corpora were read in word by word. Whenever one of the 100 stimulus words occured, it was determined which other words occured within a distance of twelve words to the left or to the right of the stimulus word, and for every pair a counter was updated. The so defined frequencies of co-occurence tt(i&j), the frequencies of the single words tt(i) and the total number of words in the corpus Q were stored in tables. Using these tables, the probabilities in formula (4) can be replaced by relative frequencies:</Paragraph> <Paragraph position="2"> In this formula the first term on the right side does not depend on j and therefore has no effect on the prediction of the associative response. With H(j) in the denominator of the second term, estimation errors have a strong impact on the association strengths for rare words. Therefore, by modifying formula (5), words with low corpus frequencies had to be</Paragraph> <Paragraph position="4"> According to our model the word j with the highest associative strength ~/,./to the stimulus word / should be the associative response. The best results were observed when parameter a was chosen to be 0.66. Parameters ~5 and 3' turned out to be relatively uncritical, and therefore to simplify parameter optimization were both set to the same value of 0.00002.</Paragraph> <Paragraph position="5"> Ongoing research shows that formula (6) has a number of weaknesses, for example that it does not discriminate words with co-occurence-frequency zero, as discussed by Gale & Church (1990) in a comparable context. However, since the results reported on later are acceptable, it probably gets the major issues right. One is, that subjects usually respond with common, i.e. frequent words in the free association task. The other is, that estimations of co-occurence-frequencies for low-frequency-words are too poor to be useful.</Paragraph> </Section> class="xml-element"></Paper>