File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-0902_intro.xml
Size: 2,547 bytes
Last Modified: 2025-10-06 14:01:00
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0902"> <Title>Comparing corpora with WordSmith Tools: How large must the reference corpus be?</Title> <Section position="4" start_page="0" end_page="8" type="intro"> <SectionTitle> 2 Methodology </SectionTitle> <Paragraph position="0"> In order to answer this question, the following English corpora were used: * Corpus of job application letters, taken from the DIRECT Corpus 2 .</Paragraph> <Paragraph position="1"> * Corpus of newspaper editorials, from the Brown Corpus ('B&quot; subcorpus). * Corpus of newspaper reviews, from the Brown Corpus ('C' subcorpus).</Paragraph> <Paragraph position="2"> * Corpus of mystery fiction, from the Brown Corpus ('L&quot; subcorpus).</Paragraph> <Paragraph position="3"> * Corpus of science fiction, from the Brown Corpus ('M' subcorpus).</Paragraph> <Paragraph position="4"> The reference corpora were compiled out of texts published in 'The Guardian'. The reason for choosing it is that newspaper text is the most typical kind of reference corpus used by applied linguists, mainly because it is easy to get. Therefore, the results obtained here would be relevant to the typical user of KeyWords. The reason for specifically choosing the Guardian is that Mike Scott, the author of WordSmith Tools, makes it available on his website a word list of 95 million tokens of The Guardian text on his website. This has become a popular choice for several WordSmith Tools users investigating English keywords. Once again, it was hoped that by using The Guardian, the investigation would mirror a typical choice of WordSmith users. For the present study, a portion of the Guardian word list was used, namely from texts published in 1994, taken randomly.</Paragraph> <Paragraph position="5"> The size of the reference corpora varied according to the size of the study corpora. For each study corpus, 18 reference corpora were created. Each one was n times larger than the study corpus, with n being 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80, 90, and 100. For instance, the letters corpus had 11,761 tokens, and so for n=2 the size of the reference corpus was 23,552 tokens (11,761 * 2); for n=3, the reference corpus size was 35,283 (11,761 x 3), for n=4 47,044, and so on, up to n=100, whose size was 1,176,100 words.</Paragraph> <Paragraph position="6"> The KeyWords settings used for the comparisons were as follows: The table below shows the size of all of the reference corpora used in the study:</Paragraph> </Section> class="xml-element"></Paper>