File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/00/w00-0902_abstr.xml

Size: 3,913 bytes

Last Modified: 2025-10-06 13:41:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0902">
  <Title>Comparing corpora with WordSmith Tools: How large must the reference corpus be?</Title>
  <Section position="2" start_page="0" end_page="0" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> WordSmith Tools (Scott, 1998) offers a program for comparing corpora, known as KeyWords. KeyWords compares a word list extracted from what has been called 'the study corpus' (the corpus which the researcher is interested in describing) with a word list made from a reference corpus. The only requirement for a word list to be accepted as reference corpus by the software is that must be larger than the study corpus. one of the most pressing questions with respect to using KeyWords seems to be what would be the ideal size of a reference corpus. The aim of this paper is thus to propose answers to this question. Five English corpora were compared to reference corpora of various sizes (varying from two to 100 times larger than the study corpus).</Paragraph>
    <Paragraph position="1"> The results indicate that a reference corpus that is five times as large as the study corpus yielded a larger number of keywords than a smaller reference corpus. Corpora larger than five times the size of the study corpus yielded similar amounts of keywords. The implication is that a larger reference corpus is not always better than a smaller one, for WordSmith Tools Keywords analysis, while a reference corpus that is less than five times the size of the study corpus may not be reliable. There seems to be no need for using extremely large reference corpora, given that the number of keywords yielded do not seem to change by using corpora larger than five times the size of the study corpus.</Paragraph>
    <Paragraph position="2"> Introduction WordSmith Tools (Scott, 1998) offers a program for comparing corpora, known as KeyWords. This tool has been used in several studies as a means for describing various lexico-grammatical characteristics of different genres (Barbara and Scott, 1999; Batista, 1998; Berber Sardinha, 1995, 1999a, b; Berber Sardinha and  1998). The keywords identified by the program are not necessarily the 'most important words' in the corpus (Scott, 1997), or those that correspond to readers' intuitions as to what the topics of the texts are. It is generally thought that a set of WordSmith Tools keywords indicate 'aboutness' (Phillips, 1989).</Paragraph>
    <Paragraph position="3"> KeyWords compares a word list extracted from what has been called 'the study corpus' (the corpus which the researcher is interested in describing) with a word list made from a reference corpus. The result is a list of keywords, or words whose frequencies are statistically higher in the study corpus than in the reference corpus. The software also identifies words whose frequencies are statistically lower in the study corpus, which are called 'negative keywords', in contrast to positive keywords, which have higher frequencies in the study corpus. Negative keywords, though, will not be discussed in the present paper. Hence, whenever keyword is mentioned in this paper, it will mean 'positive keyword'.</Paragraph>
    <Paragraph position="4"> The only requirement for a word list to be accepted as reference corpus by the software is that must be larger than the study corpus. Thus, the composition and length of KeyWord lists can vary according to at least six parameters:  Since WordSmith Tools is Windows software, it has appealed to a large audience of applied linguists willing to do corpus-based research, to whom this platform is generally the only one that they know how to use. To them, one of the most pressing questions with respect to using KeyWords seems to be what would be the ideal size of a reference corpus. The aim of this paper is thus to propose answers to this question.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML