File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/w97-0122_intro.xml

Size: 3,398 bytes

Last Modified: 2025-10-06 14:06:22

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0122">
  <Title>Using Word Frequency Lists to Measure Corpus Homogeneity and Similarity between Corpora.</Title>
  <Section position="2" start_page="0" end_page="231" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> How similar are two corpora? The question arises on many occasions. Does it matter whether lexicographers use this corpora or that, or are they similar enough for it to make no difference? (The original impetus for the research was the question, &amp;quot;are the fiction and jourDali~m parts of the Longman Lancaster Corpus and British National Corpus I (BNC) interchangeable? x) In NLP, many useful results can be generated from corpora, but when can the results developed using one corpus be applied to another? There are also questions of more general interest. Looking at British national newspapers: is the Independent more like the Guardian or the Telegraph? There are many ways in which the question could be addressed, but the one we take here is to take texts from each newspaper and compare the frequencies of words used. Given an accurately part-of-speech-tagged or parsed corpus, the same method could be applied to frequency lists of parts-of-speech or syntactic constructions, and the methodological part of the paper would still be salient. Section 2 presents the case for using word frequencies.</Paragraph>
    <Paragraph position="1"> How homogeneous is a corpus? The question is both of interest in its own right, and is a preUminary to any quantitative approach to corpus similarity. It is of interest in its own right, because a sublanguage corpus, or one containing only a specific language variety, has very different characteristics to a general corpus (Biber, 1993), yet it is not obvious how a corpus's position on this scale can be assessed. It is of interest as a preliminary to a measure of corpus similarity,  ! being compared with a heterogeneous one. For the statistical language mode11ing community, the preferred approach to assessing homogeneity is by calculating perplexity, and the approach can be extendlng to measuring similarity by calculating cross-entropy (Charniak, 1993). In section 6 compare these methods with the approach developed here.</Paragraph>
    <Paragraph position="2"> In this paper we present a method for measuring both corpus similarity and corpus homogeneity. In brief, the method (for the homogeneity case) is as follows:  * Divide the corpus into two halves by randomly placing texts in one of two subcorpora; * Produce a word frequency list for each subcorpus; * Calculate the X 2 statistic for the rtifference between the two subcorpora; * Normalise; * Iterate. (to give different random halves); * Interpret result by comparing values for different corpora.</Paragraph>
    <Paragraph position="3">  The only difPerences for the corpus-similarity case are that (1) one subcorpus is taken from the first corpus and the other from the second, and (2) the similarity value is to be interpreted by reference to the homogeneity measure for each corpus.</Paragraph>
    <Paragraph position="4"> After arguing the the case for using word frequency lists and describing related work, the paper describes the various pitfalls the measure must avoid and presents some first results.</Paragraph>
    <Paragraph position="6"/>
  </Section>
class="xml-element"></Paper>
Download Original XML