File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0122_metho.xml

Size: 22,385 bytes

Last Modified: 2025-10-06 14:14:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0122">
  <Title>Using Word Frequency Lists to Measure Corpus Homogeneity and Similarity between Corpora.</Title>
  <Section position="3" start_page="231" end_page="231" type="metho">
    <SectionTitle>
2 Why word frequency lists?
</SectionTitle>
    <Paragraph position="0"> Are word frequency lists interesting? Many think not. There are two recurring themes amongst the noes. Firstly, what is hnportant about texts is their me~ing. Since the message is thrown out m when a text is reduced to a frequency list, the heart of the text is jettisoned. This arg-ment comes i from all quarters: the second comes mainly from linguists. It is that, if we are to count, the objects we should be counting are ones with a linguistic pedigree. In relation to content, we should be n counting word senses, or lexical units, since any list will be compromised if money bank and river | bank are counted together. In relation to form, we should be counting grammatical constructions: nnmbers of relative clauses or passives tell us far more about the linguistic character of a text than numbers of occurrences of toho or which. I 'l~iug the general argument first, firstly, a text without its context is itself an abstraction. w A transcript of a conversation is a more concise version than an audio tape (which is itself more m concise than a video tape). A newspaper article is understood more fully if the reader is well- * versed in the political or other circumstances of its publication. There is not a complete break m between texts, which present meaning, and frequency lists, which do not. el Secondly, our concern is with language corpora, not with texts. While a text may be coherent I in its me~uing, a corpus comprising multiple texts can scarcely be. The objective in gathering multiple-text corpora is to identify a linguistic object in which the individual meanings of texts m are taken out of focus, to be replaced by the character of the whole. I Thirdly, as will be evident too all workers in corpus-based computational linguistics, frequency lists are very useful representations of meaning for information retrieval, text categorisation, and IIg numerous other purposes. They are useful because they are a representation of the text which is I susceptible to automatic, objective manipulation. The full text is very rich in information, but</Paragraph>
    <Paragraph position="2"> that information cannot be readily used to make, e.g., similarity judgements. When a text or corpus is represented as a frequency list, much information is lost, but the tradeoff is an object that is susceptible to statistical processing.</Paragraph>
    <Paragraph position="3"> To move on to the concerns regarding what is counted: in exploring word frequency lists we are also investigating a hypothesis. Sinclair has postulated Every distinct sense of a word is associated with a distinction in form 2 We take this one step further and postulate &amp;quot;no linguistic distinction without a word frequency distinction'; any dii~erence in the linguistic character of two corpora will leave its trace in a di~erence between their word frequency lists. It may not be evident which words will be more frequent, and which less, if one corpus uses more relative clauses and less passives than another, on this hypothesis, some will be.</Paragraph>
    <Paragraph position="4"> An advantage of using word frequency lists is that there is so much data: two corpora can be compared in respect of thousands of data points (e.g., words). Although money bar&amp; and river bank are counted together, corpora using the one and corpora using the other will tend to be discriminated because the one corpus will use money, account and Barclays more, the other, river and grassy. It is a research question to determine which words' frequencies vary for a given variation in linguistic structures (see the section on newspapers for an indication of how this can proceed). For current purposes, we can happily pool the data, referring only to individual words when we seek further insight into why we get the results we do. Biber's work (see below) shows how quantitative methods can be used to discover and capture register differences, and some of the objects he counts are words (others being gr~mm~ical constructions), so his work provides some grounds for optimism.</Paragraph>
    <Paragraph position="5"> The hypothesis would of course be a very useful, if true. Words are far easier to count accurately than syntactic categories or word senses. To count syntactic categories requires linguistic theory to identify precisely what the syntactic category is; empirical research to identify the features that indicate where it is present; and a computer program, to automatically identify occurrences. The first two stages are likely to introduce theoretical disagreements, and the last two, errors. The prospects for two independent teams arriving at the same syntactic-construction frequency list for the same corpus are slim. By contrast, if agreement is reached on a few tokenisation issues (hyphens, clitics), the chances of two groups arriving at identical word frequency lists is very good. 3 The rule that any string of alphanumerics surrounded by whitespace or punctuation is a word may have its shortcomings, but it makes word-counting very reliable.</Paragraph>
    <Paragraph position="6"> Word frequency lists axe cheap and easy to generate, so a measure of corpus similarity based on them would be of use as a quick guide in many circumstances where a more extensive analysis of the two corpora was not viable; for example, to judge how a newly available corpus related to existing resources, so a decision about buying it or installing it could be made, or as a prelhninary assessment of how much customisation was likely to be necessary to port an NLP application from one domain (and corpus) to another.</Paragraph>
  </Section>
  <Section position="4" start_page="231" end_page="237" type="metho">
    <SectionTitle>
3 Related Work
</SectionTitle>
    <Paragraph position="0"> The only other piece of work the author has found which aims to measure similarity between  corpus, most resemble each other. They take the 89 most common words in the corpus, find their rank within each genre, and calculate a Spearman rank correlation statistic. This method is compared empirically with the X 2 method in some detail in section 6 below. There is a large body of work aiming to find words which are particularly characteristic of one text, or corpus, in contrast to another. 4 This includes work on linguistic variation, author identification (Mosteller and Wallace, 1964) and information retrieval (Salton, 1989). (Dunning, 1993) and (Pedersen, 1996) shows how some of the methods which have been used in the past (particularly mutual information scores) are invalid for rare events, and introduce accurate measures of how 'surprising' rare events are. (Church and Gale, 1995a) show how Inverse Document Frequency, a measure based on the proportion of documents a word occurs in, can be used alongside word frequency to estimate how distinctive a word is of the texts it occurs in. (Church and Gale, 1995b) extend this work, showing how to model word ~equency distributions in a manner consistent with the fact that some words are evenly spread, while others tend to occur often in documents where they occur at all. As most of this work ~im~ to find good indexing terms for information retrieval, it is mostly concerned with middle-to-low frequency items, and differences in topic rather than differences in register.</Paragraph>
    <Paragraph position="1"> There is a growing body of work which explores and quantifies the differences between corpora.</Paragraph>
    <Paragraph position="2"> Pre-eminent in this field is Biber (Biber, 1988; Biber, 1995), in whose studies the objective is to identify the major dimensions of linguistic variation across languages, and to identify the linguistic and functional characteristics which co-occur in the different registers of a language. His method involves counting a range of linguistic features in each text, and then using factor analysis to determine which of the features co-occur. Co-occurring features are then grouped together to give the dimensions of variation, and the texts (or corpora) of di~erent registers can be identified by their location with respect to these dimensions.</Paragraph>
    <Paragraph position="3"> A recent paper by (Sekine, 1997) explores the domain dependence of parsing. He parses corpora of various text genres, identifies the subtrees of depth 1 in each corpus, and counts the number of occurrences of each subtree. This gives him a subtree frequency list for each corpus, and he is then able to investigate which subtree are markedly different in frequency between corpora.</Paragraph>
    <Paragraph position="4"> Such work is highly salient for customizing parsers for particular domains. In the current context, Sekine's subtree frequency lists can readily be compared with word frequency lists to determine which lists are better for measuring corpus similarity and homogeneity.</Paragraph>
    <Paragraph position="5"> Within the literature on statistical language mode\]llng, there is much discussion of related questions. From an information-theoretic point of view, the theoretical answer to the problem is simple: entropy is a measure of a corpus's homogeneity, and the cross-entropy between two corpora quantifies their similarity. Entropy is not a quantity that can be directly measured. The standard problem for statistical language modelling is to aim to find the model for which the 'cross-entropy' of the model for the corpus is as low as possible. For a perfect language model, the cross-entropy would be the entropy of the corpus (Church and Mercer, 1993; Charniak, 1993).</Paragraph>
    <Paragraph position="6"> The potential for using information-theoretic constructs to measure corpus similarity is a topic for current research. The Known Similarity Corpora evaluation methodology presented in Section 6 will be applicable to the issue of assessing how well cross-entropy captures pre-theoretical notions of corpus similarity and homogeneity.</Paragraph>
    <Paragraph position="8"> A corpus is a collection of texts. The definition only serves to Show how heterogeneous a collection of objects the word denotes. One may contain hundreds of words, another, hundreds of millions.</Paragraph>
    <Paragraph position="9"> One may include a very small number of texts, with a one-text corpus as the limiting case; another may contain thousands of texts, s These factozs present problems for a measure of corpus similarity. It is not clear what, if anything, a measure of the similarity of a thousand-word corpus and a million-word corpus, or a one-text corpus and a thousand-text corpus, would mean. Also, most contain some texts that are much bigger than others. Thus, in the BNC, the shortest file (which approximates to a 'text') contains 25 words, and the longest, a hundred thousand times that many. Two corpora of the same size and the same number of texts may still have a very different shape, if, in one, one of the texts accounts for most of the corpus, whereas in the other, they are all of similar size.</Paragraph>
    <Paragraph position="10"> Like a corpus, a text can be large or small, heterogeneous or Imlform. A corpus can contain complete texts or sampled texts, as in the Brown corpus.</Paragraph>
    <Paragraph position="11"> How homogeneous is a corpus? The first point to make is that there is no obvious way to approach the question. It is clear that the British National Corpus is less homogeneous than a corpus of software manuals, but it is not clear how to measure the difference. The second is that it is very similar to the question, &amp;quot;how similar are two corpora?&amp;quot; Our approach to measuring homogeneity is to randomly divide a corpus into two random halves and measure the similarity of the two halves, thus emphasising the relation between the two questions. The third point is that it is a pre-requisite to a measure of corpus similarity. A judgement of similarity rnns the risk of meaninglessness if a homogeneous corpus is compared with a heterogeneous one.</Paragraph>
    <Paragraph position="12"> Our method provides figures which can be directly compared for corpus homo(/hetero)geneity and for corpus (dis)similarity. (High scores correspond to heterogeneous corpora and dissimilar corpora) The possible outcomes, for various permutations of the scores for homogeneity of corpus 1 (corpl), homogeneity of corpus 2 (corp2), and corpus dissimilarity (dis), are presented in Table 1. corpl corp2 equal equal equal equal much higher high low high dis Comment 'equal same language variety/ies high low higher high high low high high a bit higher  low low a bit higher , different language varieties corp2 is homogeneous and falls within the range of 'general' corpl corp2 is homogeneous and falls outside the range of 'general' corpl impossible overlapping; share some varieties similar varieties  interpreted with respect to homogeneity.</Paragraph>
    <Paragraph position="13"> The last two lines in the table point to the differences between general corpora and specific corpora. High scores for heterogeneity will be for general corpora, which embrace a number of language varieties. Corpus similarity between general corpora will be a matter of whether 5A corpus may contain texts in di~erent languages: here, we only consider corpora which are essentially all in the same language.</Paragraph>
    <Paragraph position="14">  all the same language varieties are represented in each corpus, and in what proportions. Low heterogeneity scores will typically relate to corpora of a single language variety, so here, similarity scores may be interpreted as a measure of the distance between the two varieties. From the point of view of measuring corpus homogeneity or similarity, it is desirable to use a method which m~r6mi.qes the significance of the division of a corpus into texts. 'Text' and 'document' are problematic constructs: any corpus-building project has to make a range of practical decisions about what is to be considered a text, determining, for example, whether all the poems in a book of poetry count as one text, and how newspapers are going to be divided.S The one point at which our method uses the division into texts is in identifying the ch, mk.q of the corpus to be randomly placed in a subcorpus. Any subdivisions of the corpus which tended to keep contiguous material together and which gave an appropriate n,,mber of chunks (say, between 20 and 200), all of approxlmately the same size, would be satisfactory. One possibility is to treat a corpus as a single text, with chunks specified as &amp;quot;first 5,000 words&amp;quot;, &amp;quot;next 5,000 words&amp;quot;, etc., the strategy adopted in the experiments described below.</Paragraph>
    <Paragraph position="15">  At a Krst pass, it would appear that the chi-square test will serve to indicate whether two corpora are drawn from the same population, or whether two or more phenomena are signi~cantly fli~erent in their distributions between two corpora. For a contingency table of dimensions m x n, if the null hypothesis is true, the statistic</Paragraph>
    <Paragraph position="17"> (where 0 is the observed value, E is the expected value calculated on the basis of the joint corpus, and the sum is over the cells of the contingency table) will be M-distributed with (rn - 1) x (n - 1) degrees of freedom. T (HoR~nd and Joh~.n~son, 1982) use the test to identify where words are signi6cantly more frequent in the LOB corpus (of British English) than in the Brown corpus (of American English). In the table where they make the comparison, the X2-value for each word is given, with the value marked 1, 2 or 3 if it exceeds the critical value of the statistic at any of three different significant levels, so one might infer that the LOB-Brown diITerence was non-random.</Paragraph>
    <Paragraph position="18"> Looking at the LOB-Brown comparison, we find that very many words, including most very common words, are marked. Much of the time, the null hypothesis is defeated. Does this show that all those words have systematically different patterns of usage in British and American English? To test this, we took two corpora which were indisputably of the same language type: each was a random subset of the BNC. The sampling was as follows: all texts shorter than 20,000 words were excluded and all others were truncated at 20,000 words. The truncated texts were randomly assigned to either corpus 1 or corpus 2, and frequency lists for each corpus were generated.</Paragraph>
    <Paragraph position="19"> As in the LOB-Brown comparison, for very many words, including most common words, the null hypothesis was defeated. This reveals a bald, obvious fact about language. Words are not selected at random. There is no a pr/ori reason to expect them to behave as if they had been, SThe appropriate theoretical response, as taken in the Text Encoding Initiative, is that texts are hierarchically structured, so 'same text' does not have a unique interpretation.</Paragraph>
    <Paragraph position="20"> VProvided all expected values are over a threshold of 5. Where there is just one degree of freedom, Yates' correction is applied.</Paragraph>
    <Paragraph position="22"> and indeed they do not. The LOB-Brown differences cp-nuot in general be interpreted as British-American differences: it is in the nature of language that any two collections of texts covering a range of registers (and comprising,-say, less than a thousand samples of over a thousand words each) will show such differences. While it might seem plausible that oddities would in some way balance out to give a population that was indistinguishable from one where the individual words (as opposed to the individual texts) had been randomly selected, this turns out not to be the case.</Paragraph>
    <Paragraph position="23"> Let us look closer at why this occurs. A key word in the last paragraph is 'indistinguishable'. The null hypothesis we are testing is that both frequency lists were the outcome of random selections from the s~me source. Since words in a text are not random, we know that our corpora are not randomly generated. The only question, then, is whether there is enough evidence to say that they are not, with confidence. In general, where a word is more common, there is more evidence. This is why a higher proportion of common words than of rare ones defeat the null hypothesis. As one statistics textbook puts it: None of the null hypotheses we have considered with respect to goodness of fit can be ezactl~/true, so if we increase the sample size (and hence the value of X 2) we would ultimately reach the point when all null hypotheses would be rejected. All that the X 2 test can tell us, then, is that the sample size is too small to reject the null hypothesis! (Owen and Jones, 1977, p 359) For large corpora and common words, the sample size is no longer too small. On the null hypothesis, the expected value for the (O-E)2/E term would be 0.5 s and would not vary with word frequency. Table 2 shows that this term tends to be substantially higher than 0.5 and increases with word frequency.</Paragraph>
    <Paragraph position="24">  was generated from a list, ordered by frequency, giving the term's value for each word. The first line of the table then states that the average of these values, for the first 20 items on the list (the first of which was the) was 55.1.</Paragraph>
    <Paragraph position="25"> SO.5 rather than 1 because there are two cells in the contingency table for each degree of freedom.</Paragraph>
    <Section position="1" start_page="237" end_page="237" type="sub_section">
      <SectionTitle>
5.1 X 2 without the null hypothesis
</SectionTitle>
      <Paragraph position="0"> We cannot use the X 2 statistic for testing the null hypothesis, but nonetheless it does come close to meeting our requirements. The (O-E)2/E term gives a measure of the difference in a word's frequency between two corpora, and, while the measure tends to increase with word frequency, it does not increase by orders of magnitude. The strategy we adopt is therefore to calculate X 2 for (sub)corpus pairs, and then to use this as the measure of corpus similarity and homogeneity.</Paragraph>
      <Paragraph position="1"> The score is then norma\]ised by the nnrober of words used for the comparison (equivalent to the numbers of degrees of freedom) to give a measure we shall call CBDF (Chi By Degrees of Freedom).</Paragraph>
      <Paragraph position="2"> The question arises, wiMch words, and how many, should be used in the comparison. Since the error-term tends to increase with frequency, CBDF scores for will only be comparable if words of the same span of frequencies are used in the comparisons. We simply used the N most frequent words in the union of the two corpora to be compared. The experiments below explore different values for N.</Paragraph>
    </Section>
    <Section position="2" start_page="237" end_page="237" type="sub_section">
      <SectionTitle>
5.2 Normalisation
</SectionTitle>
      <Paragraph position="0"> At a first pass, a measure of corpus homogeneity or similarity should be able to compare corpora of different sizes. As we have seen, for all but purely random populations, (O-E)2/E tends to increase with frequency. Where corpora are larger, words will tend to be more frequent, so, for the same level of corpus similarity or homogeneity and the same number of degrees of freedom, X ~ will be larger. There is also a theoretical problem: it is not clear what it means to say that corpora of different sizes are equally homogeneous. If corpus 1 is twice as large as corpus 2, do we call them 'equally homogeneous' if corpus i contains twice as many language varieties as corpus 2, or the same number of language varieties but twice as much of each? Is a corpus as homogeneous as a subcorpus we produce from it which contains a randomly selected half of its texts, or is it as homogeneous as one that contains half of each of its texts? It is not obvious, and I am currently investigating the question further. The experiemtus described below all use same-size corpora.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML