File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/90/h90-1055_metho.xml
Size: 11,249 bytes
Last Modified: 2025-10-06 14:12:31
<?xml version="1.0" standalone="yes"?> <Paper uid="H90-1055"> <Title>Deducing Linguistic Structure from the Statistics of Large Corpora</Title> <Section position="3" start_page="277" end_page="277" type="metho"> <SectionTitle> 2.4 Results </SectionTitle> <Paragraph position="0"> A careful evaluation of this parser, like any other, requires some &quot;gold standard&quot; against which to judge its output. Soon, we will be able to use the skeletal parsing of the Penn Treebank we are about to begin producing to evaluate this work (although evaluating this parser against materials which we ourselves provide is admittedly problematic). For the moment, we have simply graded the output of the parser by hand ourselves.</Paragraph> <Paragraph position="1"> While the error rate for short sentences (15 words or less) with simple constructs is accurate, the error rate for longer sentences is more of an approximation than a rigorous value.</Paragraph> <Paragraph position="2"> On unconstrained free text from a reserved test corpus, the parser averages about two errors per sentence for sentences under 15 words in length. On sentences between 16 and 30 tokens in length, it averages between 5 and 6 errors per sentence. In nearly all of these longer sentences and many of shorter ones, at least one of the errors is caused by confusion about conjuncts.</Paragraph> <Paragraph position="3"> One interesting possibility is to use the generalized mutual information statistic to extract a grammar from a corpus. Since the statistic is consistent, and its window can span more than two constituents, it could be used to find constituent units which occur with the same distribution in similar contexts. Given the results of the next section, it may well be possible to use automatic techniques to first determine a first approximation to the set of word classes of a language, given only a large corpus of text, and then extract a grammar for that set of word classes. Such a goal is very difficult, of course, but we believe that it is worth pursuing. In the end, we believe that this, like many problems in natural language processing, cannot be solved eJficienily by grammar-based algorithms nor accurately by purely stochastic algorithms. We believe strongly that the solution to some of these problems may well be a combination of both approaches.</Paragraph> </Section> <Section position="4" start_page="277" end_page="278" type="metho"> <SectionTitle> 3 Discovering the Word Classes </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="277" end_page="277" type="sub_section"> <SectionTitle> of a Language 3.1 Introduction </SectionTitle> <Paragraph position="0"> As we ask immediately above, to what extent is it possible to discover by some kind of distributional analysis the kind of part-of-speech tags upon which our mutual information parser depends? In this section, we examine the possibility of using distributional analysis to dis-</Paragraph> </Section> <Section position="2" start_page="277" end_page="278" type="sub_section"> <SectionTitle> 3.2 The Algorithm </SectionTitle> <Paragraph position="0"> The feature discovery system works as follows. First, a large amount of text is examined to discover the frequency of occurrence of different bigrams. 4 Based upon this data, the system groups words into classes. Two words are in the same class if they can occur in the same contexts. In order to determine whether x and y belong to the same class, the sytem first examines all bigrams containing x. If for a high percentage of these bigrams, the corresponding bigram with y substituted for x exists in the corpus, then it is likely that y has all of the features that x has (and maybe more). If upon examining the bigrams containing y the system is able to conclude that x also has all of the features that y has, it then concludes that x and y are in the same class.</Paragraph> <Paragraph position="1"> For every pair of bigrams, the system must determine how much to weigh the presence of those bigrams as evidence that two words have features in common. For instance, assume: (a) the bigram ~he boy appears many times in the corpus being analyzed, while the sits never occurs. Also assume: (b) the bigram boy the (as in the boy the girl kissed ...) occurs once and sits ~he never occurs. Case (a) should be much stronger evidence that boy and sits are not in the same class than case (b).</Paragraph> <Paragraph position="2"> For each bigram o~x occurring in the corpus, evidence offered by the presence (or absence) of the bigram ay is scaled by the frequency of ax in the text divided by the total number of bigrams containing x on their right hand side. Since the end-of-phrase position is less restrictive, we would expect each bigram involving this position and the word to the right of it to occur less frequently than bigrams of two phrase-internal words. By weighing the evidence, bigrams which cross boundaries will be weighed less than those which do not.</Paragraph> <Paragraph position="3"> The function implies(x,y) calculates the likelihood (on a scale of \[0..1\]) that word y contains all of the features of word x. For example, we would expect the value of implies('a', 'the') to be close to 1, since 'the' can occur in any context which 'a' can occur in. Note that: implies(x,y) A implies(y,x) iff x and y are in the same class.</Paragraph> <Paragraph position="4"> aWe consider the set of features of a particular language to be all attributes which that language makes reference to in its syntax. The function leftimply(x,y) is the likelihood (on a scale of [0..1]) that y contains all of the features of x, where this likelihood is derived from looking at bi-grams of the form: xa. rightimply(x,y) derives the likelihood by examining all bigrams of the form: ax.</Paragraph> <Paragraph position="5"> bothoccur(a,P) is 1 if both bigrams a and /? occur in the corpus, and p occurs with a frequency at least 11THRESHOLD of that of a, for some THRESHOLD.5 bothoccur accounts for the fact that we cannot expect the distribution of two equivalent words over bigrams to be precisely the same, but we would not expect the two distributions to be too dissimilar either.</Paragraph> <Paragraph position="7"> When computing the relation between x and all other words, we use the following function, percentage, to weigh the evidence (as described above), where count(ab) is the number of occurrences of the bigram ab in the corpus, and numright(x) (numleft(x)) is the total number of bigrams with x on their right hand side (left hand side).</Paragraph> <Paragraph position="9"> For all pairs of words, x and y, we calculate implies(~,~) and implies(y,x). We can then find word classes in the following way. We first determine a threshold value, where a stronger value will result in more specific classes. Then, for each word x, we find all words 51n the experiments we ran, we found THRESHOLD = 6 to give the best results. This value was found by examining the values of implication found between the, a and an.</Paragraph> <Paragraph position="10"> y such that both irnplies(x,y) and implies(y,x) are greater than the threshold. We next take the transitive closure of pairs of sets with nonempty intersection over all of these sets, and the result is a set of sets, where each set is a word class. Classes of different degrees of specificity are found by varying the degree of similarity between distributions needed to conclude that two words are in the same class. If a high degree of similarity is required, all words in a class will have the same features.</Paragraph> <Paragraph position="11"> If a lower degree of similarity is required, then words in a class must have most, but not all, of the same features.</Paragraph> </Section> <Section position="3" start_page="278" end_page="278" type="sub_section"> <SectionTitle> 3.3 The Experiment </SectionTitle> <Paragraph position="0"> To test the algorithm discussed above, we ran the following experiment. First, the number of occurrences of each bigram in the corpus was determined. Statistics on distribution were determined by examining the complete Brown Corpus (Francis 82), where infrequently occurring open-ciass words were replaced with their part-of-speech tag. We then ran the program on a group of words including all closed-class words which occurred more than 250 times in the corpus, and the most frequently occurring open-class words. Note that the system attempted to determine the relations between these words; this does not mean that it only considered bigrams a@, where both a and ,f3 were from this list of words which were being partitioned. All bigrams which occurred more than 5 times were considered in the distributional analysis.</Paragraph> </Section> <Section position="4" start_page="278" end_page="278" type="sub_section"> <SectionTitle> 3.4 Analysis of the Experiment </SectionTitle> <Paragraph position="0"> The program successfully partitioned words into word ~lasses.~ In addition, it was able to find more fine-grained features. Among the features found were: [possessive-pronoun] , [singular-determiner], [definitedeterminer], [wh-adjunct] and [pronoun+be]. A description of some of the word classes the program discovered can be found in Appendix A.</Paragraph> </Section> <Section position="5" start_page="278" end_page="278" type="sub_section"> <SectionTitle> 3.5 The Psychological Plausibility of Distributional Analysis </SectionTitle> <Paragraph position="0"> If a child does not know a priori what features are used in her language, there are two ways in which she can acquire this information: by using either syntactic or se mantic cues. The child could use syntactic cues such as the method of distributional analysis described in this paper. The child might also rely upon semantic cues.</Paragraph> <Paragraph position="1"> There is evidence that children use syntactic rather than semantic cues in classifying words. Peter Gordon (Gordon 85) ran an experiment where the child was presented with an object which was given a made up name. For objects with semantic properties of count nouns (mass nouns), the word was used in lexical environments which only mass nouns (count nouns) are permitted to be in.</Paragraph> <Paragraph position="2"> Gordon showed that the children overwhelmingly used 60ne exception was the class of pronouns. Since [+nominative] and [-nominative] pronouns do not have similar distribution, they were not found to be in the same class.</Paragraph> <Paragraph position="3"> Raw no. Times Total no.</Paragraph> <Paragraph position="4"> of words tagged of words the distributional cues and not the semantic cues in classifying the words. Virginia Gathercole (Gathercole 85) found that &quot;children do not approach the co-occurrence conditions of much and many with various nouns from a semantic point of view, but rather from a morphosyntactic or surface-distributional one.&quot; Yonata Levy (Levy 83) examined the mistakes young children make in classifying words. The mistakes made were not those one would expect the child to make if she were using semantic cues to classify words.</Paragraph> </Section> </Section> class="xml-element"></Paper>