File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/a00-2019_intro.xml
Size: 11,606 bytes
Last Modified: 2025-10-06 14:00:39
<?xml version="1.0" standalone="yes"?> <Paper uid="A00-2019"> <Title>An Unsupervised Method for Detecting Grammatical Errors</Title> <Section position="3" start_page="140" end_page="143" type="intro"> <SectionTitle> 2 ALEK Architecture </SectionTitle> <Paragraph position="0"> What kinds of anomalous elements does ALEK identify? Writers sometimes produce errors that violate basic principles of English syntax (e.g., a desks), while other mistakes show a lack of information about a specific vocabulary item (e.g., a knowledge). In order to detect these two types of problems, ALEK uses a 30-million word general corpus of English from the San Jose Mercury News (hereafter referred to as the general corpus) and, for each target word, a set of 10,000 example sentences from North American newspaper text I (hereafter referred to as the word-specific corpus).</Paragraph> <Paragraph position="1"> i The corpora are extracted from the ACL-DCI corpora. In selecting the sentences for the word ALEK infers negative evidence from the contextual cues that do not co-occur with the target word - either in the word specific corpus or in the general English one. It uses two kinds of contextual cues in a +2 word window around the target word: function words (closed-class items) and part-of-speech tags (Brill, 1994). The Brill tagger output is post-processed to &quot;enrich&quot; some closed class categories of its tag set, such as subject versus object pronoun and definite versus indefinite determiner. The enriched tags were adapted from Francis and Ku~era (I 982).</Paragraph> <Paragraph position="2"> After the sentences have been preprocessed, ALEK counts sequences of adjacent part-of-speech tags and function words (such as determiners, prepositions, and conjunctions). For example, the sequence a/ATfull-time/JJjob/NN contributes one occurrence each to the bigrams AT+J J, JJ+NN, a+JJ, and to the part-of-speech tag trigram AT+JJ+NN. Each individual tag and function word also contributes to its own unigram count. These frequencies form the basis for the error detection measures.</Paragraph> <Paragraph position="3"> From the general corpus, ALEK computes a mutual information measure to determine which sequences of part-of-speech tags and function words are unusually rare and are, therefore, likely to be ungrammatical in English (e.g., singular determiner preceding plural noun, as in *a desks). Mutual information has often been used to detect combinations of words that occur more frequently than we would expect based on the assumption that the words are independent.</Paragraph> <Paragraph position="4"> Here we use this measure for the opposite purpose - to find combinations that occur less often than expected. ALEK also looks for sequences that are common in general but unusual in the word specific corpus (e.g., the singular determiner a preceding a singular noun is common in English but rare when the noun is specific corpora, we tried to minimize the mismatch between the domains of newspapers and TOEFL essays. For example, in the newspaper domain, concentrate is usually used as a noun, as in orange juice concentrate but in TOEFL essays it is a verb 91% of the time. Sentence selection for the word specific corpora was constrained to reflect the distribution of part-of-speech tags for the target word in a random sample of TOEFL essays.</Paragraph> <Paragraph position="5"> knowledge). These divergences between the two corpora reflect syntactic properties that are peculiar to the target word.</Paragraph> <Section position="1" start_page="141" end_page="141" type="sub_section"> <SectionTitle> 2.1 Measures based on the general </SectionTitle> <Paragraph position="0"> corpus: The system computes mutual information comparing the proportion of observed occurrences of bigrams in the general corpus to the proportion expected based on the assumption of independence, as shown below: P(A) x P(B)) Here, P(AB) is the probability of the occurrence of the AB bigram, estimated from its frequency in the general corpus, and P(A) and P(B) are the probabilities of the first and second elements of the bigram, also estimated from the general corpus. Ungrammatical sequences should produce bigram probabilities that are much smaller than the product of the unigram probabilities (the value of MI will be negative). Trigram sequences are also used, but in this case the mutual information computation compares the co-occurrence of ABC to a model in which A and C are assumed to be conditionally independent given B (see Lin, 1998).</Paragraph> <Paragraph position="1"> M/= log 2 P( B) x P( A I B ) x P(C I B) Once again, a negative value is often indicative of a sequence that violates a rule of English.</Paragraph> </Section> <Section position="2" start_page="141" end_page="142" type="sub_section"> <SectionTitle> 2.2 Comparing the word-specific corpus </SectionTitle> <Paragraph position="0"> to the general corpus: ALEK also uses mutual information to compare the distributions of tags and function words in the word-specific corpus to the distributions that are expected based on the general corpus. The measures for bigrams and trigrams are similar to those given above except that the probability in the numerator is estimated from the word-specific corpus and the probabilities in the denominator come from the general corpus. To return to a previous example, the phrase a knowledge contains the tag bigram for singular determiner followed by singular noun (AT Nil).</Paragraph> <Paragraph position="1"> This sequence is much less common in the word-specific corpus for knowledge than would be expected from the general corpus unigram probabilities of AT and NN.</Paragraph> <Paragraph position="2"> In addition to bigram and trigram measures, ALEK compares the target word's part-of-speech tag in the word-specific corpus and in the general corpus. Specifically, it looks at the conditional probability of the part-of-speech tag given the major syntactic category (e.g., plural noun given noun) in both distributions, by computing the following value.</Paragraph> <Paragraph position="4"> For example, in the general corpus, about half of all noun tokens are plural, but in the training set for the noun knowledge, the plural knowledges occurs rarely, if at all.</Paragraph> <Paragraph position="5"> The mutual information measures provide candidate errors, but this approach overgenerates - it finds rare, but still quite grammatical, sequences. To reduce the number of false positives, no candidate found by the MI measures is considered an error if it appears in the word-specific corpus at least two times. This increases ALEK's precision at the price of reduced recall. For example, a knowledge will not be treated as an error because it appears in the training corpus as part of the longer a knowledge of sequence (as in a knowledge of mathematics).</Paragraph> <Paragraph position="6"> ALEK also uses another statistical technique for finding rare and possibly ungrammatical tag and function word bigrams by computing the %2 (chi square) statistic for the difference between the bigram proportions found in the word-specific and in the general corpus: ~ Pgeneral_corpu~ i -egerneral_corpus ) / Nword specific The %2 measure faces the same problem of overgenerating errors. Due to the large sample sizes, extreme values can be obtained even though effect size may be minuscule. To reduce false positives, ALEK requires that effect sizes be at least in the moderate-to-small range (Cohen and Cohen, 1983).</Paragraph> <Paragraph position="7"> Direct evidence from the word specific corpus can also be used to control the overgeneration of errors. For each candidate error, ALEK compares the larger context in which the bigram appears to the contexts that have been analyzed in the word-specific corpus. From the word-specific corpus, ALEK forms templates, sequences of words and tags that represent the local context of the target. If a test sentence contains a low probability bigram (as measured by the X2 test), the local context of the target is compared to all the templates of which it is a part. Exceptions to the error, that is longer grammatical sequences that contain rare subsequences, are found by examining conditional probabilities. To illustrate this, consider the example of a knowledge and a knowledge of.</Paragraph> <Paragraph position="8"> The conditional probability of of given a knowledge is high, as it accounts for almost all of the occurrences of a knowledge in the word-specific corpus. Based on this high conditional probability, the system will use the template for a knowledge of to keep it from being marked as an error. Other function words and tags in the +1 position have much lower conditional probability, so for example, a knowledge is will not be treated as an exception to the error.</Paragraph> </Section> <Section position="3" start_page="142" end_page="142" type="sub_section"> <SectionTitle> 2.3 Validity of the n-gram measures </SectionTitle> <Paragraph position="0"> TOEFL essays are graded on a 6 point scale, where 6 demonstrates &quot;clear competence&quot; in writing on rhetorical and syntactic levels and 1 demonstrates &quot;incompetence in writing&quot;. If low probability n-grams signal grammatical errors, then we would expect TOEFL essays that received lower scores to have more of these ngrams. To test this prediction, we randomly selected from the TOEFL pool 50 essays for each of the 6 score values from 1.0 to 6.0. For information < -3.60, by score point each score value, all 50 essays were concatenated to form a super-essay. In every super-essay, for each adjacent pair and triple of tags containing a noun, verb, or adjective, the bigram and trigram mutual information values were computed based on the general corpus.</Paragraph> <Paragraph position="1"> Table 1 shows the proportions ofbigrams and trigrams with mutual information less than -3.60. As predicted, there is a significant negative correlation between the score and the proportion of low probability bigrams (rs = -.94, n=6, p<.01, two-tailed) and trigrams (r~= -.84, n=6, p<.05, two-tailed).</Paragraph> </Section> <Section position="4" start_page="142" end_page="143" type="sub_section"> <SectionTitle> 2.4 System development </SectionTitle> <Paragraph position="0"> ALEK was developed using three target words that were extracted from TOEFL essays: concentrate, interest, and knowledge. These words were chosen because they represent different parts of speech and varying degrees of polysemy. Each also occurred in at least 150 sentences in what was then a small pool of TOEFL essays. Before development began, each occurrence of these words was manually labeled as an appropriate or inappropriate usage without taking into account grammatical errors that might have been present elsewhere in the sentence but which were not within the target word's scope.</Paragraph> <Paragraph position="1"> Critical values for the statistical measures were set during this development phase. The settings were based empirically on ALEK's performance so as to optimize precision and recall on the three development words. Candidate errors were those local context sequences that produced a mutual information value of less than -3.60 based on the general corpus; mutual information of less than -5.00 for the specific/general comparisons; or a X2 value greater than 12.82 with an effect size greater than 0.30. Precision and recall for the three words are shown below.</Paragraph> </Section> </Section> class="xml-element"></Paper>