File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/87/e87-1007_metho.xml
Size: 6,411 bytes
Last Modified: 2025-10-06 14:12:03
<?xml version="1.0" standalone="yes"?> <Paper uid="E87-1007"> <Title>in Newsletter of the International Computer Archive of</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> HOW TO DETECT GRAMMATICAL ERRORS IN A TEXT WITHOUT PARSING IT </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> ABSTRACT </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> The Constituent Likelihood Automatic Word-tagging </SectionTitle> <Paragraph position="0"> System (CLAWS) was originally designed for the low-level grammatical analysis of the million-word LOB Corpus of English text samples. CLAWS does not attempt a full parse, but uses a firat-order Markov model of language to assign word-class labels to words. CLAWS can be modified to detect grammatical errors, essentially by flagging unlikely word-class transitions in the input text. This may seem to be an intuitively implausible and theoretically inadequate model of natural language syntax, but nevertheless it can successfully pinpoint most grammatical errors in a text.</Paragraph> <Paragraph position="1"> Several modifications to CLAWS have been explored. The resulting system cannot detect all errors in typed documents; but then neither do far more complex systems, which attempt a full parse, requiting much greater computation.</Paragraph> <Paragraph position="2"> Checking Grammar in Texts A number of ~rcbers have experimented with ways to cope with grammatically ill-formed English input (for example, \[Carboneil and Hayes 83\], \[Charniak 83\], \[Granger 83\], \[Hayes and Mouradian 81\], \[Heidorn et al 82\], \[Jensen et al 83\], \[Kwasny and Sondheimer 81\], \[Weischedel and Black 80\], \[Weischedel and Sondheimer 83\]). However, the majority of these systems are designed for Natural Language interfaces to software systems, and so can assume a restricted vocabulary and syntax; for example, the system discussed by \[Fass 83\] had a vocabulary of less than 50 words. This may be justifiable for a NL front-end to a computer system such as a Database Query system, since even an artificial subset of English may be more acceptable to users than a formal command or query language.</Paragraph> <Paragraph position="3"> However, for automated text-checking in Word Processing, we cannot reasonably ask the WP user to restrict their English text in this way. This means that WP text-checking systems must be extremely robust, capable of analysing a very wide range of lexical and syntactic constructs.</Paragraph> <Paragraph position="4"> Otherwise, the grammar checker is liable to flag many constructs which are in fact acceptable to humans, but happen not to be included in the system's limited grammar.</Paragraph> <Paragraph position="5"> A system which not only performs syntactic analysis of text, but also pinpoints grammatical errors, must be assessed along two orthogonal scales rather than a single 'accuracy' measure:</Paragraph> </Section> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> RECALL - </SectionTitle> <Paragraph position="0"> &quot;number of words/constructs correctly flagged as errors&quot; &quot;total number of 'true' errors that should be flagged&quot;</Paragraph> </Section> <Section position="4" start_page="0" end_page="38" type="metho"> <SectionTitle> PRECISION = </SectionTitle> <Paragraph position="0"> &quot;number of words/constructs correctly flagged as errors&quot; &quot;total number of wordslconstructs flagged by the system&quot; It is easy to optimise one of these performance measures at the expense of the other, flagging (nearly) ALL words in a text will guarantee optimal recall (i.e. (nearly) all actual errors will be flagged) but at a low precision; and conversely, reducing the number of words flagged to nearly zero should raise the precision but lower the recall. The problem is to balance this trade-off to arrive at recall AND precision levels acceptable to WP users. A system which can accept a limited subset of English (and reject (or flag as erroneous) anything else) may have a reasonable recall rate; that is, most of the 'true' errors will probably be included in the rejected text. However, the precision rate is liable to be unacceptable to the WP user:, large amounts of the input text will effectively be marked as potentially erroneous, with no indication of where' within this text the actual errors lie. One way to deal with this problem is to increase the size and power of the parser and underlying grammar to deal with something nearer the whole gamut of English syntax; this is the approach taken by IBM's EPISTLE project (see \[Heidorn et al 82\], \[Jensen et al 83\]). Unfortunately, this can lead to a very large and computationally expensive system: \[Heidorn et al 82\] reported that the EPISTLE system required a 4Mb virtual machine (although a more efficient implementation under development should require less memory).</Paragraph> <Paragraph position="1"> The UNIX Writer's Workbench collection of programs (see \[Cherry and Macdonald 83\], \[Cherry et ai 83\]) is probably the most widely-used system for WP text-checking (and also one of the most widely-used NLP systems overall see \[AtweU 86\], \[Hubert 85\]). This system includes a number of separate programs to check for different types of faults, including misspellings, cliches, and cee, ain stylistic infelicities such as overly long (or short) sentences.</Paragraph> <Paragraph position="2"> However, it lacks a general-purpose grammar checker, the nearest program is a tool to filter out doubled words (as in &quot;I signed the the contract&quot;). Although there is a program PARTS which assigns a part of speech tag to each word in the text (as a precursor to the stylistic analysis programs), this program uses a set of localized heuristic rules to disambiguate words according to context; and these roles are based on the underlying assumption that the input sentences are grammatically well-formed. So, there is no clear way to modify PARTS to flag grammatical errors, unless we introduce a radically different mechanism for disambiguating word*tags according to contexu</Paragraph> </Section> class="xml-element"></Paper>