XML Viewer - a97-1024

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/a97-1024_metho.xml
Size: 22,623 bytes
Last Modified: 2025-10-06 14:14:31
<?xml version="1.0" standalone="yes"?>
<Paper uid="A97-1024">
  <Title>EasyEnglish: A Tool for Improving Document Quality</Title>
  <Section position="4" start_page="159" end_page="159" type="metho">
    <SectionTitle>
2 Controlled Language Checker or
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="159" end_page="159" type="sub_section">
      <SectionTitle>
Grammar Checker?
</SectionTitle>
      <Paragraph position="0"> The emphasis of a CL compliance checker is on ensuring that the input text (document) conforms to the restrictions imposed by the definition of the CL, whereas the emphasis of a standard grammar checker is on ensuring that the text is not ungrammatical. Controlled Languages have been invented to solve the problems associated with readability and translatability, with slight regard to ensuring grammaticality. In fact, the point has been made that it is up to the writer to ensure that the text is grammatical (Hayes et al. 1996). Or, in the words of Goyvaerts (1996): &amp;quot; ... it is still possible to write controlled non-English.&amp;quot; A similar point has been made for GIFAS Rationalized French (Lux and Dauphin 1996).</Paragraph>
      <Paragraph position="1"> However, the more grammatical the text is, the easier it is to read and translate, so it seems that this concept of a CL checker is too narrow. On the other hand, in many applications it may not be necessary for writers to restrict themselves to a very limited subset of English in order to write easily understandable and translatable documents. In this sense the concept of a CL checker may be too broad.</Paragraph>
      <Paragraph position="2"> We have developed a system that we believe strikes a useful balance between CL checking and standard grammar checking. It consists in restricting the CL checking to the detection of structural (syntactic) ambiguity, complexity, and violations of vocabulary constraints. This view is in line with the description of Dokamentationsdea~sch in (Schachtl 1996). Dokumentationsdeutsch is not defined by a list of allowed constructions, but rather by a list of forbidden constructions, allowing most of standard German syntax. In the same way, EasyEnglish allows most of standard English syntax. Also (Lux and Dauphin 1996) point out the importance of the linguistic coverage being as broad as possible. At the same time, we perform some of the checks that a standard grammar checker would perform. 3 The CL checks of EasyEnglish do work better when the text is not too ill-formed grammatically, since ill-formedness reduces the chances of the parser making good sense of the input. Most grammar checkers seem to have a problem with precision, 4 and this evidently stems from the inability of the system to make sense of the input. This is caused not only by too narrow coverage of the parser, but also by the ill-formed input that a standard gram3We have conflated the notions of grammar errors and style tveaknesses. For a good discussion of the differences, see (Ravin 1993).</Paragraph>
      <Paragraph position="3"> 4We define precision to be the number of relevant error reports divided by the total number of error reports. In other words, it is a measure of how many irrelevant error reports the user will be bothered with. The higher the precision, the better.</Paragraph>
      <Paragraph position="4"> mar checker tries to deal with; it is harder to parse non-standard constructions correctly. It has been pointed out time and again (Richardson and Braden-Harder 1993; Wojcik and Holmback 1996; Cl6mencin 1996) that user acceptance depends on suitably high precision. Of course, the user also wants the checker to find the problems that need to be corrected, 5 but this seems to take much lower precedence (Wojcik and Holmback 1996; C16mencin 1996).</Paragraph>
      <Paragraph position="5"> We have made a small, preliminary study comparing the quality of EasyEnglish with that of Grammatik and the grammar checker in AmiPro. For the study, we used a variety of text types, including technical documents, a manager's speech, and an on-line job advertisement written by a non-native English speaker. The (Precision, Recall) figures were: EasyEnglish (0.81, 0.87), Grammatik (0.51, 0.86), AmiPro (0.50, 0.69). There is overlap in the kinds of checks made by these three systems, but we attempted to evaluate each system on its own terms, i.e. on the basis of the collection of checks that it.</Paragraph>
      <Paragraph position="6"> tries to do. That is, these figures show how we\[i each system does what it tries to do, rather than how useful what it tries to do is. (The recall figures for Grammatik and AmiPro may be artificially high, since we may not have been able to identify all the problems that these grammar checkers intend to address.) Of course, high precision and recall alone are not enough to ensure the usefulness of an authoring tool such as EasyEnglish. We agree with (Adriaens and Macken 1995; Wojcik and Holmbach 1996) that it is also necessary to evaluate how well writers can use the system to arrive at a satisfactory document. It is our claim that the types of checks EasyEnglish performs are vastly more relevant for ensuring high document quality than a majority of the checks in the above-mentioned grammar checkers (e.g. most of the lexically-based checks). It has been claimed that standard grammar checkers typically check for stylistic issues that are relevant for writers of fiction (Goyvaerts 1996). But, as Goyvaerts (1996) puts it: &amp;quot;Industry does not need Shakespeare or Chaucer, industry needs clear, concise communcarive writing -- in one word Controlled Language.&amp;quot; Of course, standard grammar checkers do also try to supply checks that are relevant for non-fictional genres. However, some of the standard stylistic recommendations are not entirely relevant for technical documents at least. It is, for example, rather common for a standard grammar checker to discourage repetition. For a company that has to pay for document translation on a per word basis, every repeti-</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="159" end_page="161" type="metho">
    <SectionTitle>
3 Resolution of Ambiguity
</SectionTitle>
    <Paragraph position="0"> EasyEnglish identifies a number of structurally ambiguous constructions and supplies suggestions for unambiguous rephrasings. It is then up to the user to decide which interpretation is intended. Some systems support automatic substitution; since we deal with truly ambiguous constructions, we have to involve the user in making the choice. The EasyEnglish editor interface, however, does allow the user to select an offered rephrasing by mouse-clicking and have the selection substituted automatically.</Paragraph>
    <Paragraph position="1"> Other systems, e.g. the Attempto System (Fuchs and Schwitter 1996), present the user with a rephrasing that illustrates which interpretation the system arrived at. If that interpretation is not the desired one, it is up to the user to construct a rephrasing that will result in the desired interpretation. We think it is more user-friendly to show the user exactly how the construction may be ambiguous and let her make her own choice.</Paragraph>
    <Paragraph position="2"> In order for the disambiguation rules to work properly, it is crucial to have a deep analysis of the text. This deep analysis is provided by English Slot Grammar (ESG) (McCord 1980, 1990, 1993) in the form of parse trees expressed as a network structure. The disambiguation rules explore the network to spot ambiguous and potentially ambiguous constructions. null ESG often provides more than one parse, ranked according to a specific numerical ranking system (McCord 1990, 1993). But, unlike some other systems, e.g. the Boeing Simplified English Checker (Wojcik and Holmback 1996), which look at a whole forest of trees, it is only necessary for EasyEnglish to look at the highest-ranked parse. ESG parsing heuristics often arrive at correct attachments in the highest-ranked parse. But even when the attachment is off, EasyEnglish can often point out other attachment possibilities to the writer. For exampie, if a present participial clause is attached to the object of a verb, there will also be the possibility that the participial clause actually should modify the subject instead. However, it is not necessary for the parse to reflect this, since this can be reflected in the EasyEnglish rule instead. A simplistic view of this rule would be: &amp;quot;If a present participial clause modifies the object, suggest two rephrasings, one that forces the attachment to the subject, and one that forces the attachment to the object&amp;quot;.</Paragraph>
    <Paragraph position="3"> An example taken from an IBM manual: &amp;quot;Different system users may operate on different objects using the same application program.&amp;quot; This sentence generates the following message: Ambiguous attachment of verb phrase: &amp;quot;using the same application program&amp;quot;.</Paragraph>
    <Paragraph position="4"> Who/what is &amp;quot;using the same application program&amp;quot;, &amp;quot;Different system users&amp;quot; or &amp;quot;different objects&amp;quot; ? If &amp;quot;Different system users&amp;quot;, a possible rephrasing would be: &amp;quot;by using the same application program&amp;quot;; If &amp;quot;different objects&amp;quot;, a possible rephrasing would be: &amp;quot;different objects that use the same application program&amp;quot;.</Paragraph>
    <Paragraph position="5"> Notice the additional benefit we get from basing the suggestion on a parse: the correct subject-verb agreement can be inferred for use in the suggestions.</Paragraph>
    <Paragraph position="6"> Coordination is another source of ambiguity, since the scope is not always clear. One type of ambiguity occurs when a conjoined noun phrase premodifies a noun, as in this example from an IBM manual: &amp;quot; It is the number defined in the file or result field definition. &amp;quot; The phrase &amp;quot;file or result field definition&amp;quot; is ambiguous in many ways, as is shown by the output from EasyEnglish: Ambiguity in: &amp;quot;the file or result field definition&amp;quot;. Possible rephrasings: '`the result field definition or the file&amp;quot; or &amp;quot;the file definition or the result field definition&amp;quot; or &amp;quot;the file field definition or the result field definition&amp;quot; or &amp;quot;the definition of the file or of the result field&amp;quot; or &amp;quot;the field definition of the file or of the result&amp;quot; Another type of ambiguity in coordination concerns combinations of coordinating conjunctions, as illustrated by the following example: &amp;quot;The cat and the rat or the mat sat.&amp;quot; Ambiguous coordination; possible rephraslags: &amp;quot;Either the cat and the rat or the mat&amp;quot; or &amp;quot;The cat and either the rat or the mat&amp;quot; The above cases illustrate constructions that are definitely ambiguous; however, some common problems involve modification that may or may not be correct, depending on domain knowledge, which we do not attempt to make use of at present.</Paragraph>
    <Paragraph position="7"> For example, the implicit subject in a nonfinite clause premodifying the main clause should be the same as the subject of the main clause. It is generally not possible to tell, on the basis of syntax alone, whether the author has adhered to this rule. But it is possible to alert the user to the potential problem. The following two examples illustrate the problem.</Paragraph>
    <Paragraph position="8"> The first example, taken from an IBM manual, is okay, whereas the second example, taken from (Lederer 1989), is not okay.</Paragraph>
    <Paragraph position="9">  = After signing on, the user has access to all objects on the system. &amp;quot; Potentially urrong modification: =signing on&amp;quot;. Okay if subject of &amp;quot;signing on&amp;quot; is &amp;quot;the user&amp;quot;: &amp;quot; As a baboon who grew up wild in the jungle, I realized that Wiki had special nutritional needs. &amp;quot; Potentially wrong modification; okay if &amp;quot;I&amp;quot; is &amp;quot;a baboon who grew up mild in the  jungle&amp;quot;.</Paragraph>
    <Paragraph position="10"> An earlier version of EasyEnglish, written in Prolog, included a pronoun resolution module, RAP (Lappin and McCord 1990a,b; Lappin and Leass 1994). This module, originally written for use with LMT, was modified slightly to point out ambiguous pronominal references. It has not yet been included in the C version of EasyEnglish, and we give here an example of its use produced by the Prolog version. The example is taken from (Lederer 1989): =Guilt, vengeance, and bitterness can be emotionally destructive to you and your children. You must get rid of them.&amp;quot; This generates the following message: Ambiguous pronoun reference: '2hem&amp;quot;.</Paragraph>
  </Section>
  <Section position="6" start_page="161" end_page="161" type="metho">
    <SectionTitle>
4 Vocabulary Functions
</SectionTitle>
    <Paragraph position="0"> EasyEnglish comes with a built-in general English dictionary of about 80,000 words. In addition, EasyEnglish has a flexible system for using dictionaries as it does its analysis. Users can specify in a user profile which dictionaries they want to call up.</Paragraph>
    <Paragraph position="1"> The specification can include any number of term dictionaries, any number of abbreviation dictionaries, any number of non-allowed word dictionaries, and any number of controlled vocabulary dictionaries. There are EasyEnglish commands for compiling a user-maintainable format of these different kinds of dictionaries into efficiently useable forms, and for creating abbreviation dictionaries from terminology dictionaries in maintainance form.</Paragraph>
    <Paragraph position="2"> The dictionaries support three different types of vocabulary checks. The first vocabulary check looks for restricted words, i.e. words that the writer either should never use, or that the writer should only use as certain parts-of-speech. The user may specify these words in a specific user dictionary along with preferred alternatives. In addition, this category includes slang words, a list of which is systemsupplied. The second type of vocabulary check identifies acronyms or abbreviations in the text and checks to see that the first occurrence is properly spelled out according to the definition supplied in the user dictionary for&amp;quot; acronyms. The third check gives the user the option to specify a controlled vocabulary; all words that are not in the controlledvocabulary file or that are improperly used with respect to part-of-speech will be flagged, should the user decide to turn this check on. User dictionaries for restricted words, acronyms, and controlled vocabulary have been built for the IDWB for certain domains.</Paragraph>
    <Paragraph position="3"> The vocabulary checks rely on two things: the parser and user dictionaries. It is crucial to be able to determine the applicable part of speech with accuracy. Take for example the word &amp;quot;beef&amp;quot;. If this is used as a verb (&amp;quot;they beef a lot&amp;quot;), it should be flagged as slang; on the other hand, if it is used as a noun (&amp;quot;he ate beef&amp;quot;), it should not be flagged. A full parse helps decide on this.</Paragraph>
    <Paragraph position="4"> User dictionaries may be built with the help of the separate terms module, ETerms, which is run independently of EasyEnglish. ETerms identifies candidates for new terms by looking for words not found in any of the dictionaries 6 as well as multinoun terms.</Paragraph>
    <Paragraph position="5"> The output from ETerms is very accurate due to the use of full ESG parsing. For each term, the frequency is stated, and the user has the choice between having the terms sorted either in frequency order or alphabetical order. The terms file has a format that is directly usable as a user dictionary; however, to keep terminology consistent and remove misspellings, it is necessary that a terminologist approve the content before actual use.</Paragraph>
    <Paragraph position="6"> The terms file may also be sent to the IBM translation centers at an early stage. This speeds up the task of translation considerably, since their terminologists can decide on the proper translations before the translators actually start the translation process.</Paragraph>
    <Paragraph position="7"> This list is also a good start on an online bilingual dictionary for an MT system.</Paragraph>
  </Section>
  <Section position="7" start_page="161" end_page="162" type="metho">
    <SectionTitle>
5 Standard Grammar Checking
</SectionTitle>
    <Paragraph position="0"> In addition to spotting ambiguity and providing terminological support, EasyEnglish also performs more traditional grammar checking. It is a delicate balance to process text that has grammatical errors; the parser needs to be able to make reasonably good sense of the text in order for the checking component not to overflag problems. The grammatical checks fall into three different categories, which we will treat separately: Syntactic problems, lexical problems, and punctuation problems.</Paragraph>
    <Section position="1" start_page="161" end_page="162" type="sub_section">
      <SectionTitle>
5.1 Syntactic problems
</SectionTitle>
      <Paragraph position="0"> This category is obviously the category most sensitive to parsing problems. However, we have found that a number of checks can be implemented successfully, including, but not limited to, checks for lack of parallelism in coordination and in list elements, passives, double negatives, long sentences, incomplete sentences, wrong pronoun case, and long noun strings.</Paragraph>
      <Paragraph position="1">  To illustrate the function of these checks, let us look at the checks for passives. When a passive construction is encountered, an active transformation provides the desired suggested rephrasing, provided the logical subject is available. If the logical subject is not available, the passive is pointed out, but no rephrasing is offered. Some standard grammar checkers insist on supplying an active rephrasing even in this case, and they do that by introducing a fake subject 'T', &amp;quot;they&amp;quot;, or &amp;quot;he&amp;quot;. In our view, this rarely provides a reasonable iephrasing.</Paragraph>
      <Paragraph position="2"> The following sentence from an IBM manual illustrates both cases: &amp;quot;The format is defined in the file which was not included by the header file.&amp;quot; This sentence generates two messages for the passives, one without a rephrasing, and one with a rephrasing: Passive construction: ~is defined in the file which was not included by the header file&amp;quot;.</Paragraph>
      <Paragraph position="3"> Passive construction: &amp;quot;was not included by the header file&amp;quot;. Possible rephrasing: &amp;quot;which the header file did not include&amp;quot; The parse supplies the information necessary to decide on the correct word order and tense used in the rephrasing.</Paragraph>
      <Paragraph position="4"> In the case of a double passive, there is the additional problem of ambiguity, as illustrated by the following example from (Lederer 1989): &amp;quot;Two cars were reported stolen by the Groveton police yesterday.&amp;quot; 7 Ambiguous passive construction. Is the subject of &amp;quot;s~olen': '2he Groveton police&amp;quot;f In contrast to this group of syntactic problems, a check for subject-verb agreement is much harder to implement reliably. This is due to the ambiguity of part-of-speech that is so prevalent in English. Many verbs can also be nouns and vice versa. When there then is a mistake in subject-verb agreement, it becomes very hard to produce a reliable parse. (We are assuming a strictly syntactic approach). Standard grammar checkers seem to have even worse problems with this check (on the order of a precision of less than 10 percent).</Paragraph>
    </Section>
    <Section position="2" start_page="162" end_page="162" type="sub_section">
      <SectionTitle>
5.2 Lexical Problems
</SectionTitle>
      <Paragraph position="0"> Lexical problems, on the other hand, are not very much affected by bad parses and can be spotted with a high degree of reliability. These include misspelled or unknown words, duplicated words, and the like.</Paragraph>
    </Section>
    <Section position="3" start_page="162" end_page="162" type="sub_section">
      <SectionTitle>
5.3 Punctuation
</SectionTitle>
      <Paragraph position="0"> Using a full parse, EasyEnglish is able to spot a variety of punctuation errors, including, but not limited to, missing commas ill conjoined clauses and noun 7This sentence is actually ambiguous in many ways; here, we shall not address the other ambiguities.</Paragraph>
      <Paragraph position="1"> phrases, comma splices, missing hyphens, missing punctuation at the end of a segment, and questions with a final period instead of a question mark.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="162" end_page="162" type="metho">
    <SectionTitle>
6 The Use of Formatting Tags
</SectionTitle>
    <Paragraph position="0"> EasyEnglish works with SGML, Bookmaster, or IP\[,' formats as well as with plain text. Dealing with formatting tags is a necessary, but rather complex, task, which is often underestimated (as pointed out by Cl~mencin (1996)). But the trouble of building a good tag-handling system is well-rewarded. Formatting tags are of great help in the segmentation process and may be enlisted for identifying conditions such as missing periods (or other sentence delimiters) and lack of parallelism in lists, both of which are handled by EasyEnglish. It is also useful to be able to identify tables and displays, thereby allowing differential treatment of them. Furthermore, it can be helpful for the parser to take the tags into account, especially quote and highlighting tags, which may delimit complete phrases; header tags can influence the parser to prefer noun phrase analyses over sentence analysis.</Paragraph>
    <Paragraph position="1"> Another, very important, use of formatting tags is checking of revised text only. The so-called reviswn tags indicate revisions to earlier versions of the d,,eument. Being able to properly identify revised parts means that the user can elect to check only revised parts. This is a great time saver, considering the extensive use of previously written documents in a technical environment (Means and Godden 1996).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML