File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-0815_metho.xml
Size: 20,133 bytes
Last Modified: 2025-10-06 14:07:39
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-0815"> <Title>Evaluating text quality: judging output texts without a clear source</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Evaluating intelligibility </SectionTitle> <Paragraph position="0"> The use of rating scales to assess the intelligibility of MT output has been widespread since the early days in the field. Typically, monolingual raters assign a score to each sentence in the output text. However, this does not amount to an agreed methodology, since the number of points on the scale and their definition have varied considerably. For example, Carroll (1966) used a nine-point scale where point 1 was defined as &quot;hopelessly unintelligible&quot; and point 9 as &quot;perfectly clear and intelligible&quot;; Nagao and colleagues (Nagao et al., 1985), in contrast, used a five-point scale, while Arnold and his colleagues (Arnold et al., 1994) suggest a four-point discrimination. In evaluating the intelligibility of the AGILE output, we asked professional translators and authors who were native speakers of the languages concerned--Bulgarian, Czech and Russian--to score individual text fragments on a four-point scale. The evaluators were also asked to give a summative assessment of the output's suitability as the first draft of a manual.</Paragraph> <Paragraph position="1"> In a single pass, AGILE is capable of generating several types of text, each constituting a section of a typical software user manual--i.e., overview, short instructions, full instructions, and functional descriptions--and appearing in one of two styles (personal/direct or impersonal/indirect). We evaluated all of these text types using the same method. The intelligibility evaluation was complemented by an assessment of the grammaticality of the output, conducted by independent native speakers trained in linguistics. Following an approach widely used in MT (e.g., Lehrberger and Bourbeau, 1987), the judges referred to a list of error categories for their annotations.</Paragraph> </Section> <Section position="5" start_page="0" end_page="3" type="metho"> <SectionTitle> 3 Evaluating fidelity </SectionTitle> <Paragraph position="0"> In MT, evaluating fidelity (or &quot;accuracy&quot;) entails a judgment about the extent to which two texts &quot;say the same thing&quot;. Usually, the two texts in question are the source (i.e., original) text and the (machine-)translated text and the judges are expert translators who are again invited to rate the relative information content of pairs of sentences on an anchored scale (e.g., Nagao et al., 1985). But others (e.g., Caroll, 1966) have also compared the informativeness of the machine translation and a human translation deemed to serve as a benchmark.</Paragraph> <Paragraph position="1"> Interestingly, both of these researchers found a high correlation between the intelligibility evaluations and the fidelity evaluations, which suggests that it may be possible to infer fidelity from the (less costly) evaluation of intelligibility. However, at the current state-of-the-art this approach does not guarantee to detect cases where the translation is perfectly fluent but also quite wrong.</Paragraph> <Paragraph position="2"> For NLG, the story is rather different.</Paragraph> <Paragraph position="3"> Lacking a source text, we are denied the relatively straightforward approach of detecting discrepancies between artifacts of the same type: texts. The question is, instead, whether the generated text &quot;says the same thing&quot; as the message -- i.e., the model of the intended semantic content together with the pragmatic force of the utterance.</Paragraph> <Paragraph position="4"> The message is clearly only available through an external representation. In translation generally, this external representation is the source text and the task is commonly characterized as identifying the message -which originates in the writer's mental model -in order to re-express it in the target language. In an NLG system, the one external representation that is commonly available is the particular domain model that serves as input to the generation system. This model may have been provided directly by an artificial agent, such as an expert system. Alternatively, it may have been constructed by a human agent as the intended instantiation of their mental model.</Paragraph> <Paragraph position="5"> Yet, whatever its origins, directly comparing this intermediate representation to the output text is problematic.</Paragraph> <Paragraph position="6"> A recent survey of complete NLG systems (Cahill et al., 1999) found that half of the 18 systems examined accepted input directly from another system . A typical example is the Caption Generation System (Mittal et al., 1998), which produces paragraph-sized captions to accompany the complex graphics generated by SAGE (Roth et al., 1994). The input to generation includes definitions of the graphical constituents that are used to by SAGE to convey information: &quot;spaces (e.g., charts, maps, tables), graphemes (e.g., labels, marks, bars), their properties (e.g., color, shape) and encoders--the frames of reference that enable their properties to be interpreted/translated back to data values (e.g., axes, graphical keys).&quot; For obvious reasons, this does not readily lend itself to direct comparison with the generated text caption.</Paragraph> <Paragraph position="7"> In the remaining half of the systems covered, the domain model is constructed by the user (usually a domain expert) through a technique that has come to be known as symbolic authoring: the 'author' uses a specially-built knowledge editor to construct the symbolic source of the target text. These editors are interfaces that allow authors to build the domain model using a representation that is more 'natural' to them than the artificial language of the knowledge base.</Paragraph> <Paragraph position="8"> The purpose of these representations is to provide feedback intended to make the content of the domain model more available to casual inspection than the knowledge representation language of the By complete systems, we refer to systems that determine both &quot;what to say&quot; and &quot;how to say it&quot;, taking as input a specification that is not a hand-crafted simulation of some intermediate representation.</Paragraph> <Paragraph position="9"> Mittal et al., 1998, pg. 438.</Paragraph> <Paragraph position="10"> See Scott, Power and Evans, 1998.</Paragraph> <Paragraph position="11"> domain model. As such, they are obvious candidates as the standard against which to measure the content of the texts that are generated from them.</Paragraph> <Paragraph position="12"> We first consider the case of feedback presented in graphical mode, and then the option of textual feedback, using the WYSIWYM technology (Power and Scott, 1998; Scott, Power and Evans, 1998). We go on to make recommendations concerning the desirable properties of the feedback text.</Paragraph> </Section> <Section position="6" start_page="3" end_page="6" type="metho"> <SectionTitle> 4 Graphical representations of content </SectionTitle> <Paragraph position="0"> Symbolic authoring systems typically make use of graphical representations of the content of the domain model--for example, conceptual graphs (Caldwell and Korelsky, 1994). Once trained in the language of the interface, the domain specialist uses standard text-editing devices such as menu selection and navigation with a cursor, together with standard text-editing actions (e.g., select, copy, paste, delete) to create and edit the content specification of the text to be generated in one or several selected languages.</Paragraph> <Paragraph position="1"> The user of AGILE, conceived to be a specialist in the domain of the particular software for which the manual is required (i.e., CAD/CAM), models the procedures for how to use the software. AGILE's graphical user interface (Hartley, Power et al., 2000) closely resembles the interface that was developed for an earlier system, DRAFTER, which generates software manuals in English and French (Paris et al., 1995). The design of the interface represents the components of the procedures (e.g., goals, methods, preconditions, sub-steps, side-effects) as differently coloured boxes. The user builds a model of the procedures for using the software by constructing a series of nested boxes and assigning labels to them via menus that enable the selection of concepts from the underlying domain ontology.</Paragraph> <Section position="1" start_page="3" end_page="6" type="sub_section"> <SectionTitle> 4.1 The input specification for the user </SectionTitle> <Paragraph position="0"> As part of our evaluation of AGILE, we asked 18 IT professionals to construct a number of predetermined content models of various degrees of complexity and to have the system There were six for each of the three Eastern European languages; all had some (albeit limited) experience of CAD/CAM systems and were fluent speakers of English. generate text from them in specified styles in their native language. Since the evaluation was not conducted in situ with real CAD/CAM system designers creating real draft manuals, we needed to find a way to describe to the evaluators what domain models we wanted them to build. Among the possible options were to give them a copy of either: * the desired model as it would appear to them in the interface (e.g., Figure 1); * the target text that would be produced from the model (e.g., Figure 2); * a 'pseudo-text' that described the model in a form of English that was closer to the language of the AGILE interface than to fluent English (e.g., Figure 3).</Paragraph> <Paragraph position="1"> We rejected the first option because it amounted to a task of replication which could be accomplished successfully even without users having any real understanding of the meaning of Draw a line by specifying its start and end points. To draw a line Specify the start point of the line.</Paragraph> <Paragraph position="2"> Specify the end point of the line.</Paragraph> <Paragraph position="3"> the model they were building. Therefore, it would shed no light on how users might be able to build a graphical model externalising their own mental model.</Paragraph> <Paragraph position="4"> We discarded the second because a text may not necessarily make any explicit linguistic distinction between different components of the model--for example, between a precondition on a method and the first step in a method consisting of several steps . Thus, in general, target texts may not reflect every distinction available in the underlying domain model (without this necessarily causing any confusion in the mind of the reader). As a result of such underspecification, they are ill-suited to serving as a staring point from which a symbolic author could build a formal model.</Paragraph> <Paragraph position="5"> We opted, then, for providing our evaluators with a pseudo-text in which there was an explicit and regular relationship between the We focused on (a), which was of course mediated by (c); that is, we focused on the issue of creating an accurate model. This is an easier issue than that of the fidelity of the output text to the model (b), while the representations in (d) are too remote from one another to permit useful comparison.</Paragraph> <Paragraph position="6"> To measure the correspondence between the actual models and the desired/target models, we adopted the Generation String Accuracy (GSA) metric (Bangalore, Rambow and Whittaker, 2000; Bangalore and Rambow, 2000) used in evaluating the output of a NLG system. It extends the simple Word Accuracy metric suggested in the MT literature (Alshawi et al., 1998), based on the string edit distance between some reference text and the output of the system. As it stands, this metric fails to account for some of the special properties of the text generation task, which involves ordering word tokens. Thus, corrections may involve re-ordering tokens. In order not to penalise a misplaced constituent twice--as both a deletion and an insertion--the generation accuracy metric treats the deletion (D) of a token from one location and its insertion (I) at another location as a single movement (M). The remaining deletions, insertions, and substitutions (S) are counted separately. Generation accuracy is given by the following equation, where R is the number of (word) tokens in the reference of the procedures and their p l expression. Figure 4 is one of the texts used in the evaluation.</Paragraph> <Paragraph position="7"> Draw an arc First, start-tool the ARC command.</Paragraph> <Paragraph position="8"> M1. Using the Windows operating system: choose the 3 Points option from the Arc flyout on the Draw toolbar.</Paragraph> <Paragraph position="9"> M2. Using the DOS or UNIX operating system: igure 4: fragment of a typical pseudo-text Evaluating the fidelity of the output ticular set-up afforded us the possibility dging the fidelity of the 'translation' n the following representations: a) desired model and model produced b) model produced and output text c) pseudo-text and model produced d) pseudo-text and the output text ple, between: &quot;To cook a goose: Before starting, re the goose has been plucked. Put the goose in a oven for 1.5 hours.&quot; and &quot;To cook a goose: First goose. Then put it in a medium oven for 1.5 For Bangalore and his colleagues, the reference text is the desired text; it is a gold standard given a priori by a corpus representing the target output of the system. The generation accuracy of a string from the actual output of the system is computed on the basis of the number of movements, substitutions, deletions and insertions required to edit the string into the desired form.</Paragraph> <Paragraph position="10"> In our case, the correspondence was measured between models rather than texts, but we found the metric 'portable'. The tokens are no longer textual strings but semantic entities. Although this method provided a useful quantitative measure of the closeness of the fit of the actual generated text to what was intended, it is not without problems, some of choose the Arc option from the Draw menu.</Paragraph> <Paragraph position="11"> choose 3 Points option.</Paragraph> <Paragraph position="12"> Specify the start point of the arc.</Paragraph> <Paragraph position="13"> which apply irrespective of whether the metric is applied to texts or to semantic models. For example, it does not capture qualitative differences between the generated object and the reference object, that is, it does not distinguish trivial from serious mistakes. Thus, representing an action as the first step in a procedure rather than as a precondition would have less impact on the end-reader's ability to follow the instructions than would representing a goal as a side-effect.</Paragraph> </Section> </Section> <Section position="7" start_page="6" end_page="8" type="metho"> <SectionTitle> 5 Textual representations of content </SectionTitle> <Paragraph position="0"> Once the model they represent becomes moderately complex, graphical representations prove to be difficult to interpret and unwieldy to visualise and manipulate (Kim, 1990; Petre, 1995). WYSIWYM offers an alternative, textual modality of feedback, which is more intuitive and natural. As we will discuss below, there is a sense in which, in its current form, the feedback text may be too natural.</Paragraph> <Section position="1" start_page="6" end_page="7" type="sub_section"> <SectionTitle> 5.1 Current status of WYSIWYM feedback text </SectionTitle> <Paragraph position="0"> The main purpose of the text generated in feedback mode, as currently conceived, is to show the symbolic author the possibilities for further expanding the model under development.</Paragraph> <Paragraph position="1"> As with AGILE's box representation, clicking on a coloured 'anchor' brings up a menu of legitimate fillers for that particular slot in the content representation. Instantiating green anchors is optional, but all red anchors must be instantiated for a model to be potentially complete (Figure 5). Once this is the case, authors tend to switch to output mode, which produces a natural text reflecting the specified model and nothing else.</Paragraph> <Paragraph position="2"> See Hartley et al (2000) for further discussion of this issue and the results of the AGILE evaluation. In WYSIWYM systems the same generator is used to produce both the feedback and output texts; this means that the feedback text can be as fluent as the output text. In its current instantiations, this is precisely what is produced, even when the generator is capable of producing texts of rather different styles for the different purposes.</Paragraph> </Section> <Section position="2" start_page="7" end_page="8" type="sub_section"> <SectionTitle> 5.2 Feedback in a controlled language </SectionTitle> <Paragraph position="0"> The motivation for generating a new type of feedback text comes from two sources.</Paragraph> <Paragraph position="1"> The first is the pseudo-texts that we constructed by hand for the AGILE evaluation. As far as the form of the models actually constructed is concerned, they proved consistently reliable guides for the symbolic authors. Where they proved inadequate was in their identification of multiple references to the same domain model entity; several authors tended to create multiple instances of an entity rather than multiple pointers to a single instance. Let us now turn from the testing scenario, where authors have a defined target to hit, and consider instead a production setting where the author is seeking to record a mental model. It is a simple matter to have the system generate a second feedback text, complementing the present one, this time in the style of the pseudo-texts for the purpose of describing unambiguously, if rebarbatively, the state of a potentially complete model.</Paragraph> <Paragraph position="2"> The second is Attempto Controlled English (ACE: Fuchs and Schwitter, 1996; Fuchs, Schwertel and Schwitter, 1999), which allows domain specialists to interactively formulate software requirements specifications. The specialists are required to learn a number of compositional rules which they must then apply when writing their specifications. These are parsed by the system.</Paragraph> <Paragraph position="3"> For all sentences that it accepts, the system creates a paraphrase (Figure 6) that indicates its interpretations by means of brackets. These interpretations concern phenomena like anaphoric reference, conjunction and disjunction, attachment of prepositional phrases, relative clauses and quantifier scope. The user As, for example, in the ICONOCLAST system (see http://www.itri.bton.ac.uk/projects/iconoclast). Modulo the reference problems, for which a solution is indicated below 1. Do <red>this action</red> by using <green>this method</green>.</Paragraph> <Paragraph position="4"> 2. Schedule <red>this event</red> by using <green>this method</green>.</Paragraph> <Paragraph position="5"> 3. Schedule the appointment by using The principle of making interpretations explicit appears to be good one in the NLG context too, especially for the person constructing the domain model. Moreover, in the context where the output text is required to be in a controlled language, the use of WYSIWYM relieves the symbolic author of the burden of learning the specialized writing rules of the given control language.</Paragraph> <Paragraph position="6"> Optimising the formulation of the controlled language feedback is matter of iteratively revising it via the testing scenario, using GSA as the metric, until authors consistently achieve total fidelity of the models they construct with the reference models.</Paragraph> </Section> </Section> class="xml-element"></Paper>