File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1419_metho.xml
Size: 7,352 bytes
Last Modified: 2025-10-06 14:10:43
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1419"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Evaluations of NLG Systems: common corpus and tasks or common dimensions and metrics?</Title> <Section position="5" start_page="0" end_page="127" type="metho"> <SectionTitle> 2 Enlarging the view of system-oriented </SectionTitle> <Paragraph position="0"> evaluations The comparison of NLG systems should not be limited to a particular task in a specific context. Most systems are designed for specific applications in specific domains and tend to be tuned for these applications. Evaluating them in a context of a specific common evaluation task might de-contextualise them and might encourage fine-tuning for this task, which might not be useful in general. Furthermore, the evaluation of a system should not be limited to its performance in a specific context but should address characteristics such as: * Cost of building (time and effort); * Ease of extension, maintainability and customisability to handle new requirements (time, effort and expertise required); * Cost of porting to a new domain or application (time, effort and expertise required); * Cost of data capture if required (how expensive, expertise required); * Coverage issues (users, tasks, dimensions of context; and * Ease of integration with other software.</Paragraph> <Paragraph position="1"> These dimensions are important if we want the technology to be adopted and if we want poten- null tial users of the technology to be able to make an informed choice as to what approach to choose when.</Paragraph> <Paragraph position="2"> Most NLG systems are built around a specific application. Using them in the context of a different application or domain might be difficult. While one can argue that basic techniques do not differ from one application to another, the cost of the modifications required and the expertise and skills needed may not be worth the trouble. It may be simply cheaper and more convenient to rebuild everything. However, firstly, this might not be an option, and, secondly, it may increase the cost of using an NLG approach to such an extent as to make it unaffordable. In addition, applications evolve over time and often require a quick deployment. It is thus increasingly desirable to be able to change (update) an application, enabling it to respond appropriately to the new situations which it must now handle: this may require the ability to handle new situations (e.g., generate new texts) or the ability to respond differently than originally envisaged to known situations. This is important for at least two reasons: null (1) We are designers not domain experts.</Paragraph> <Paragraph position="3"> Although we usually carry out a domain/corpus/task analysis beforehand to acquire the domain knowledge and understand the users' needs in terms of the text to be generated, it is almost impossible to become a domain expert and know what is the most appropriate in each situation. Thus, the design of a specific application should allow the experts to take on control and ensure the application is configured appropriately. This imposes the additional constraint that an application should be maintainable directly by a requirement specialist, an author, expert or potentially the reader/listener; (2) Situations are dynamic - what is satisfactory today may be unsatisfactory tomorrow. We must be prepared to take on board new requirements as they come in.</Paragraph> <Paragraph position="4"> These requirements, of course, come at a cost.</Paragraph> <Paragraph position="5"> With this in mind, then, we believe that there is another side to system-oriented evaluation which we, as designers of NLG systems, need to consider: the ease or cost of developing flexible applications that can be easily configured and maintained to meet changing requirements. As a start towards this goal, we attempted to look more precisely at one of the characteristics mentioned above, the cost of maintaining and extending an application, attempting to understand what we should take into account to evaluate a system on that dimension. We believe asking the following questions might be useful. When there are new requirements: (1) What changes are needed and do the modifications require the development of new resources, the implementation of additional functionality to the underlying architecture, or both? (2) Who can do it and what is the expertise required? - NLG systems are now quite complex and require a lot of expertise that may be shared among several individuals (e.g., software engineering, computational linguistics, domain expertise, etc.).</Paragraph> <Paragraph position="6"> (3) How hard it is? - How much effort and time would be required to modify/update the system to the new requirements? In asking these questions, we believe it is also useful to decouple a specific system and its underlying architecture, and ask the appropriate questions to both.</Paragraph> </Section> <Section position="6" start_page="127" end_page="128" type="metho"> <SectionTitle> 3 Usability Evaluations of NLG Systems </SectionTitle> <Paragraph position="0"> When talking about evaluation of NLG systems, we should also remember that usability evaluations are crucial, as they can confirm the usefulness of a system for its purpose and look at the impact of the generated text on its intended audience. There has been an increasing number of such evaluations - e.g., (Reiter et al., 2001, Paris et al., 2001, Colineau et al., 2002, Kushniruk et al., 2002, Elhadad et al., 2005) and we should continue to encourage them as well as develop and share methodologies (and pitfalls) for performing these evaluations. It is interesting, in fact, to note that communities that have emphasized common task and corpus evaluations, such as the IR community, are now turning their attention to stakeholder-based evaluations such as task-based evaluations. In looking at ways to evaluate NLG systems, we might again enlarge our view beyond reader/listener-oriented usability evaluations, as readers are not the only persons potentially affected by our technology. When doing our evaluations, then, we must also consider other parties. Considering NLG systems as information systems, we might consider the following stakeholders beyond the reader: * The creators of the information: for some applications, this may refer to the person creating the resources or the information required for the NLG system. This might be, for example, the people writing the fragments of text that will be later assembled automatically. Or it might include the person who will author the discourse rules or the templates required. With respect to these people, we might ask questions such as: &quot;Does employing this NLG system/approach save them time?&quot;, &quot;Is it easy for them to update the information?&quot; * The &quot;owners&quot; of the information. We refer here to the organisation choosing to employ an NLG system. Possible questions here might be: &quot;Does the automatically generated text achieve its purpose with respect to the organisation?&quot;, &quot;Can the organisation convey similar messages with the automated system? (e.g., branding issues).</Paragraph> </Section> class="xml-element"></Paper>