File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1421_metho.xml
Size: 10,510 bytes
Last Modified: 2025-10-06 14:10:42
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1421"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Shared-task Evaluations in HLT: Lessons for NLG</Title> <Section position="3" start_page="0" end_page="133" type="metho"> <SectionTitle> 2 Shared-task evaluation in HLT </SectionTitle> <Paragraph position="0"> Over the past twenty years, virtually every field of research in human language technology (HLT) has introduced STECs. A small selection is presented in the table below1. NLG researchers have tended to be somewhat unconvinced of the benefits of comparative evaluation in general, and the kind of competitive, numbers-driven STECs that have been typical of NLU in particular. Yet STECs do not have to be hugely competitive events fixated on one task with associated input/output data and single evaluation metric, static over time.</Paragraph> <Paragraph position="1"> Tasks: There is a distinction between (i) evaluations designed to help potential users to decide 1Apologies for omissions, and for bias towards English. whether the technology will be valuable to them, and (ii) evaluations designed to help system developers improve the core technology (Sp&quot;arck Jones and Galliers, 1996). In the former, the application context is a critical variable in the task definition; in the latter it is fixed. Developer-oriented evaluation promotes focus on the task in isolation, but if the context is fixed badly, or if the outside world changes but the evaluation does not, then it becomes irrelevant. NLP STECs have so far focused on developer-oriented evaluation, but there are increasing calls for more 'embedded', more task-based types of evaluations2.</Paragraph> <Paragraph position="2"> Existing NLP STECs show that tasks need to be broadly based and continuously evolving. To begin with, the task needs to be simple, easy to understand and easy for people to recognise as their task. Over time, as the limitations of the simple task are noted and a more substantial community is 'on board', tasks can multiply, diversify and become more sophisticated. This is something that TREC has been good at (still going strong 14 years on), and the parsing community has failed to achieve (see notes in table).</Paragraph> <Paragraph position="3"> Evaluation: NLP STECs have tended to use automatic evaluations because of their speed and reproducibility, but some have used human evaluators, in particular in fields where language is generated (MT, summarisation, speech synthesis).</Paragraph> <Paragraph position="4"> Evaluation scores are not independent of the task and context for which they are calculated.</Paragraph> <Paragraph position="5"> This is clearly true of human-based evaluation, but even scores by a simple metric like word error rate in speech recognition are not comparable unless certain parameters are the same: backgroundnoise, language, whether or not speech is controlled. Development of evaluation methods and benchmark tasks therefore must go hand in hand.</Paragraph> <Paragraph position="6"> Evaluation methods have to be accepted by the research community as providing a true approxi- null 1. PARSEVAL is an evaluation measure, not a full STEC. This has proved problematic: the parsing community no longer accepts the PARSEVAL measure, but there has been no organisational framework for establishing an alternative. 2. SEMEVAL did not proceed largely because it was too ambitious and agreement between people with different interests and theoretical positions was not achieved. It was eventually reduced in scope and aspects became incorporated in MUC, SUMMAC and SENSEVAL.</Paragraph> <Paragraph position="7"> 3. MT has been transformed by corpus methods, which have shifted MT from a backwater to perhaps the most vibrant area of NLP in the last five years.</Paragraph> <Paragraph position="8"> 4. In TC-STAR, the SST task is broken down into numerous subtasks. The modules and systems that meet the given criteria are exchanged among the participants, lowering the barrier to entry. mation of quality. E.g. BLEU is strongly disliked in the non-statistical part of the MT community because it is biased in favour of statistical MT systems. PARSEVAL stopped being used when the parsing community moved towards dependency parsing and related approaches.</Paragraph> <Paragraph position="9"> Sharing: As PARSEVAL shows, measures and resources alone are not enough. Also required are (i) an event (or better, cycle of events) so people can attend and feel part of a community; (ii) a forum for reviewing task definitions and evaluation methods; (iii) a committee which 'owns' the STEC, and organises the next campaign.</Paragraph> <Paragraph position="10"> Funding is usually needed for gold-standard corpus creation but rarely for anything else (Kilgarriff, 2003). Participants can be expected to cover the cost of system development and workshop attendance. A funded project is best seen as supporting and enabling the STEC (especially during the early stages) rather than being it.</Paragraph> <Paragraph position="11"> In sum, STECs are good for community building. They produce energy (as we saw when the possibility was raised for NLG at UCNLG'05 and ENLG'05) which can lead to rapid scientific and technological progress. They make the field look like a game and draw people in.</Paragraph> </Section> <Section position="4" start_page="133" end_page="134" type="metho"> <SectionTitle> 3 Towards an NLG STEC </SectionTitle> <Paragraph position="0"> In 1981, Sp&quot;arck Jones wrote that IR lacked consolidation and the ability to build new work on old, and that this was substantially because there was no commonly agreed framework for describing and evaluating systems (Sp&quot;arck Jones, 1981, p. 245). Since 1981, various NLP sub-disciplines have consolidated results and progressed collectively through STECs, and have seen successful commercial deployment of NLP technology (e.g.</Paragraph> <Paragraph position="1"> speech recognition software, document retrieval and dialogue systems).</Paragraph> <Paragraph position="2"> However, Sp&quot;arck Jones's 1981 analysis could be said to still hold of NLG today. There has been little consolidation of results or collective progress, and there still is virtually no commercial deployment of NLG systems or components.</Paragraph> <Paragraph position="3"> We believe that comparative evaluation is key if NLG is to consolidate and progress collectively.</Paragraph> <Paragraph position="4"> Conforming to the evaluation paradigm now common to the rest of NLP will also help re-integration, and open up the field to new researchers.</Paragraph> <Paragraph position="5"> Tasks: In defining sharable tasks with associated data resources for NLG, the core problem is deciding what inputs should look like. There is a real risk that agreement cannot be achieved on this, so not many groups participate, or the plan never reaches fruition (as happened in SEMEVAL).</Paragraph> <Paragraph position="6"> There are, however, ways in which this problem can be circumvented. One is to use a more abstract task specification describing system functionality, so that participants can use their own inputs, and systems are compared in task-based evaluations similar to the traditions and standards of software evaluation (as in Morpholympics). An alternative is to approach the issue through tasks with inputs and outputs that 'occur naturally', so that participants can use their own NLG-specific representations. Examples include data-to-text mappings where e.g. time-series data or a data repository are mapped to fault reports, forecasts, etc.</Paragraph> <Paragraph position="7"> Both data-independent task definitions and tasks with naturally occurring data have promise, but we propose the second as the simpler, easier to organise solution, at least initially. A specific proposal of a set of tasks can be found elsewhere in this volume (Reiter and Belz, 2006). An interesting idea (recommended by ELRA/ELDA) is to break down the input-output mapping into stages (as in the TC-STAR workshops, see table) and then, in a second round of evaluations, to make available intermediate representations from the most successful systems from the first round. In this way, standardised representations might develop almost as a side-effect of STECs.</Paragraph> <Paragraph position="8"> Evaluation: As in MT there are at least two criteria of quality for NLG systems: language quality (fluency in MT) and correctness of content (adequacy in MT). In NLG, these have mostly been evaluated directly using human scores or preference judgments, although recently automatic metrics such as BLEU have been used. They have also been evaluated indirectly, e.g. by measuring reading speeds and manual post-processing3. A more user-oriented type of evaluation has been to assess real-world usefulness, in other words, whether the generated texts achieve their purpose (e.g. whether users learn more with NLG techniques than with cheaper alternatives4).</Paragraph> <Paragraph position="9"> The majority of NLP STECs have used automatic evaluation methods, and the ability to produce results 'at the push of a button', quickly and reproducibly, is ideal in the context of STECs. However, existing metrics are unlikely to be suitable for NLG 3E.g. in the SkillSum and SumTime projects at Aberdeen. 4E.g. evaluation of the NL interface of the DIAG intelligent tutoring system, di Eugenio et al.</Paragraph> <Paragraph position="10"> (Belz and Reiter, 2006), and there is a lot of scepticism among NLG researchers regarding automatic evaluation. We believe that NLG should develop its own automatic metrics (development of such metrics is part of the proposal by Reiter and Belz, this volume), but for the time being an NLG STEC needs to involve human-based evaluations of the intrinsic as well as extrinsic type.</Paragraph> <Paragraph position="11"> Sharing: A recent survey conducted on the main NLG and corpus-based NLP mailing lists5 revealed that there are virtually no data resources that could be directly used in shared tasks. Considerable investment has to go into developing such resources, and direct funding is necessary. This points to a funded project, but we recommend direct involvement of the NLG community and SIGGEN. Other aspects of organisation are not NLG-specific, so the general recommendations in the preceding section apply.</Paragraph> </Section> class="xml-element"></Paper>