File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/88/c88-2112_abstr.xml
Size: 16,282 bytes
Last Modified: 2025-10-06 13:46:35
<?xml version="1.0" standalone="yes"?> <Paper uid="C88-2112"> <Title>Evaluating Natural Language Systems: A Sourcebook Approach *</Title> <Section position="1" start_page="0" end_page="532" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> This paper reports progress in development of evaluation methodologies for natural language systems. Without a common classification of the problems in natural language understanding authors have no way to specify clearly what their systems do, potential users have no way to compare different systems and researchers have no way to judge the advantages or disadvantages of different approaches to developing systems.</Paragraph> <Paragraph position="1"> introduction.</Paragraph> <Paragraph position="2"> Recent years have Seen a proliferation of natural language systems. These include both applied systems such as database front-ends, expert system interfaces and on-line help systems and research systems developed to test particular theories of language processing. Each system comes with a set of claims about what types of problems the system can &quot;handle&quot;. But what does &quot;handles ellipsis&quot; or &quot;resolves anaphoric reference&quot; actually mean? All and any such cases? Certain types? And what classification of 'types' of ellipsis is the author using? Without a common classification of the problems in natural language understanding authors have no way to specify clearly what their systems do, potential users have no way to compare different systems and researchers have no way to judge the advantages or disadvantages of different approaches to developing systems. While these problems have been noted over the last 10 years (Woods, 1977; Tennant, 1979), research developing specific criteria for evaluation of natural language systems has appeared only recently.</Paragraph> <Paragraph position="3"> This paper reports progress in development of evaluation methodologies for natural language systems. This work is part of the Artificial Intelligence Measurement System (AIMS) project of the Center for the Study of Evaluation at UCLA.</Paragraph> <Paragraph position="4"> The AIMS project is developing evaluation criteria for expert systems, vision systems and natural language systems.</Paragraph> <Paragraph position="5"> i~revious Work on Natural Language Evaluation.</Paragraph> <Paragraph position="6"> Woods (1977) discussed a number of dimensions along which nro~re~s in development of natural language systems can be *This work reported here is part of the Artificial Intelligence Measurement Systems (AIMS) Project, which is supported in part by ONR contract number N00014-S6-K-0395.</Paragraph> <Paragraph position="7"> measured. In particular, he considered approaches via a %axonomy of linguistic phenomena&quot; covered, the convenience and perspicuity of the model used and the time used in processing.</Paragraph> <Paragraph position="8"> As Woods points out, the difficulty of a taxonomic approach is that the taxonomy will always be incomplete. Any particular phenomenon will have many subclasses and it often turns out that the pubhshed examples cover only a small part oZ the problem. A system might claim &quot;handles pronoun reference&quot; but the examples only cover parallel constructions. To make such a taxonomy useful we have to identify as many subclasses as possible. On the positive side, if we can build such a taxonomy, it will allow authors to state clearly just what phenomena they are making claims about. It could serve not only as a description of what has been achieved but as a guide to what still needs to be done.</Paragraph> <Paragraph position="9"> Woods provides a useful discussion of the difficulties involved in each of these approaches but offers no specific evaluative criteria. He draws attention to the great effort involved in doing evaluation by any of these methods and to the importance of a &quot;detailed case-by-case analysis&quot;. Our present work is an implementation and extension of some of these ideas.</Paragraph> <Paragraph position="10"> Tennant and others (Tennant 1979; Finin, Goodman & Tennant, 1979) make a distinction between conceptual coy..</Paragraph> <Paragraph position="11"> erage and linguistic coverage of a natural language system and argue that systems have to be measured on each of th~e dimensions. Conceptual coverage refers to the range of concepts handled by the system and linguistic coverage to the range of language used to discuss the concepts. Tennant suggests a possible experimental separation between conceptual and linguistic coverage.</Paragraph> <Paragraph position="12"> The distinction these authors make is important and useful, in part for emphasizing the significance of the knowledge base for usability of a natm'al language system. But the examples that Tennant gives for conceptual completeness presupposition, reference to discourse objects -- seem to be part of a continuum with topics like ellipsis and anaphora, which are more clearly linguistic. Fox&quot; this reason we don't draw a sharp distinction here. We prefer to look at the broades~ possible range of language use. Insofar as recognizing presupposition~ depends on the structure of Sheknowt~,dge base, we ~ote theft in the examples. In any case, the question of evaluating ~he linguistic coverage is still open.</Paragraph> <Paragraph position="13"> Bars ar,.d Guida (1984) give a general overview of issues ~n evahlation of natural language systems. They emphasize the import~mce is measuring competence, what the system is capable of doing, over performance, what users actually do with ~he system. We agree with the emphasis. But how do we measure competence? Guida and Mauri (1984, 1986) present the most formal and detailed approach to e~luation of natural language systems. Ttmy consider a natural language system as a function fl'om input;~ to (sets of) outputs. Assuming a measure of error (closeness of the Output to tile correct output) and a measure of the importance of each input, they ewluate the system by the sum of the errors weighted by the importance of the input. It i.'; assumed that the user can assign these measures in some r~sonable way. They give some suggestions for this assignmeni; and work out a small example in detail.</Paragraph> <Paragraph position="14"> The advantage of a careful, formal analysis is that it focuses art(ration on the key role of the 'importance' and 'error' measures. In practice, the importance measure has to be given over categories of input. The difficulty is determining what these categories are for a natural language. A system that hundred five types d ellipsis but not the type the user most nee(1,~ would be of little use. If the user has a description of the varieties of issues involved, he can define his specific ~meds and give his own weights to the different categories.</Paragraph> <Section position="1" start_page="530" end_page="532" type="sub_section"> <SectionTitle> The Sourcebook Project </SectionTitle> <Paragraph position="0"> The natural language part, of the AIMS project has two parts.</Paragraph> <Paragraph position="1"> The first task is to deveh~p methods for describing the coverage of natural language systerrm. To this end, we are building a database of 'exemplars' of representative problems in nat-ural language undcrstanding, mostly from the computational linguistics literature. Each exemplar includes a piece of text (sentenee~ dialogue fragment, etc.) a description of the conceptual issue represented, a detailed discussion of the probler~m in understanding the text and a reference to a more extensive discussion in the literature. (See appendix A for examples.) .The Sourcebook consists of a large set of these exemplars a~td a conceptual taxonomy of the types of issues represented in the database. The exemplars are indexed by source in the literature and by conceptual class of the issue so that the user can readily access the rele~rant examples.</Paragraph> <Paragraph position="2"> The Sourcebook provides a structural representation of the coverage that can be expected of a natural language system.</Paragraph> <Paragraph position="3"> The second task of our group is to develop methods for a ~process evaluation' of n~tural language systexrm. A process evahlation includes questions of efficiency, perspicuity and conceptual coverage in the sense of Tennant. We are interdeg ested in the tearnability of a system~ in how well the model is documented, in how easily the system can be extended, etc. Generally, we are interested in how the system actuMly works, including the user interface. The criteria we develop will be applied to representative existing systems. In this paper we focus on the Soureebook.</Paragraph> <Paragraph position="4"> Why a Sourcebook? In developing evMuative criteria for linguistic coverage we had several goals we wanted to achieve. First, the criteria used should be applicable over the broadest possible range of systems and still provide comparability of the systems. The criteria should be relevant to even very innovative approaches.</Paragraph> <Paragraph position="5"> In fact, the criteria should let the developers of the system describe exactly what is innovative about the system. Second, the criteria should be independent of impleamntation issues including programming language. A complete analysis of a particular system would of coui'se include implementation details. But it should be possible to describe the coverage independent of such details. Only in this way do we have a basis for claiming an advantage for new implementations or representations. Third, the system shouldn't just rate the syste m on a pass/fail count. It should outline areas of competence so that implementers and researchers can see where further work is needed within their system or their paradigm. They should be able to say &quot;this approach handles types 1, 2 and 3 of ellipsis but not types 4 and 5 yet&quot; rather than &quot;this approach handles ellipsis&quot;. Fourth, the criteria used should be comprehensible to the general user and to researchers outside computational linguistics. For one thing, as Tennant noted, users are less deterred by, say, syntactic limitaLions than by limitations in the system's concepts, discourse ability, ability to understand the user's goals, etc. We need to present the issues in such a way that the user can make judgnmnts about the importance of different components of the e~Mua,tion. This means presenting the issues in terms of the general principles involved and giving concrete examples. This approach also allows us to bring in information fi'om areas like psychology, sociology, law and literary analysis and enables researchers in those areas to contribute to the evaluation. A fifth point is more a negative point. We don't expect to be able to judge any system by one or even a few numbers. Our goal is to find a way to describe and to compare the coverage of systems.</Paragraph> <Paragraph position="6"> One method often used in computer science to test programs is a test suite and these have been used for natural language evaluation. Test suites have the advantage of simplicity and precision. Hewlett-Pacl~rd presented one such suite covering a variety of tests of English syntax at the 1987 Association for CompugationM Linguistics meeting. But this approach is very limited. Although a parser passed one example of a &quot;Bach-Peters sentence (periphery)&quot;, it might fail on: another very similar sentence which is conceptually different. (This test suite doesn't measure how well the system understands what's going on.) The categories ate those derived from a particular syntactic theory, rather than categories that users work with. The test suite tests only a very limited range of linguistic phenomena and the test is simply pass/fail. And when a sentence fails to pass, it's not always clear why without looking at the implementation. For the reasons mentioned here, we looked for a more generally useful method than test suites.</Paragraph> <Paragraph position="7"> Rather than start with a particular theory of language, we began with a search of the computational linguistics hterature. While no-one would claim that computational linguistics has discovered, let alone solved, every problem in language use, twenty-five years of research has covered a broad range of problems. Looking at language use computationally focuses attention on phenomena that are often neglected in more theoretical analyses. Building systems intended to read real text or interact with real users raises complex problems of interaction of linguistic phenomena. The exemplars are mostly taken from the literature although we have added examples to fill in gaps where we felt the published examples were incomplete. Because many of the published cases involved particular systems, the examples are often discussed in the literature in relation to that system. In the exemplars, we analyze the example in terms of the general issue represented. Then the exemplars are groupe d 'into categories of related problems. This generates the hierarchical classification of the issues. We don't start with an a priori theory for this classification but rather look for patterns in the exemplars. (A surmnary of the first two layers of the hierarchical classification is in Appendix 2.) By drawing examples from the full range of the literature, includin@i~ot only successful examples but unsuccessful ones, the.~ourcelJo.ok gives a broad view of linguistic phenomena. Although published examples are often about implementations, we have focused on examples that illustrate more general issues. The classification of the examples maps the overall topology of the issues and describes both areas covered and areas not covered. Finally, by defining the is-Sues through specific examples and conceptual classification, rather than implementation details or linguistic theories, the Sourcebook is accessible to non-specialists in computational linguistics.</Paragraph> <Paragraph position="8"> In the hierarchical classification, groups I, II and III roughly match stages of development in natural language systems.</Paragraph> <Paragraph position="9"> They correspond to simple database query systems (I), databasm systems capable of extended interaction (II) and systems where knowledge flow between user and system goes both ways (III).</Paragraph> <Paragraph position="10"> Type III systems will be needed for, e.g., intelligent interfaces to expert systems. Progress on problems in areas I, II and III can be considered as describing first, second and third generation natural language systems, respectively.</Paragraph> <Paragraph position="11"> Continuing 'and Future Work We are continuing to add exemplars to the Sourcebook and are elaborating the classification scheme. We will be making the Sourcebook available to other researchers for comment and analysis.</Paragraph> <Paragraph position="12"> We have several hundred exemplars and we estimate that we have covered 10 per cent of the relevant literature (jour null nals, proceedings volumes, dissertations, major textbooks) in computational linguistics, artificial intelligence and cognitive science. Our intention is to be as exhaustive as possible.</Paragraph> <Paragraph position="13"> Which leaves us with a very ambitious project.</Paragraph> <Paragraph position="14"> We are also continuing work on the process evaluation methodologies.</Paragraph> <Paragraph position="15"> In (1), we know the speaker is the one singing in the shower.</Paragraph> <Paragraph position="16"> How? Because we know that earthquakes don't sing. Sc it is likely that there is a missing &quot;while&quot; and the speaker heard an earthquake while singing in the shower. However: that reasoning fails on (2). In that sentence, the earthquake L, singing, not the person in the Shower. A selectional restriction that says earthquakes don't sing will work in understandin~ (1) but fail for (2). How is the correct actor for actions like singing determined?</Paragraph> </Section> </Section> class="xml-element"></Paper>