File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/90/h90-1022_metho.xml

Size: 18,967 bytes

Last Modified: 2025-10-06 14:12:30

<?xml version="1.0" standalone="yes"?>
<Paper uid="H90-1022">
  <Title>Definition of &amp;quot;Class A&amp;quot; Class A queries will be identified by exception. Class A</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1. A Brief History
</SectionTitle>
    <Paragraph position="0"> The work reported in this paper began in 1988, when people working on the DARPA Spoken Language Systems program and others working on other aspects of natural language processing met at a workshop on NL evaluation \[1\]. At that meeting, considerable time was devoted to clarifying the issues of black-box evaluation and glass-box evaluation.</Paragraph>
    <Paragraph position="1"> Since that meeting, the SLS community formed a committee on evaluation, chaired by Dave Pallett of NIST. The charge of this committee was to develop a methodology for data collection, training data dissemination, and testing for SLS systems under development (see \[2\] and \[3\]). The emphasis of the committee's work has been on automatic evaluation of quefies to an air travel information system. The first community-wide evaluation using the first version of methodology developed by this committee took place in June, 1990. It is reported in \[4\].</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. The Issues
</SectionTitle>
    <Paragraph position="0"> Why are systems for NL understanding, or speech understanding, more difficult to evaluate than SR systems? The key difference is that that the output of speech recognition is easy to specify -- it is a character string containing the words that were spoken as input to the system - and it is trivially easy to determine the &amp;quot;fight&amp;quot; answer and to compare it to the output of a particular SR system. Each of these steps,  1. specifying the form that output should take, 2. determining the right output for particular input, and 3. comparing the fight answer to the output of a particular system, is very problematic for NL and SU systems.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="102" type="metho">
    <SectionTitle>
3. The Goal
</SectionTitle>
    <Paragraph position="0"> The goal of the work was to produce a well-defined, meaningful evaluation methodology (implemented using an automatic evaluation system) which will both permit meaningful comparisons between different systems and also allow us to track the improvement in a single NL system over time. The systems are assumed to be front ends to an interactive application (database inquiry) in a particular domain (ATIS - an air travel information system).</Paragraph>
    <Paragraph position="1"> The intent is to evaluate specifically NL understanding capabilities, not other aspects of a system, such as the user interface, or the utility (or speed) of performing a particular task with a system that includes a NL component.</Paragraph>
  </Section>
  <Section position="5" start_page="102" end_page="103" type="metho">
    <SectionTitle>
4. The Evaluation Framework
</SectionTitle>
    <Paragraph position="0"> The methodology that was developed is very similar in style to that which has been used for speech recognition systems for several years. It is:  1. Collect a set of data as large as feasible, under conditions as realistic as possible.</Paragraph>
    <Paragraph position="1"> 2. Reserve some of that data as a test set, and distribute the rest as a training set.</Paragraph>
    <Paragraph position="2"> 3. Develop answers for the items in the test set, and an automatic comparison program to compare those &amp;quot;right&amp;quot; answers with the answers produced by various systems. 4. Send the test set to the sites, where they will be processed unseen and without modifications to the system.  The answers are then returned and run through the evaluation proc~ure, and the results reported.</Paragraph>
    <Paragraph position="3"> Figure 2 illustrates the relationship between an SLS system and the evaluation system.</Paragraph>
    <Section position="1" start_page="102" end_page="102" type="sub_section">
      <SectionTitle>
4.1 Collecting Data
</SectionTitle>
      <Paragraph position="0"> A method of data collection called &amp;quot;Wizard scenarios&amp;quot; was used to collect raw data (speech and transcribed text). This system is described in \[5\]. It resulted in the collection of a number of human-machine dialogues. This data exhibits some interesting characteristics.</Paragraph>
      <Paragraph position="1"> Because much of the data in the database is represented in encoded form (e.g., &amp;quot;L&amp;quot; for &amp;quot;Lunch&amp;quot;, and &amp;quot;F&amp;quot; for &amp;quot;First Class&amp;quot;), and because the names of the database fields (which show up as column headers in the answers shown to the users) are often abbreviated or cryptic, many of the users' questions centered on finding out the meaning of some code which was printed as part of the answer to a previous question. 1 This is a characteristic of this particular database, and is not necessarily shared by other databases or domains.</Paragraph>
      <Paragraph position="2"> The amount of data shown to the user influences the range of subsequent queries. For example, when the user asks for a list of flights and the answer includes not just the flight numbers but also the departarture and arrival times and locations, the meal service, and so on, there is never any need for a follow-up question like &amp;quot;When does flight AA123 get in?&amp;quot; or &amp;quot;Is lunch served on that flight?&amp;quot;.</Paragraph>
      <Paragraph position="3"> The language obtained in Wizard scenarios is very strongly influenced by the particular task, the domain and database being used, and the amount and form of data returned to the user. Therefore, restricting the training and test sets to data collected in Wizard scenarios in the AIrS domain means that the language does not exhibit very many examples of some phenomena (such as quantification, negation, and complex conjunction) which are known to appear frequently in database interfaces to other domains.</Paragraph>
    </Section>
    <Section position="2" start_page="102" end_page="102" type="sub_section">
      <SectionTitle>
4.2 Classifying Data
</SectionTitle>
      <Paragraph position="0"> One of the first things to become clear was that not all of the collected data was suitable as test data, and thus it was desirable that the training data be marked to indicate which queries one might reasonably expect to find in the test set.</Paragraph>
      <Paragraph position="1"> The notion emerged of having a number of classes of data, each more inclusive than the last, so that we could begin with a core (Class A) which was clearly definable and possible to evaluate automatically, and, as we came to understand the evaluation process better, which could be extended to other types of queries (Classes B, C, etc.).</Paragraph>
      <Paragraph position="2"> Several possible classification systems were presented and discussed in great detail. The one which has been agreed on at this time, Class A, is defined in Appendix A; Classes B, C, etc. have not yet been defined.</Paragraph>
    </Section>
    <Section position="3" start_page="102" end_page="103" type="sub_section">
      <SectionTitle>
4.3 Agreeing on Meaning
</SectionTitle>
      <Paragraph position="0"> How do you know whether a system has given the fight answer to a question like &amp;quot;List the mid-day flights from Boston to Dallas&amp;quot;, &amp;quot;Which non-stop flights from Boston to Dallas serve meals?&amp;quot;, or &amp;quot;What's the fare on AA 167?&amp;quot; It is necessary, for cross-site comparisons, to agree on the meaning of terms such as &amp;quot;mid-day&amp;quot;, &amp;quot;meals&amp;quot; (e.g., does that  mean any food at all, or does it include full meals but exclude snacks), and &amp;quot;the fare of a flight&amp;quot; (because in this database, flights don't have simple fares, but a flight together with a class determine the fare).</Paragraph>
      <Paragraph position="1"> The process of defining precisely what is meant by many words and phrases is still not complete, but Appendix B lists the current points of agreement. Without this agreement, many systems would produce very different answers for the same questions, all of them equally fight according to the systems' own definitions of the terms, but not amenable to automatic evaluation.</Paragraph>
    </Section>
    <Section position="4" start_page="103" end_page="103" type="sub_section">
      <SectionTitle>
4.4 Developing Canonical Answers
</SectionTitle>
      <Paragraph position="0"> It is not enough to agree on meaning, it is also necessary to have a common understanding of what is to be produced as the answer, or part of the answer, to a question.</Paragraph>
      <Paragraph position="1"> For example, if a user asks &amp;quot;What is the departure time of the earliest flight from San Francisco to Atlanta?&amp;quot;, one system might reply with a single time and another might reply with that time plus additional columns containing the carrier and flight number, a third system might also include the arrival time and the origin and destination airports.</Paragraph>
      <Paragraph position="2"> None of these answers could be said to be wrong, although one might argue about the advantages and disadvantages of terseness and verbosity.</Paragraph>
      <Paragraph position="3"> It was agreed that, for the sake of automatic evaluation, a canonical answer (the minimum &amp;quot;fight&amp;quot; answer) should be developed for each Class A query in the training set, and that the canonical answer should be that answer retrieved by a canonical SQL expression. That is, the fight answer was defined by the expression which produces the answer from the database, as well as the answer retrieved. This ensures A) that it is possible to retrieve the canonical answer via SQL, B) that even if the answer is empty or otherwise limited in content, it is possible for system developers to understand what was expected by looking at the SQL, and C) the canonical answer contains the least amount of information needed to determine that the system produced the fight answer.</Paragraph>
      <Paragraph position="4"> Thus it was agreed that for identifying a &amp;quot;flight&amp;quot; the unique flight id would be used, not the carrier and flight number (since there may be several different flights called AA 123 connecting different cities with different departure times andother characteristics).</Paragraph>
      <Paragraph position="5"> What should be produced for an answer is determined both by domain-independent linguistic principles \[2\] and domain-specific stipulation (Appendix B). The language used to express the answers is defined in Appendix C.</Paragraph>
    </Section>
    <Section position="5" start_page="103" end_page="103" type="sub_section">
      <SectionTitle>
4.5 Developing Comparators
</SectionTitle>
      <Paragraph position="0"> A final necessary component is, of course, a program to compare the canonical answers to those produced by various systems. Two programs were constructed to do this, one written in Common Lisp by BBN and one written in C by NIST. Their functionality is substantially similar; anyone interested in obtaining the code for these comparators should contact Bill Fisher at NIST or Sean Boisen at BBN.</Paragraph>
      <Paragraph position="1"> The task of answer comparison is complicated substantially by the fact that the canonical answer is intentionally minimal, but the answer supplied by a system may contain extra information. Some intelligence is needed to determine when two answer match (i.e. simple identity tests won't work).</Paragraph>
    </Section>
    <Section position="6" start_page="103" end_page="103" type="sub_section">
      <SectionTitle>
4.6 Choosing a Test Set
</SectionTitle>
      <Paragraph position="0"> For the first evaluation, a test set of 90 Class A queries was chosen from dialogues collected by 4 subjects who were not represented in the training set. We believe that makes the test set harder than necessary, since it is clear that there is not enough training data to illustrate many of the linguistic forms used in this domain, and there is also a strong indication that new users tend to stick with a particular way of asking questions.</Paragraph>
    </Section>
    <Section position="7" start_page="103" end_page="103" type="sub_section">
      <SectionTitle>
4.7 Presenting Results
</SectionTitle>
      <Paragraph position="0"> Expressing results can be almost as complicated as obtaining them. Originally it was thought that a simple &amp;quot;X percent correct&amp;quot; measure would be sufficient, however it became clear that not all systems could answer all questions, and that there was a significant difference between giving a wrong answer and giving no answer at all, so the results were presented as: Number fight, Number wrong, Number not answered. How harshly systems should be judged for giving wrong answers was not determined.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="103" end_page="104" type="metho">
    <SectionTitle>
5. Strengths of this Methodology
</SectionTitle>
    <Paragraph position="0"> It forces advance agreement on the meaning of critical terms and on at least minimal information to be included in the answer.</Paragraph>
    <Paragraph position="1"> It is objective, to the extent that a method for selecting testable queries can be defined, and to the extent that the agreements mentioned above can be reached.</Paragraph>
    <Paragraph position="2"> It requires less human effort (primarily in the creating of canonical examples and answers) than non-automatic, more subjective evaluation.. It is thus better suited to large test sets.</Paragraph>
    <Paragraph position="3"> Flexible comparison means system developers need not develop a completely separate system for evaluation purposes. With only minor formatting changes, substantially the same system can be used for other purposes.</Paragraph>
    <Paragraph position="4"> It can be easily extended, as discussed in section 7 below.</Paragraph>
  </Section>
  <Section position="7" start_page="104" end_page="104" type="metho">
    <SectionTitle>
6. Weaknesses of this Methodology
</SectionTitle>
    <Paragraph position="0"> It does not distinguish between merely acceptable answers and very good answers (although the comparators could be made to take this into account if multiple canonical answers with associated acceptability levels could be provided).</Paragraph>
    <Paragraph position="1"> It does not distinguish between some cases, and may thus give undue credit to a system that &amp;quot;over answers&amp;quot;. For example, if the system prints out the carrier, flight number, arrival and departure times and locations, and meal service every time it is asked about a flight, then the answers to &amp;quot;What is the arrival time of AA 123&amp;quot;, &amp;quot;What is the destination of AA 123&amp;quot;, &amp;quot;What meal is served on AA 123&amp;quot; and &amp;quot;List flight AA 123&amp;quot; could all produce exactly the same answer and be scored correct on all of them, since the canonical answers would be a subset of the information printed (note that it still must correctly distinguish flight AA 123 from other flights).</Paragraph>
    <Paragraph position="2"> It cannot tell if a system gets the right answer for the wrong reason. This is an&amp;quot; 11 unavoidable problem with &amp;quot;black-box&amp;quot; evaluation, but it can be mitigated by use of larger test sets.</Paragraph>
    <Paragraph position="3"> It does not adequately measure the handling of some phenomena, such as extended dialogues.</Paragraph>
    <Paragraph position="4"> 7. Suggestions for The Future Our experience thus far has shown that the methodology of developing well-defined test set criteria, combined with automatic evaluation of canonical answers, is a useful one.</Paragraph>
    <Section position="1" start_page="104" end_page="104" type="sub_section">
      <SectionTitle>
7.1 Challenge Training Set
</SectionTitle>
      <Paragraph position="0"> One of the strongest needs of the SLS community at this time is more training data. Instead of a few hundred training queries, several thousand are needed. One way of obtaining more data is simply to continue to run Wizard scenarios to collect them, but this process is rather slow, and tends to yield numerous examples of very similar queries.</Paragraph>
      <Paragraph position="1"> We suggest that a separate training set, called the Challenge Set be created. To form this set, each of the five system-developing sites would create, using any means they wish, 500 Class A queries, together with canonical SQL and answers for them. Each site would be encouraged to include queries that show the scope of their system and that create a challenge for other sites to match. Every site would then be required to report on the results of running their system on the challenge set. This use of &amp;quot;crucial examples&amp;quot; is similar to more traditional linguistic methods.</Paragraph>
    </Section>
    <Section position="2" start_page="104" end_page="104" type="sub_section">
      <SectionTitle>
7.2 Beyond Class A
</SectionTitle>
      <Paragraph position="0"> 7.2.1Context The existing Class A definition excludes all sentences whose interpretation requires context outside the sentence itself, i.e. &amp;quot;Which of those flights are non-stop?&amp;quot;. The only obstacle to including such sentences is agreement on what discourse phenomenon to allow, what the meanings in context should be, and when the context should be reset (because sentences have been excluded for other reasons). The existing evaluation framework naturally extends to such cases, assuming the system can produce answers without user interaction.</Paragraph>
      <Paragraph position="1"> A proposal has been made \[6\] to standardize output displays in an attempt to reset context for the evaluation of discourse. This would make it possible to evaluate queries containing references that are display-specific, but not many queries in the training data are of this type, and we believe that there are simpler ways of evaluating common discourse phenomena.</Paragraph>
      <Paragraph position="2"> 7.2.2Ambiguity A simple extension to the language for expressing answers would allow more than one answer to be returned for a query. At a minimum, this could be used to give several alternatives: an answer matching any alternative would then be scored as correct. For example, the answer to the query &amp;quot;What is the distance from the San Francisco airport to downtown&amp;quot; could be either the distance to San Francisco or the distances to San Francisco and Oakland (since both of those cities are served by the San Francisco airport). A more sophisticated approach would be to assign different weights to these alternatives, so systems obtaining the preferred reading would score the highest.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML