XML Viewer - a92-1023

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/a92-1023_metho.xml
Size: 35,999 bytes
Last Modified: 2025-10-06 14:12:53
<?xml version="1.0" standalone="yes"?>
<Paper uid="A92-1023">
  <Title>A Practical Methodology for the Evaluation of Spoken Language Systems</Title>
  <Section position="2" start_page="0" end_page="164" type="metho">
    <SectionTitle>
2 The Evaluation Framework
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="162" type="sub_section">
      <SectionTitle>
2.1 Characteristics of the Methodology
</SectionTitle>
      <Paragraph position="0"> The goal of this research has been to produce a well-defined, meaningful evaluation methodology which is *The work reported here was supported by the Advanced Research Projects Agency and was monitored by the Off'ice of Naval Research under Contract No. 00014-89-C-0008. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the Defense Advanced Research Projects Agency or the United States Government.</Paragraph>
      <Paragraph position="1"> * automatic, to enable evaluation over large quantities of data based on an objective assessment of the understanding capabilities of a NL system (rather than its user interface, portability, speed, etc.) * capable of application to a wide variety of NL systems and approaches * suitable for blind testing * as non-intrusive as possible on the system being evaluated (to decrease the costs of evaluation) * domain independent.</Paragraph>
      <Paragraph position="2"> The systems are assumed to be front ends to an interactive database query system, implemented in a particular common domain.</Paragraph>
      <Paragraph position="3"> The methodology can be described as &amp;quot;black box&amp;quot; in thai there is no attempt to evaluate the internal representations (syntactic, semantic, etc.) of a system. Instead, only the content of an answer relrieved from the database is evaluated: if the answer is correct, it is assumed that the system understood the query correctly. Comparing answers has the practical advantage of being a simple way to give widely var. ied systems a common basis for comparison. Although some recent work has suggested promising approaches (Black e, al., 1991), system-internal representations are hard to com. pare, or even impossible in some cases where System X hm no level of representation corresponding to System Y's. I is easy, however, to define a simple common language fo~ representing answers (see Appendix A), and easy to ma~ system-specific representations into this common language. This methodology has been successfully applied in the context of cross-site blind tests, where the evaluation i: based on input which the system has never seen before This type of evaluation leaves out many other important as pects of a system, such as the user interface, or the utilit: (or speed) of performing a particular task with a system tha includes a NL component (work by Tennant (1981), Bate: and Rettig (1988), and Neal et al. (1991) addresses some o these other factors).</Paragraph>
      <Paragraph position="4"> Examples below will be taken from the current DARPt SLS application, the Airline Travel Information Systen (ATIS). This is a database of flights with information o the aircraft, stops and connections, meals, etc.</Paragraph>
    </Section>
    <Section position="2" start_page="162" end_page="164" type="sub_section">
      <SectionTitle>
2.2 Evaluation Architecture and Common Resources
</SectionTitle>
      <Paragraph position="0"> We assume an evaluation architecture like that in Figure 1.</Paragraph>
      <Paragraph position="1"> The shaded components are common resources of the evaluation, and are not part of the system(s) being evaluated.</Paragraph>
      <Paragraph position="2"> Specifically, it is assumed there is a common database which all systems use in producing answers, which defines both the data tuples (rows in tables) and the data types for elements of these tuples (string, integer, etc.).</Paragraph>
      <Paragraph position="3"> Queries relevant to the database are collected under conditions as realistic as possible (see 2.4). Answers to the corpus of queries must be provided, expressed in a common standard format (Common Answer Specification, or CAS): one such format is exemplified in Appendix A. Some portion of these pairs of queries and answers is then set aside as a test corpus, and the remainder is provided as training material.</Paragraph>
      <Paragraph position="4"> In practice, it has also proved useful to include in the training data the database query expression (for example, an SQL expression) which was used to produce the reference answer: this often makes it possible for system developers to understand what was expected for a query, even if the answer is empty or otherwise limited in content.</Paragraph>
      <Paragraph position="5">  While the pairing of queries with answers provides the training and test corpora, these must be augmented by common agreement as to how queries should be answered. In practice, agreeing on the meaning of queries has been one of the hardest tasks. The issues are often extremely subtle, and interact with the structure and content of the database in sometimes unexpected ways.</Paragraph>
      <Paragraph position="6"> As an example of the problem, consider the following request to an airline information system: List the direct flights from Boston to Dallas that serve meals.</Paragraph>
      <Paragraph position="7"> It seems straightforward, but should this include flights that might stop in Chicago without making a connection there? Should it include flights that serve a snack, since a snack is not considered by some people to be a full meal? Without some common agreement, many systems would produce very different answers for the same questions, all of them equally right according to each system's own definitions of the terms, but not amenable to automatic intersystem comparison. To implement this methodology for such a domain, therefore, it is necessary to stipulate the meaning of potentiMly ambiguous terms such as &amp;quot;mid-day&amp;quot;, &amp;quot;meals&amp;quot; , &amp;quot;the fare of a flight&amp;quot;. The current list of such &amp;quot;principles of interpretation&amp;quot; for the ATIS domain contains about 60 specifications, including things like: * which tables and fields in the database identify the major entities in the domain (flights, aircraft, fares, etc.) * how to interpret fare expressions like &amp;quot;one-way fare&amp;quot;, &amp;quot;the cheapest fare&amp;quot;, &amp;quot;excursion fare&amp;quot;, etc. * which cities are to be considered &amp;quot;near&amp;quot; an airport. Some other examples from the current principles of interpretation are given in Appendix B.</Paragraph>
      <Paragraph position="8">  It is not enough to agree on meaning of queries in the chosen domain. It is also necessary to develop a common understanding of precisely what is to be produced as the answer, or part of the answer, to a question.</Paragraph>
      <Paragraph position="9"> For example, if a user asks &amp;quot;What is the departure time of the earliest flight from San Francisco to Atlanta?&amp;quot;, one system might reply with a single time and another might reply with that time plus additional columns containing the carrier and flight number, a third system might also include the arrival time and the origin and destination airports. None of these answers could be said to be wrong, although one might argue about the advantages and disadvantages of terseness and verbosity.</Paragraph>
      <Paragraph position="10"> While it is technically possible to mandate exactly which columns from the database should be returned for expressions, this is not practical: it requires agreement on a much larger set of issues, and conflicts with the principle that evaluation should be as non-intrusive as possible. Furthermore, it is not strictly necessary: what matters most is not whether a system provided exactly the same data as some reference answer, but whether the correct answer is clearly among the data provided (as long as no incorrect data was returned).</Paragraph>
      <Paragraph position="11"> For the sake of automatic evaluation, then, a canonical reference answer (the minimum &amp;quot;right answer&amp;quot;) is developed for each evaluable query in the training set. The content of this reference answer is determined both by domain-independent linguistic principles (Boisen et al., 1989) and domain-specific stipulation. The language used to express the answers for the ATIS domain is presented in Appendix A.</Paragraph>
      <Paragraph position="12"> Evaluation using the minimal answer alone makes it possible to exploit the fact that extra fields in an answer axe not penalized. For example, the answer ((&amp;quot;AA&amp;quot; 152 0920 1015 &amp;quot;BOS .... CHI&amp;quot; &amp;quot;SNACK&amp;quot; ) ) could be produced for any of the following queries: * &amp;quot;When does American Airlines flight 152 leave?&amp;quot; * &amp;quot;What's the earliest flight from Boston to Chicago?&amp;quot; * &amp;quot;Does the 9:20 flight to Chicago serve meals?&amp;quot; and would be counted correct.</Paragraph>
      <Paragraph position="13"> For the ATIS evaluations, it was necessary to rectify this problem without overly constraining what systems can produce as an answer. The solution arrived at was to have two kinds of reference answers for each query: a minimum answer, which contains the absolute minimum amount of data that must be included in an answer for it to be correct, and a maximum answer (that can be automatically derived from the minimum) containing all the &amp;quot;reasonable&amp;quot; fields that might be included, but no completely irrelevant ones.</Paragraph>
      <Paragraph position="14"> For example, for a question asking about the arrival time of a flight, the minimum answer would contain the flight 1D and the arrival time. The maximum answer would contain the airline name and flight number, but not the meal service or any fare information. In order to be counted correct, the answer produced by a system must contain at least the data in the minimum answer, and no more than the data in the maximum answer; if additional fields are produced, the answer is counted as wrong. This successfully reduced the incentive for systems to overgenerate answers in hope of getting credit for answering queries that they did not really understand.</Paragraph>
      <Paragraph position="15">  Another common resource is software to compare the reference answers to those produced by various systems. 1 This task is complicated substantially by the fact that the reference answer is intentionally minimal, but the answer supplied by a system may contain extra information, and cannot be assumed to have the columns or rows in the same order as the reference answer. Some intelligence is therefore needed to determine when two answers match: simple identity tests won't work.</Paragraph>
      <Paragraph position="16"> In the general case, comparing the atomic values in an answer expression just means an identity test. The only exception is real numbers, for which an epsilon test is performed, to deal with round-off discrepancies arising from different hardware precision. 2 The number of significant digits that are required to be the same is a parameter of the comparator. Answer comparison at the level of tables require more sophistication, since column order is ignored, and the answer may include additional columns that are not in the specification. Furthermore, those additional columns can mean that the answer will include extra whole tuples not present in the specification. For example, in the ATIS domain, if the Concorde and Airbus are both aircraft whose type is &amp;quot;JET&amp;quot;, they would together contribute only one tuple (row) to the simple list of aircraft types below.</Paragraph>
      <Paragraph position="18"> On the other hand, if aircraft names were included in the table, they would each appear, producing a larger number of tuples overall.</Paragraph>
      <Paragraph position="20"> With answers in the form of tables, the algorithm explores each possible mapping from the required columns found in the reference answer (henceforth REF) to the actual columns found in the answer being evaluated (HYP). (Naturally, there must be at least as many columns in HYP as in REF, or the answer is clearly wrong.) For each such mapping, it reduces HYP according to the mapping, eliminating any duplicate tuples in the reduced table, and then compares REF against that reduced table, testing set-equivalence between the two.</Paragraph>
      <Paragraph position="21"> Special provision is made for single element answers, sc that a scalar REF and a HYP which is a table containing a single element are judged to be equivalent That is, a scalar REF will match either a scalar or a single elemenl  somewhat so that, e.g., strings need not have quotes around their if they do not contain &amp;quot;white space&amp;quot; characters. See Appendix t for further details.</Paragraph>
      <Paragraph position="22">  table for HYP, and a REF which is a single element table specification will also match either kind of answer.</Paragraph>
      <Paragraph position="23"> For the ATIS evaluations, two extensions were made to this approach. A REF may be ambiguous, containing several sub expressions each of which is itself a REF: in this case, if HYP matches any of the answers in REF, the comparison succeeds. A special answer token (NO_ANSWER) was also agreed to, so that when a system can detect that it doesn't have enough information, it can report that fact rather than guessing. This is based on the assumption that failing to answer is less serious than answering incorrectly.</Paragraph>
    </Section>
    <Section position="3" start_page="164" end_page="164" type="sub_section">
      <SectionTitle>
2.3 Scoring Answers
</SectionTitle>
      <Paragraph position="0"> Expressing results can be almost as complicated as obtaining them. Originally it was thought that a simple &amp;quot;X percent correct&amp;quot; measure would be sufficient, however it became clear that there was a significant difference between giving a wrong answer and giving no answer at all, so the results are now presented as: Number right, Number wrong, Number not answered, Weighted Error Percentage (weighted so that wrong answers are twice as bad as no answer at all), and Score (100 - weighted error).</Paragraph>
      <Paragraph position="1"> Whenever numeric measures of understanding are presented, they should in principle be accompanied by some measure of the significance and reliability of the metric. Although precise significance tests for this methodology are not yet known, it is clear that &amp;quot;'black box&amp;quot; testing is not a perfect measure. In particular, it is impossible to tell whether a system got a correct answer for the &amp;quot;right&amp;quot; reason, rather than through chance: this is especially true when the space of possible answers is small (yes-no questions are an extreme answer). Since more precise measures are much more costly, however, the present methodology has been considered adequate for the current state of the art in NL evaluation.</Paragraph>
      <Paragraph position="2"> Given that current weighted error rates for the DARPA ATIS evaluations range from 55%--18%, we can roughly estimate the confidence interval to be approximately 8%. 3 Another source of variation in the scoring metric is the fact that queries taken from different speakers can vary widely in terms of how easy it is for systems to understand and answer them correctly. For example, in the February 1991 ATIS evaluations, the performance of BBN's Delphi SLS on text input from individual speakers ranged from 75% to 10% correcL The word error from speech recognition was also the highest for those speakers with the highest NL error rates, suggesting that individual speaker differences can strongly impact the results.</Paragraph>
      <Paragraph position="3"> 3Assuming there is some probability of error in each trial (query), the variance in this error rate can be estimated using the formula where e is the error rate expressed as a decimal (so 55% error = 0.55), and n is the size of the test set. Taking e = 0.45 (one of the better scores from the February 91 ATIS evaluation), and n -- 145, differences in scores greater than 0.08 (8%) have a 95% likelihood of being significant.</Paragraph>
    </Section>
    <Section position="4" start_page="164" end_page="164" type="sub_section">
      <SectionTitle>
2.4 Evaluation Data
2.4.1 Collecting Data
</SectionTitle>
      <Paragraph position="0"> The methodology presented above places no a priori restrictions on how the data itself should be collected. For the ATIS evaluations, several different methods of data collection, including a method called &amp;quot;Wizard scenarios&amp;quot;, were used to collect raw data, both speech and transcribed text (Hemphill, 1990). This resulted in the collection of a number of human-machine dialogues. One advantage of this approach is that it produced both the queries and draft answers at the same time. It also became clear that the language obtained is very strongly influenced by the particular task, the domain and database being used, the amount and form of data returned to the user, and the type of data collection methodology used. This is still an area of active research in the DARPA SLS community.</Paragraph>
      <Paragraph position="1">  Typically, some of the data which is collected is not suitable as test data, because: * the queries fall outside the domain or the database query application * the queries require capabilities beyond strict NL understanding (for example, very complex inferencing or the use of large amounts of knowledge outside the domain) * the queries are overly vague (&amp;quot;Tell me about ...&amp;quot;) It is also possible that phenomena may arise in test data which falls outside the agreement on meanings derived from the training data (the &amp;quot;principles of interpretation&amp;quot;). Such queries should be excluded from the test corpus, since it is not possible to make a meaningful comparison on answers unless there is prior agreement on precisely what the answer should be.</Paragraph>
      <Paragraph position="2">  The methodology of comparing paired queries and answers assumes the query itself contains all the information necessary for producing an answer. This is, of course, often not true in spontaneous goal-directed utterances, since one query may create a context for another, and the full context is required to answer (e.g., &amp;quot;Show me the flights ... &amp;quot;, 'Which of THEM ...&amp;quot;). Various means of extending this methodology for evaluating context-dependent queries have been proposed, and some of them have been implemented in the ATIS evaluations (Boisen et al. (1989), Hirschman et al. (1990), Bates and Ayuso (1991), Pallett (1991)).</Paragraph>
    </Section>
  </Section>
  <Section position="3" start_page="164" end_page="165" type="metho">
    <SectionTitle>
3 The DARPA SLS Evaluations
</SectionTitle>
    <Paragraph position="0"> The goal of the DARPA Spoken Language Systems program is to further research and demonstrate the potential utility of speech understanding. Currently, at least five major sites (AT&amp;T, BBN, CMU, MIT, and SRI) are developing complete SLS systems, and another site (Paramax) is integrating its NL component with other speech systems. Representatives from these and other organizations meet regularly to discuss program goals and to evaluate progress.</Paragraph>
    <Paragraph position="1">  This DARPA SLS community formed a committee on evaluation 4, chaired by David Pallett of the National Institute of Standards and Technology (NIST). The committee was to develop a methodology for data collection, training data dissemination, and testing for SLS systems under development. The first community-wide evaluation using the first version of this methodology took place in June, 1990, with subsequent evaluations in February 1991 and February 1992.</Paragraph>
    <Paragraph position="2"> The emphasis of the committee's work has been on automatic evaluation of queries to an air travel information system (ATIS). Air travel was chosen as an application that is easy for everyone to understand. The methodology presented here was originally developed in the context of the need for SLS evaluation, and has been extended in important ways by the community based on the practical experience of doing evaluations.</Paragraph>
    <Paragraph position="3"> As a result of the ATIS evaluations, a body of resources has now been compiled and is available through NIST. This includes the ATIS relational database, a corpus of paired queries and answers, protocols for data collection, software for automatic comparison of answers, the &amp;quot;Principles of Interpretation&amp;quot; specifying domain-specific meanings of queries, and the CAS format (Appendix A is the current version). Interested parties should contact David Pallet of NIST for more information. 5  4 Advantages and Limitations of the Methodology Several benefits come from the use of this methodology: * It forces advance agreement on the meaning of critical terms and on some information to be included in the answer.</Paragraph>
    <Paragraph position="4"> * It is objective, to the extent that a method for selecting testable queries can be defined, and to the extent that the agreements mentioned above can be reached.</Paragraph>
    <Paragraph position="5"> * It requires less human effort (primarily in the creating of canonical examples and answers) than non-automatic, more subjective evaluation. It is thus better suited to large test sets.</Paragraph>
    <Paragraph position="6"> * It can be easily extended.</Paragraph>
    <Paragraph position="7">  Most of the weaknesses of this methodology arise from the fact that the answers produced by a database query system are only an approximation of its understanding capabilities. As with any black-box approach, it may give undue credit to a system that gets the right answer for the wrong reason (i.e., without really understanding the query), although this should be mitigated by using larger and more varied test  Standards and Technology, Technology Building, Room A216, Gaithersburg, MD 20899, (301)975-2944.</Paragraph>
    <Paragraph position="8"> corpora. It does not distinguish between merely acceptable answers and very good answers.</Paragraph>
    <Paragraph position="9"> Another limitation of this approach is that it does not adequately measure the handling of some phenomena, such as extended dialogues.</Paragraph>
  </Section>
  <Section position="4" start_page="165" end_page="165" type="metho">
    <SectionTitle>
5 Other Evaluation Methodologies
</SectionTitle>
    <Paragraph position="0"> This approach to evaluation shares many characteristics with the methods used for the DARPA-sponsored Message Understanding Conferences (Sundheim, 1989; Sundheim, 1991). In particular, both approaches are focused on external (black-box) evaluation of the understanding capabilities of systems using input/output pairs, and there are many similar problems in precisely specifying how NL systems are to satisfy the application task.</Paragraph>
    <Paragraph position="1"> Despite these similarities, this methodology probably comes closer to evaluating the actual understanding capabilities of NL systems. One reason is that the constraints on both input and output are more rigorous. For database query tasks, virtually every word must be correctly understood to produce a correct answer: by contrast, much of the MUC-3 texts is irrelevant to the application task. Since this methodology focuses on single queries (.perhaps with additional context), a smaller amount of language is being examined in each individual comparison.</Paragraph>
    <Paragraph position="2"> Similarly, for database query, the database itself implicitly constrains the space of possible answers, and each answer is scored as either correct or incorrect. This differs from the MUC evaluations, where an answer template is a composite of many bits of information, and is scored along the dimensions of recall, precision, and overgeneration.</Paragraph>
    <Paragraph position="3"> Rome Laboratory has also sponsored a recent effort to define another approach to evaluating NL systems (Neal et al., 1991; Walter, 1992). This methodology is focussed on human evaluation of interactive systems, and is a &amp;quot;glassbox&amp;quot; method which looks at the performance of the linguistic components of the system under review.</Paragraph>
  </Section>
  <Section position="5" start_page="165" end_page="166" type="metho">
    <SectionTitle>
6 Future Issues
</SectionTitle>
    <Paragraph position="0"> The hottest topic currently facing the SLS community with respect to evaluation is what to do about dialogues. Many of the natural tasks one might do with a database interface involve extended problem-solving dialogues, but no methodology exists for evaluating the capabilities of systems attempting to engage in dialogues with users.</Paragraph>
    <Paragraph position="1"> A Common Answer Specification (CAS) for the ATIS Application (Note: this is the official CAS specification for the DARPA ATIS evaluations, as distributed by NIST. It is domain independent, but not necessarily complete: for example, it assumes that the units of any database value are unambiguously determined by the database specification. This would not be sufficient for applications that allowed unit conversion, e.g. &amp;quot;Show me the weight of ...&amp;quot; where the weight could be expressed in tons, metric tons, pounds, etc. This sort of extension should not affect the ease of automatically comparing answer expressions, however.)  Basic Syntax in BNF answer , casl \[ ( casl OR answer ) casl , scalar-value \[ relation \] NO.ANSWER I no_answer scalar-value , boolean-value I number-value \[ string boolean-value , YES \[ yes \[ TRUE \[ true \[ NO I no I FALSE I false number-value , integer \] real-number integer , \[sign\] digit+ sign , + digit ,0 1 \[ 2 \[ 3 { 4 \[ 5 { 6 I 7 I 8 9 real-number , sign digit+, digit* \[ digit+, digit* string , char_except_whitespace+ I &amp;quot; char* &amp;quot; relation , ( tuple* ) tuple ~ ( value+ ) value , scalar-value \[NIL  Standard BNF notation has been extended to include two other common devices : &amp;quot;A+&amp;quot; means &amp;quot;one or more A's&amp;quot; and &amp;quot;m*&amp;quot; means &amp;quot;zero or more A's&amp;quot;. The formulation given above does not define char_except_whitespace and char. All of the standard ASCII characters count as members of char, and all but &amp;quot;white space&amp;quot; are counted as char_except_whitespace. Following ANSI &amp;quot;C&amp;quot;, blanks, horizontal and vertical tabs, newlines, formfeeds, and comments are, collectively, &amp;quot;white space&amp;quot;. The only change in the syntax of CAS itself from the previous version is that now a string may be represented as either a sequence of characters not containing white space or as a sequence of any characters enclosed in quotation marks. Note that only non-exponential real numbers are allowed, and that empty tuples are not allowed (but empty relations are).</Paragraph>
    <Section position="1" start_page="166" end_page="166" type="sub_section">
      <SectionTitle>
Additional Syntactic Constraints
</SectionTitle>
      <Paragraph position="0"> The syntactic classes boolean-value, string, and numbervalue define the types &amp;quot;boolean&amp;quot;, &amp;quot;string&amp;quot;, and &amp;quot;'number&amp;quot;, respectively. All the tuples in a relation must have the same number of values, and those values must be of the same respective types (boolean, string, or number).</Paragraph>
      <Paragraph position="1"> If a token could represent either a string or a number, it will be taken to be a number; if it could represent either a string or a boolean, it will be taken to be a boolean. Interpretation as a string may be forced by enclosing a token in quotation marks.</Paragraph>
      <Paragraph position="2"> In a tuple, NIL as the representation of missing data is allowed as a special case for any value, so a legal answer indicating the costs of ground transportation in Boston would be ({&amp;quot;L&amp;quot; 5.00) (&amp;quot;R&amp;quot; nil) (&amp;quot;A&amp;quot; nil) (&amp;quot;R&amp;quot; nil)) Elementary Rules for CAS Comparisons String comparison is case-sensitive, but the distinguished values (YES, NO, TRUE, FALSE, NO~ANSWEP~ and NIL) may be written in either upper or lower case.</Paragraph>
      <Paragraph position="3"> Each indexical position for a value in a tuple (say, the ith) is assumed to represent the same field or variable in all the tuples in a given relation.</Paragraph>
      <Paragraph position="4"> Answer relations must be derived from the existing relations in the database, either by subsetting and combining relations or by operations like averaging, summation, etc. In matching an hypothesized (HYP) CAS form with a reference (REF) one, the order of values in the tuples is not important; nor is the order of tuples in a relation, nor the order of alternatives in a CAS form using OR. The scoring algorithm will use the re-ordering that maximizes the indicated score. Extra values in a tuple are not counted as errors, but distinct extra tuples in a relation are. A tuple is not distinct if its values for the fields specified by the REF CAS are the same as another tuple in the relation; these duplicate tuples are ignored. CAS forms that include alternate CAS's connected with OR are intended to allow a single HYP form to match any one of several REF CAS forms. If the HYP CAS form contains alternates, the score is undefined.</Paragraph>
      <Paragraph position="5"> In comparing two real number values, a tolerance will be allowed; the default is -t-.01%. No tolerance is allowed in the comparison of integers. In comparing two strings, initial and final sub-strings of white space are ignored. In comparing boolean values, TRUE and YES are equivalent, as are FALSE and NO.</Paragraph>
      <Paragraph position="6"> B Some Examples from the Principles of Interpretation Document for the ATIS Application (Note: these are excerpted from the official Principles of Interpretation document dated 11/20/91. The entire document is comprised of about 60 different points, and is available from David Pallet at NIST.</Paragraph>
      <Paragraph position="7"> The term &amp;quot;annotator&amp;quot; below refers to a human preparing training or test data by reviewing reference answers to queries.)</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="166" end_page="167" type="metho">
    <SectionTitle>
INTERPETING ATIS QUERIES RE THE DATABASE
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="166" end_page="167" type="sub_section">
      <SectionTitle>
1.3 Each interpretation must be expressible as one SQL
</SectionTitle>
      <Paragraph position="0"> statement.</Paragraph>
      <Paragraph position="1"> At present (11/18/91) a few specified exceptions to this principle are allowed, such as allowing boolean answers for yes/no questions.</Paragraph>
      <Paragraph position="2"> 1.4 All interpretations meeting the above rules will be used by the annotators to generate possible reference answers. null A query is thus ambiguous iff it has two interpretations that are fairly represented by distinct SQL expressions.  The reference SQL expression stands as a semantic representation or logical form. If a query has two interpretations that result in the same SQL, it will not be considered ambiguous. The fact that the two distinct SQL expressions may yield the same answer given the database is immaterial.</Paragraph>
      <Paragraph position="3"> The annotators must be aware of the usual sources of ambiguity, such as structural ambiguity, exemplified by cases like &amp;quot;the prices of flights, first class, from X to Y&amp;quot;, in which the attachment of a modifier that can apply to either prices or flights is unclear. (This should be (ambiguously) interpreted both ways, as both &amp;quot;the first-class prices on flights from X to Y&amp;quot; and &amp;quot;the prices on first-class flights from X to Y&amp;quot;.) More generally, if structural ambiguities like this could result in different (SQL) interpretations, they must be treated as ambiguous. null</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="167" end_page="167" type="metho">
    <SectionTitle>
2 Specific Principles:
</SectionTitle>
    <Paragraph position="0"> In this arena, certain English expressions have special meanings, particularly in terms of the database distributed by TI in the spring of 1990 and revised in November 1990 and May 1991. Here are the ones we have agreed on: (In the following, &amp;quot;A.B&amp;quot; refers to field B of table A.)</Paragraph>
  </Section>
  <Section position="8" start_page="167" end_page="167" type="metho">
    <SectionTitle>
2.1 Requests for enumeration.
</SectionTitle>
    <Paragraph position="0"> A large class of tables in the database have entries that can be taken as defining things that can be asked for in a query. In the answer, each of these things will be identified by giving a value of the primary key of its table. These tables are:  A &amp;quot;one-way&amp;quot; fare is a fare for which round_trip_required = &amp;quot;NO&amp;quot;.</Paragraph>
    <Paragraph position="1"> A &amp;quot;round-trip&amp;quot; fare is a fare with a non-null value for fare.round_trip_cost.</Paragraph>
    <Paragraph position="2"> The &amp;quot;cheapest fare&amp;quot; means the lowest onedirection fare.</Paragraph>
    <Paragraph position="3"> ...</Paragraph>
    <Paragraph position="4"> Questions about fares will always be treated as fares for flights in the maximal answer.</Paragraph>
    <Paragraph position="5">  The normal answer to otherwise unmodified &amp;quot;when&amp;quot; queries will be a time of day, not a date or a duration.</Paragraph>
    <Paragraph position="6"> The answer to queries like &amp;quot;On what days does flight X fly&amp;quot; will be a list of days.day.name fields. Queries that refer to a time earlier than 1300 hours without specifying &amp;quot;a.m.&amp;quot; or &amp;quot;p.m.&amp;quot; are ambiguous and may be interpreted as either.</Paragraph>
    <Paragraph position="7"> Periods of the day.</Paragraph>
    <Paragraph position="8"> The following table gives precise interpretations for some vague terms referring to time periods. The time intervals given do not include the end points. Items flagged with &amp;quot;*&amp;quot; are in the current  quests for the &amp;quot;meaning&amp;quot; of something will only be interpretable if that thing is a code with a canned decoding definition in the database. In case the code field is not the key field of the table, informarion should be returned for all tuples that match on the code field. Here are the things so defined, with the fields containing their decoding:  sidereal to be ambiguous between interpretation as a yes-or-no question and interpretation as the corresponding wh-question. For example, &amp;quot;Are there flights from Boston to Philly?&amp;quot; may be answered by either a boolean value (&amp;quot;YES/TRUE/NO/FALSE&amp;quot;) or a table of flights from Boston to Philadelphia.</Paragraph>
    <Paragraph position="9"> 2.15 When a query refers to an aircraft type such as &amp;quot;BOE-ING 767&amp;quot;, the manufacturer (if one is given) must match the aircraft.manufacturer field and the type may be matched against either the aircraft.code field or the aircraft.basic_type field, ambiguously.</Paragraph>
    <Paragraph position="10"> 2.16 Utterances whose answers require arithmetic computation are not now considered to be interpretable; this does not apply to arithmetic comparisons, including computing the maximum or minimum value of a field, or counting elements of a set of tuples.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML