File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/89/h89-2019_metho.xml

Size: 27,939 bytes

Last Modified: 2025-10-06 14:12:17

<?xml version="1.0" standalone="yes"?>
<Paper uid="H89-2019">
  <Title>A PROPOSAL FOR SLS EVALUATION</Title>
  <Section position="3" start_page="135" end_page="138" type="metho">
    <SectionTitle>
2 THE PROPOSAL
</SectionTitle>
    <Paragraph position="0"> Here we present the basic substance of our proposal for a Common Answer Specification (CAS), deferring some details and elaboration to Section 3. The CAS, which is designed to support system evaluation using the Comparator, covers four basic areas.&amp;quot;  ing canonical answers to system output.</Paragraph>
    <Paragraph position="1"> We assume an evaluation architecture like that in Figtire 1. Everything on the right hand side of the figure is the developer's responsibility 1. Items on the left side of the diagram will be provided as part of the evaluation process.</Paragraph>
    <Paragraph position="2"> The Common Answer Specification was devised to enable common, automatic evaluation: it is expressly not designed to meet the needs of human users, and should be considered external. This means that developers are free to implement any output format they find convenient for their own use: we only propose to dictate the form of what is supplied as input to the Comparator. It is assumed that some simple post-processing of system output by the developer will be required to conform to the CAS.</Paragraph>
    <Paragraph position="3"> While the central points of the CAS are domainindependent, certain details relating to content can only be determined relative to a specific domain. These portions of the proposal are specified for the personnel domain and SLS Personnel Database (Boisen, 1989), and all examples are taken from that domain as well.</Paragraph>
    <Section position="1" start_page="135" end_page="135" type="sub_section">
      <SectionTitle>
2.1 Notation
</SectionTitle>
      <Paragraph position="0"> S-1 The basic CAS expression will be a relation, that is, a set of tuples. The types of the elements of tuples will be one of the following: Boolean, number, string, or date. Relations consisting of a single tuple with only one element can alternatively be represented as a scalar.</Paragraph>
      <Paragraph position="1"> 1BBN has offered their ERL interpreter (Ramshaw, 1989) as a backend database interface for those who desire one: use of the ERL interpreter is explicitly not required for common evaluation, however, and developers are free to use whatever database interface they find suitable.</Paragraph>
      <Paragraph position="2"> S-2 CAS expressions will be expressed as a Lisp-style parenthesized list of non-empty lists. Scalar answers may alternatively be represented as atomic expressions.</Paragraph>
      <Paragraph position="3"> The two Boolean values are true and false. Numeric answers can be either integers or real numbers. Dates will be an 9-character string like &amp;quot;01-JAN-80&amp;quot;. The number and types of the elements of tuples must be the same across tuples. An empty relation will be represented by the empty list. Alphabetic case, and white-space between elements, will be disregarded, except in strings.</Paragraph>
      <Paragraph position="4"> A BNF specification for the syntax of the Common Answer Specification is found in Appendix A. Here are some examples of answers in the CAS format:</Paragraph>
      <Paragraph position="6"/>
    </Section>
    <Section position="2" start_page="135" end_page="137" type="sub_section">
      <SectionTitle>
2.2 Minimal Content
</SectionTitle>
      <Paragraph position="0"> Certain queries only require scalar answers, among them yes/no questions, imperatives like &amp;quot;Count&amp;quot; and &amp;quot;Sum&amp;quot;, questions like &amp;quot;How much/many __ ?&amp;quot;, &amp;quot;How long ago __ ?&amp;quot;, and &amp;quot;When __ ?&amp;quot;. Other queries may require a relation for an answer: for these cases, the CAS must specify (in linguistic terms) which of the entitles referred to in the English expression are required to be included in the answer. For required entities, it is also necessary to specify the database fields that identify them.</Paragraph>
      <Paragraph position="1"> S-3 For WH-questions, the required entity is the syntactic head of the WH noun phrase. For imperatives, the required NP is the head of the object NP. The nouns &amp;quot;'list&amp;quot;, &amp;quot;'table&amp;quot;, and &amp;quot;'display&amp;quot; will not be considered &amp;quot;true&amp;quot; heads.</Paragraph>
      <Paragraph position="2"> Examples: For the query &amp;quot;Which chief scientists in department 45 make more than $70000?&amp;quot;, the required entity is the scientists, not the department or their salaries.</Paragraph>
      <Paragraph position="3"> In the case of &amp;quot;Give me every employee's phone number&amp;quot;, the only required entity is the phone number.</Paragraph>
      <Paragraph position="4">  For &amp;quot;Show me a list of all the departments&amp;quot;, the required entity is the departments, not a list.</Paragraph>
      <Paragraph position="5"> For the query &amp;quot;Count the people in department 43', only the scalar value is required.</Paragraph>
      <Paragraph position="6"> Entities often can be identified by more than one database field: for example, a person could be represented by their employee identification number, their first and last names, their Social Security Number, etc. For any given domain, therefore, the fields that identify entities must be determined. If only one database identifier is available (i.e., only the field salary can represent a person's salary), the choice is clear: in other cases, it must be stipulated.</Paragraph>
      <Paragraph position="7"> S--4 (Personnel) In any evaluation domain, canonical database identifiers for entities in the domain must be determined. For the personnel domain, Table 1 will specify the database fields that are canonical identifiers for entities with multiple database representations. Example: For the query &amp;quot;List department 44 employees&amp;quot;, the required database field is empXoyee-id, so a suitable answer would</Paragraph>
      <Paragraph position="9"> where 4322, 5267, etc. are employee identification numbers.</Paragraph>
      <Paragraph position="10"> Certain English expressions in the personnel domain are vague, in that they provide insufficient information to determine their interpretation. We therefore stipulate how such expressions are to be construed.</Paragraph>
      <Paragraph position="11"> S-5 (Personnel) In any evaluation domain, the interpretation of vague references must be determined. For the personnel domain, Table 2 will designate the referents for several vague nominal expressions.</Paragraph>
      <Paragraph position="12"> Example: For the query &amp;quot;What is Paul Tai's phone number?&amp;quot;, the expression will be interpreted as a request for Tai's work phone number, not his home phone.</Paragraph>
      <Paragraph position="13">  Another case involves the specificity of job rifles. In the Common Personnel Database, there are several job rifles that are related to a specific type of profession: for example, &amp;quot;STAFF SCIENTIST&amp;quot;, &amp;quot;SCIEN-TIST&amp;quot;, &amp;quot;CHIEF SCIENTIST&amp;quot;, etc. We propose that all these be considered scientists in the generic sense for queries like &amp;quot;Is Mary Graham a scientist?&amp;quot; and &amp;quot;How many scientists are there in department 45?&amp;quot;.</Paragraph>
      <Paragraph position="14"> S--6 (Personnel) Someone will be considered a scientist if and only if that person's j ob-tltle contains the string &amp;quot;SCIENTIST&amp;quot;. The same is true for the following generic profession titles: engineer, programmer, clerk, and accountant. The terms manager and supervisor will be treated interchangeably, and someone will be considered a manager or supervisor only if that person is the supervisor of some other person.</Paragraph>
    </Section>
    <Section position="3" start_page="137" end_page="137" type="sub_section">
      <SectionTitle>
2.3 Test Corpora
</SectionTitle>
      <Paragraph position="0"> S--7 The primary corpus will be composed of simple pairs of queries and their answers. In addition, a distinct corpus will be used for evaluating the simple discourse capability of reference to an immediately preceding query. This corpus will contain triplets of ( querltl , query, answer2), where quer!ll contains the referent of some portion of querFz. The canonical answer will then be compared to answer2.</Paragraph>
      <Paragraph position="1"> Example: One entry in the discourse corpus might be the following triplet: query1: &amp;quot;What is Mary Smith's job fl0e?&amp;quot; querFz: &amp;quot;What is her salary?&amp;quot; answeR: ((52000)) S-8 For the time being, we propose to exclude from the test corpora queries that:</Paragraph>
    </Section>
    <Section position="4" start_page="137" end_page="138" type="sub_section">
      <SectionTitle>
2.4 The Comparator
</SectionTitle>
      <Paragraph position="0"> S-9 The Comparator will use an &amp;quot;epsilon&amp;quot; measure for comparing real numbers. The number in the canonical answer will be multiplied by the epsilon figure to determine the allowable deviation : numbers that differ by more than this amount will be scored as incorrect.</Paragraph>
      <Paragraph position="1"> The value of the epsilon figure will be initially set at 0.0001.</Paragraph>
      <Paragraph position="2"> Example: If the canonical answer is 53200.0, the maximum deviation allowed will be (53200.0 x 0.0001), or approximately 5.32.</Paragraph>
      <Paragraph position="3"> Thus a system answer of 53198.8 would score as correct, but 53190.9 would be incorrect.</Paragraph>
      <Paragraph position="4">  S-10 Extra fields may be present in an answer relation, and will be ignored by the Comparator. The order of fields is also not specified: the mapping from fields in a system answer to fields in the canonical answer will be determined by the Comparator.</Paragraph>
      <Paragraph position="5"> Example: For the query &amp;quot;Show Paul Tai's name and employee id&amp;quot;, with the canonical</Paragraph>
      <Paragraph position="7"> any of the following would be an acceptable answer:</Paragraph>
      <Paragraph position="9"> S-11 The only output of the Comparator will be &amp;quot;correct&amp;quot; or &amp;quot;'incorrect&amp;quot;. Capturing more subtle degrees of correctness will be accomplished by the quantity and variety of the test data.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="138" end_page="140" type="metho">
    <SectionTitle>
3 DISCUSSION
</SectionTitle>
    <Paragraph position="0"> This section presents some of the justification supporting the proposal in Section 2, as well as amplifying several points and discussing some possible shortcomings and extensions. It may be usefully omitted by those who are not interested in these details. The organization follows the four areas of the proposal: notation, minimal content, test corpora, and the Comparator procedures.</Paragraph>
    <Section position="1" start_page="138" end_page="138" type="sub_section">
      <SectionTitle>
3.1 Notation
</SectionTitle>
      <Paragraph position="0"> The proposal in S-2 allows scalar values to be represented either as relations or as atomic expressions, i.e., either of false ((false)) Treating the answer to questions like &amp;quot;Is Paul Tai a clerk?&amp;quot; as a relation seems somewhat unfortunate. This allows for a completely homogeneous representation of answers, however, and is permitted for this reason. The range of types in S-l, while adequate for the personnel domain, may need to be enlarged for other domains. One obvious omission is a type for amounts: units are not specified for numbers. In the personnel domain, there are only two amounts, years and dollars, neither of which is likely to require expression in other units (though one could conceivably ask &amp;quot;How many days has Paul Tai worked for the company?&amp;quot;). Other domains would not be similarly restricted, and might require unit labels for amounts of length, speed, volume, etc. Of course, it is always possible to simply specify canonical units for the domain and require all numbers to be in those units: this would increase the effort required to conform to the CAS, however.</Paragraph>
      <Paragraph position="1"> Answers to requests for percentages should express the ratio directly, not multiplying by 100: so if 45 out of 423 employees have PhD's, the answer to &amp;quot;What percentage of employees have PhD's?&amp;quot; should be</Paragraph>
    </Section>
    <Section position="2" start_page="138" end_page="139" type="sub_section">
      <SectionTitle>
3.2 Minimal Content
</SectionTitle>
      <Paragraph position="0"> Under section S-3 of the proposal, one should note the possibility of a required NP being conjoined or modified by a &amp;quot;conjoining&amp;quot; PP modifier, as in the following examples: List the clerks and their salaries.</Paragraph>
      <Paragraph position="1"> List the clerks with their salaries.</Paragraph>
      <Paragraph position="2"> Clearly in both of these cases the salaries as well as the employee IDs should be required in the answer.</Paragraph>
      <Paragraph position="3"> One possible objection to the approach proposed in S-3 is that it ignores some well-documented concerns about the &amp;quot;informativeness&amp;quot; or pragmatic capabilities of question-answering systems. For example, in a normal context, a suitably informative answer for List the salary of each employee.</Paragraph>
      <Paragraph position="4"> would provide not just a list of salaries but some identification of the employees as well, under the assumption that this is what the user really wants. Since the purpose of the CAS is automatic evaluation rather than user convenience, this objection seems irrelevant, at least until a metric for measuring pragmatic capabilities can be defined. Note that S-10 means developers are free to include extra fields for pragmatic purposes: such fields are simply not required for correctness. A similar point can be made concerning vague expressions (S-5). Only the street field is explicitly required for references to an address, since that should be sufficient to determine correctness, but developers may also include city and state if they wish.</Paragraph>
      <Paragraph position="5">  One might argue that the proposed treatment of manager/supervisor in S-6 is inconsistent with the approach taken for scientists, engineers, programmers, etc. Our decision is essentially to consider manager and supervisor as technical terms which indicate a supervisory relation to another employee, rather than generic descriptions of a profession. This has the possibly unfortunate consequence that employees in the Common Personnel Database with job rifles like &amp;quot;SUPERVISOR&amp;quot; and &amp;quot;PROJECT MANAGER&amp;quot; may not be considered supervisors or managers in this technical sense. Nevertheless, given that these rifles seem less like descriptions of a profession, the approach is not inconsistent.</Paragraph>
      <Paragraph position="6"> There are probably other vague expressions which will have to be dealt with on a case-by-case basis, either by agreeing on a referent or eliminating them from the corpora.</Paragraph>
    </Section>
    <Section position="3" start_page="139" end_page="140" type="sub_section">
      <SectionTitle>
3.3 Test Corpora
</SectionTitle>
      <Paragraph position="0"> Some comments are in order about various items excluded by section S-8 of the proposal. For example, ambiguity (as opposed to vagueness) is barred at present, primarily to simplify the Common Answer Specification.</Paragraph>
      <Paragraph position="1"> It would not be difficult to enhance the canonical answer to include several alternatives for queries which were genuinely ambiguous: then the Comparator could see if a system answer matched any of them, counting at least one match as correct. A more challenging test would be to only score answers as correct that matched all reasonable answers. This would obviously require substantial consensus on which queries were ambiguous and what the possible readings were.</Paragraph>
      <Paragraph position="2"> Presupposition is another area where one could imagine extensions to the proposal: for queries like List the female managers who make more than $50000.</Paragraph>
      <Paragraph position="3"> there is an implicit assumption that there are, in fact, female managers. If no managers are female, the answer would presumably be an empty relation. The reason the answer set would be empty, however, would have nothing to do with the data failing to meet some set of restrictions, as in ordinary cases. Rather, the object NP would have no referent: this is a different kind of failure, and systems that can detect it have achieved a greater degree of sophistication, which presumably ought to be reflected in their evaluation. Such failure to refer can be subdivided into two cases: necessary failure: cases which are impossible by definition. For example, the set of men and the set of women are disjoint: therefore &amp;quot;the female men&amp;quot; necessarily fails to refer to any entity.</Paragraph>
      <Paragraph position="4"> contingent failure: cases which happen to fail because of the state of the world (as in the example above).</Paragraph>
      <Paragraph position="5"> One modification to the CAS that would include such cases would be to extend S-1 to include types for failure, with several possible values indicating the failure encountered. Then the canonical answer to the example above might be the special token contingent-failure rather than simply () Until a broader consensus on this issue is achieved, however, we consider the best approach to be the elimination&amp;quot; of such cases from the test corpora.</Paragraph>
      <Paragraph position="6"> Section S-8 excludes queries whose answer is not available due to missing data. As an example, in the SLS Personnel Database the home phone number for Richard Young, an attorney, is missing, and is therefore entered as NIL. We therefore propose to exclude queries like &amp;quot;What is Richard Young's home phone?&amp;quot; from test corpora, since no data is available to answer the question. On the other hand, the query &amp;quot;List the home phone numbers of all the attorneys&amp;quot; would not be excluded. The answer here would be the set of phone numbers, including Richard Young's:</Paragraph>
      <Paragraph position="8"> Queries involving sorting are currently omitted pending resolution of several technical problems: * Given that extra fields are allowed, the primary sort keys would have to be specified by the CAS.</Paragraph>
      <Paragraph position="9"> Different sites might have different and incompatible approaches to sub-sorts, ff the primary keys are not unique.</Paragraph>
      <Paragraph position="10"> * Since relations are assumed to not be ordered, a different notation for sorted answers would be needed.  In addition to these problems, evaluating queries that require sorting would seem to contribute little to understanding the key technological capabilities of an SLS, and is therefore at best a marginal issue. In light of these points, we consider it expedient to simply omit such cases for the present.</Paragraph>
    </Section>
    <Section position="4" start_page="140" end_page="140" type="sub_section">
      <SectionTitle>
3.4 Comparator
</SectionTitle>
      <Paragraph position="0"> The epsilon measure proposed in S-9 assumes that, if an SLS does any rounding of real numbers, it will not exceed the epsilon figure. Two particular cases where this might be problematic are percentages and years: repeating an earlier example, the correct answer to a query like &amp;quot;What percentage of employees have PhDs?&amp;quot; might be  and rounding this to 0.11 would score as incorrect. Similarly, for a query like &amp;quot;How many years has Sharon Lease worked here?&amp;quot;, the years must not be treated as whole units: an answer like  would score as correct, but 37 would be incorrect. One consequence of S-10 is the small possibility of a incorrect system answer being spuriously scored as correct. This is especially likely when there are only a few tuples and elements, and the range of possible values is small. Yes/no questions are an extreme example: an unscrupulous developer could always get such queries correct by simply answering</Paragraph>
      <Paragraph position="2"> since the Comparator would generously choose the right case. Eliminating such aberrations would require distinguishing queries whose answers are relations from those that produce scalars, and imposing this distinction in the CAS. We therefore assume for the time being that developers will pursue more pnncipled approaches, and we rely on the variety of test corpora to de-emphasize the significance of these marginal cases.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="140" end_page="141" type="metho">
    <SectionTitle>
4 THE COMPARATOR
</SectionTitle>
    <Paragraph position="0"> In this section we describe the Comparator, a Common Lisp program for comparing system output (conforming to the CAS) with canonical answers. We have chosen this name to reflect an important but subtle point: evaluation requires human judgement, and therefore the best we can expect from a program is comparison, not evaluation. Since the degree to which system output reflects system capabilities is always imperfect, we view the resuits of the Comparator as only one facet of the entire effort of evaluating Spoken Language Systems. The Comparator software is available without charge from BBN Systems and Technologies Corporation, which reserves all fights to the software. To obtain a copy, contact sboisen@bbn, con The Comparator takes two inputs: the answer from a particular SLS, and the canonical answer. The output is a Boolean value indicating whether the system-supplied answer matched the canonical one. To make that judgement, the Comparator needs to perform type-appropriate comparisons on the individual data items, and to handle correctly system answers that contain extra values.</Paragraph>
    <Paragraph position="1"> As described in S-9, real numbers are compared using an epsilon test that compares only a fixed number of the most significant digits of the two answers. The number&amp;quot; of digits compared is intended to generously reflect the accuracy range of the least accurate systems involved.</Paragraph>
    <Paragraph position="2"> Note that there is still some danger of numeric imprecision causing an answer to be counted wrong if the test set includes certain pathological types of queries, like those asking for the difference between two very similar real numbers.</Paragraph>
    <Paragraph position="3"> The other, more major issue for the Comparator concerns the fact that table answers are allowed to include extra columns of information, as long as they also include the minimal information required by the canonical answer(S-10). Note that these additional columns can mean that the system answer will also include extra tupies not present in the canonical answer. For example, if Smith and Jones both make exactly $40,000, they would contribute only one tuple to a simple list of salaries, but if a column of last names were included in the answer table, there would be two separate tuples.</Paragraph>
    <Paragraph position="4"> What the Comparator does with table answers is to explore each possible mapping from the required columns found in the canonical answer to the actual columns found in the system-supplied answer. (Naturally, there must be at least as many columns as in the canonical answer, or the system answer is clearly incorrect.) Applying each mapping in turn to the provided answer, the Comparator builds a reduced answer containing only  those columns indicated by the mapping, with any duplicate tuples in the reduced answer eliminated. It is this reduced answer that is compared with the canonical answer, in terms of set equivalence.</Paragraph>
    <Paragraph position="5"> Finally, it should be stressed that the Comparator works within the context of relational database principles. It treats answer tables as sets of tuples, rather than lists. This means first that order of tuples is irrelevant. It also means that duplicate tuples are given no semantic weight; any duplicates in a provided answer will be removed by the Comparator before the comparison is made.</Paragraph>
  </Section>
  <Section position="6" start_page="141" end_page="141" type="metho">
    <SectionTitle>
5 CORPUS DEVELOPMENT AND TAG-
GING
</SectionTitle>
    <Paragraph position="0"> Any corpus which is collected for SLS development and testing will be more useful if it is easily sub-divided into easier and harder cases. Different systems have different capabilities, particularly with respect to handling discourse phenomena: the ideal corpus will therefore include both the most basic case (i.e., no discourse phenomena) and enough difficult cases to drive more advanced research.</Paragraph>
    <Paragraph position="1"> We propose the tagging of corpora using a hierarchical categorization that reflects the richness of context required to handle queries. These categories primarily distinguish levels of effort for Spoken Language Systems, such that the lowest category should be attempted by every site, and the highest category attempted only by the most ambitious. Two other criteria for the categorization are the following: * Categories should be maximally distinctive: there is no need for fine distinctions that in practice only separate a few cases.</Paragraph>
    <Paragraph position="2"> * Categories should be easily distinguishable by those who will do the tagging. That is, they should be objective and clearly defined rather than relying on sophisticated linguistic judgements.</Paragraph>
    <Paragraph position="3"> Here is a candidate categorization, where the category number increases as the context required becomes successively richer: Category 0: no extra-Sentential context is required (i.e., &amp;quot;0&amp;quot; context): the sentence can be understood in isolation. This is the default case.</Paragraph>
    <Paragraph position="4"> Category 1: &amp;quot;local&amp;quot; extra-sentential reference, excluding reference to answers: that is, the sentence can be understood if the text of the previous question is available. One is not allowed to go back more than one question, or look at the answer, to find the referent.</Paragraph>
    <Paragraph position="5"> Category 2: ellipsis cases, such as the following sequence: null &amp;quot;What's Sharon Lease's salary?&amp;quot; &amp;quot;How about Paul Tai?&amp;quot; Category 3: &amp;quot;non-local&amp;quot; reference. The referent is in the answer to the previous query, or in the text of a query earlier the previous one. This probably includes several other kinds of phenomena that would be usefully separated out at a later date.</Paragraph>
    <Paragraph position="6"> Category X: all cases excluded from corpora, both for discot~se and other reasons (see S-8).</Paragraph>
    <Paragraph position="7"> We propose two initial uses of this categorization for SLS evaluation: creating basic test corpora of Category 0 queries, and designing simple discourse corpora that include Category 1 queries (see S-7). The other categories would enable developers either to focus on or to eliminate from consideration more difficult kinds of discourse phenomena.</Paragraph>
    <Paragraph position="8"> There may be other categories which are of interest to particular developers: for such cases, it is suggested that the developer do their own site-specific tagging, using those features which seem reasonable to them.</Paragraph>
    <Paragraph position="9"> This scheme is offered solely to expedite community-wide evaluation: there are many possible categorizations which might be useful for other purposes, but they are independent of the considerations of common evaluation. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML