File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/h91-1008_metho.xml

Size: 39,877 bytes

Last Modified: 2025-10-06 14:12:41

<?xml version="1.0" standalone="yes"?>
<Paper uid="H91-1008">
  <Title>SESSION 2: DARPA RESOURCE MANAGEMENT AND ATIS BENCHMARK TEST POSTER SESSION</Title>
  <Section position="1" start_page="0" end_page="50" type="metho">
    <SectionTitle>
SESSION 2: DARPA RESOURCE MANAGEMENT AND
ATIS BENCHMARK TEST POSTER SESSION
</SectionTitle>
    <Paragraph position="0"> DARPA contractors and others prior to the February 1991 meeting. Results were reported to NIST and scored using &amp;quot;official&amp;quot; scoring software and reference answers and the results were reported to the participants.</Paragraph>
    <Paragraph position="1"> All papers in this poster session at the DARPA speech workshop reported results obtained using the Benchmark Test Material. The Workshop Planning Committee suggested a three-part session format consisting of: (1) Introductory remarks, (2) one hour to review and discuss posters, and (3) open discussion.</Paragraph>
    <Paragraph position="2"> Section H of this paper presents an overview describing the approaches used by the participants in this session, while Section III summarizes the open discussion. Section IV describes the benchmark test material selection process and benchmark test protocols. Section V presents tabulations of these results, and discussion of these results is included in Section VI.</Paragraph>
    <Paragraph position="3"> II. SESSION OVERVIEW A total of fourteen papers were presented in poster form. Eleven of the papers were presented by DARPA SLS contractors, and three were from non-DARPA sites.</Paragraph>
    <Paragraph position="4"> Five papers dealt with speech recognition systems: (1) The group at Dragon Systems reported speaker-dependent system results for the Resource Management (RM) test set, using the word-pair grammar \[1\]. Dragon's results were obtained on a 25 Mhz 80486-based PC, with an RM vocabulary modelled using &amp;quot;roughly 30,000 phonemes in context or PICs&amp;quot;, and making use of the Dragon rapid match module.</Paragraph>
    <Paragraph position="5"> (2) Doug Paul of MIT Lincoln Laboratory reported speech recognition results for both the RM and ATIS SPREC test material \[2\]. Recent work includes: variations in semiphone modelling, a &amp;quot;very simple improved duration model&amp;quot; responsible for reducing the error rate by about 10%, a new training strategy, and modifications to the recognizer to use back-off bigrarn language models.</Paragraph>
    <Paragraph position="6"> (3) The Spoken Language Group at MIT's Laboratory for Computer Science also reported results for both the RM and ATIS SPREC test material \[3\]. The MIT SUMMIT system is a &amp;quot;segment-based&amp;quot; speech recognition system, including a front end that incorporates a model of the auditory system, a hierarchical segmentation algorithm to identify a network of possible acoustical segments, segmental measurements, and a statistical classifier to produce a phonetic network. The best-scoring word sequence is derived by matching the phonetic network against a pronunciation network.</Paragraph>
    <Paragraph position="7"> Recent developments have incorporated more complex context-dependency modelling as well as an improved corrective training procedure.</Paragraph>
    <Paragraph position="8"> (4) Francis Kubala et al. reported on BBN's BYBLOS results for both the RM and ATIS SPREC test material \[4\]. The reported RM speaker-independent results include results for a SI model built using only 12 training speakers. BBN's ATIS results include speaker-independent results for two conditions. &amp;quot;The first is a controlled condition using a specific training set and bigram grammar&amp;quot; \[similar to that used by Paul \[2\]\]. The second condition makes use of augmented training data (collected at BBN) and a 4-grarn class grammar.</Paragraph>
    <Paragraph position="9"> (5) A collaborative effort involving Marl Ostendorf and her colleagues at Boston University and others at BBN makes use of a general formalism for integrating two or more speech recognition technologies \[5\]. &amp;quot;In this formalism, one system uses the N-best search strategy to generate a list of candidate sentences; the list is restored by other systems; and the different scores are combined to optimize perforrnanee.&amp;quot; Ostendorf et al. &amp;quot;report on combining the BU system based on stochastic segment models and the BBN system based on hidden Markov models.&amp;quot; Six papers were presented by DARPA contractors describing integrations of speech and natural language processing into ATIS systems.</Paragraph>
    <Paragraph position="10"> (1) The Spoken Language Group at MIT's Laboratory for Computer Science presented a status report on the MIT ATIS system \[6\]. A context-independent version of the SUMMIT system (described in \[3\]) including a word-pair grammar with perplexity 92 has been incorporated. The back-end has been redesigned, and the parser now produces an intermediate semantic-frame representation &amp;quot;which serves as the focal-point for all back- end operations.&amp;quot; Results are reported for both the February '91 ATIS benchmark test set and for a test set collected at MIT.</Paragraph>
    <Paragraph position="11"> (2) The Speech and Natural Language Groups at SRI reported results for both the RM and ATIS SPREC speech recognition test sets and for the ATIS NL and SLS tests \[7\]. The primary emphasis of the SRI presentation was to describe improvements to the SRI DECIPHER speech recognition system, a component in SRI's ATIS system.</Paragraph>
    <Paragraph position="12">  Recent &amp;quot;significant&amp;quot; performance improvements are attributed to the addition of tied-mixture HMM modelling. Other approaches discussed include experiments with male-female separation, speaker adaptation, rejection of out-of-vocabulary input, and language modelling (including the use of multi-word lexical units). SRI's &amp;quot;simple serial integration of speech and natural language processing&amp;quot; is said to work well &amp;quot;because the speech recognition system uses a statistical language model to improve recognition performance, and because the natural language processing uses a template matching approach (described elsewhere in this proceedings) that makes it somewhat insensitive to recognition errors&amp;quot;.</Paragraph>
    <Paragraph position="13"> (3) Wayne Ward presented one of two papers from CMU describing the CMU ATIS System, &amp;quot;PHOENIX&amp;quot; \[8\]. The speech recognition component consists of a recent vocabulary-independent version of SPHINX, presently without incorporation of out-of- vocabulary models.</Paragraph>
    <Paragraph position="14"> PHOENIX's &amp;quot;concept of flexible parsing combines frame-based semantics with a semantic phrase grammar,&amp;quot; so that the &amp;quot;operation of the parser can be viewed as 'phrase spotting.'&amp;quot; Language modelling included a bigram model for the recognizer and a grammar for the parser.</Paragraph>
    <Paragraph position="15">  (4) The second paper from CMU, by Sheryl Young, described the &amp;quot;structure and operation of SOUL (for Semantically-Oriented Understanding of Language)&amp;quot; \[9\]. SOUL can use semantic and pragmatic knowledge to correct, reject and/or clarify the outputs of the PHOENIX case frame parser in the ATIS domain.</Paragraph>
    <Paragraph position="16"> (5) BBN's NL group reported on the BBN DELPHI natural language system and the integration of this system with the BBN BYBLOS system (described in \[4\]), using an N-best architecture \[10\]. The BBN authors cite a number of improvements to the DELPHI system that are described in other papers in this Proceedings.</Paragraph>
    <Paragraph position="17"> (6) Recent work on the Unisys ATIS Spoken Language System  was described by Norton et al. \[11\]. &amp;quot;Enhancements to the system's semantic processing for handling non-transparent argument structure and enhancements to the system's pragmatic processing of material in answers displayed to the user&amp;quot; are described. In addition to the Unisys system's NL results, results were reported for the case of SLS systems consisting of the Unisys natural language system coupled with two ATIS speech recognition systems: (1) the MIT SUMMIT system (described in \[3\]) and (2) the MIT Lincoln Labs system (described in \[2\]). The Unisys system's natural language constraints were also used to select the first-best of N-best speech recognition results (for the SPREC tests) based on syntactic, semantic and pragmatic knowledge.</Paragraph>
    <Paragraph position="18"> Three papers were presented by non-DARPA sites.</Paragraph>
    <Paragraph position="19"> (1) Douglas O'Shaughnessy described &amp;quot;the initial development of a natural language text processor, as the first step in an INRS \[INRS-Telecommunications, University of Quebec\] dialogue- by-voice system \[12\]. A keyword slot-filling approach is used, rather than a &amp;quot;standard parser for English.&amp;quot; (2) In one of two papers from AT&amp;T Bell Laboratories included in this session, Evelyne Tzoukermarm described &amp;quot;The Use of a Commercial Natural Language Interface in the ATIS Task&amp;quot; \[13\]. Tzoukermarm relates their &amp;quot;experience in adapting \[a commercial natural language interface\] to handle domain dependent ATIS queries.&amp;quot; The discussion of error analysis notes that, in contrast to the &amp;quot;well-formed&amp;quot; written English for which the commercial product was designed, spontaneous speech contains repetitions, restarts, deletions, interjections and ellipsis, as well as the omission of punctuation marks that &amp;quot;might give the  system information&amp;quot;.</Paragraph>
    <Paragraph position="20"> (3) The second AT&amp;T Bell Laboratories paper, by Pieraccini,  Levin and Lee, proposes &amp;quot;a model for a statistical representation of the conceptual structure in a restricted subset of spoken natural language&amp;quot; \[14\]. The &amp;quot;technique of ease decoding&amp;quot; is applied to the Class A sentences in the ATIS domain, with sentences analyzed in terms of 7 general cases: QUERY, OBJECT, ATTRIBUTE, RESTRICTION, Q\[UERY\] ATTRIBUTE, AND, and DUMMY. Unlike other papers in this session, this paper implements a non-standard test paradigm that prevents explicit comparisons with the results cited for other systems. To address this shortcoming, the authors indicate that they &amp;quot;are developing a module that translates the conceptual representation into an SQL query&amp;quot;. Presumably the SQL queries, in conjunction with the ATIS relational database, will permit use of existing DARPA ATIS queryanswer performance evaluation procedures.</Paragraph>
    <Paragraph position="21"> HI. DISCUSSION Following review of the posters, a number of issues were discussed.</Paragraph>
    <Paragraph position="22"> (1) Differences between ATIS Test Sets: It was noted that there were a number of differences between the June 1990 and February 1991 ATIS test sets, including evidence of greater-than-expected incidence of dysfluencies in the speech and &amp;quot;skewed&amp;quot; or disproportionate representation of some syntactic/semantic phenomena. Doug Paul noted that the test set perplexity for the June 1990 &amp;quot;Class A&amp;quot; test set was 18, in contrast with 22 for the present &amp;quot;Class A&amp;quot; test set, and a perplexity of 45 for the &amp;quot;non-Class A&amp;quot; test material (i.e., all other utterances). Inferences about &amp;quot;progress&amp;quot; or &amp;quot;trends&amp;quot; may thus be complicated by these differences between test sets. (2) Limited training material: Also noted was the fact that only a limited amount of fully &amp;quot;canonized&amp;quot; training material--for training acoustic models and for studying such phenomena as dialogue modelling--was available prior to this meeting, in some cases limiting system development. This factor was cited in a number of papers e.g., \[2, 4, 8, 10\]).</Paragraph>
    <Paragraph position="23">  (3) Limitations on the future value of the Resource</Paragraph>
    <Section position="1" start_page="49" end_page="50" type="sub_section">
      <SectionTitle>
Management Corpora:
</SectionTitle>
      <Paragraph position="0"> Hy Murveit noted his belief that demonstrable progress in recognizing speaker independent RM1 speech was limited by &amp;quot;how much information we can tease out of \[3990\] training  utterances&amp;quot;. Richard Schwartz took exception to this, citing IV. BENCHMARK TEST MATERIAL AND steady progress in recognizing RM speech. PROTOCOLS  (4) Properties of ATIS-domain speech: Richard Schwartz shared some analysis of the ATIS-domain test set speakers. He noted that there was one speaker in the test set  with &amp;quot;24 instances of 'uh' in 12 sentences&amp;quot;, \[which leads to\] &amp;quot;a 50% word error rate&amp;quot; for that speaker. On the basis of his analysis, he noted that &amp;quot;people don't know how to talk to a system&amp;quot;, and suggested that there ought to be more user/speaker feedback during the data collection process so that the incidence of dysfluencies would be reduced. In response, Hy Murveit noted that if we regard the two worst speakers in the test material as atypical, then &amp;quot;the current word error rate is close to 15%, and with some success in modelling the 'urns' and 'ers', the error rate may be only 10%, or abeut twice as bad as for RM&amp;quot;. Correlation was noted between difficulty in recognizing both the speech and \[in understanding\] the natural language for the &amp;quot;bad&amp;quot; speakers, so that the suggestion that these speakers may be atypical may be warranted.</Paragraph>
      <Paragraph position="1"> Patti Price noted that the \[speech recognition\] error rates suggest that &amp;quot;ATIS is more difficult, but we don't know why&amp;quot;. It may be that ATIS speech is &amp;quot;more casual&amp;quot;, but we need to study these issues in more detail, especially as they affect data collection.</Paragraph>
      <Paragraph position="2"> (5) Selection of the February ATIS test material: Victor Zue and Rich Schwartz asked about selection of the February 1991 ATIS test set, asking if there had been screening to select or reject potential test material on the basis of the incidence of dysfluencies noted in the transcriptions. NIST noted that the only such screening was to partition some of the utterances into the &amp;quot;Optional&amp;quot; categories on the basis of evidence of verbal deletions in the &amp;quot;lexical SNOR&amp;quot; transcriptions, since this evidence does not appear in the conventional SNOR transcriptions. For the June test set, there was no such screening, since attention had not been directed to the subset containing verbal deletions.</Paragraph>
      <Paragraph position="3"> (6) Use of &amp;quot;Baseline&amp;quot; or &amp;quot;Reference&amp;quot; Conditions: John Makhoul noted that there &amp;quot;too many uncontrolled variables&amp;quot; (e.g., algorithms, training materials, grammar) to make comparisons of the ATIS speech recognition systems beneficial using the SPREC results. BBN had advocated use of a &amp;quot;baseline&amp;quot; condition and provided SPREC data for both a &amp;quot;baseline&amp;quot; and an &amp;quot;augmented&amp;quot; training condition to permit such comparisons \[4\]. MIT/LL also made use of this &amp;quot;baseline&amp;quot; condition \[2\]. Makhoul noted that a similar situation (i.e., &amp;quot;too many uncontrolled variables&amp;quot;) applies for the case of the NL results. Hy Murveit noted that SRI's reluctance to &amp;quot;lock into a baseline condition&amp;quot; was based on a reluctance to choose one with the 'wrong operating point'&amp;quot;, based on inadequate training. Francis Kubala noted, however, that choosing a &amp;quot;baseline that undershoots&amp;quot; \[performance\] ought not to be a problem if one wished to &amp;quot;demonstrate clear wins&amp;quot;, and that such a baseline could be changed over time. John Makhoul also noted that reporting error rates is in general preferable to reporting &amp;quot;scores&amp;quot;.</Paragraph>
    </Section>
    <Section position="2" start_page="50" end_page="50" type="sub_section">
      <SectionTitle>
Benchmark Test Material
</SectionTitle>
      <Paragraph position="0"> One portion of the test material consisted of beth &amp;quot;Speaker-Dependent&amp;quot; and &amp;quot;Speaker-Independent&amp;quot; test sets from the Resource Management (RM1) Corpus, for use in tests of speech recognition technology. Each of these test sets consisted of 300 sentence utterances. The most recent tests using the RM1 corpus were conducted prior to the October 1989 Meeting, sixteen months ago. A second portion of the test material consisted of Air Travel Information System (ATIS) domain speech material and related transcriptions. This material was collected at TI in recent months, using the &amp;quot;Wizard&amp;quot; protocol described by Hemphill at the June Meeting \[15\]. There were a total of 9 speakers in the ATIS test set.</Paragraph>
      <Paragraph position="1"> This material was partitioned into four subsets: one subset consisting of an extension of the &amp;quot;Class A&amp;quot; category used at the June meeting (expanded to include &amp;quot;testably ambiguous&amp;quot; queries) and containing 145 queries, a second subset consisting of 38 Class D1 query pairs, and two additional smaller &amp;quot;optional&amp;quot; subsets that included examples of &amp;quot;verbal deletion&amp;quot; and/or &amp;quot;verbal correction&amp;quot; (i.e., Optional Class A and Optional Class D1). The transcriptions used as input to the NL systems and for scoring the ATIS SPREC tests were provided using a recently developed &amp;quot;lexical SNOR&amp;quot; format.</Paragraph>
      <Paragraph position="2"> CMU reported benchmark Resource Management results that were not represented in the poster session. These data from CMU are included in the tables of reported results. A paper describing how these results were achieved appears in \[17\].</Paragraph>
      <Paragraph position="3"> Benchmark Test Protocols In addition to the Resource Management speech recognition tests, for which there is considerable precedent, the ATIS material could be used for three tests: (1) spontaneous ATIS-domain SPeech RECognition component tests (designated as SPREC in this paper), (2) ATIS-domain Natural Language system component tests (designated as NL), and (3) complete ATIS-domain Spoken Language System tests (designated as SLS). During the June meeting, several sites reported results for NL tests, with CMU being the sole site to report complete SLS test results at that meeting \[16\].</Paragraph>
      <Paragraph position="4"> The SPREC test design was outlined by an ad hoc Working Group chaired by Victor Zue, with scoring software adapted for this purpose by NIST. This is the first time that the SPREC test has been implemented.</Paragraph>
      <Paragraph position="5"> In computing results tabulated for the NL and SLS tests, the most reeeent version of the NIST &amp;quot;comparator&amp;quot; was used to compare the hypothesized CAS-format answers against NIST's &amp;quot;canonical&amp;quot; reference answers, as described in a previous paper \[16\]. Answers are scored as either &amp;quot;True&amp;quot;, &amp;quot;False&amp;quot;, or (if the No_Answer option has been exercised) &amp;quot;No Answer&amp;quot;.</Paragraph>
      <Paragraph position="6"> A DARPA SLS Coordinating Committee decision in November, 1990 suggested computation of a &amp;quot;weighted error percentage&amp;quot; on the basis that (on &amp;quot;intuitive grounds&amp;quot;) &amp;quot;a false answer is twice as bad as no answer&amp;quot;. The weighted error so defmed consists of two times the percentage of total queries in the subset that are scored &amp;quot;False&amp;quot; plus the percentage scored 53.</Paragraph>
      <Paragraph position="7"> &amp;quot;No Answer&amp;quot;. A single-number &amp;quot;score&amp;quot; may be derived by subtracting the weighted error from 100%, providing a single-number score that may range from -100%, for the case of all false answers, to +100% for all true answers.</Paragraph>
      <Paragraph position="8"> A &amp;quot;Class D1 test protocol&amp;quot; was developed and used on a trial basis for these tests. Class D1 consists of query pairs for which the second query (&amp;quot;Q2&amp;quot;) has been classified as &amp;quot;context dependent&amp;quot;, and for which an answerable prior query (&amp;quot;QI&amp;quot;) has been identified as defining the context for Q2. Scoring of Class D1 query pairs was for the answers provided only for Q2, regardless of the answers provided for the context-setting query, Q1.</Paragraph>
      <Paragraph position="9"> The Class D1 test protocol had never previously been implemented, and its usage was regarded by many participants more as a &amp;quot;debugging of a test protocol&amp;quot; than as a valid indicator of systems' abilities to handle context-dependent queries. It is also the case that the amount of labelled &amp;quot;Class Di&amp;quot; training material was extremely small and that it was not widely available until shortly before the test --thus limiting system developers' abilities to make adequate use of the training material. Future implementations of the Class D1 test protocol will undoubtedly yield more significant results.</Paragraph>
      <Paragraph position="10"> The &amp;quot;optional&amp;quot; test subsets are not discussed extensively in Section VI since these subsets axe too small, and their usage too limited, to have significance.</Paragraph>
      <Paragraph position="11"> V. BENCHMARK TEST RESULTS Tables 1 - 4 (included at the end of the text of this paper) present tabulations of results reported to NIST for uniform scoring against the final &amp;quot;official&amp;quot; sets of reference transcriptions and reference answers.</Paragraph>
      <Paragraph position="12"> Some of these numbers may differ slightly from those reported at the meeting or in some of the papers in this proceedings, since earlier results reported at the Asilomar meeting were derived with: (1) a slightly larger Class A test set (148 vs. 145 queries), since the classification of 3 utterances, originally included in the Class A subset, was reconsidered, after the meeting, and determined to be &amp;quot;unanswerable&amp;quot; and thus not Class A, and (2) the reference answers for several utterances were corrected andlor modified in response to comments from the participants in these tests. However, these differences are not likely to be statistically significant.</Paragraph>
      <Paragraph position="13"> Designation of a set of results as &amp;quot;LATE&amp;quot; signifies that the results were received at NIST some time after midnight on February 6th, 1991. &amp;quot;COB&amp;quot; on that date had been designated as the due date for submission of results. In some cases prior notice had been given to NIST that results would arrive &amp;quot;late&amp;quot;, and in a few cases, late results were invited for the sake of completeness and to permit informative comparisons with earlier results.</Paragraph>
    </Section>
    <Section position="3" start_page="50" end_page="50" type="sub_section">
      <SectionTitle>
Resource Management (RM1) Speech
Recognition Tests
</SectionTitle>
      <Paragraph position="0"> Table 1 presents a tabulation of speech recognition system results for the (read speech) RM1 test material.</Paragraph>
    </Section>
  </Section>
  <Section position="2" start_page="50" end_page="50" type="metho">
    <SectionTitle>
ATIS Spontaneous Speech Recognition
Component Tests (SPREC)
</SectionTitle>
    <Paragraph position="0"> Table 2 presents a tabulation of SPREC results for speech recognition systems (or SLS speech recognition components) results for the spontaneous speech in the ATIS domain.</Paragraph>
  </Section>
  <Section position="3" start_page="50" end_page="50" type="metho">
    <SectionTitle>
ATIS Natural Language Component Tests
(NL)
</SectionTitle>
    <Paragraph position="0"> Table 3 presents a tabulation of natural language system results for the ATIS NL system components (or systems).</Paragraph>
    <Paragraph position="1"> In Tables 3 and 4, both the number of queries (and the corresponding percentage of the total number of queries in a given category) are shown for the categories &amp;quot;True&amp;quot; (or correct), &amp;quot;False&amp;quot; (incorrect) and &amp;quot;No Answer&amp;quot;. The &amp;quot;Weighted Error&amp;quot; percentage was computed by multiplying the percentage of False answers by 2 and adding the percentage of &amp;quot;No_Answer&amp;quot; responses. The column labelled &amp;quot;Score&amp;quot; was computed by subtracting the Weighted Error (%) from 100%.</Paragraph>
  </Section>
  <Section position="4" start_page="50" end_page="50" type="metho">
    <SectionTitle>
ATIS Spoken Language Systems Tests (SLS)
</SectionTitle>
    <Paragraph position="0"> Table 4 presents a tabulation of spoken language system results for complete ATIS systems.</Paragraph>
  </Section>
  <Section position="5" start_page="50" end_page="52" type="metho">
    <SectionTitle>
VI. DISCUSSION OF BENCHMARK TEST
RESULTS
RM1 Speech Recognition Results (Table 1)
</SectionTitle>
    <Paragraph position="0"> Focusing on the Speaker Independent test set results, with use of the Word Pair grammar, the word error ranges from 9.7% to 3.6%, while the sentence error ranges from 47.3% to 19.3%.</Paragraph>
    <Paragraph position="1"> Using the NIST implementation of the McNemar test used in previous tests \[16\], the differences between the sentence errorlevel results for the system with the lowest reported word and sentence error rates (sys4, the CMU system described in reference \[18\]) and all other systems in this category are significant for all but sysl0 and sysll (two BBN systems described in reference \[4\]). The sentence- error-levelperformance differences between the CMU system and the two BBN systems are not significant.</Paragraph>
    <Paragraph position="2"> There are three sets of results for the BU-BBN collaborative effort described in \[5\]. The first of these (designated sys7), with a santenee error rate of 27.0%, results from the hybrid BU-BBN system. The second of these (sysl2), with a sentence error rate of 27.7%, results from the top answer from the BBN N-best system used for this study. The third (sysl3), with a sentence error rate of 47.3%, results from the top answer of the BU context-independent, gender-dependent segment model system.</Paragraph>
    <Paragraph position="3"> Lowest overall word and sentence error rates (1.8% and 12.0%, respectively) are reported for the case of the speaker-dependent Word-Pair grammar system results (sys5) reported by Paul, at MIT/LL, described in \[2\].</Paragraph>
    <Paragraph position="4"> In addition to results reported in this session, note that results were reported to NIST for two systems not described in this session. Huang et al. at CMU reported results for an HMM system incorporating a &amp;quot;shared semi-continuons model&amp;quot;. That  system is described in a paper to be presented at ICASSP-91 \[17\]. Gauvain and Lee at AT&amp;T Bell Laboratories reported results for an investigation &amp;quot;into the use of Bayesian learning of the parameters of a Gaussian mixture density&amp;quot;, and this study is reported in another paper in this proceedings \[18\].</Paragraph>
  </Section>
  <Section position="6" start_page="52" end_page="52" type="metho">
    <SectionTitle>
ATIS SPREC Results (Table 2)
</SectionTitle>
    <Paragraph position="0"> Focusing on the word error for the 145 utterances in the Class A test set, the range is from 46.1% to 15.7%, while the sentence error ranges from 91.0% to 52.4%.</Paragraph>
    <Paragraph position="1"> The McNemar sentence-error-level significance test (not shown) indicates that the system with the lowest reported word and sentence error rate for the Class A utterances (sys24-a, the Unisys implementation of syntactic, semantic and pragmatic constraints in selecting the first candidate from an N-best listing provided by BBN, described in \[11\]) has an error rate that is significantly less than all but two other systems, (sysl8-a, the BBN &amp;quot;augmented training&amp;quot; system, and sys06-a, the SRI system). Performance differences (at the sentence error level) between these three systems, however, are not significant.</Paragraph>
    <Paragraph position="2"> Comparison of the results for the BBN &amp;quot;baseline&amp;quot; and &amp;quot;augmented&amp;quot; training condition (sysl8 and sysl9) gives some indication of the benefits of additional (in this case, domainspecific) training and a more powerful 4-gram statistical class grammar. The McNemar test indicates that the difference in performance between sysl8-a and 19-a is significant.</Paragraph>
    <Paragraph position="3"> Comparisons of results for similar systems for the two larger test subsets (i.e., Class A results vs. Class D1 results) suggest that the Class D1 material is somewhat more difficult to recognize (i.e., the error rates are higher). An interesting hypothesis that may account, in part, for this phenomenon is offered by Norton et al.: &amp;quot;...this higher error rate in context dependent spontaneous utterances may be due in part to the presence of prosodic phenomena common in dialogue such as destressing 'old' information&amp;quot; \[11\].</Paragraph>
    <Paragraph position="4"> Typical SPREC error rates are higher still for the two &amp;quot;optional&amp;quot; test subsets. This ought not to be surprising in view of the fact that these utterance subsets are, by definition and selection, more dysfluent (i.e., contain verbal deletions).</Paragraph>
    <Paragraph position="5"> Not shown in Table 2, but indicated by other analyses, is high inter-subject variability for the SPREC tests as well as for the NL and SiS tests.</Paragraph>
  </Section>
  <Section position="7" start_page="52" end_page="52" type="metho">
    <SectionTitle>
ATIS NL Results (Table 3)
</SectionTitle>
    <Paragraph position="0"> For the Class A subset, results are tabulated for eight NL systems at 5 DARPA contractors' sites, and at AT&amp;T Bell Laboratories and at INRS-Telecom. For the DARPA contractor's systems, the weighted error ranges from 51.7% to 31.0%.</Paragraph>
    <Paragraph position="1"> The two sets of CMU results include data for the PHOENIX system described in \[8\] (sysOl), and for the PHOENIX system integrated with the SOUL module described by Young in \[9\] (sys02).</Paragraph>
    <Paragraph position="2"> For the Class A test material, the lowest weighted error figures (31.0%) are found for both the SRI system described by Murveit et al. in \[7\] (sys13-a), and for the CMU PHOENIX + SOUL system of \[9\] (sys02-a).</Paragraph>
    <Paragraph position="3"> For the Class D1 and Optional Class D1 subsets, the weighted error percentages are substantially higher than for the Class A results. For the Class D1 test material, the lowest weighted error figures (36.8%) are found for the Unisys system described by Norton et al. in \[11\] (sys09-d).</Paragraph>
    <Paragraph position="4"> Note that two sets of results are reported for the Class D1 material for BBN (denoted syslS-d and sys23-d). Subsequent to submission of the initial results for sysl5-d, BBN's representatives notified NIST that &amp;quot;...there was a small bug in the component that translates the result of the understanding (i.e., the output of the discourse component) into SQL... \[and that since\] the bug in our system.., was NOT in the UNDERSTANDING or the DISCOURSE component but between the output of those components and the SQL backend and ...</Paragraph>
    <Paragraph position="5"> \[since\] one small quick fix in the backend corrected the problem, we concluded that it is reasonable to send you new answers for our Class D test&amp;quot; \[19\]. The data designated as sys23-d is derived from these &amp;quot;new answers&amp;quot;.</Paragraph>
  </Section>
  <Section position="8" start_page="52" end_page="58" type="metho">
    <SectionTitle>
ATIS SLS Results (Table 4)
</SectionTitle>
    <Paragraph position="0"> For the Class A subset, results are tabulated for 7 SLS systems at 5 DARPA contractors' sites. Non-DARPA contractors declined to participate in the SLS tests. The weighted error ranges considerably, from 90.3% to 41.4%, with the best (lowest weighted error) results for the SRI system described in \[7\] and in other SRI papers in this proceedings.</Paragraph>
    <Paragraph position="1"> The low SRI SLS weighted error rate (41.4%) appears to be a consequence of both a well-performing ATIS speech recognition component and a well-performing natural language component (i.e., a SPREC test word error rate of 18.0% and an NL weighted error rate of 31.0%).</Paragraph>
    <Paragraph position="2"> Not surprisingly, weighted error figures for complete SLS systems are higher than for corresponding NL components (processing the lexical SNOR formatted versions of the same utterances). The relative increase in weighted error rate appears to correspond to the relative performance of the speech recognition component.</Paragraph>
    <Paragraph position="3"> By comparing comparable data from Tables 3 and 4, note that for the SRI system the weighted error rate for the Class A subset increases from 31.0% (for the NL component) to 41.4% (for the complete SLS system.</Paragraph>
    <Paragraph position="4"> Two SLS systems made use of BBN's BYBLOS ATIS SPREC data: the BBN HARC system (sys16-a) and the Unisys-BBN SPREC system (sys22-a). Comparing the increases of weighted error rates for NL vs. SLS systems, one can note an approximate increase in weighted error rate of only 8 or 9 percentage points for these systems (i.e., from 49.0% for the BBN DELPHI NL system to 57.2% for the BBN HARC SIS system, and from 51.7% for the Unisys NL system to 60.7% for the Unisys-((BBN SPREC)) SLS system). This relatively small increase in error rate is probably attributable to the BBN &amp;quot;augmented training&amp;quot; (sysl8-a) SPREC test word error rate of (only) 16.1%, which is not significantly different from SRI's SPREC test results of 18.0%.</Paragraph>
    <Paragraph position="5">  In contrast, a substantially larger increase in error rate can be noted for the CMU systems (i.e., 35.9% and 31.0% for the two CMU NL systems vs. 65.5% for the SLS system), probably due to performance of the CMU SPREC system with error rates that are significantly higher than for the SRI SPREC system. Unisys reported results for three system configurations: using speech recognition results provided by the MIT/LCS ATIS SPREC system (designated sysl0-a), by the MIT/LL ATIS SPREC system (sys ll-a), and by the BBN BYBLOS/ATIS system (sys22-a). In this case, better performance on the SLS test (i.e., lower weighted error) correlates with better performance on the SPREC results, as would be expected.</Paragraph>
    <Paragraph position="6"> As was also the case for the NL results, the weighted error results for the Class D1 test subset are substantially higher than for the Class A results.</Paragraph>
    <Paragraph position="7"> VII. ACKNOWLEDGEMENT Too many individuals have served as points-of-contact at the research sites involved in these benchmark tests to be individually thanked, but their efforts and patience in seeing that information and data are made available are greatly appreciated. My colleagues at NIST deserve special thanks for their efforts and effieiancy in making these tests possible and in tabulation of the results. In particular, Bill Fisher has had a key role, both as Chairman of the DARPA SLS Performance</Paragraph>
    <Section position="1" start_page="53" end_page="58" type="sub_section">
      <SectionTitle>
Evaluation Working Group and as the individual responsible for
</SectionTitle>
      <Paragraph position="0"> ATIS test material selection and in reviewing the &amp;quot;canonical&amp;quot; auxiliary files. Jon Fiscus and John Garofolo, also at NIST, have been responsible for implementation of scoring software and for preparation of corpora on CD-ROM.</Paragraph>
      <Paragraph position="1"> VIII. REFERENCES  1. Baker, J., et al., &amp;quot;Dragon Systems Resource Management Benchmark Test Results--February 1991&amp;quot; (in this Proceedings).</Paragraph>
      <Paragraph position="2"> 2. Paul, D. B. &amp;quot;New Results with the Lincoln Tied-Mixture HMM CSR System&amp;quot; (in this Proceedings).</Paragraph>
      <Paragraph position="3"> 3. Phillips, M., Glass, J. and Zue, V., &amp;quot;Modelling Context Dependency in Acoustic-Phonetic and Lexical Representations&amp;quot; (in this Proceedings).</Paragraph>
      <Paragraph position="4"> 4. Kubala, F. et al., &amp;quot;BYBLOS Speech Recognition Benchmark Results&amp;quot; (in this Proceedings).</Paragraph>
      <Paragraph position="5"> 5. Ostendorf, M. et al., &amp;quot;Integration of Diverse Recognition Methodologies Through Reevaluation of N-Best Sentence Hypotheses&amp;quot; (in this Proceedings).</Paragraph>
      <Paragraph position="6"> 6. Seneff, S. et al., &amp;quot;Development and Preliminary Evaluation of the M1T ATIS System&amp;quot; (in this Proceedings).</Paragraph>
      <Paragraph position="7"> 7. Murveit, H. et al., &amp;quot;SRI's Speech and Natural Language Evaluation&amp;quot; (in this Proceedings).</Paragraph>
      <Paragraph position="8"> 8. Ward, W., &amp;quot;Current Status of the CMU ATIS System&amp;quot; (in this Proceedings).</Paragraph>
      <Paragraph position="9"> 9. Young, S., &amp;quot;Using Semantics to Correct Parser Output for ATIS Utterances&amp;quot; (in this Proceedings).</Paragraph>
      <Paragraph position="10"> 10. Austin, S. et al., &amp;quot;BBN HARC and Delphi Results on the ATIS Benchmarks--February 1991&amp;quot; (in this Proceedings). 11. Norton, L. et al., &amp;quot;Augmented Role filling Capabilities for Semantic Interpretation of Spoken Language&amp;quot; (in this Proceedings).</Paragraph>
      <Paragraph position="11"> 12. O'Shaughnessy, D., &amp;quot;A Textual Processor to Handle ATIS Queries&amp;quot; (in this Proceedings).</Paragraph>
      <Paragraph position="12"> 13. Tzoukermann, E., &amp;quot;The Use of a Commercial Natural Language Interface in the ATIS Task&amp;quot; (in this Proceedings). 14. Pieraccini, R., Levin, E. and Lee, C.H., &amp;quot;Stochastic Representation of Conceptual Structure in the ATIS Task&amp;quot; (in this Proceedings).</Paragraph>
      <Paragraph position="13"> 15. Hemphill, C.T., Godfrey, J.J., and Doddington, G.R., &amp;quot;The ATIS Spoken Language System Pilot Corpus&amp;quot; in Proceedings of the DARPA\] Speech and Natural Language Workshop&amp;quot; June 1990, pp. 96 -101.</Paragraph>
      <Paragraph position="14"> 16. Pallett, D.S., et al. &amp;quot;DARPA ATIS Test Results: June 1990&amp;quot; in Proceedings of the DARPA\] Speech and Natural Language Workshop&amp;quot; June 1990, pp. 114 - 121.</Paragraph>
      <Paragraph position="15"> 17. Huang, X. et al., &amp;quot;Improved Acoustic Modelling for the SPHINX Speech Recognition System&amp;quot;, (to be presented at ICASSP-91).</Paragraph>
      <Paragraph position="16"> 18. Gauvaln, J. and Lee, C.H., &amp;quot;Bayesian Learning of Gaussian Mixture Densities for Hidden Marker Models&amp;quot; (in this Proceedings).</Paragraph>
      <Paragraph position="17"> 19. ARPANET communication from M. Bates and R. Ingria (BBN) to Dave Pallett (NIST), February 13, 1991.</Paragraph>
      <Paragraph position="18">  Key to Tables 2, 3 and 4: The following key is provided as an aid in cross-referencing the NIST-ID numbers to the sites submitting ATIS results and to descriptions of the systems in the references cited in this paper. Note: key for these tables differs from that for the RM1 results of Table 1. KEY: ATIS SPREC, NL, AND SLS TEST REFERENCES</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML